Setting reasonable priors for computational modeling

Note: This blog is based on information available as of May 2024 and may be updated periodically.

In an earlier blog (titled: “Differences between Bayesian and Frequentist approaches), I talked about the need to define priors for model parameters estimated within a Bayesian framework. To briefly recapitulate: Priors are combined with the likelihood of observed data to update beliefs and form posterior distributions of parameters. We define priors to summarize our a priori assumptions (i.e., before we look at the data) about possible parameter values. Defining priors should be informed by the context of the data (laboratory, field, game experiments of longitudinal vs cross-sectional data), prior applications in the same research field (e.g., modeling diffusion decision models, Ornstein-uhlenbeck models, leaking accumulator models), and the scale of variables (e.g., restriction to positive values, rate parameters). I provide below a summary of common priors including those used for parameters of the diffusion decision model (DDM; Ratcliff, 1978).

Disclosure:

We want to choose priors wisely because they can influence model convergence and the extend of biases in our interpretations of estimated parameters. Below, I provide a list of commonly used priors, including explanations for when and why we use them. These tips are not exhaustive and are drawn from my own experience as well as the extensive resources provided by the STAN community and other experts in the field. Therefore, these recommendations should be considered cautiously and are based on our current knowledge. I will update them as our understanding evolves.

General advice:

Use weakly informative priors rather than uniform flat priors. Because uniform priors are NOT non-informative priors (i.e., they impose the assumption that all values within the defined range are equally plausible). Moreover, hard constraints should only be used if they present true constraints and even then, it’s better to transform your predictors to an unconstrained scale to avoid that the sampler gets stuck.
Standardize your data to ensure that covariates are on the same scale, enhancing their interpretability and stability of the regression model. This practice helps to improve model convergence and interpretability by ensuring that the prior is appropriately scaled and centered relative to the data.
When setting up models, we typically specify one prior for each model parameter. By so doing, we assume that parameters are independent from each other. We do so because the model is easier to handle and better to understand. However, we need to be careful about model parameterization and we should check this assumption about prior independence (will be covered in a future blog).

Common weakly informative priors (alphabetic order):

Below is a list of common priors and a brief description when we use them. I generally recommend to always plot specified priors to get a better handle on them. I provide the corresponding R code for the plots shown below on my GitHub account (under the section "Visualizing Priors").

Beta distribution as Prior:

The Beta distribution is commonly used for parameters that are probabilities or proportions and that are constrained to the interval [0, 1]. The beta distribution (in Stan notation) is defined as follows: x ~ beta(alpha, beta), where alpha and beta represent the shape parameters. Note that if these parameters are set to 1, the beta distribution represents a uniform distribution. The Stan community recommends for a correlation parameter a Beta(2,2). It keeps point estimates away from the boundaries but still allows the likelihood to get close to these boundaries if this is what the data is suggesting. If we choose 2 for alpha and beta, then we will have a mean of 0.5 and a moderate variance. In Bayesian inference, the Beta distribution is conjugate to the binomial likelihood, which means the posterior distribution is also a Beta distribution. This property can simplify analytical and computational procedures. In the DDM context, we can use the beta distribution as a prior for the parameter starting point (z). In this case, we could center the distribution around 0.5 if we want to impose an initial assumption that there isn’t any starting point bias in our data. Some have argued that starting points range between 0.3 and 0.7 but I think that this is highly context-dependent and depends on whether you apply a DDM to a value-based decision-making task (for which I have seen already greater ranges of values) or to perceptual tasks.

Exponential distribution as Prior:

This prior is useful for parameters that are strictly positive and that represent a rate (like in a Poisson process). We also often use an exponential prior for variance components which is favored over the usual use of half-cauchy priors as the latter tends to favor extreme values in particular in logistic regressions (Bürkner, 2017; McElreath, 2016). This prior assumes that very large values of the parameter are increasingly unlikely. The exponential distribution (in Stan notation) is defined as follows: x ~ exponential(lambda) with lambda representing the rate parameter. A smaller lambda results in a wider spread (higher mean), and a larger lambda results in more mass concentration around zero. The mean of the distribution can guide the choice of lambda. Specifically, the mean of an exponential distribution is 1/lambda. Let’s assume we expect the values of our parameter to lie between 0 and 1.3. Furthermore, we expect that the mean is somewhat in the middle of that range. So, we could choose a mean around 0.65 which implies: lambda = 1/0.65 = 1.54.

Gamma distribution as Prior:

This prior is useful for parameters that represent rates or scales of non-negative continuous data. The shape and rate (or scale) parameters of the gamma distribution can be adjusted to reflect different levels of knowledge and uncertainty about the parameter’s expected value. The Stan community recommends Gamma(2,0.1) for scale parameters in a hierarchical model to keep the mode away from 0 but still allow it to be arbitrarily close if the likelihood suggests that based on the data. Though, this prior can induce positive biases in estimates when the number of groups is small. Therefore, they suggest Gamma(2, 1/A) for small groups where A is a scale parameter representing how high the outcome variable can be. The gamma distribution (in STAN notation) is defined as follows: x ~ gamma(alpha,beta) with alpha representing the shape and beta representing the rate. In the DDM context, we can use gamma distributions as priors for boundary separation (a) which have typical values between 0.5 and 2, with higher values indicating a greater emphasis on accuracy over speed.

Half-Normal distribution as Prior:

This refers to a normal distribution restricted to non-negative values. Hence, we assume that the actual parameter value could be close to zero. Sigma refers to the standard deviation and determines the range of plausible parameter values. The half-normal distribution (in STAN notation) is defined as follows: x ~ Halfnormal(μ,σ) with μ=0 and σ=2.5. In the DDM context, we can use this prior for across-trial variability parameters whose expected ranges are more context-specific, but values are typically small relative to their respective main parameters.

Log-Normal distribution as Prior:

This prior is useful if we expect a parameter to vary over several orders of magnitude. This prior assumes that the logarithm of the parameter follows a normal distribution. It is useful when the parameter can be both small and very large. For instance, imagine we assume that the mean of our prior is around 1.5 and that possible values can be 2 standard deviations above and below that mean but they cannot go negative. In this case, a log-normal as prior seems suitable because it will allow us to specify a distribution that is strictly positive while also centering around our desired mean with a large variable. The lognormal distribution (in STAN notation) is defined as follows: x ~ Log-normal(mu,sigma) where mu and sigma represent the mean and std of the logarithmized variable x. In the DDM context, we can use this prior for the model parameter nondecision time (Ter), particularly if we assume that the range of possible values is wide (e.g., different conditions that present stimuli across different modalities such as visual and auditory).

Choosing between a gamma and lognormal as prior:

Choosing between a gamma and a lognormal distribution as a prior depends on the specific characteristics of the parameter and the context of the problem. Both, the gamma and log-normal distributions are defined on the positive real line but the latter allows for a wider range of shapes and skewness. If the parameter is strictly positive and there is no need to accommodate extreme skewness or heavy tails, a gamma distribution seems more appropriate due to its simplicity and tractability. Instead, if there is prior knowledge that the parameter's distribution is skewed or asymmetric, the lognormal distribution may be preferred. In some cases, the parameters have natural interpretations on a logarithmic scale. For example, parameters related to rates, scales, elasticities might be more naturally interpreted on a logarithmic scale. In such cases, a lognormal prior might be more interpretable and align better with the theoretical understanding of the parameter. However, if the likelihood function and other priors in the model allow for conjugacy with the gamma distribution, it might be preferred for computational and analytical convenience.

Normal distribution as Prior:

The normal distribution is commonly used as a prior for parameters expected to be around a certain value with some uncertainty. In hierarchical modeling, it is often used for group-level parameters to pool information across groups, enhancing parameter estimation accuracy. As we are often centering the data (to put them on the same scale), we are commonly using Normal(0,2.5) as a prior because it provides a weakly informative prior, centered at zero with a standard deviation of 2.5. This corresponds to the null hypothesis in the Frequentist approach, where we assume the parameter is equal to zero. This prior setting helps to regularize the estimates, preventing overfitting while allowing sufficient flexibility for the data to inform the parameter estimates. In the DDM context, empirical applications often find drift rates in the range of -5 to 5 (but again, this can vary depending on the task and individual differences). In this case, a Normal(0,2.5) as a prior seems suitable as shown below.

t-distribution as Prior:

The t-distribution is useful if we expect that parameters have a distribution with heavier tails than the normal distribution as shown in the plot. We could account this circumstance with a t-student distribution that has fatter tails than a normal distribution. For instance, we could set: student_t(3,0,2).

Cauchy distribution as a special case:

The Cauchy is a special case of the t-distribution with only 1 degree of freedom. It has an even heavier tail than other t-distributions. It is often used as a prior for location parameters, particularly when we know very little a priori about the scale of the parameter. However, note that while this heavier tail makes it more robust to not bias towards a priori information it also makes it more prone to be influenced by extreme values. Moreover, unlike many other distributions, the Cauchy distribution does often not lead to a conjugate prior which makes analytical solutions to the posterior more difficult. Note that the Stan community has moved away from the Cauchy and often use now normal(0, 2.5) as default priors for parameters whose data has been normalized.

Priors for hierarchical covariance matrices:

The Lewandowski-Kurowicka-Joe (LKJ) distribution is used as a prior for correlation matrices in Bayesian hierarchical models (Lewandowski et al., 2009). It is particularly useful for specifying priors on the correlation structure of multivariate normal distributions. The LKJ distribution ensures that the resulting correlation matrix is positive definite and allows for control over the concentration of correlations around zero. When using the LKJ prior in practice, you typically set a shape parameter 𝜂; 𝜂=1 implies a uniform prior over correlation matrices, while 𝜂>1 concentrates the prior around the identity matrix, favoring weaker correlations. For a hierarchical covariance matrix, a Wishart (not inverse-Wishart) is sometimes also suggested (see link to Stan community website below).

Frequently asked questions:

Why should I care about conjugacy of priors?

Conjugate priors are a concept in Bayesian statistics where the prior distribution is chosen to be from the same family as the posterior distribution. This choice simplifies the calculations because the posterior distribution has the same functional form as the prior. For example, if you're using a normal distribution as the likelihood function, choosing a normal distribution as the prior would result in a normal distribution as the posterior. However, in many cases, practitioners may choose non-conjugate priors for various reasons, such as flexibility or capturing specific characteristics of the data more accurately. For instance: Lognormal distributions are not conjugate to many likelihood functions, but they still offer certain analytical conveniences and computational advantages over more complex distributions.

Useful Links:

https://stackoverflow.com/questions/61670240/how-to-decide-on-what-priors-distributions-to-use-for-parameters-in-pymc3

https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

References:

General advice on priors:

Bürkner, P.-C. (2017a). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80, 1–28. https://doi.org/10.18637/jss.v080.i01
Bürkner, P.-C. (2017b). Advanced Bayesian Multilevel Modeling with the R Package brms. ArXiv:1705.11123 [Stat]. http://arxiv.org/abs/1705.11123
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Lewandowski, D., Kurowicka, D., & Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100(9), 1989–2001. https://doi.org/10.1016/j.jmva.2009.04.008.
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman and Hall/CRC. https://doi.org/10.1201/9781315372495
Kruschke, J. (2014). Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press.

For setting priors for the DDM:

Myers, C. E., Interian, A., & Moustafa, A. A. (2022). A practical introduction to using the drift diffusion model of decision-making in cognitive psychology, neuroscience, and health sciences. Frontiers in Psychology, 13, 1039172.
Tran, N. H., Van Maanen, L., Heathcote, A., & Matzke, D. (2021). Systematic parameter reviews in cognitive modeling: Towards a robust and cumulative characterization of psychological processes in the diffusion decision model. Frontiers in psychology, 11, 608287.
To see the initial priors used for HDDM: Wiecki, T. V., Sofer, I., & Frank, M. J. (2013). HDDM: Hierarchical Bayesian estimation of the drift-diffusion model in Python. Frontiers in neuroinformatics, 7, 55610.

For general information about the DDM and estimating model parameters:

Ratcliff, R. (1978). A theory of memory retrieval. Psychological review, 85(2), 59.
Ratcliff, R., & Tuerlinckx, F. (2002). Estimating parameters of the diffusion model: Approaches to dealing with contaminant reaction times and parameter variability. Psychonomic bulletin & review, 9(3), 438-481.