Baby’s first Gibb’s Sampler

Published:

My first rotation during graduate school was in the lab I ultimately joined, and my project was to estimate heritability in a group of diverse outbred mice, conveniently named the Diversity Outbred mouse population. Not knowing much about statistics, MCMC, Bayesian methods, R, the list goes on, I probably made a lot of mistakes.

The code for the Gibb’s sampler is written up in R, and I wrote up a little report about the endeavor as it will likely never get published. Here it is:


Background

Estimates of heritability and genetic correlation can be derived from studies in structured populations with known characteristics. Diversity Outbred (DO) mice are a large, heterogeneous population derived from eight inbred founder strains bred randomly over many generations. The resulting mice contain extensive genetic diversity from random matings.

Aims

In this project, we use data generated from two populations of DO mice to estimate the parameters of heritability and genetic correlation. In particular, we aim to determine not only the point estimates of heritability and genetic correlation but also interval estimates. Typically there is large uncertainty for estimates of these parameters and reports of point estimates may be misleading on their own.

  • DO1: population of 315 male and female mice tested with sex as a covariate; \(h^2\) was calculated on 10 phenotypes
  • DO2: separate population of 287 female mice fed either a chrolic acid or high protein diet, which was included in the models for 8 phenotypes

Statement of Model

Narrow-sense heritability, \(h^2\), is calculated as \(\frac{Var(\text{Additive})}{Var(\text{Phenotype})}\). Under such a definition, \[ h^2 = \frac{\tau^2}{\tau^2 + \sigma^2} \] where \(\tau^2\) is a parameter of variance due to inherited genetic factors and \(\sigma^2\) is a parameter of residual variance that is due to noise.

Likelihood

\[ \mathbf{y} \sim \mathrm{N}(\mathbf{X}\boldsymbol{\beta}+\mathbf{u}, \mathbf{I}\sigma^2) \] \[ \boldsymbol{\beta} \sim \mathrm{N}(0,\mathbf{I}\sigma^2_\beta) \]

where \(\textbf{y}\) is an \(\textit{n}\)-vector containing a phenotype for \(\textit{n}\) individuals, \(\textbf{R}\) is an \(\mathit{n}\times\mathit{n}\) similarity kernel for those individuals, and \(\textbf{X}\) is an \(\mathit{n}\times\mathit{p}\) design matrix of \(\textit{p}\) covariates. \(\textbf{u}\) represents a vector of additive genetic effects that impact expression of the phenotype and is influenced by an individual’s genetic similarity to the rest of the cohort.

\[ \mathbf{u} \sim \mathrm{N}(0,\mathbf{R}\tau^2) \]

Sampling of \(h^2\) was done through Markov Chain Monte Carlo procedures via Gibbs sampling to find \[ p(h^2|\mathbf{y}) \] The \(h^2\) mean estimate is calculated with 350 samples from the Gibbs sampler from an original 5000 iterations (1500 burn-in and thinning by 10).

Priors

Instead of explicitly putting a prior on \(h^2\), we can choose additive genetic and residual components of variance that follow independent inverse gamma distributions.

\[ \tau^{-2}\sim \mathrm{Gamma}(a_\tau,b) \] \[ \sigma^{-2}\sim \mathrm{Gamma}(a_\sigma,b) \]

Using an auxillary random variable, \(U\), which can be defined as \(U =\tau^{-2} + \sigma^{-2}\), we can make the one-to-one transformation

\[ h^2 = f_1(\tau,\sigma) = \frac{\tau^2}{\tau^2 + \sigma^2} = \frac{\sigma^{-2}}{\tau^{-2} + \sigma^{-2}}, \; 0<h^2<1 \]

\[ U = f_2(\tau,\sigma) =\tau^{-2} + \sigma^{-2}, \; U>0 \]

In the case that \(\tau^{-2}\) and \(\sigma^{-2}\) share the same shape parameter, \(b\), the prior distribution of \(h^2\) reduces to the density of a beta distribution with parameters \(a_\sigma\) and \(a_\tau\).

\(h^2\) ranges from 0-1 so we can choose an implicit prior distribution that is uniform over 0 to 1.

\[ h^2 \sim \mathrm{Beta}(a_\tau = 1,a_\sigma=1) \]

Model for genetic correlation

\[\begin{bmatrix} \mathbf{y_1} \\ \mathbf{y_2} \end{bmatrix} \sim \mathrm{N} \Bigg( \begin{bmatrix} \boldsymbol{\mu}_1+\mathbf{u_1} \\ \boldsymbol{\mu}_2+\mathbf{u_2} \end{bmatrix} , \begin{bmatrix} \mathbf{I} \sigma_1^{2} & \rho_\epsilon\mathbf{I} \sigma_1\sigma_2 \\ \rho_\epsilon\mathbf{I}\sigma_1\sigma_2 & \mathbf{I}\sigma_2^{2} \end{bmatrix} \Bigg)\]

where \(\mathbf{y_1}\) and \(\mathbf{y_2}\) contain two phenotypes measured on the same set of n individuals, and \(\textbf{R}\) is an \(n \times n\) genetic relatedness matrix for those individuals.

\[\begin{bmatrix} \mathbf{u_1}\\ \mathbf{u_2} \end{bmatrix} \sim \mathrm{N} \Bigg( \begin{bmatrix} \mathbf{0} \\ \mathbf{0} \end{bmatrix} , \begin{bmatrix} \mathbf{R} \tau_1^{2} & \rho_R\mathbf{R} \tau_1\tau_2 \\ \rho_R\mathbf{R}\tau_1\tau_2 & \mathbf{R}\tau_2^{2} \end{bmatrix} \Bigg)\]

The genetic correlation is defined as the correlation \(\rho_R\) in the bivariate mixed model.

We calculated genetic correlation between traits using the \(\mathbf{u}\) vectors from Gibbs sampling for two different phenotypes according to the model and Bayesian methods as written for heritability.

Some results

… are shown in this poster that I presented during the Gordon Research Seminar/Conference on Quantitative Genetics and Genomics in 2017.

Caveats

Little did I know that INLA does the same thing… faster, better (probably). Oh well – learning!

More background (sorry, I’m long-winded)