Differential Expression Analysis • ngstan

library(ngstan)
library(posterior)
#> This is posterior version 1.6.1
#> 
#> Attaching package: 'posterior'
#> The following objects are masked from 'package:stats':
#> 
#>     mad, sd, var
#> The following objects are masked from 'package:base':
#> 
#>     %in%, match

Introduction

In this vignette, we’ll replicate a portion of the example differential expression analysis from the Harvard Chan Bioinformatics Core, using data from Kenny, et al.. The package comes with a filtered version of this data set (genes that average less than 2 counts per million across samples are filtered out), called mov10.

head(mov10)
#> # A tibble: 6 × 9
#>      geneid low_1 low_2 high_1 high_2 high_3 control_1 control_2 control_3
#>       <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
#> 1    653635   840   511    627    606    327       567       480       418
#> 2 100996442    88    68     80     85     42        87        76        67
#> 3    729737   302   188    233    246    111       214       192       172
#> 4 102723897   970   573    712    685    362       657       540       473
#> 5 100132287   402   265    307    301    144       282       270       212
#> 6 113219467 11902  6792   7325   6808   4589      7004      5619      4263

Below, we’ll set up a differential expression analysis that compares the Mov10 Overexpression samples (high_1, etc.) to the control samples (control_1, etc.).

Preparing the model inputs

Specifying the experimental design

We represent the experimental design as a model matrix. Since we’re focusing on a single treatment condition (Mov10 Overexpression), this is just a two-column matrix.

# the first three samples are in the treatment condition
mov10_overexpression <- c(1, 1, 1, 0, 0, 0)
X_g <- model.matrix(~mov10_overexpression)
print(X_g)
#>   (Intercept) mov10_overexpression
#> 1           1                    1
#> 2           1                    1
#> 3           1                    1
#> 4           1                    0
#> 5           1                    0
#> 6           1                    0
#> attr(,"assign")
#> [1] 0 1

Identifying sharp nulls

If we want to test sharp null hypotheses (i.e. there is literally no difference in expression between treatment and control for gene $g$ ), we need to identify the model parameters that are associated with the sharp null hypotheses so that the model uses a mixture prior for those parameters.

We do this with an indicator vector:

mixture_probabilities <- c(1, 0.2)

The order corresponds to the order of columns in the model matrix. We are not interested in testing whether or not the log-scale intercept is exactly 0, so the first element of mixture_probabilities is 0 - the model will not use a mixture prior for the intercept. Since we would like to test $H_0^g: \text{there is no difference in expression level between treatment and control for gene } g,$ we use a mixture prior for the parameter associated with the second column of the model matrix - the parameter that represents the difference in (log) expression level between treatment and control.

Arguments list

args <- list(
  y = as.matrix(mov10[, 4:ncol(mov10)]), # don't want geneid
  G = nrow(mov10), # number of genes / tags
  X_g = X_g,
  Z_g = matrix(data = numeric(), ncol = 0),
  mixture_probabilities = mixture_probabilities
)

Running the model

As an aside, this model is huge and will take a long time to sample. If you have access to a computing cluster, I would recommend setting up a model fitting and analysis pipeline to run on your cluster. See the vignette "Computational costs" for more information about how to minimize model sampling time and how to use the alternatives to MCMC sampling.

fit <- do.call(run_hpl_glmm_mix_model, args)

Analyzing the model output

draws <- fit$draws()