Differential Expression Analysis
differential_expression_analysis.Rmd
library(ngstan)
library(posterior)
#> This is posterior version 1.6.1
#>
#> Attaching package: 'posterior'
#> The following objects are masked from 'package:stats':
#>
#> mad, sd, var
#> The following objects are masked from 'package:base':
#>
#> %in%, match
Introduction
In this vignette, we’ll replicate a portion of the example
differential expression analysis from the Harvard Chan
Bioinformatics Core, using data from Kenny,
et al.. The package comes with a filtered version of this data set
(genes that average less than 2 counts per million across samples are
filtered out), called mov10
.
head(mov10)
#> # A tibble: 6 × 9
#> geneid low_1 low_2 high_1 high_2 high_3 control_1 control_2 control_3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 653635 840 511 627 606 327 567 480 418
#> 2 100996442 88 68 80 85 42 87 76 67
#> 3 729737 302 188 233 246 111 214 192 172
#> 4 102723897 970 573 712 685 362 657 540 473
#> 5 100132287 402 265 307 301 144 282 270 212
#> 6 113219467 11902 6792 7325 6808 4589 7004 5619 4263
Below, we’ll set up a differential expression analysis that compares
the Mov10 Overexpression samples (high_1
, etc.) to the
control samples (control_1
, etc.).
Preparing the model inputs
Specifying the experimental design
We represent the experimental design as a model matrix. Since we’re focusing on a single treatment condition (Mov10 Overexpression), this is just a two-column matrix.
# the first three samples are in the treatment condition
mov10_overexpression <- c(1, 1, 1, 0, 0, 0)
X_g <- model.matrix(~mov10_overexpression)
print(X_g)
#> (Intercept) mov10_overexpression
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 0
#> 5 1 0
#> 6 1 0
#> attr(,"assign")
#> [1] 0 1
Identifying sharp nulls
If we want to test sharp null hypotheses (i.e. there is literally no difference in expression between treatment and control for gene ), we need to identify the model parameters that are associated with the sharp null hypotheses so that the model uses a mixture prior for those parameters.
We do this with an indicator vector:
mixture_probabilities <- c(1, 0.2)
The order corresponds to the order of columns in the model matrix. We
are not interested in testing whether or not the log-scale intercept is
exactly 0, so the first element of mixture_probabilities
is
0 - the model will not use a mixture prior for the intercept. Since we
would like to test
we use a mixture prior for the
parameter associated with the second column of the model matrix - the
parameter that represents the difference in (log) expression level
between treatment and control.
Running the model
As an aside, this model is huge and will take a long time to sample.
If you have access to a computing cluster, I would recommend setting up
a model fitting and analysis pipeline to run on your cluster. See the
vignette "Computational costs"
for more information about
how to minimize model sampling time and how to use the alternatives to
MCMC sampling.
fit <- do.call(run_hpl_glmm_mix_model, args)