Chapter 13 Marker gene detection

13.1 Motivation

To interpret our clustering results from Chapter ???, we identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to more subtle differences in activation status or differentiation state.

Identification of marker genes is usually based around the retrospective detection of differential expression between clusters. Genes that are more strongly DE are more likely to have driven cluster separation in the first place. Several different statistical tests are available to quantify the differences in expression profiles, and different approaches can be used to consolidate test results into a single ranking of genes for each cluster. These choices parametrize the theroetical differences between the various marker detection strategies presented in this chapter. We will demonstrate using the 10X PBMC dataset:

### loading ###
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

library(DropletUtils)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)

### gene-annotation ###
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, 
    column="SEQNAME", keytype="GENEID")

### cell-detection ###
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

### quality-control ###
sce.pbmc <- calculateQCMetrics(sce.pbmc, feature_controls=list(Mito=which(location=="MT")))
high.mito <- isOutlier(sce.pbmc$pct_counts_Mito, nmads=3, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]

### normalization ###
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, min.mean=0.1, cluster=clusters)
sce.pbmc <- normalize(sce.pbmc)

### feature-selection ###
fit.pbmc <- trendVar(sce.pbmc, use.spikes=FALSE)
dec.pbmc <- decomposeVar(fit=fit.pbmc)
o <- order(dec.pbmc$bio, decreasing=TRUE)
chosen.hvgs <- rownames(dec.pbmc)[head(o, 2000)]

### dimensionality-reduction ###
set.seed(10000)
sce.pbmc <- runPCA(sce.pbmc, feature_set=chosen.hvgs, ncomponents=25,
    BSPARAM=BiocSingular::IrlbaParam())

set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, use_dimred="PCA")

set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, use_dimred="PCA")

### clustering ###
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
sce.pbmc$cluster <- factor(clust)
## class: SingleCellExperiment 
## dim: 33694 3922 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(10): ID Symbol ... total_counts log10_total_counts
## colnames(3922): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCACAGGTCCAC-1 TTTGTCATCCCAAGAT-1
## colData names(39): Sample Barcode ...
##   pct_counts_in_top_500_features_Mito cluster
## reducedDimNames(3): PCA TSNE UMAP
## spikeNames(0):

13.2 Using pairwise \(t\)-tests

13.2.1 Standard application

The Welch \(t\)-test is an obvious choice of statistical method to test for differences in expression between clusters. It is quickly computed and has good statistical properties for large numbers of cells [soneson2018bias]. We use the findMarkers() function to perform pairwise comparisons between clusters for each gene. This yields a list of DataFrames contained ranked candidate markers for each cluster.

## DataFrameList of length 12
## names(12): 1 2 3 4 5 6 7 8 9 10 11 12

To demonstrate, we use cluster 9 as our cluster of interest for this section. The relevant DataFrame contains log2-fold changes of expression in cluster 9 over each other cluster, along with several statistics obtained by combining \(p\)-values (Simes 1986) across the pairwise comparisons involving 9.

##  [1] "Top"      "p.value"  "FDR"      "logFC.1"  "logFC.2"  "logFC.3" 
##  [7] "logFC.4"  "logFC.5"  "logFC.6"  "logFC.7"  "logFC.8"  "logFC.10"
## [13] "logFC.11" "logFC.12"

Of particular interest is the Top field, which contains the highest17 rank for each gene across all pairwise comparisons involving cluster 9. The set of genes with Top values of 1 contains the gene with the lowest \(p\)-value from each comparison. Similarly, the set of genes with Top values less than or equal to 10 contains the top 10 genes from each comparison. Each DataFrame produced by findMarkers() will order genes based on the Top value.

## DataFrame with 10 rows and 3 columns
##                Top               p.value                   FDR
##          <integer>             <numeric>             <numeric>
## LEF1             1 6.20939421496419e-167 2.09219328679006e-162
## GZMA             1 2.85174646001885e-163 4.80433726119381e-159
## ACAP1            1 6.15083691785694e-163  6.9082099703425e-159
## IGHD             1 7.47621860433727e-135 5.03807419309093e-131
## TGFBI            1  5.05258071032724e-82  4.25604136134411e-79
## FCGR3A           1  2.52077860009527e-70   1.4395782059595e-67
## HLA-DQA1         1  3.97224203642212e-60  1.48711914639119e-57
## SEC61B           1  6.04861464944905e-38  7.25274099638915e-36
## PF4              1  2.35117342678427e-28  1.53231020197426e-26
## LYAR             2 4.83449973651891e-131 2.71489390203784e-127

We use the Top value to identify a set of genes that is guaranteed to distinguish cluster 9 from any other cluster. Here, we examine the top 5 genes from each pairwise comparison (Figure 13.1). Some inspection of the most upregulated genes suggest that cluster 9 contains platelets or their precursors, based on the expression of platelet factor 4 (PF4) and pro-platelet basic protein (PPBP).

Heatmap of log-fold changes for cluster `r chosen` over all other clusters. Colours are capped at -5 and 5 to preserve dynamic range.

Figure 13.1: Heatmap of log-fold changes for cluster r chosen over all other clusters. Colours are capped at -5 and 5 to preserve dynamic range.

13.2.2 Using the log-fold change

Our previous findMarkers() call considers both up- and downregulated genes to be potential markers. However, downregulated genes are less appealing as markers as it is more difficult to interpret and experimentally validate an absence of expression. To focus on up-regulated markers, we can instead perform a one-sided \(t\)-test to identify genes that are upregulated in each cluster compared to the others. This is achieved by setting direction="up" in the findMarkers() call.

## DataFrame with 10 rows and 3 columns
##                 Top              p.value                  FDR
##           <integer>            <numeric>            <numeric>
## PF4               1 1.17558671339212e-28 3.96102187210342e-24
## TMSB4X            2 1.64674064685164e-25 2.77426396775095e-21
## SDPR              2 3.55136790784062e-23 3.98865967622607e-19
## GPX1              3 4.58923373210941e-22 3.50460237357863e-18
## PPBP              3 5.20063271439818e-22 3.50460237357863e-18
## NRGN              4 1.28482173823981e-21 7.21513060804201e-18
## TAGLN2            4  4.3101937823381e-21 2.07468099002999e-17
## GNG11             5 8.00725077665062e-20 3.37245384585583e-16
## HIST1H2AC         6 1.02722989599352e-19 3.84572045728949e-16
## TUBB1             7 1.52910079823675e-19 5.15215222957892e-16

The \(t\)-test also allows us to specify a non-zero log-fold change as the null hypothesis. This allows us to consider the magnitude of the log-fold change in our \(p\)-value calculations, in a manner that is more rigorous than simply filtering directly on the log-fold changes (McCarthy and Smyth 2009). (Specifically, a simple threshold does not consider the variance and can enrich for genes that have both large log-fold changes and large variances.) We perform this by setting lfc= in our findMarkers() call - when combined with direction=, this tests for genes with log-fold changes that are significantly greater than 1:

## DataFrame with 10 rows and 3 columns
##                 Top              p.value                  FDR
##           <integer>            <numeric>            <numeric>
## PF4               1 6.02669690930859e-25 2.03063525662244e-20
## TMSB4X            2 7.97170089366702e-19 8.95328299704057e-15
## SDPR              2 3.73547685699784e-19 6.29315786098425e-15
## GPX1              3 6.07463154377111e-18 4.09357270471648e-14
## PPBP              3 1.33107797557639e-18 1.12123353272677e-14
## NRGN              4 4.05970433364242e-17 2.27979463029579e-13
## TAGLN2            4 2.25018630384581e-16 1.08311110459687e-12
## GNG11             5 5.50609782505896e-16 2.31903075146921e-12
## HIST1H2AC         6 6.53594082728356e-16 2.44691100260547e-12
## TUBB1             7 2.07146248785668e-15 6.97958570658431e-12

These two settings yield a more focused set of candidate marker genes that are upregulated in cluster 9 (Figure ??).

Heatmap of log-fold changes for cluster 9 over all other clusters. Colours are capped at -5 and 5 to preserve dynamic range.

Figure 13.2: Heatmap of log-fold changes for cluster 9 over all other clusters. Colours are capped at -5 and 5 to preserve dynamic range.

Of course, this increased stringency is not without cost. If only upregulated genes are requested from findMarkers(), any cluster defined by downregulation of a marker gene will not contain that gene among the top set of features in its DataFrame. This is occasionally relevant for subtypes or other states that are distinguished by high versus low expression of particular genes18. Similarly, setting an excessively high log-fold change threshold may discard otherwise useful genes. For example, a gene upregulated in a small proportion of cells of a cluster will have a small log-fold change but can still be an effective marker if the focus is on specificity rather than sensitivity.

13.2.3 Finding cluster-specific markers

By default, findMarkers() will give a high ranking to genes that are differentially expressed in any pairwise comparison. This is because a gene only needs a very low \(p\)-value in a single pairwise comparison to achieve a low Top value. A more stringent approach would onlly consider genes that are differentially expressed in all pairwise comparisons involving the cluster of interest. To achieve this, we set pval.type="all" in findMarkers() to use an intersection-union test (Berger and Hsu 1996) where the combined \(p\)-value for each gene is the maximum of the \(p\)-values from all pairwise comparisons. A gene will only achieve a low combined \(p\)-value if it is strongly DE in all comparisons to other clusters.

## DataFrame with 10 rows and 2 columns
##                        p.value                  FDR
##                      <numeric>            <numeric>
## PF4       1.80014139122098e-28 6.06539640357997e-24
## SDPR      3.55136790784062e-23 5.98298951433908e-19
## PPBP      6.48169985592535e-22 7.27981316485164e-18
## NRGN      5.86340715476335e-21 4.93904101681492e-17
## GNG11     8.95542696697954e-20 6.03488312450816e-16
## HIST1H2AC 1.39989539303288e-19   7.360217470827e-16
## TUBB1     1.52910079823675e-19   7.360217470827e-16
## TMSB4X    1.11117788046074e-16 4.68000343803052e-13
## ACRBP     7.90465526790601e-15 2.95932727329806e-11
## TAGLN2    1.13799040801938e-14  3.8343448807805e-11

This strategy will only report genes that are highly specific to the cluster of interest. When it works, it can be highly effective as it generates a small focused set of candidate markers. However, any gene that is expressed at the same level in two or more clusters will simply not be detected. This is likely to discard many interesting genes, especially if the clusters are finely resolved with weak separation. To give a concrete example, consider a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T cells. With pval.type="all", neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations. In comparison, pval.type="any" will detect both of these genes as they will be DE between at least one pair of subpopulations. This reliability motivates the use of the latter setting as the default in findMarkers().

13.3 Alternative testing regimes

13.3.1 Using the Wilcoxon rank sum test

The Wilcoxon rank sum test (also known as the Wilcoxon-Mann-Whitney test, or WMW test) is another widely used method for pairwise comparisons between groups of observations. Its strength lies in the fact that it directly assesses separation between the expression distributions of different clusters. The WMW test statistic is proportional to the area-under-the-curve (AUC), i.e., the concordance probability, which is the probability of a random cell from one cluster having higher expression than a random cell from another cluster. In a pairwise comparison, AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions. Thus, the WMW test directly addresses the most desirable property of a candidate marker gene, while the \(t\) test only does so indirectly via the difference in the means and the intra-group variance.

We perform WMW tests using the overlapExprs() function. This returns a list of DataFrames containing ranked candidate markers for each cluster. The direction=, lfc= and pval.type= arguments can be specified and have the same interpretation as described for \(t\) tests. We demonstrate below by detecting upregulated genes in each cluster with direction="up".

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

To explore the results in more detail, we focus on the DataFrame for cluster 9. The interpretation of Top is the same as described for \(t\) tests, and Simes’ method is again used to combine \(p\)-values across pairwise comparisons.

## DataFrame with 10 rows and 3 columns
##                 Top               p.value                   FDR
##           <integer>             <numeric>             <numeric>
## PF4               1  1.0202929049051e-187 3.43777491378731e-183
## TMSB4X            1  1.01281466445067e-33  3.25007402895248e-31
## SDPR              2  3.5648749185145e-184 6.00574477522145e-180
## PPBP              3 4.69409849189378e-163 5.27209848619572e-159
## TUBB1             3  8.5842665088848e-162 7.23095689375918e-158
## GNG11             3 3.27662779920279e-159 2.20805394132683e-155
## NRGN              3 1.42552621019204e-155  8.0052800210352e-152
## CLU               5 1.29715673377618e-149 5.46329987348186e-146
## HIST1H2AC         5 2.48168081125816e-106 5.57451688363549e-103
## TAGLN2            5  2.32628433861945e-34  7.60988587431489e-32

The DataFrame contains the AUCs from comparing cluster 9 to every other cluster (Figure 13.3). A value greater than 0.5 indicates that the gene is upregulated in the current cluster compared to the other cluster, while values less than 0.5 correspond to downregulation. We would typically expect AUCs of 0.7-0.8 for a strongly upregulated candidate marker.

Heatmap of AUCs for cluster `r chosen` over all other clusters.

Figure 13.3: Heatmap of AUCs for cluster r chosen over all other clusters.

The main disadvantage of the WMW test is that the AUCs are much slower to compute compared to \(t\)-statistics. This may be inconvenient for interactive analyses involving multiple iterations of marker detection. We can mitigate this to some extent by parallelizing these calculations using the BPPARAM= argument in overlapExprs().

13.3.2 Using a binomial test

The binomial test identifies genes that differ in the proportion of expressing cells between clusters. (For the purposes of this section, a cell is considered to express a gene simply if it has non-zero expression for that gene.) This represents a much more stringent definition of marker genes compared to the other methods, as differences in expression between clusters are effectively ignored if both distributions of expression values are not near zero. The premise is that genes are more likely to contribute to important biological decisions if they were active in one cluster and silent in another, compared to more subtle “tuning” effects from changing the expression of an active gene. From a practical perspective, a binary measure of presence/absence is easier to validate.

We perform pairwise binomial tests between clusters using the pairwiseBinom() function. This returns a list of DataFrames containing marker statistics for each cluster, such as the Top rank and its \(p\)-value. Here, the effect size is reported as the log-fold change in this proportion between each pair of clusters. Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other. Here, we focus on genes that are upregulated in each cluster compared to the others by setting direction="up".

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
##  [1] "Top"      "p.value"  "FDR"      "logFC.1"  "logFC.2"  "logFC.3" 
##  [7] "logFC.4"  "logFC.5"  "logFC.6"  "logFC.7"  "logFC.8"  "logFC.10"
## [13] "logFC.11" "logFC.12"

Figure (fig:viol-de-binom) confirms that the top genes exhibit strong differences in the proportion of expressing cells in cluster 9 compared to the others.

Distribution of log-normalized expression values for the top 10 DE genes involving cluster 9 with the binomial test, stratified by cluster assignment and coloured by the plate of origin for each cell.

Figure 13.4: Distribution of log-normalized expression values for the top 10 DE genes involving cluster 9 with the binomial test, stratified by cluster assignment and coloured by the plate of origin for each cell.

The disadvantage of the binomial test is that its increased stringency can lead to the loss of good candidate markers. For example, GCG is a known marker for pancreatic alpha cells but is expressed in almost every other cell of the Lawlor et al. (2017) pancreas data (Figure 13.5) and would not be highly ranked by the binomial test.

Distribution of log-normalized expression values for _GCG_ across different pancreatic cell types in GSE86469.

Figure 13.5: Distribution of log-normalized expression values for GCG across different pancreatic cell types in GSE86469.

Another property of the binomial test is that it will not respond to scaling normalization. Systematic differences in library size between clusters will not be considered when computing \(p\)-values or effect sizes. This is not necessarily problematic for marker gene detection - users can treat this as retaining information about the total RNA content, analogous to spike-in normalization.

13.3.3 Using custom DE methods

It is possible to perform marker gene detection based on precomputed DE statistics. This allows us to take advantage of more sophisticated tests in dedicated DE analysis packages. To demonstrate, consider the voom() approach from the limma package (Law et al. 2014). We first process our SingleCellExperiment to obtain a fit object as shown below.

##  [1] "cluster1"  "cluster2"  "cluster3"  "cluster4"  "cluster5" 
##  [6] "cluster6"  "cluster7"  "cluster8"  "cluster9"  "cluster10"
## [11] "cluster11" "cluster12"
##    Mode   FALSE    TRUE 
## logical   29387    4307

We then perform pairwise comparisons between clusters using the TREAT strategy (McCarthy and Smyth 2009) to test for log-fold changes that are significantly greater than 0.5. For each comparison, we store the corresponding data frame of statistics in all.results, along with the identities of the clusters involved in all.pairs.

These custom results are consolidated into a single marker list for each cluster with the combineMarkers() function. This combines test statistics across all pairwise comparisons involving a single cluster, yielding a per-cluster DataFrame that can be interpreted in the same manner as discussed previously.

##  [1] "Top"             "p.value"         "FDR"            
##  [4] "logFC.cluster1"  "logFC.cluster2"  "logFC.cluster3" 
##  [7] "logFC.cluster4"  "logFC.cluster5"  "logFC.cluster6" 
## [10] "logFC.cluster7"  "logFC.cluster8"  "logFC.cluster10"
## [13] "logFC.cluster11" "logFC.cluster12"
## DataFrame with 6 rows and 3 columns
##                Top   p.value       FDR
##          <integer> <numeric> <numeric>
## PDZK1IP1         1         0         0
## RGS18            1         0         0
## SLC40A1          1         0         0
## SDPR             1         0         0
## PF4              1         0         0
## PPBP             1         0         0

By default, we do not use custom DE methods to perform marker detection, for several reasons. Many of these methods rely on empirical Bayes shrinkage to share information across genes in the presence of limited replication. However, this is unnecessary when one treats the many cells in a group as “replicates” of each other (see Section ???). These methods also make stronger assumptions about the data (e.g., equal variances for linear models, the distribution of variances during empirical Bayes) that are more likely to be violated in noisy scRNA-seq contexts. From a practical perspective, they require more work to set up and take more time to run. Nonetheless, some custom methods (e.g., MAST) may provide a useful point of difference from the simpler tests, in which case they can be converted into a marker detection scheme as described above.

13.4 Handling blocking factors

13.5 Using the block= argument

Large studies may contain factors of variation that are known and not interesting (e.g., batch effects, sex differences). If these are not modelled, they can interfere with marker gene detection - most obviously by inflating the variance within each cluster, but also by distorting the log-fold changes if the cluster composition varies across levels of the blocking factor. To avoid these issues, we set the block= argument in the findMarkers() call, as demonstrated below for the 416B data set.

### loading ###
library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b") 

### gene-annotation ###
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)
rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
    keytype="GENEID", column="SYMBOL")
rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
    keytype="GENEID", column="SEQNAME")

library(scater)
rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, 
    rowData(sce.416b)$SYMBOL)

### quality-control ###
mito <- which(rowData(sce.416b)$SEQNAME=="MT")
sce.416b <- calculateQCMetrics(sce.416b, feature_controls=list(Mt=mito))

combined <- paste0(sce.416b$block, ":", sce.416b$phenotype)
libsize.drop <- isOutlier(sce.416b$total_counts, nmads=3, type="lower", 
    log=TRUE, batch=combined)
feature.drop <- isOutlier(sce.416b$total_features_by_counts, nmads=3, 
    type="lower", log=TRUE, batch=combined)
spike.drop <- isOutlier(sce.416b$pct_counts_ERCC, nmads=3, type="higher",
    batch=combined)

keep <- !(libsize.drop | feature.drop | spike.drop)
sce.416b <- sce.416b[,keep]

### normalization ###
library(scran)
sce.416b <- computeSumFactors(sce.416b)
sce.416b <- computeSpikeFactors(sce.416b, general.use=FALSE)
sce.416b <- normalize(sce.416b)

### variance-modelling ###
fit.416b <- trendVar(sce.416b, block=sce.416b$block)
dec.416b <- decomposeVar(sce.416b, fit.416b)

# Taking all of the genes with positive biological components.
chosen.hvgs <- rownames(dec.416b)[dec.416b$bio > 0]

### batch-correction ###
# Composition of the plates is expected to be the same,
# hence the use of a simple 'removeBatchEffect'.
library(limma)
assay(sce.416b, "corrected") <- removeBatchEffect(logcounts(sce.416b), 
    design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block)

### dimensionality-reduction ###
sce.416b <- denoisePCA(sce.416b, technical=dec.416b, 
    subset.row=chosen.hvgs, assay.type="corrected")

set.seed(1010)
sce.416b <- runTSNE(sce.416b, use_dimred="PCA", perplexity=10)

### clustering ###
my.dist <- dist(reducedDim(sce.416b, "PCA"))
my.tree <- hclust(my.dist, method="ward.D2")

library(dynamicTreeCut)
my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist),
    minClusterSize=10, verbose=0))
sce.416b$cluster <- factor(my.clusters)

For each gene, each pairwise comparion between clusters is performed separately in each level of the blocking factor - in this case, the plate of origin. The function will then combine \(p\)-values from different plates using Stouffer’s Z method to obtain a single \(p\)-value per pairwise comparison. (These \(p\)-values are further combined across comparisons to obtain a single \(p\)-value per gene, using either Simes’ method or an intersection-union test depending on the value of pval.type=.) This approach favours genes that exhibit consistent DE in the same direction in each plate.

## DataFrame with 24 rows and 3 columns
##                          Top              p.value                  FDR
##                    <integer>            <numeric>            <numeric>
## Cenpk                      1 3.64842075891759e-38 1.70031001048595e-33
## Uhrf1                      1 1.27436092404157e-31 2.96951582520165e-27
## Pkp4                       1 2.10780869575213e-27 1.63720527428053e-23
## ENSMUSG00000082279         1 1.62145081562279e-23 5.03697869568323e-20
## Hbb-bt                     1 1.47303160270372e-21 2.14528640038763e-18
## ...                      ...                  ...                  ...
## Rfc5                       5 1.43735740166193e-26 8.37332554338155e-23
## Elp4                       5 1.85254656015266e-23 5.07859293466794e-20
## ENSMUSG00000082043         5 4.23815536237289e-19 3.46517530715835e-16
## Cxcr5                      5 2.68967679921673e-15 9.35445504109675e-13
## Kcnk1                      5  7.8504116345369e-13 1.35880722663217e-10

The block= argument works with all tests shown above and is robust to difference in the log-fold changes or variance between batches. However, it assumes that each pair of clusters is present in at least one batch. In scenarios where cells from two clusters never co-occur in the same batch, the comparison will be impossible and NAs will be reported in the output.

13.6 Using the design= argument

Another approach is to define a design matrix containing the batch of origin as the sole factor. findMarkers() will then fit a linear model to the log-expression values, similar to the use of limma for bulk RNA sequencing data (Ritchie et al. 2015). This handles situations where multiple batches contain unique clusters, as comparisons can be implicitly performed via shared cell types in each batch. There is also a slight increase in power when information is shared across clusters for variance estimation.

## DataFrame with 18 rows and 3 columns
##                          Top              p.value                  FDR
##                    <integer>            <numeric>            <numeric>
## ENSMUSG00000081664         1 1.91285355952216e-41 8.91466272879706e-37
## Pmf1                       1 1.12621582896182e-36  1.3121540623234e-32
## Ccna2                      1  1.4320290224626e-35 1.11230467604745e-31
## Hbb-bt                     1   1.144389533167e-28 1.90475463584696e-25
## ENSMUSG00000082610         1 9.85538881311309e-27 1.17769369293929e-23
## ...                      ...                  ...                  ...
## Tuba8                      4 1.12209603758146e-17 3.09432921511517e-15
## Top2a                      5 2.69761684182245e-36 2.51439470592585e-32
## ENSMUSG00000084220         5 1.41054844370574e-29 2.98805453047556e-26
## Mfsd2b                     5 1.77368012038614e-28 2.85036511484399e-25
## ENSMUSG00000084340         5 3.14925058189259e-13  4.0882360478697e-11

The use of a linear model makes some strong assumptions, necessitating some caution when interpreting the results. If the batch effect is not consistent across clusters, the variance will be inflated and the log-fold change estimates will be distorted. Variances are also assumed to be equal across groups, which is not true in general. In particular, the presence of clusters in which a gene is silent will shrink the residual variance towards zero, preventing the model from penalizing genes with high variance in other clusters. Thus, we generally recommend the use of block= where possible.

13.7 Invalidity of \(p\)-values

13.7.1 From data snooping

All of our DE strategies for detecting marker genes between clusters are statistically flawed to some extent. The DE analysis is performed on the same data used to obtain the clusters, which represents “data dredging” (also known as fishing or data snooping). The hypothesis of interest - that are there differences between clusters? - is formulated from the data, so we are more likely to get a positive result when we re-use the data set to test that hypothesis.

The practical effect of data dredging is best illustrated with a simple simulation. We simulate i.i.d. normal values, perform \(k\)-means clustering and test for DE between clusters of cells with findMarkers(). The resulting distribution of \(p\)-values is heavily skewed towards low values (Figure 13.6). Thus, we can detect “significant” differences between clusters even in the absence of any real substructure in the data. This effect arises from the fact that clustering, by definition, yields groups of cells that are separated in expression space. Testing for DE genes between clusters will inevitably yield some significant results as that is how the clusters were defined.

Distribution of $p$-values from a DE analysis between two clusters in a simulation with no true subpopulation structure.

Figure 13.6: Distribution of \(p\)-values from a DE analysis between two clusters in a simulation with no true subpopulation structure.

For marker gene detection, this effect is largely harmless as the \(p\)-values are used only for ranking. However, it becomes an issue when the \(p\)-values are used to define “significant differences” between clusters with respect to an error rate threshold. Meaningful interpretation of error rates require consideration of the long-run behaviour, i.e., the rate of incorrect rejections if the experiment were repeated many times. The concept of statistical significance for differences between clusters is not applicable if clusters and their interpretations are not stably reproducible across (hypothetical) replicate experiments.

13.7.2 Nature of replication

The naive application of DE analysis methods will treat counts from the same cluster of cells as replicate observations. This is not the most relevant level of replication when cells are derived from the same biological sample (i.e., cell culture, animal or patient). DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability (A. T. L. Lun and Marioni 2017). The latter is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be replicated. Indeed, the use of cells as replicates only masks the fact that the sample size is actually one in an experiment involving a single biological sample. This reinforces the inappropriateness of using the marker gene \(p\)-values to make statements about statistical inference.

13.8 Session Info

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] limma_3.41.15               scater_1.13.9              
 [3] ggplot2_3.2.0               pheatmap_1.0.12            
 [5] scran_1.13.9                SingleCellExperiment_1.7.0 
 [7] SummarizedExperiment_1.15.5 DelayedArray_0.11.4        
 [9] BiocParallel_1.19.0         matrixStats_0.54.0         
[11] Biobase_2.45.0              GenomicRanges_1.37.14      
[13] GenomeInfoDb_1.21.1         IRanges_2.19.10            
[15] S4Vectors_0.23.17           BiocGenerics_0.31.5        
[17] BiocStyle_2.13.2            Cairo_1.5-10               

loaded via a namespace (and not attached):
 [1] viridis_0.5.1            dynamicTreeCut_1.63-1   
 [3] edgeR_3.27.9             BiocSingular_1.1.5      
 [5] viridisLite_0.3.0        DelayedMatrixStats_1.7.1
 [7] assertthat_0.2.1         statmod_1.4.32          
 [9] BiocManager_1.30.4       dqrng_0.2.1             
[11] GenomeInfoDbData_1.2.1   vipor_0.4.5             
[13] yaml_2.2.0               pillar_1.4.2            
[15] lattice_0.20-38          glue_1.3.1              
[17] digest_0.6.20            RColorBrewer_1.1-2      
[19] XVector_0.25.0           colorspace_1.4-1        
[21] htmltools_0.3.6          Matrix_1.2-17           
[23] pkgconfig_2.0.2          bookdown_0.12           
[25] zlibbioc_1.31.0          purrr_0.3.2             
[27] scales_1.0.0             tibble_2.1.3            
[29] withr_2.1.2              lazyeval_0.2.2          
[31] magrittr_1.5             crayon_1.3.4            
[33] evaluate_0.14            beeswarm_0.2.3          
[35] tools_3.6.0              stringr_1.4.0           
[37] munsell_0.5.0            locfit_1.5-9.1          
[39] irlba_2.3.3              compiler_3.6.0          
[41] rsvd_1.0.2               rlang_0.4.0             
[43] grid_3.6.0               RCurl_1.95-4.12         
[45] BiocNeighbors_1.3.3      igraph_1.2.4.1          
[47] bitops_1.0-6             rmarkdown_1.14          
[49] gtable_0.3.0             R6_2.4.0                
[51] gridExtra_2.3            knitr_1.23              
[53] dplyr_0.8.3              stringi_1.4.3           
[55] ggbeeswarm_0.6.0         Rcpp_1.0.2              
[57] tidyselect_0.2.5         xfun_0.8                

Bibliography

Berger, R. L., and J. C. Hsu. 1996. “Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets.” Statist. Sci. 11 (4). The Institute of Mathematical Statistics:283–319. https://doi.org/10.1214/ss/1032280304.

Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biol. 15 (2):R29.

Lawlor, N., J. George, M. Bolisetty, R. Kursawe, L. Sun, V. Sivakamasundari, I. Kycia, P. Robson, and M. L. Stitzel. 2017. “Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes.” Genome Res. 27 (2):208–22.

Lun, A. T. L., and J. C. Marioni. 2017. “Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data.” Biostatistics 18 (3):451–64.

McCarthy, D. J., and G. K. Smyth. 2009. “Testing significance relative to a fold-change threshold is a TREAT.” Bioinformatics 25 (6):765–71.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7):e47.

Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3):751–54.


  1. That is, the lowest rank value. So 1st is 1, and 2nd is 2, so 1 is higher than 2. Geez.

  2. Standard operating procedure is to (i) experience a brief but crushing bout of disappointment due to the poor quality of upregulated candidate markers, (ii) rage-quit, and (iii) remember to check the genes that are changing in the other direction.