# Chapter 12 Cell type annotation

## 12.1 Motivation

The most challenging task in scRNA-seq data analysis is arguably the interpretation of the results. Obtaining clusters of cells is fairly straightforward, but it is more difficult to determine what biological state is represented by each of those clusters. Doing so requires us to bridge the gap between the current dataset and prior biological knowledge, and the latter is not always available in a consistent and quantitative manner15. Indeed, even the concept of a “cell type” is not clearly defined, with most practitioners possessing a “I’ll know it when I see it” intuition that is not amenable to computational analysis. As such, intepretation of scRNA-seq data is often manual and a common bottleneck in the analysis workflow.

To expedite this step, we can use various computational approaches that exploit prior information to assign meaning to an uncharacterized scRNA-seq dataset. The most obvious sources of prior information are the curated gene sets associated with particular biological processes, e.g., from the Gene Ontology (GO) or the Kyoto Encyclopedia of Genes and Genomes (KEGG) collections. Alternatively, we can directly compare our expression profiles to published reference datasets where each sample or cell has already been annotated with its putative biological state by domain experts. Here, we will demonstrate both approaches with several different scRNA-seq datasets.

## 12.2 Assigning cell labels from reference data

### 12.2.1 Overview

A conceptually straightforward annotation approach is to compare the single-cell expression profiles with previously annotated reference datasets. Labels can then be assigned to each cell in our uncharacterized test dataset based on the most similar reference sample(s), for some definition of “similar”. This is a standard classification challenge that can be tackled by standard machine learning techniques such as random forests and support vector machines. Any published and labelled RNA-seq dataset (bulk or single-cell) can be used as a reference, though its reliability depends greatly on the expertise of the original authors who assigned the labels in the first place.

In this section, we will demonstrate the use of the SingleR method (Aran et al. 2019) for cell type annotation. This method assigns labels to cells based on the reference samples with the highest Spearman rank correlations, and thus can be considered a rank-based variant of $$k$$-nearest-neighbor classification. To reduce noise, SingleR identifies marker genes between pairs of labels and computes the correlation using only those markers. It also performs a fine-tuning step for each cell where calculation of the correlations is repeated with just the marker genes for the top-scoring labels. This aims to resolve any ambiguity between those labels by removing noise from irrelevant markers for other labels.

### 12.2.2 Using the in-built references

SingleR contains a number of built-in reference datasets, mostly assembled from bulk RNA-seq or microarray data of sorted cell types. These built-in references are often good enough for most applications, provided that they contain the cell types that are expected in the test population. We will demonstrate on the 10X PBMC dataset using a reference constructed from Blueprint and ENCODE data (Martens and Stunnenberg 2013; The ENCODE Project Consortium 2012).

### loading ###
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

library(DropletUtils)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")

### gene-annotation ###
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column="SEQNAME", keytype="GENEID") ### cell-detection ### set.seed(100) e.out <- emptyDrops(counts(sce.pbmc)) sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

### quality-control ###
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher") sce.pbmc <- sce.pbmc[,!high.mito] ### normalization ### library(scran) set.seed(1000) clusters <- quickCluster(sce.pbmc) sce.pbmc <- computeSumFactors(sce.pbmc, min.mean=0.1, cluster=clusters) sce.pbmc <- logNormCounts(sce.pbmc) ### variance-modelling ### set.seed(1001) dec.pbmc <- modelGeneVarByPoisson(sce.pbmc) ### dimensionality-reduction ### set.seed(10000) sce.pbmc <- denoisePCA(sce.pbmc, technical=dec.pbmc, BSPARAM=BiocSingular::IrlbaParam()) set.seed(100000) sce.pbmc <- runTSNE(sce.pbmc, use_dimred="PCA") set.seed(1000000) sce.pbmc <- runUMAP(sce.pbmc, use_dimred="PCA") ### clustering ### g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA') clust <- igraph::cluster_walktrap(g)$membership
sce.pbmc$cluster <- factor(clust) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3922 ## metadata(0): ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3922): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCACAGGTCCAC-1 TTTGTCATCCCAAGAT-1 ## colData names(3): Sample Barcode cluster ## reducedDimNames(3): PCA TSNE UMAP ## spikeNames(0): ## altExpNames(0): We label our PBMCs using the SingleR() function with the main cell type labels in the reference. This returns a DataFrame where each row corresponds to a cell in the test dataset and contains its label assignments. Alternatively, we could use the labels in ref$label.fine, which provide more resolution at the cost of speed and increased ambiguity in the assignments.

library(SingleR)
ref <- BlueprintEncodeData()
pred <- SingleR(test=sce.pbmc, ref=ref, labels=ref$label.main) table(pred$labels)
##
##      B-cells CD4+ T-cells CD8+ T-cells           DC  Eosinophils
##          524          733         1293            1            1
## Erythrocytes          HSC    Monocytes     NK cells
##            6           16         1113          235

We inspect the results using a heatmap of the per-cell and label scores (Figure 12.1). Ideally, each cell should exhibit a high score in one label relative to all of the others, indicating that the assignment to that label was unambiguous. This is largely the case for monocytes and B cells, whereas we see more ambiguity between CD4+ and CD8+ T cells (and to a lesser extent, NK cells).

plotScoreHeatmap(pred)

SingleR() will attempt to prune out low-quality assignments by marking them as NA. This is done based on the difference $$\Delta_{med}$$ of the assigned label’s score from the median score within each cell. Small $$\Delta_{med}$$ values indicate that the cell assignment was so uncertain that the reported label is not much better than the bulk of other labels in the reference. We set a minimum threshold on the acceptable $$\Delta_{med}$$ using an outlier-based approach for each label, where labels with $$\Delta_{med}$$ that are substantially lower than the majority of values for a given label are marked as NA (Figure 12.2). If necessary, more control over the pruning can be achieved by supplying custom parameters to the pruneScores() function.

sum(is.na(pred$pruned.labels)) ## [1] 78 plotScoreDistribution(pred) We compare the assignments with the clustering results to determine the identity of each cluster. Ideally, clusters and labels would have a 1:1 relationship, though some nesting is likely depending on the resolution of the clustering algorithm. For example, several clusters are nested within the monocyte and B cell labels, suggesting the the former represent finer subdivisions within the latter. Interestingly, our clustering does not effectively distinguish between CD4+ and CD8+ T cell labels. We attribute this to the presence of other factors of heterogeneity within the T cell subpopulation (specifically, naive versus stimulated) that have a stronger influence on unsupervised methods than the a priori expected CD4/CD8 distinction. table(Assigned=pred$pruned.labels, Cluster=sce.pbmc$cluster) ## Cluster ## Assigned 1 2 3 4 5 6 7 8 9 10 11 12 13 ## B-cells 0 0 0 0 1 502 0 0 0 0 0 0 14 ## CD4+ T-cells 0 0 1 1 229 0 0 417 0 76 0 2 0 ## CD8+ T-cells 1 5 0 501 280 10 0 407 0 74 0 2 0 ## DC 0 0 0 0 0 0 1 0 0 0 0 0 0 ## Eosinophils 0 0 0 0 0 0 0 0 1 0 0 0 0 ## Erythrocytes 0 0 0 0 0 0 0 0 6 0 0 0 0 ## HSC 0 0 0 0 9 0 0 0 0 0 0 1 6 ## Monocytes 779 0 55 0 0 0 127 0 0 0 92 0 16 ## NK cells 0 190 0 36 1 1 0 0 0 0 0 0 0 This episode highlights some of the differences between reference-based annotation and unsupervised clustering. The former explicitly focuses on aspects of the data that are known to be interesting, simplifying the process of biological interpretation. However, the cost is that the downstream analysis is restricted by the diversity and resolution of the available labels. We suggest applying both strategies and, if major disagreements are present between reference label and cluster assignments, using those discrepancies as the basis for further investigation to discover novel effects. ### 12.2.3 Using custom references It is also straightforward to apply SingleR to user-supplied reference datasets. For bulk references, the same code shown above can be used, while some additional work is required for single-cell references. To illustrate, we will use the Muraro et al. (2016) human pancreas dataset as our reference. ### loading ### library(scRNAseq) sce.muraro <- MuraroPancreasData() ### gene-annotation ### library(AnnotationHub) edb <- AnnotationHub()[["AH73881"]] gene.symb <- sub("__chr.*$", "", rownames(sce.muraro))
gene.ids <- mapIds(edb, keys=gene.symb,
keytype="SYMBOL", column="GENEID")

# Removing duplicated genes or genes without Ensembl IDs.
keep <- !is.na(gene.ids) & !duplicated(gene.ids)
sce.muraro <- sce.muraro[keep,]
rownames(sce.muraro) <- gene.ids[keep]

### quality-control ###
library(scater)
stats <- perCellQCMetrics(sce.muraro)
sce.muraro <- sce.muraro[,!qc$discard] ### normalization ### library(scran) set.seed(1000) clusters <- quickCluster(sce.muraro) sce.muraro <- computeSumFactors(sce.muraro, min.mean=0.1, clusters=clusters) sce.muraro <- logNormCounts(sce.muraro) sce.muraro ## class: SingleCellExperiment ## dim: 16940 2346 ## metadata(0): ## assays(2): counts logcounts ## rownames(16940): ENSG00000268895 ENSG00000121410 ... ## ENSG00000159840 ENSG00000074755 ## rowData names(2): symbol chr ## colnames(2346): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94 ## colData names(3): label donor plate ## reducedDimNames(0): ## spikeNames(0): ## altExpNames(1): ERCC sce.muraro <- sce.muraro[,!is.na(sce.muraro$label) &
sce.muraro$label!="unclear"] table(sce.muraro$label)
##
##      acinar       alpha        beta       delta        duct endothelial
##         218         803         446         191         242          20
##     epsilon mesenchymal          pp
##           3          80          99

We use methods from scran to generate marker sets for every reference label (Chapter 28.2.8). Here, we use the pairwiseWilcox() and getTopMarkers() functions to take the top 10 upregulated genes from each pairwise comparison between reference labels, using the Wilcoxon test with a log-fold change threshold of 1. This enables us to select a custom set of marker genes that is both discriminative and concise, which is faster to run and often more effective for single-cell references than the default marker set that is auto-generated inside SingleR(). (We could also use this custom approach for our bulk references, but the default is more convenient and often works well enough.)

library(scran)
pairwise.muraro <- pairwiseWilcox(logcounts(sce.muraro),
sce.muraro$label, direction="up", lfc=1) muraro.markers <- getTopMarkers(pairwise.muraro$statistics,
pairwise.muraro$pairs, n=10) muraro.markers ## List of length 9 ## names(9): acinar alpha beta delta duct endothelial epsilon mesenchymal pp We use these newly identified markers in SingleR() to assign labels to our test dataset from Segerstolpe et al. (2016). As it so happens, we are in the fortunate position where our test dataset also contains independently defined labels. We see strong consistency between the two sets of labels (Figure ??), indicating that our automatic annotation is comparable to that generated manually by domain experts. ### loading ### library(scRNAseq) sce.seger <- SegerstolpePancreasData() ### gene-annotation ### library(AnnotationHub) edb <- AnnotationHub()[["AH73881"]] symbols <- rowData(sce.seger)$symbol
ens.id <- mapIds(edb, keys=symbols, keytype="SYMBOL", column="GENEID")
ens.id <- ifelse(is.na(ens.id), symbols, ens.id)

# Removing duplicated rows.
keep <- !duplicated(ens.id)
sce.seger <- sce.seger[keep,]
rownames(sce.seger) <- ens.id[keep]

### sample-annotation ###
emtab.meta <- colData(sce.seger)[,c("cell type",
"individual", "single cell well quality")]
colnames(emtab.meta) <- c("CellType", "Donor", "Quality")
colData(sce.seger) <- emtab.meta

sce.seger$CellType <- gsub(" cell", "", sce.seger$CellType)
sce.seger$CellType <- paste0( toupper(substr(sce.seger$CellType, 1, 1)),
substring(sce.seger$CellType, 2)) ### quality-control ### low.qual <- sce.seger$Quality == "low quality cell"

library(scater)
stats <- perCellQCMetrics(sce.seger)
sce.seger <- sce.seger[,!(qc$discard | low.qual)] ### normalization ### library(scran) clusters <- quickCluster(sce.seger) sce.seger <- computeSumFactors(sce.seger, clusters=clusters) sce.seger <- logNormCounts(sce.seger, use_altexps=FALSE) pred.seger <- SingleR(test=sce.seger, ref=sce.muraro, labels=sce.muraro$label, genes=muraro.markers)
table(pred.seger$labels) ## ## acinar alpha beta delta duct endothelial ## 193 897 301 110 392 17 ## epsilon mesenchymal pp ## 7 54 198 tab <- table(pred.seger$pruned.labels, sce.seger$CellType) library(pheatmap) pheatmap(log2(tab+1), color=colorRampPalette(c("white", "blue"))(101)) An interesting question is - given a single-cell reference dataset, is it better to use it directly or convert it to pseudo-bulk values? A single-cell reference preserves the “shape” of the subpopulation in high-dimensional expression space, potentially yielding more accurate predictions when the differences between labels are subtle (or at least capturing ambiguity more accurately to avoid grossly incorrect predictions). However, it also requires more computational work to assign each cell in the test dataset. We tend to prefer using a single-cell reference directly when one is available, though it is unlikely to make much difference when the labels are well-separated. ## 12.3 Assigning cell labels from gene sets A related strategy is to explicitly identify sets of marker genes that are highly expressed in each individual cell. This does not require matching of individual cells to the expression values of the reference dataset, which is faster and more convenient when only the identities of the markers are available. We demonstrate this approach using neuronal cell type markers derived from the Zeisel et al. (2015) study. ### loading ### library(scRNAseq) sce.zeisel <- ZeiselBrainData() sce.zeisel <- sce.zeisel[rowData(sce.zeisel)$featureType!="repeat",]

library(scater)
sce.zeisel <- aggregateAcrossFeatures(sce.zeisel,
id=sub("_loc[0-9]+$", "", rownames(sce.zeisel))) ### gene-annotation ### library(org.Mm.eg.db) ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype="SYMBOL", column="ENSEMBL") rowData(sce.zeisel)$ENSEMBL <- ensembl

### quality-control ###
stats <- perCellQCMetrics(sce.zeisel)
sce.zeisel <- sce.zeisel[,!qc$discard] ### normalization ### library(scran) set.seed(1000) clusters <- quickCluster(sce.zeisel) sce.zeisel <- computeSumFactors(sce.zeisel, cluster=clusters, min.mean=0.1) sce.zeisel <- logNormCounts(sce.zeisel) library(scran) wilcox.z <- pairwiseWilcox(logcounts(sce.zeisel), sce.zeisel$level1class, lfc=1, direction="up")
markers.z <- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs,
pairwise=FALSE, n=50)
lengths(markers.z)
## astrocytes_ependymal    endothelial-mural         interneurons
##                   74                   88                  121
##            microglia     oligodendrocytes        pyramidal CA1
##                   70                   80                  124
##         pyramidal SS
##                  155

Our test dataset will be another brain scRNA-seq experiment from Tasic et al. (2016).

library(scRNAseq)
sce.tasic <- TasicBrainData()
sce.tasic
## class: SingleCellExperiment
## dim: 24058 1809
## assays(1): counts
## rownames(24058): 0610005C13Rik 0610007C21Rik ... mt_X57780
##   tdTomato
## rowData names(0):
## colnames(1809): Calb2_tdTpositive_cell_1 Calb2_tdTpositive_cell_2
##   ... Rbp4_CTX_250ng_2 Trib2_CTX_250ng_1
## colData names(13): sample_title mouse_line ... secondary_type
##   aibs_vignette_id
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

We use the AUCell package to identify marker sets that are highly expressed in each cell. This method ranks genes by their expression values within each cell and constructs a response curve of the number of genes from each marker set that are present with increasing rank. It then computes the area under the curve (AUC) for each marker set, quantifying the enrichment of those markers among the most highly expressed genes in that cell. This is roughly similar to performing a Wilcoxon rank sum test between genes in and outside of the set, but involving only the top ranking genes by expression in each cell.

library(GSEABase)
all.sets <- lapply(names(markers.z), function(x) {
GeneSet(markers.z[[x]], setName=x)
})
all.sets <- GeneSetCollection(all.sets)

library(AUCell)
rankings <- AUCell_buildRankings(counts(sce.tasic),
plotStats=FALSE, verbose=FALSE)
cell.aucs <- AUCell_calcAUC(all.sets, rankings)
results <- t(assay(cell.aucs))
head(results)
##                           gene sets
## cells                      astrocytes_ependymal endothelial-mural
##   Calb2_tdTpositive_cell_1               0.1353           0.04777
##   Calb2_tdTpositive_cell_2               0.1328           0.04763
##   Calb2_tdTpositive_cell_3               0.1084           0.08739
##   Calb2_tdTpositive_cell_4               0.1285           0.04869
##   Calb2_tdTpositive_cell_5               0.1612           0.06984
##   Calb2_tdTpositive_cell_6               0.1302           0.09979
##                           gene sets
## cells                      interneurons microglia oligodendrocytes
##   Calb2_tdTpositive_cell_1       0.5247   0.05451           0.1444
##   Calb2_tdTpositive_cell_2       0.4417   0.02645           0.1226
##   Calb2_tdTpositive_cell_3       0.3412   0.03533           0.1518
##   Calb2_tdTpositive_cell_4       0.5052   0.05313           0.1499
##   Calb2_tdTpositive_cell_5       0.4769   0.07673           0.1347
##   Calb2_tdTpositive_cell_6       0.3346   0.03157           0.1568
##                           gene sets
## cells                      pyramidal CA1 pyramidal SS
##   Calb2_tdTpositive_cell_1        0.2242       0.3452
##   Calb2_tdTpositive_cell_2        0.1939       0.2760
##   Calb2_tdTpositive_cell_3        0.3009       0.5137
##   Calb2_tdTpositive_cell_4        0.2357       0.3487
##   Calb2_tdTpositive_cell_5        0.2032       0.3046
##   Calb2_tdTpositive_cell_6        0.3892       0.5244

We assign cell type identity to each cell in the test dataset by taking the marker set with the top AUC as the label for that cell. Our new labels mostly agree with the original annotation from Tasic et al. (2016), which is encouraging. The only exception involves miassignment of oligodendrocyte precusors to astrocytes, which may be understandable given that they are derived from a common lineage. In the absence of prior annotation, a more general diagnostic check is to compare the assigned labels to cluster identities, under the expectation that most cells of a single cluster would have the same label (or, if multiple labels are present, they should at least represent closely related cell states).

new.labels <- colnames(results)[max.col(results)]
tab <- table(new.labels, sce.tasic$broad_type) tab ## ## new.labels Astrocyte Endothelial Cell GABA-ergic Neuron ## astrocytes_ependymal 43 2 0 ## endothelial-mural 0 27 0 ## interneurons 0 0 759 ## microglia 0 0 0 ## oligodendrocytes 0 0 1 ## pyramidal SS 0 0 1 ## ## new.labels Glutamatergic Neuron Microglia Oligodendrocyte ## astrocytes_ependymal 0 0 0 ## endothelial-mural 0 0 0 ## interneurons 2 0 0 ## microglia 0 22 0 ## oligodendrocytes 0 0 38 ## pyramidal SS 810 0 0 ## ## new.labels Oligodendrocyte Precursor Cell Unclassified ## astrocytes_ependymal 21 4 ## endothelial-mural 0 2 ## interneurons 0 15 ## microglia 0 1 ## oligodendrocytes 1 0 ## pyramidal SS 0 60 Another simple diagnostic metric is the difference $$\Delta_{AUC}$$ between the maximum and median AUCs for each cell. An umambiguous assignment should manifest as a large $$\Delta_{AUC}$$ for that cell (Figure 12.4), while small differences indicate that the assignment is uncertain. If necessary, we can remove uncertain assignments by applying a minimum threshold on the $$\Delta_{AUC}$$, e.g., to achieve greater agreement with the clustering results or prior annotation. The example below identifies small outlier $$\Delta_{AUC}$$ values under the assumption that most cells are correctly assigned and that there is only modest heterogeneity within each label. library(scater) library(DelayedMatrixStats) deltas <- rowMaxs(results) - rowMedians(results) discard <- isOutlier(deltas, nmads=3, type="lower", batch=new.labels) table(new.labels[discard]) ## ## astrocytes_ependymal endothelial-mural interneurons ## 25 3 7 ## oligodendrocytes pyramidal SS ## 9 20 par(mar=c(10,4,1,1)) boxplot(split(deltas, new.labels), las=2) points(attr(discard, "thresholds")[1,], col="red", pch=4, cex=2) Interpretation of the AUCell results is most straightforward when the marker sets are mutually exclusive, as shown above for the cell type markers. In other applications, one might consider computing AUCs for gene sets associated with signalling or metabolic pathways. It is likely that multiple pathways will be active in any given cell, and it is tempting to use the AUCs to quantify this activity for comparison across cells. However, such comparisons must be interpreted with much caution as the AUCs are competitive values - any increase in one pathway’s activity will naturally reduce the AUCs for all other pathways, potentially resulting in spurious differences across the population. As we mentioned previously, the advantage of the AUCell approach is that it does not require reference expression values. This is particularly useful when dealing with gene sets derived from the literature or other qualitative forms of biological knowledge. (In this particular example, we do have the original expression values, so we could have used SingleR directly. However, this may not always be the case.) The flipside is that information on relative expression is lost when only the marker identities are used. The net effect of ignoring expression values is difficult to predict - it may reduce performance for resolving more subtle cell types, but may also improve performance if the per-cell expression was too noisy to be useful. ## 12.4 Assigning cluster labels from markers Yet another strategy for annotation is to perform a gene set enrichment analysis on the marker genes defining each cluster. This identifies the pathways and processes that are (relatively) active in each cluster based on upregulation of the associated genes compared to other clusters. We demonstrate on the mouse mammary dataset from Bach et al. (2017), using markers that are identified by findMarkers() as being upregulated at a log-fold change threshold of 1. ### loading ### library(BiocFileCache) bfc <- BiocFileCache() base.path <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2834nnn/GSM2834500/suppl" barcode.fname <- bfcrpath(bfc, file.path(base.path, "GSM2834500%5FG%5F1%5Fbarcodes%2Etsv%2Egz")) gene.fname <- bfcrpath(bfc, file.path(base.path, "GSM2834500%5FG%5F1%5Fgenes%2Etsv%2Egz")) counts.fname <- bfcrpath(bfc, file.path(base.path, "GSM2834500%5FG%5F1%5Fmatrix%2Emtx%2Egz")) library(Matrix) library(SingleCellExperiment) gene.info <- read.table(gene.fname, stringsAsFactors=FALSE) colnames(gene.info) <- c("Ensembl", "Symbol") sce.mam <- SingleCellExperiment( list(counts=as(readMM(counts.fname), "dgCMatrix")), rowData=DataFrame(gene.info), colData=DataFrame(Barcode=readLines(barcode.fname)) ) ### gene-annotation ### library(scater) rownames(sce.mam) <- uniquifyFeatureNames( rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol) library(AnnotationHub) ens.mm.v97 <- AnnotationHub()[["AH73905"]] rowData(sce.mam)$SEQNAME <- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl, keytype="GENEID", column="SEQNAME") ### quality-control ### is.mito <- rowData(sce.mam)$SEQNAME == "MT"
stats <- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito)))
sce.mam <- sce.mam[,!qc$discard] ### normalization ### library(scran) set.seed(101000110) clusters <- quickCluster(sce.mam) sce.mam <- computeSumFactors(sce.mam, clusters=clusters, min.mean=0.1) sce.mam <- logNormCounts(sce.mam) ### variance-modelling ### set.seed(00010101) dec.mam <- modelGeneVarByPoisson(sce.mam) ### dimensionality-reduction ### library(BiocSingular) set.seed(101010011) sce.mam <- denoisePCA(sce.mam, technical=dec.mam, BSPARAM=IrlbaParam()) sce.mam <- runTSNE(sce.mam, dimred="PCA") ### clustering ### snn.gr <- buildSNNGraph(sce.mam, use.dimred="PCA", k=25) sce.mam$cluster <- factor(igraph::cluster_walktrap(snn.gr)$membership) ### marker-detection ### markers.mam <- findMarkers(sce.mam, cluster=sce.mam$cluster,
direction="up", lfc=1)
markers.mam
## List of length 11
## names(11): 1 2 3 4 5 6 7 8 9 10 11

As an example, we obtain annotations for the marker genes that define cluster 4. We will use gene sets defined by the Gene Ontology (GO) project, which describe a comprehensive range of biological processes and functions. We define our subset of relevant marker genes at a FDR of 5% and apply the goana() function from the limma package. This performs a hypergeometric test to identify GO terms that are overrepresented in our marker subset. (The log-fold change threshold mentioned above is useful here, as it avoids including an excessive number of genes from the overpowered nature of per-cell DE comparisons.)

chosen <- "4"
cur.markers <- markers.mam[[chosen]]
is.de <- cur.markers$FDR <= 0.05 summary(is.de) ## Mode FALSE TRUE ## logical 27824 174 # goana() requires Entrez IDs, some of which map to multiple # symbols - hence the unique() in the call below. library(org.Mm.eg.db) entrez.ids <- mapIds(org.Mm.eg.db, keys=rownames(cur.markers), column="ENTREZID", keytype="SYMBOL") library(limma) go.out <- goana(unique(entrez.ids[is.de]), species="Mm", universe=unique(entrez.ids)) # Only keeping biological process terms that are not overly general. go.out <- go.out[order(go.out$P.DE),]
go.useful <- go.out[go.out$Ont=="BP" & go.out$N <= 200,]
head(go.useful, 20)
##                                                                                    Term
## GO:0022408                                    negative regulation of cell-cell adhesion
## GO:0006119                                                    oxidative phosphorylation
## GO:0006641                                               triglyceride metabolic process
## GO:0035148                                                               tube formation
## GO:0050729                                 positive regulation of inflammatory response
## GO:0042775                       mitochondrial ATP synthesis coupled electron transport
## GO:0042773                                     ATP synthesis coupled electron transport
## GO:0006639                                               acylglycerol metabolic process
## GO:0006638                                              neutral lipid metabolic process
## GO:0071404               cellular response to low-density lipoprotein particle stimulus
## GO:0019432                                            triglyceride biosynthetic process
## GO:0032760                      positive regulation of tumor necrosis factor production
## GO:1903557 positive regulation of tumor necrosis factor superfamily cytokine production
## GO:0046460                                           neutral lipid biosynthetic process
## GO:0046463                                            acylglycerol biosynthetic process
## GO:0019915                                                                lipid storage
## GO:0045333                                                         cellular respiration
## GO:0022904                                         respiratory electron transport chain
## GO:0042098                                                         T cell proliferation
## GO:0001838                                          embryonic epithelial tube formation
##            Ont   N DE      P.DE
## GO:0022408  BP 183 11 6.782e-08
## GO:0006119  BP  85  8 1.424e-07
## GO:0006641  BP  95  8 3.391e-07
## GO:0035148  BP 172 10 3.709e-07
## GO:0050729  BP 135  9 4.572e-07
## GO:0042775  BP  51  6 1.446e-06
## GO:0042773  BP  52  6 1.625e-06
## GO:0006639  BP 118  8 1.782e-06
## GO:0006638  BP 120  8 2.023e-06
## GO:0071404  BP  15  4 2.913e-06
## GO:0019432  BP  34  5 3.640e-06
## GO:0032760  BP  92  7 3.725e-06
## GO:1903557  BP  93  7 4.005e-06
## GO:0046460  BP  37  5 5.606e-06
## GO:0046463  BP  37  5 5.606e-06
## GO:0019915  BP  68  6 7.970e-06
## GO:0045333  BP 148  8 9.619e-06
## GO:0022904  BP  71  6 1.025e-05
## GO:0042098  BP 197  9 1.037e-05
## GO:0001838  BP 151  8 1.114e-05

We see an enrichment for genes involved lipid synthesis, cell adhesion and tube formation. Given that this is a mammary gland experiment, we might guess that cluster 4 contains luminal epithelial cells responsible for milk production and secretion. Indeed, a closer examination of the marker list indicates that this cluster upregulates milk proteins Csn2 and Csn3 (Figure ??).

plotExpression(sce.mam, features=c("Csn2", "Csn3"),
x="cluster", colour_by="cluster")

Further inspection of interesting GO terms is achieved by extracting the relevant genes. This is usually desirable to confirm that the interpretation of the annotated biological process is appropriate. Many terms have overlapping gene sets, so a term may only be highly ranked because it shares genes with a more relevant term that represents the active pathway.

# Extract symbols for each GO term; done once.
tab <- select(org.Mm.eg.db, keytype="SYMBOL",
keys=rownames(sce.mam), columns="GOALL")
by.go <- split(tab[,1], tab[,2])

# Identify genes associated with an interesting term.
head(cur.markers[rownames(cur.markers) %in% adhesion,1:3], 10)
## DataFrame with 10 rows and 3 columns
##               Top              p.value                  FDR
##         <integer>            <numeric>            <numeric>
## Spint2          8 2.35996156231144e-38  1.2466830909735e-35
## Cebpb          16 1.13376598276496e-15 3.56664943656779e-13
## Epcam          17 2.30150123283305e-92 2.38657153766147e-89
## Btn1a1         20 2.38156689688023e-14  7.0935223381758e-12
## Cd24a          22 4.69304874481581e-30 2.11928997995731e-27
## Ceacam1        27 1.91068488757264e-30 8.76973040692769e-28
## Cd9            52 5.55595158077947e-13 1.54015378572934e-10
## Anxa1          57 2.41419091972851e-10 6.14477430641442e-08
## Sdc4           62 4.23799113305402e-06 0.000785796528100969
## Klf4           77    0.824685006791971                    1
## <integer>

## 8

## 16

## 17

## 20

## 22

## 27

## 52

## 57

## 62

## 77

## <numeric>

## 2.35996156231144e-38

## 1.13376598276496e-15

## 2.30150123283305e-92

## 2.38156689688023e-14

## 4.69304874481581e-30

## 1.91068488757264e-30

## 5.55595158077947e-13

## 2.41419091972851e-10

## 4.23799113305402e-06

## 0.824685006791971

## <numeric>

## 1.2466830909735e-35

## 3.56664943656779e-13

## 2.38657153766147e-89

## 7.0935223381758e-12

## 2.11928997995731e-27

## 8.76973040692769e-28

## 1.54015378572934e-10

## 6.14477430641442e-08

## 0.000785796528100969

## 1

Gene set testing of marker lists is a reliable approach for determining if pathways are up- or down-regulated between clusters. As the top marker genes are simply DEGs, we can directly apply well-established procedures for testing gene enrichment in DEG lists (see here for relevant packages). This contrasts with the AUCell approach where scores are not easily comparable across cells. The downside is that all conclusions are made relative to the other clusters, making it more difficult to determine cell identity if an “outgroup” is not present in the same study.

### Bibliography

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2):163–72.

Bach, K., S. Pensa, M. Grzelak, J. Hadfield, D. J. Adams, J. C. Marioni, and W. T. Khaled. 2017. “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nat Commun 8 (1):2128.

Martens, J. H., and H. G. Stunnenberg. 2013. “BLUEPRINT: mapping human blood cell epigenomes.” Haematologica 98 (10):1487–9.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.

Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4):593–607.

Tasic, B., V. Menon, T. N. Nguyen, T. K. Kim, T. Jarsky, Z. Yao, B. Levi, et al. 2016. “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics.” Nat. Neurosci. 19 (2):335–46.

The ENCODE Project Consortium. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature 489 (7414):57–74.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226):1138–42.

1. For example, it may be somewhere in your bench collaborator’s head. Try sshing into that.