# Chapter 15 Doublet detection

## 15.1 Overview

In single-cell RNA sequencing (scRNA-seq) experiments, doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols (Zheng et al. 2017) involving thousands of cells. Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, doublets can be mistaken for intermediate populations or transitory states that do not actually exist. Thus, it is desirable to identify and remove doublet libraries so that they do not compromise interpretation of the results.

Several experimental strategies are available for doublet removal. One approach exploits natural genetic variation when pooling cells from multiple donor individuals (Kang et al. 2018). Doublets can be identified as libraries with allele combinations that do not exist in any single donor. Another approach is to mark a subset of cells (e.g., all cells from one sample) with an antibody conjugated to a different oligonucleotide (Stoeckius et al. 2017). Upon pooling, libraries that are observed to have different oligonucleotides are considered to be doublets and removed. These approaches can be highly effective but rely on experimental information that may not be available.

A more general approach is to infer doublets from the expression profiles alone (Dahlin et al. 2018). In this workflow, we will describe two purely computational approaches for detecting doublets from scRNA-seq data. The main difference between these two methods is whether or not they need cluster information beforehand. We will demonstrate the use of these methods on 10X Genomics data from a droplet-based scRNA-seq study of the mouse mammary gland (Bach et al. 2017).

### loading ###
library(BiocFileCache)
bfc <- BiocFileCache()
base.path <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2834nnn/GSM2834500/suppl"
barcode.fname <- bfcrpath(bfc, file.path(base.path,
"GSM2834500%5FG%5F1%5Fbarcodes%2Etsv%2Egz"))
gene.fname <- bfcrpath(bfc, file.path(base.path,
"GSM2834500%5FG%5F1%5Fgenes%2Etsv%2Egz"))
counts.fname <- bfcrpath(bfc, file.path(base.path,
"GSM2834500%5FG%5F1%5Fmatrix%2Emtx%2Egz"))

library(Matrix)
library(SingleCellExperiment)
colnames(gene.info) <- c("Ensembl", "Symbol")
sce.mam <- SingleCellExperiment(
rowData=DataFrame(gene.info),
)

### gene-annotation ###
library(scater)
rownames(sce.mam) <- uniquifyFeatureNames(
rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol)

library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
rowData(sce.mam)$SEQNAME <- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl,
keytype="GENEID", column="SEQNAME")

### quality-control ###
is.mito <- rowData(sce.mam)$SEQNAME == "MT" stats <- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito))) qc <- quickCellQC(stats, percent_subsets="subsets_Mito_percent", nmads=3) sce.mam <- sce.mam[,!qc$discard]

### normalization ###
library(scran)
set.seed(101000110)
clusters <- quickCluster(sce.mam)
sce.mam <- computeSumFactors(sce.mam, clusters=clusters, min.mean=0.1)
sce.mam <- logNormCounts(sce.mam)

### variance-modelling ###
set.seed(00010101)
dec.mam <- modelGeneVarByPoisson(sce.mam)

### dimensionality-reduction ###
library(BiocSingular)
set.seed(101010011)
sce.mam <- denoisePCA(sce.mam, technical=dec.mam, BSPARAM=IrlbaParam())
sce.mam <- runTSNE(sce.mam, dimred="PCA")

### clustering ###
snn.gr <- buildSNNGraph(sce.mam, use.dimred="PCA", k=25)
sce.mam$cluster <- factor(igraph::cluster_walktrap(snn.gr)$membership)
sce.mam
## class: SingleCellExperiment
## dim: 27998 2772
## assays(2): counts logcounts
## rownames(27998): Xkr4 Gm1992 ... Vmn2r122 CAAA01147332.1
## rowData names(3): Ensembl Symbol SEQNAME
## colnames: NULL
## colData names(2): Barcode cluster
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
## altExpNames(0):

## 15.2 Doublet detection with clusters

### 15.2.1 Overview

The doubletCluster() function will identify clusters that have intermediate expression profiles of two other clusters (Bach et al. 2017). Specifically, it will examine every possible triplet of clusters consisting of a query cluster and its two “parents”. It will then compute a number of statistics:

• The number of genes (N) that are differentially expressed in the same direction in the query cluster compared to both of the parent clusters. Such genes would be unique markers for the query cluster and provide evidence against the null hypothesis, i.e., that the query cluster consists of doublets from the two parents. Clusters with few unique genes are more likely to be composed of doublets.
• The ratio of the median library size in each parent to the median library size in the query (lib.size fields). Doublet libraries are generated from a larger initial pool of RNA compared to libraries for single cells, and thus the former should have larger library sizes. Library size ratios much greater than unity are inconsistent with a doublet identity for the query.
• The proportion of cells in the query cluster should also be reasonable - typically less than 5% of all cells, depending on how many cells were loaded onto the 10X Genomics device.

For each query cluster, the best pair of parents is identified based solely on the lowest N. (This means that any lib.size* above unity is not definitive evidence against a doublet identity for a query cluster, as there may be a pair of parents with slightly higher N but both lib.size* values below unity.) If more detail is necessary, the all.pairs field contains statistics on all possible parent pairs for a given query cluster.

library(scran)
dbl.out <- doubletCluster(sce.mam, sce.mam$cluster) dbl.out ## DataFrame with 11 rows and 9 columns ## source1 source2 N best p.value ## <character> <character> <integer> <character> <numeric> ## 10 11 1 0 Rfc1 0.304133297031587 ## 11 5 2 30 Ptn 1.35818791124658e-14 ## 1 10 3 117 Tinagl1 4.38558716805188e-17 ## 4 6 3 132 Pigr 3.04170568092603e-18 ## 5 11 8 134 Cd63 1.0483429284541e-13 ## 2 11 10 153 Ptma 1.9263161669515e-11 ## 6 9 5 197 Epcam 4.90533186350571e-20 ## 8 9 5 270 AF251705 3.296606994399e-24 ## 7 9 5 300 Fabp4 2.70725398963721e-32 ## 9 7 6 388 Dcn 4.93706079643116e-32 ## 3 11 4 517 Car2 1.89438943741351e-21 ## lib.size1 lib.size2 prop ## <numeric> <numeric> <numeric> ## 10 0.483225348362474 0.905387414573533 0.0256132756132756 ## 11 0.9924694645973 1.18853889246028 0.0216450216450216 ## 1 1.10449955886678 1.31840015684737 0.254689754689755 ## 4 0.679205503772747 1.49234354194407 0.184704184704185 ## 5 1.00758767465532 1.16785416859443 0.00865800865800866 ## 2 0.84136918559728 1.74115283572864 0.176406926406926 ## 6 0.882698905407613 0.882780591406633 0.00901875901875902 ## 8 0.856192060850963 0.856271293875287 0.0187590187590188 ## 7 0.666050295857988 0.666111932938856 0.0119047619047619 ## 9 1.50138811771238 1.13288913566537 0.0140692640692641 ## 3 0.404825637593873 0.670086995315637 0.274531024531025 ## all.pairs ## <DataFrameList> ## 10 11:1:0:...,2:1:4:...,4:2:27:...,... ## 11 5:2:30:...,7:2:111:...,10:5:136:...,... ## 1 10:3:117:...,10:4:168:...,10:6:251:...,... ## 4 6:3:132:...,10:6:181:...,7:3:205:...,... ## 5 11:8:134:...,8:6:140:...,8:7:157:...,... ## 2 11:10:153:...,11:7:186:...,11:9:261:...,... ## 6 9:5:197:...,5:4:219:...,10:5:239:...,... ## 8 9:5:270:...,7:5:338:...,6:5:366:...,... ## 7 9:5:300:...,9:6:329:...,5:2:334:...,... ## 9 7:6:388:...,8:7:403:...,11:8:417:...,... ## 3 11:4:517:...,5:1:627:...,5:4:634:...,... ## <character> ## 11 ## 5 ## 10 ## 6 ## 11 ## 11 ## 9 ## 9 ## 9 ## 7 ## 11 ## <character> ## 1 ## 2 ## 3 ## 3 ## 8 ## 10 ## 5 ## 5 ## 5 ## 6 ## 4 ## <integer> ## 0 ## 30 ## 117 ## 132 ## 134 ## 153 ## 197 ## 270 ## 300 ## 388 ## 517 ## <character> ## Rfc1 ## Ptn ## Tinagl1 ## Pigr ## Cd63 ## Ptma ## Epcam ## AF251705 ## Fabp4 ## Dcn ## Car2 ## <numeric> ## 0.304133297031587 ## 1.35818791124658e-14 ## 4.38558716805188e-17 ## 3.04170568092603e-18 ## 1.0483429284541e-13 ## 1.9263161669515e-11 ## 4.90533186350571e-20 ## 3.296606994399e-24 ## 2.70725398963721e-32 ## 4.93706079643116e-32 ## 1.89438943741351e-21 ## <numeric> ## 0.483225348362474 ## 0.9924694645973 ## 1.10449955886678 ## 0.679205503772747 ## 1.00758767465532 ## 0.84136918559728 ## 0.882698905407613 ## 0.856192060850963 ## 0.666050295857988 ## 1.50138811771238 ## 0.404825637593873 ## <numeric> ## 0.905387414573533 ## 1.18853889246028 ## 1.31840015684737 ## 1.49234354194407 ## 1.16785416859443 ## 1.74115283572864 ## 0.882780591406633 ## 0.856271293875287 ## 0.666111932938856 ## 1.13288913566537 ## 0.670086995315637 ## <numeric> ## 0.0256132756132756 ## 0.0216450216450216 ## 0.254689754689755 ## 0.184704184704185 ## 0.00865800865800866 ## 0.176406926406926 ## 0.00901875901875902 ## 0.0187590187590188 ## 0.0119047619047619 ## 0.0140692640692641 ## 0.274531024531025 ## <DataFrameList> ## 11:1:0:...,2:1:4:...,4:2:27:...,... ## 5:2:30:...,7:2:111:...,10:5:136:...,... ## 10:3:117:...,10:4:168:...,10:6:251:...,... ## 6:3:132:...,10:6:181:...,7:3:205:...,... ## 11:8:134:...,8:6:140:...,8:7:157:...,... ## 11:10:153:...,11:7:186:...,11:9:261:...,... ## 9:5:197:...,5:4:219:...,10:5:239:...,... ## 9:5:270:...,7:5:338:...,6:5:366:...,... ## 9:5:300:...,9:6:329:...,5:2:334:...,... ## 7:6:388:...,8:7:403:...,11:8:417:...,... ## 11:4:517:...,5:1:627:...,5:4:634:...,... Clusters are then ranked by N in the output of doubletClusters(), allowing us to prioritize “high-risk” clusters that require careful investigation. If a more concrete threshold is necessary, we can identify clusters that have unusually low N using an outlier-based approach. library(scater) chosen.doublet <- rownames(dbl.out)[isOutlier(dbl.out$N,
chosen.doublet
## [1] "10"

### 15.2.2 Application to the mammary data

Examination of the output of doubletCluster() indicates that cluster 10 has the fewest unique genes and library sizes that are comparable to or greater than its parents. We see that every gene detected in this cluster is also expressed in either of the two proposed parent clusters (Figure 15.1).

markers <- findMarkers(sce.mam, sce.mam$cluster, direction="up") dbl.markers <- markers[["10"]] library(scater) chosen <- rownames(dbl.markers)[dbl.markers$Top <= 10]
plotHeatmap(sce.mam, columns=order(sce.mam$cluster), colour_columns_by="cluster", features=chosen, cluster_cols=FALSE, center=TRUE, symmetric=TRUE, zlim=c(-5, 5), show_colnames=FALSE) Closer examination of some known markers suggests that the offending cluster consists of doublets of basal cells (Acta2) and alveolar cells (Csn2) (Figure 15.2). Indeed, no cell type is known to strongly express both of these genes at the same time, which supports the hypothesis that this cluster consists solely of doublets19. plotExpression(sce.mam, features=c("Acta2", "Csn2"), x="cluster", colour_by="cluster") The strength of doubletCluster() lies in its simplicity and ease of interpretation. Suspect clusters can be quickly flagged based on the metrics returned by the function. However, it is obviously dependent on the quality of the clustering. Clusters that are too coarse will fail to separate doublets from other cells, while clusters that are too fine will complicate interpretation. The method is also somewhat biased towards clusters with fewer cells, where the reduction in power is more likely to result in a low N. (Fortunately, this is a desirable effect as doublets should be rare in a properly performed scRNA-seq experiment.) ## 15.3 Doublet detection by simulation ### 15.3.1 Overview The other doublet detection strategy involves in silico simulation of doublets from the single-cell expression profiles (Dahlin et al. 2018). This is performed using the doubletCells() function from scran, which will: 1. Simulate thousands of doublets by adding together two randomly chosen single-cell profiles. 2. For each original cell, compute the density of simulated doublets in the surrounding neighbourhood. 3. For each original cell, compute the density of other observed cells in the neighbourhood. 4. Return the ratio between the two densities as a “doublet score” for each cell. This approach assumes that the simulated doublets are good approximations for real doublets. The use of random selection accounts for the relative abundances of different subpopulations, which affect the likelihood of their involvement in doublets; and the calculation of a ratio avoids high scores for non-doublet cells in highly abundant subpopulations. ### 15.3.2 Application to the mammary data We see the function in action below. To speed up the density calculations, doubletCells() will perform a PCA on the log-expression matrix. When BSPARAM=IrlbaParam(), methods from the irlba package are used to perform a fast approximate PCA - hence the use of set.seed() to ensure that results are reproducible. library(BiocSingular) set.seed(100) # Setting up the parameters for consistency with denoisePCA(); # this can be changed depending on your feature selection scheme. dbl.dens <- doubletCells(sce.mam, BSPARAM=IrlbaParam(), subset.row=dec.mam$bio > 0,
d=ncol(reducedDim(sce.mam)))
summary(dbl.dens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##       0       4      14     335      48   56669

The highest doublet scores are concentrated in a single cluster of cells in the centre of Figure 15.3.

sce.mam\$DoubletScore <- log10(dbl.dens+1)
plotTSNE(sce.mam, colour_by="DoubletScore")

From the clustering information, we see that the affected cells belong to the same cluster that was identified using doubletCluster() (Figure 15.4)20.

plotColData(sce.mam, x="cluster", y="DoubletScore", colour_by="cluster")

## 15.4 Strengths and weaknesses

The advantage of doubletCells() is that it does not depend on clusters, reducing the sensitivity of the results to clustering quality. The downside is that it requires some strong assumptions about how doublets form, such as the combining proportions and the sampling from pure subpopulations. In particular, doubletCells() treats the library size of each cell as an accurate proxy for its total RNA content. If this is not true, the simulation will not combine expression profiles from different cells in the correct proportions. This means that the simulated doublets will be systematically shifted away from the real doublets, resulting in doublet scores that are too low.

Simply removing cells with high doublet scores will not be sufficient to eliminate real doublets from the data set. In some cases, only a subset of the cells in the putative doublet cluster actually have high scores. Removing these would still leave enough cells in that cluster to mislead downstream analyses. In fact, even defining a threshold on the doublet score is difficult as the interpretation of the score is relative. There is no general definition for a fixed threshold above which libraries are to be considered doublets.

We recommend interpreting the doubletCells() scores in the context of cluster annotation. All cells from a cluster with a large average doublet score should be considered suspect, and close neighbours of problematic clusters should also be treated with caution. In contrast, a cluster containing a small proportion of high-scoring cells is probably safe provided that any interesting results ae not being driven by those cells (e.g., checking that DE in an interesting gene is not driven solely by cells with high doublet scores). While clustering is still required, this approach is more robust than doubletClusters() to the quality of the clustering as the scores are computed on a per-cell basis.

(As an aside, the issue of unknown combining proportions can be solved completely if spike-in information is available, e.g., in plate-based protocols. This will provide an accurate estimate of the total RNA content of each cell. To this end, spike-in-based size factors from Section 7.5 can be supplied to the doubletCells() function via the size.factors.content= argument. This will use the spike-in size factors to scale the contribution of each cell to a doublet library.)

Doublet detection procedures should only be applied to libraries generated in the same experimental batch. It is obviously impossible for doublets to form between two cells that were captured separately. Thus, some understanding of the experimental design is required prior to the use of the above functions. This avoids unnecessary concerns about the validity of batch-specific clusters that cannot possibly consist of doublets.

It is also difficult to interpret doublet predictions in data containing cellular trajectories. By definition, cells in the middle of a trajectory are always intermediate between other cells and are liable to be incorrectly detected as doublets. Some protection is provided by the non-linear nature of many real trajectories, which reduces the risk of simulated doublets coinciding with real cells in doubletCells(). One can also put more weight on the relative library sizes in doubletCluster() instead of relying solely on N, under the assumption that sudden spikes in RNA content are unlikely in a continuous biological process.

The best solution to the doublet problem is experimental - that is, to avoid generating them in the first place21. This should be a consideration when designing scRNA-seq experiments, where the desire to obtain large numbers of cells at minimum cost should be weighed against the general deterioration in data quality and reliability when doublets become more frequent. If cell labelling information is available (Kang et al. 2018; Stoeckius et al. 2018), we recommend using it to mark doublet cells for use in identification of problematic clusters or cellular states in downstream analyses. Direct removal fails to account for unlabelled intra-sample doublets that can still be present in sufficient numbers to drive misleading conclusions, while the marking approach uses “guilt by association” to identify neighboring unlabelled doublets.

## Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] BiocSingular_1.1.5          scater_1.13.18
[3] ggplot2_3.2.1               scran_1.13.18
[5] SingleCellExperiment_1.7.8  SummarizedExperiment_1.15.9
[7] DelayedArray_0.11.4         BiocParallel_1.19.2
[9] matrixStats_0.55.0          Biobase_2.45.1
[11] GenomicRanges_1.37.15       GenomeInfoDb_1.21.1
[13] IRanges_2.19.14             S4Vectors_0.23.21
[15] BiocGenerics_0.31.5         Cairo_1.5-10
[17] BiocStyle_2.13.2            OSCAUtils_0.0.1

loaded via a namespace (and not attached):
[1] viridis_0.5.1            edgeR_3.27.13
[3] viridisLite_0.3.0        DelayedMatrixStats_1.7.2
[5] assertthat_0.2.1         statmod_1.4.32
[7] BiocManager_1.30.4       highr_0.8
[9] dqrng_0.2.1              GenomeInfoDbData_1.2.1
[11] vipor_0.4.5              yaml_2.2.0
[13] pillar_1.4.2             lattice_0.20-38
[15] glue_1.3.1               limma_3.41.16
[17] digest_0.6.20            RColorBrewer_1.1-2
[19] XVector_0.25.0           colorspace_1.4-1
[21] cowplot_1.0.0            htmltools_0.3.6
[23] Matrix_1.2-17            pkgconfig_2.0.2
[25] pheatmap_1.0.12          bookdown_0.13
[27] zlibbioc_1.31.0          purrr_0.3.2
[29] scales_1.0.0             tibble_2.1.3
[31] withr_2.1.2              lazyeval_0.2.2
[33] magrittr_1.5             crayon_1.3.4
[35] evaluate_0.14            beeswarm_0.2.3
[37] tools_3.6.1              stringr_1.4.0
[39] munsell_0.5.0            locfit_1.5-9.1
[41] irlba_2.3.3              compiler_3.6.1
[43] rsvd_1.0.2               rlang_0.4.0
[45] grid_3.6.1               RCurl_1.95-4.12
[47] BiocNeighbors_1.3.3      igraph_1.2.4.1
[49] labeling_0.3             bitops_1.0-6
[51] rmarkdown_1.15           gtable_0.3.0
[53] R6_2.4.0                 gridExtra_2.3
[55] knitr_1.24               dplyr_0.8.3
[57] stringi_1.4.3            ggbeeswarm_0.6.0
[59] Rcpp_1.0.2               tidyselect_0.2.5
[61] xfun_0.9                

## 15.6 References

### Bibliography

Bach, K., S. Pensa, M. Grzelak, J. Hadfield, D. J. Adams, J. C. Marioni, and W. T. Khaled. 2017. “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nat Commun 8 (1):2128.

Dahlin, J. S., F. K. Hamey, B. Pijuan-Sala, M. Shepherd, W. W. Y. Lau, S. Nestorowa, C. Weinreb, et al. 2018. “A single-cell hematopoietic landscape resolves 8 lineage trajectories and defects in Kit mutant mice.” Blood 131 (21):e1–e11.

Kang, H. M., M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, et al. 2018. “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.” Nat. Biotechnol. 36 (1):89–94.

Stoeckius, M., S. Zheng, B. Houck-Loomis, S. Hao, B. Yeung, P. Smibert, and R. Satija. 2017. “Cell "Hashing" with Barcoded Antibodies Enables Multiplexing and Doublet Detection for Single Cell Genomics.” bioRxiv. Cold Spring Harbor Laboratory. https://doi.org/10.1101/237693.

Stoeckius, M., S. Zheng, B. Houck-Loomis, S. Hao, B. Z. Yeung, W. M. Mauck, P. Smibert, and R. Satija. 2018. “Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics.” Genome Biol. 19 (1):224.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January):14049.

1. While it is possible that this cluster represents an entirely novel cell type, the presence of doublets provides a more sober explanation for its expression profile. So, there’s probably no Nature paper here, move along.

2. Phew. Would have been awkward if they didn’t match up, right?

3. Doh!