Chapter 8 Quality Control

8.1 Motivation

Low-quality libraries in scRNA-seq data can arise from a variety of sources such as cell damage during dissociation or failure in library preparation (e.g., inefficient reverse transcription or PCR amplification). These usually manifest as “cells” with low total counts, few expressed genes and high mitochondrial or spike-in proportions. These low-quality libraries are problematic as they can contribute to misleading results in downstream analyses:

  • They form their own distinct cluster(s), complicating interpretation of the results. This can be most obviously driven by increased mitochondrial proportions or enrichment for nuclear RNAs after cell damage. However, very small libraries can also form their own clusters due to shifts in the mean upon transformation (Lun 2018).
  • They contain genes that appear to be strongly “upregulated” due to aggressive scaling to normalize for small library sizes. This is most problematic for contaminating transcripts (e.g., from the ambient solution) that are present in all libraries at low but constant levels. Increased scaling in low-quality libraries transforms small counts for these transcripts in large normalized expression values, resulting in apparent upregulation compared to other cells.
  • They distort the characterization of population heterogeneity during variance estimation or principal components analysis. The first few principal components will capture differences in quality rather than biology, reducing the effectiveness of dimensionality reduction. Similarly, genes with the largest variances will be driven by differences between low- and high-quality cells.

To avoid - or at least mitigate - these problems, we need to remove these cells at the start of the analysis. This step is commonly referred to as quality control (QC) on the cells. (We will use “library” and “cell” rather interchangeably here, though the distinction will become important when dealing with droplet-based data.) We will demonstrate using a small scRNA-seq dataset from A. T. L. Lun et al. (2017), which is provided with no prior QC so that we can apply our own procedures.

## class: SingleCellExperiment 
## dim: 46703 192 
## metadata(0):
## assays(1): counts
## rownames(46703): ENSMUSG00000102693 ENSMUSG00000064842 ... SIRV7
##   CBFB-MYH11-mcherry
## rowData names(1): Length
## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(9): Source Name cell line ... spike-in addition
##   block
## reducedDimNames(0):
## spikeNames(2): ERCC SIRV

8.2 Choice of QC metrics

We use several common QC metrics to identify low-quality cells based on their expression profiles. These metrics are described below in terms of reads for SMART-seq2 data, but the same definitions apply to UMI data generated by other technologies like MARS-seq and droplet-based protocols.

  • The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with small library sizes are of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation.
  • The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
  • The proportion of reads mapped to spike-in transcripts is calculated relative to the library size for each cell. As the same amount of spike-in RNA should have been added to each cell, any enrichment in spike-in counts is symptomatic of loss of endogenous RNA. Thus, high proportions are indicative of poor-quality cells where endogenous RNA has been lost due to, e.g., partial cell lysis or RNA degradation during dissociation.
  • In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

For each cell, we calculate these QC metrics using the calculateQCMetrics function from the scater package (McCarthy et al. 2017). These are stored in the column-wise metadata of the SingleCellExperiment for future reference. (Similar per-gene metrics are computed in the row-wise metadata, but we will not use them here.)

## [1] "Length"                  "is_feature_control"     
## [3] "is_feature_control_Mito" "is_feature_control_ERCC"
## [5] "is_feature_control_SIRV" "mean_counts"
## [1] "Source Name"              "cell line"               
## [3] "cell type"                "single cell well quality"
## [5] "genotype"                 "phenotype"

8.3 Identifying low-quality cells

8.3.1 With fixed thresholds

The simplest approach to identifying low-quality cells is to apply thresholds on the QC metrics. For example, we might discard all cells that have library sizes below 100,000 reads; express fewer than 5,000 genes; have spike-in proportions above 10%; or have mitochondrial proportions above 5%1.

## DataFrame with 1 row and 5 columns
##     LibSize    NExprs SpikeProp  MitoProp     Total
##   <integer> <integer> <integer> <integer> <integer>
## 1         3         0        19         7        27
## [1] 46703   165

While simple, this strategy generally requires considerable experience to determine appropriate thresholds for each experimental protocol and biological system. For example, thresholds for read count-based data are simply not applicable for UMI-based data, and vice versa. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of cDNA capture efficiency and sequencing depth.

8.3.2 With adaptive thresholds

To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells. We then identify cells that are outliers for the various QC metrics, based on the median absolute deviation (MAD) from the median value of each metric across all cells. Specifically, a value is considered an outlier if it is more than 3 MADs from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution.

For the 416B data, we identify cells with log-transformed library sizes that are more than 3 MADs below the median. A log-transformation is used to improve resolution at small values when type="lower". In particular, it guarantees that the threshold is not a negative value, which would be meaningless for quality control on these metrics. Moreover, these metrics can occasionally exhibit a heavy right tail, and the log-transformation makes the distribution seem more normal to justify the 99% rationale mentioned above.

We do the same for the log-transformed number of expressed genes.

isOutlier() will also return the exact filter thresholds for each metric. These is useful for checking whether the automatically selected thresholds are appropriate.

##  lower higher 
## 480573    Inf
##  lower higher 
##   5269    Inf

We identify outliers for the proportion-based metrics in a similar manner. Here, no transformation is required as we are identifying large outliers that should already be clearly distinguishable from zero.

A cell that is an outlier for any of these metrics is considered to be of low-quality and discarded. This is achieved simply by subsetting the SingleCellExperiment as previously shown.

## DataFrame with 1 row and 5 columns
##     LibSize    NExprs SpikeProp  MitoProp     Total
##   <integer> <integer> <integer> <integer> <integer>
## 1         4         0         1         1         4

With this strategy, the thresholds adapt to both the location and spread of the distribution of values for a given metric. This allows the QC procedure to adjust to changes in sequencing depth, cDNA capture efficiency, mitochondrial content, etc. without requiring any user intervention or prior experience. However, it does require some implicit assumptions that are discussed below in more detail.

8.3.3 Considering experimental factors

More complex studies may involve batches of cells generated at different times or in different systems. In such cases, the adaptive strategy should be applied to each batch separately. It makes little sense to compute medians and MADs from a mixture distribution containing samples from multiple batches. For example, if the sequencing coverage is lower in one batch compared to the others, it will drag down the median and inflate the MAD. This will reduce the suitability of the adaptive threshold for the other batches.

If each batch is represented by its own SingleCellExperiment, the isOutlier() function can be directly applied as shown above. However, if cells from all batches have been merged into a single SingleCellExperiment, the batch= argument should be used to ensure that outliers are identified within each batch. This allows isOutlier() to accommodate systematic differences in the QC metrics across batches.

We will again illustrate using the 416B dataset, which actually contains two experimental factors - plate of origin and oncogene induction status. (We had been ignoring this in the previous section for simplicity, but no longer.) We combine these factors together and supply this to the batch= argument. This results in the removal of more cells as the MAD is no longer inflated by (i) systematic differences in sequencing depth between batches and (ii) differences in number of genes expressed upon oncogene induction.

## DataFrame with 1 row and 5 columns
##     LibSize    NExprs SpikeProp  MitoProp     Total
##   <integer> <integer> <integer> <integer> <integer>
## 1         4         2         4         4         7

8.4 Checking diagnostic plots

TODO.

8.5 Droplet-specific procedures

8.5.1 Background

An interesting aspect of droplet-based data is that we have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced, resulting in non-zero counts for libraries that do not contain any cell.

To demonstrate, we obtain the unfiltered count matrix for the PBMC dataset from 10X Genomics. This has not been subjected to the cell filtering algorithm method in the CellRanger software suite, allowing us to apply our own algorithms for identifying the libraries that correspond to cells.

## class: SingleCellExperiment 
## dim: 33694 737280 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
##   TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):

The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 8.1), probably corresponding to cell-containing and empty droplets respectively.

Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

Figure 8.1: Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown.

8.5.2 Testing for empty droplets

We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2019). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average.

## [1] 4233

emptyDrops() computes Monte Carlo \(p\)-values based on a Dirichlet-multinomial model of sampling molecules into droplets. These \(p\)-values are stochastic so it is important to set the random seed to obtain reproducible results. The stability of the Monte Carlo \(p\)-values depends on the number of iterations used to compute them. This is controlled using the niters= argument in emptyDrops(), set to a default of 10000 for speed. Larger values improve stability with the only cost being that of time, so users should set niters= to the largest value they are willing to wait for.

emptyDrops() assumes that libraries with total UMI counts below a certain threshold (lower=100 by default) correspond to empty droplets. These are used to estimate the ambient expression profile against which the remaining libraries are tested. Under this definition, these low-count libraries cannot be cell-containing droplets and are excluded from the hypothesis testing - hence the NAs in the output of the function.

To retain only the detected cells, we subset our SingleCellExperiment object. Discerning readers will notice the use of which(), which conveniently removes the NAs prior to the subsetting.

Users wanting to use the cell calling algorithm from the CellRanger pipeline (version 2) can call defaultDrops() instead. This tends to be quite conservative as it often discards genuine cells with low RNA content (and thus low total counts). It also requires an a priori estimate of the expected number of cells in the experiment, which may or may not be accurate.

(As an aside, CellRanger version 3 automatically performs cell calling using an algorithm similar to emptyDrops(). If we had started our analysis with the filtered count matrix, we would not need to run emptyDrops() manually as shown here. Nonetheless, it may be desirable on occasions where more detailed inspection or control of the cell-calling statistics is desired.)

8.5.3 Examining cell-calling diagnostics

The number of Monte Carlo iterations determines the lower bound for the \(p\)-values (Phipson and Smyth 2010). The Limited field in the output indicates whether or not the computed \(p\)-value for a particular barcode is bounded by the number of iterations. If any non-significant barcodes are TRUE for Limited, we may need to increase the number of iterations. A larger number of iterations will result in a lower \(p\)-value for these barcodes, which may allow them to be detected after correcting for multiple testing.

##        Limited
## Sig     FALSE TRUE
##   FALSE  1056    0
##   TRUE   1661 2572

As mentioned above, emptyDrops() assumes that barcodes with low total UMI counts are empty droplets. Thus, the null hypothesis should be true for all of these barcodes. We can check whether the hypothesis testing procedure holds its size by examining the distribution of \(p\)-values for low-total barcodes with test.ambient=TRUE. Ideally, the distribution should be close to uniform.

Distribution of $p$-values for the assumed empty droplets.

Figure 8.2: Distribution of \(p\)-values for the assumed empty droplets.

Large peaks near zero indicate that barcodes with total counts below lower are not all ambient in origin. This can be resolved by decreasing lower further to ensure that barcodes corresponding to droplets with very small cells are not used to estimate the ambient profile.

While emptyDrops() will distinguish cells from empty droplets, it makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. This can be done using the outlier-based strategy to set data-adaptive thresholds on the mitochondrial proportion. (emptyDrops() will already remove cells with low library sizes or number of expressed genes, so further filtering on these metrics is generally not necessary.)

8.6 Assumptions of outlier detection

8.6.1 Overview

The use of isOutlier() (Section 8.3.2) is a general and effective strategy for identifying and removing low-quality cells. However, it involves a number of assumptions about the experiment and biological system, and while these assumptions are mostly reasonable, users should still keep them in mind when interpreting the results2.

An outlier-based definition for low-quality cells assumes that most cells are of acceptable quality. This is usually a reasonable expectation and can be experimentally supported in some situations by visually checking that the cells are intact, e.g., on the microwell plate. If most cells are of (unacceptably) low quality, the adaptive thresholds will fail as they cannot remove the majority of cells. Of course, what is acceptable or not is in the eye of the beholder - neurons, for example, are notoriously difficult to dissociate, and we would often retain cells in a neuronal scRNA-seq dataset with QC metrics that would be unacceptable in a more amenable system like embryonic stem cells.

Another assumption is that the QC metrics are independent of the biological state of each cell. This implies that any outlier values for these metrics are driven by technical factors rather than biological processes. Thus, removing cells based on the metrics will not misrepresent the biology in downstream analyses. This assumption is most likely to be violated in highly heterogeneous cell populations. For example, some cell types may naturally have less RNA or express fewer genes than other cell types. These cell types are more likely to be considered outliers and removed, even if they are of high quality. QC filtering based on outliers may not be appropriate in such cases.

The use of the MAD mitigates this problem by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells). Systematic differences in the QC metrics can be handled to some extent using the batch= argument in the isOutlier() function. However, this is not applicable in experiments where the relevant factors (e.g., cell type) are not known in advance.

8.6.2 Checking for discarded cell types

The biggest practical concern with the assumptions is whether a cell type is lost during the QC procedure. We can diagnose this by looking for differences in gene expression between the discarded and retained cells. To demonstrate, we compute the average count across the discarded and retained pools in the 416B data set, and we compute the log-fold change between the averages.

If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 8.3, suggesting that the QC step did not inadvertently filter out a cell type in the 416B dataset.

Log-fold change in expression in the discarded cells compared to the retained cells in the 416B dataset. Each point represents a gene, with spike-in and mitochondrial transcripts in red and blue respectively.

Figure 8.3: Log-fold change in expression in the discarded cells compared to the retained cells in the 416B dataset. Each point represents a gene, with spike-in and mitochondrial transcripts in red and blue respectively.

For comparison, let’s pretend that we applied a fixed threshold on the library size to filter cells in the PBMC data set. Specifically, we remove all libraries with a library size below 500.

The presence of a distinct population in the discarded pool manifests in Figure 8.4 as a set of genes that are strongly upregulated in lost. This includes PF4, PPBP and CAVIN2, which (spoiler alert!) indicates that there is a platelet population that has been discarded by alt.discard.

Average counts across all discarded and retained cells in the PBMC dataset, after using a more stringent filter on the total UMI count. Each point represents a gene, with platelet-related genes highlighted in orange.

Figure 8.4: Average counts across all discarded and retained cells in the PBMC dataset, after using a more stringent filter on the total UMI count. Each point represents a gene, with platelet-related genes highlighted in orange.

If we suspect that cell types have been incorrectly discarded by our QC procedure, the most direct solution is to relax the QC filters by increasing nmads= in the isOutlier() calls. We can also avoid filtering on metrics that are associated with genuine biological differences between cell types.

As an aside, it is worth mentioning that the true technical quality of a cell may also be correlated with its type. (This differs from a correlation between the cell type and the QC metrics, as the latter are our imperfect proxies for quality.) This can arise if some cell types are not amenable to dissociation or microfluidics handling during the scRNA-seq protocol. In such cases, it is possible to “correctly” discard an entire cell type during QC if all of its cells are damaged. Indeed, concerns over the computational removal of cell types during QC are probably minor compared to losses in the experimental protocol.

8.7 Session Info

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.8.2                    
 [2] edgeR_3.27.9                          
 [3] limma_3.41.15                         
 [4] Matrix_1.2-17                         
 [5] DropletUtils_1.5.4                    
 [6] BiocFileCache_1.9.1                   
 [7] dbplyr_1.4.2                          
 [8] scater_1.13.9                         
 [9] ggplot2_3.2.0                         
[10] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0
[11] GenomicFeatures_1.37.4                
[12] AnnotationDbi_1.47.0                  
[13] SingleCellExperiment_1.7.0            
[14] SummarizedExperiment_1.15.5           
[15] DelayedArray_0.11.4                   
[16] BiocParallel_1.19.0                   
[17] matrixStats_0.54.0                    
[18] Biobase_2.45.0                        
[19] GenomicRanges_1.37.14                 
[20] GenomeInfoDb_1.21.1                   
[21] IRanges_2.19.10                       
[22] S4Vectors_0.23.17                     
[23] BiocGenerics_0.31.5                   
[24] BiocStyle_2.13.2                      
[25] Cairo_1.5-10                          

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             bit64_0.9-7             
 [3] progress_1.2.2           httr_1.4.0              
 [5] tools_3.6.0              backports_1.1.4         
 [7] R6_2.4.0                 irlba_2.3.3             
 [9] HDF5Array_1.13.4         vipor_0.4.5             
[11] DBI_1.0.0                lazyeval_0.2.2          
[13] colorspace_1.4-1         withr_2.1.2             
[15] tidyselect_0.2.5         gridExtra_2.3           
[17] prettyunits_1.0.2        bit_1.1-14              
[19] curl_4.0                 compiler_3.6.0          
[21] BiocNeighbors_1.3.3      rtracklayer_1.45.2      
[23] bookdown_0.12            scales_1.0.0            
[25] askpass_1.1              rappdirs_0.3.1          
[27] stringr_1.4.0            digest_0.6.20           
[29] Rsamtools_2.1.3          R.utils_2.9.0           
[31] rmarkdown_1.14           XVector_0.25.0          
[33] pkgconfig_2.0.2          htmltools_0.3.6         
[35] rlang_0.4.0              RSQLite_2.1.2           
[37] DelayedMatrixStats_1.7.1 R.oo_1.22.0             
[39] dplyr_0.8.3              RCurl_1.95-4.12         
[41] magrittr_1.5             BiocSingular_1.1.5      
[43] GenomeInfoDbData_1.2.1   Rhdf5lib_1.7.3          
[45] Rcpp_1.0.2               ggbeeswarm_0.6.0        
[47] munsell_0.5.0            viridis_0.5.1           
[49] R.methodsS3_1.7.1        stringi_1.4.3           
[51] yaml_2.2.0               zlibbioc_1.31.0         
[53] rhdf5_2.29.0             grid_3.6.0              
[55] blob_1.2.0               dqrng_0.2.1             
[57] crayon_1.3.4             lattice_0.20-38         
[59] Biostrings_2.53.2        hms_0.5.0               
[61] locfit_1.5-9.1           zeallot_0.1.0           
[63] knitr_1.23               pillar_1.4.2            
[65] biomaRt_2.41.7           XML_3.98-1.20           
[67] glue_1.3.1               evaluate_0.14           
[69] BiocManager_1.30.4       vctrs_0.2.0             
[71] gtable_0.3.0             openssl_1.4.1           
[73] purrr_0.3.2              assertthat_0.2.1        
[75] xfun_0.8                 rsvd_1.0.2              
[77] viridisLite_0.3.0        tibble_2.1.3            
[79] GenomicAlignments_1.21.4 beeswarm_0.2.3          
[81] memoise_1.1.0           

Bibliography

Ilicic, T., J. K. Kim, A. A. Kołodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biol. 17 (1):29.

Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac, M. Kasper, P. Lonnerberg, and S. Linnarsson. 2014. “Quantitative single-cell RNA-seq with unique molecular identifiers.” Nat. Methods 11 (2):163–66.

Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv.

Lun, A., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, participants in the 1st Human Cell Atlas Jamboree, and J. Marioni. 2019. “EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data.” Genome Biol. 20 (1):63.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9:Article 39.


  1. These numbers were rectally derived.

  2. “Be alert, not alarmed” - for those who remember the Australian Government’s indoctrination - uh, I mean, counter-terror campaign.