# Chapter 6 Quality Control

## 6.1 Motivation

Low-quality libraries in scRNA-seq data can arise from a variety of sources such as cell damage during dissociation or failure in library preparation (e.g., inefficient reverse transcription or PCR amplification). These usually manifest as “cells” with low total counts, few expressed genes and high mitochondrial or spike-in proportions. These low-quality libraries are problematic as they can contribute to misleading results in downstream analyses:

• They form their own distinct cluster(s), complicating interpretation of the results. This is most obviously driven by increased mitochondrial proportions or enrichment for nuclear RNAs after cell damage. In the worst case, low-quality libraries generated from different cell types can cluster together based on similarities in the damage-induced expression profiles, creating artificial intermediate states or trajectories between otherwise distinct subpopulations. Additionally, very small libraries can form their own clusters due to shifts in the mean upon transformation (A. Lun 2018).
• They distort the characterization of population heterogeneity during variance estimation or principal components analysis. The first few principal components will capture differences in quality rather than biology, reducing the effectiveness of dimensionality reduction. Similarly, genes with the largest variances will be driven by differences between low- and high-quality cells. The most obvious example involves low-quality libraries with very low counts where scaling normalization inflates the apparent variance of genes that happen to have a non-zero count in those libraries.
• They contain genes that appear to be strongly “upregulated” due to aggressive scaling to normalize for small library sizes. This is most problematic for contaminating transcripts (e.g., from the ambient solution) that are present in all libraries at low but constant levels. Increased scaling in low-quality libraries transforms small counts for these transcripts in large normalized expression values, resulting in apparent upregulation compared to other cells. This can be misleading as the affected genes are often biologically sensible but are actually expressed in another subpopulation.

To avoid - or at least mitigate - these problems, we need to remove these cells at the start of the analysis. This step is commonly referred to as quality control (QC) on the cells. (We will use “library” and “cell” rather interchangeably here, though the distinction will become important when dealing with droplet-based data.) We will demonstrate using a small scRNA-seq dataset from A. T. L. Lun et al. (2017), which is provided with no prior QC so that we can apply our own procedures.

#--- setup ---#
library(OSCAUtils)
chapterPreamble(use_cache = TRUE)

library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b")
sce.416b$block <- factor(sce.416b$block)
sce.416b
## class: SingleCellExperiment
## dim: 46604 192
## assays(1): counts
## rownames(46604): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000095742
##   CBFB-MYH11-mcherry
## rowData names(1): Length
## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 SLX-9555.N701_S503.C89V9ANXX.s_1.r_1
##   ... SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(9): Source Name cell line ... spike-in addition block
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC SIRV

## 6.2 Choice of QC metrics

We use several common QC metrics to identify low-quality cells based on their expression profiles. These metrics are described below in terms of reads for SMART-seq2 data, but the same definitions apply to UMI data generated by other technologies like MARS-seq and droplet-based protocols.

• The library size is defined as the total sum of counts across all relevant features for each cell. Here, we will consider the relevant features to be the endogenous genes. Cells with small library sizes are of low quality as the RNA has been lost at some point during library preparation, either due to cell lysis or inefficient cDNA capture and amplification.
• The number of expressed features in each cell is defined as the number of endogenous genes with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
• The proportion of reads mapped to spike-in transcripts is calculated relative to the total count across all features (including spike-ins) for each cell. As the same amount of spike-in RNA should have been added to each cell, any enrichment in spike-in counts is symptomatic of loss of endogenous RNA. Thus, high proportions are indicative of poor-quality cells where endogenous RNA has been lost due to, e.g., partial cell lysis or RNA degradation during dissociation.
• In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), presumably because of loss of cytoplasmic RNA from perforated cells. The reasoning is that, in the presence of modest damage, the holes in the cell membrane permit efflux of individual transcript molecules but are too small to allow mitochondria to escape, leading to a relative enrichment of mitochondrial transcripts.

For each cell, we calculate these QC metrics using the perCellQCMetrics() function from the scater package (McCarthy et al. 2017). The sum column contains the total count for each cell, the detected column contains the number of detected genes, subsets_Mito_percent contains the percentage of reads mapped to mitochondrial transcripts (based on Ensembl annotation) and altexps_ERCC_percent contains the percentage of reads mapped to ERCC transcripts.

# Identifying the mitochondrial transcripts:
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
location <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
keytype="GENEID", column="SEQNAME")
is.mito <- which(location=="MT")

# Computing the QC metrics.
library(scater)
df <- perCellQCMetrics(sce.416b, subsets=list(Mito=is.mito))
df
## DataFrame with 192 rows and 16 columns
##           sum  detected percent_top_50 percent_top_100 percent_top_200 percent_top_500
##     <integer> <integer>      <numeric>       <numeric>       <numeric>       <numeric>
## 1      865936      7618        26.7218         32.2773         39.7208         52.9038
## 2     1076277      7521        29.4043         35.0354         42.2581         55.7454
## 3     1180138      8306        27.3454         32.4770         39.3296         51.9337
## 4     1342593      8143        35.8092         40.2666         46.2460         57.1210
## 5     1668311      7154        34.1198         39.0901         45.6660         58.2004
## ...       ...       ...            ...             ...             ...             ...
## 188    776622      8174        45.9362         49.7010         54.6101         64.4249
## 189   1299950      8956        38.0829         42.8930         49.0622         60.6675
## 190   1800696      9530        30.6675         35.5839         41.8550         53.6781
## 191     46731      6649        32.2998         37.9149         44.5999         56.5235
## 192   1866692     10964        26.6632         31.2584         37.5608         48.9489
##     subsets_Mito_sum subsets_Mito_detected subsets_Mito_percent altexps_ERCC_sum
##            <integer>             <integer>            <numeric>        <integer>
## 1              78790                    20              9.09882            65278
## 2              98613                    20              9.16242            74748
## 3             100341                    19              8.50248            60878
## 4             104882                    20              7.81190            60073
## 5             129559                    22              7.76588           136810
## ...              ...                   ...                  ...              ...
## 188            48126                    20              6.19684            61575
## 189           112225                    25              8.63302            94982
## 190           135693                    23              7.53559           113707
## 191             3505                    16              7.50037             7580
## 192           150375                    29              8.05569            48664
##     altexps_ERCC_detected altexps_ERCC_percent altexps_SIRV_sum altexps_SIRV_detected
##                 <integer>            <numeric>        <integer>             <integer>
## 1                      39              6.80658            27828                     7
## 2                      40              6.28030            39173                     7
## 3                      42              4.78949            30058                     7
## 4                      42              4.18567            32542                     7
## 5                      44              7.28887            71850                     7
## ...                   ...                  ...              ...                   ...
## 188                    39              7.17620            19848                     7
## 189                    41              6.65764            31729                     7
## 190                    40              5.81467            41116                     7
## 191                    44             13.48898             1883                     7
## 192                    39              2.51930            16289                     7
##     altexps_SIRV_percent     total
##                <numeric> <integer>
## 1                2.90165    959042
## 2                3.29130   1190198
## 3                2.36477   1271074
## 4                2.26741   1435208
## 5                3.82798   1876971
## ...                  ...       ...
## 188             2.313165    858045
## 189             2.224004   1426661
## 190             2.102562   1955519
## 191             3.350892     56194
## 192             0.843271   1931645

A key assumption here is that the QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial proportions) are presumed to be driven by technical factors rather than biological processes, meaning that the subsequent removal of cells will not misrepresent the biology in downstream analyses. Major violations of this assumption would potentially result in the loss of cell types that have, say, systematically low RNA content or high numbers of mitochondria. We can check for such violations using some diagnostics described in Sections 6.4 and ??.

## 6.3 Identifying low-quality cells

### 6.3.1 With fixed thresholds

The simplest approach to identifying low-quality cells is to apply thresholds on the QC metrics. For example, we might consider cells to be low quality if they have library sizes below 100,000 reads; express fewer than 5,000 genes; have spike-in proportions above 10%; or have mitochondrial proportions above 10%.

qc.lib <- df$sum < 1e5 qc.nexprs <- df$detected < 5e3
qc.spike <- df$altexps_ERCC_percent > 10 qc.mito <- df$subsets_Mito_percent > 10
discard <- qc.lib | qc.nexprs | qc.spike | qc.mito

# Summarize the number of cells removed for each reason.
DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs),
SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard))
## DataFrame with 1 row and 5 columns
##     LibSize    NExprs SpikeProp  MitoProp     Total
##   <integer> <integer> <integer> <integer> <integer>
## 1         3         0        19        14        33

While simple, this strategy requires considerable experience to determine appropriate thresholds for each experimental protocol and biological system. Thresholds for read count-based data are simply not applicable for UMI-based data, and vice versa. Differences in mitochondrial activity or total RNA content require constant adjustment of the mitochondrial and spike-in thresholds, respectively, for different biological systems. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of cDNA capture efficiency and sequencing depth per cell.

#### 6.3.2.1 Identifying outliers

To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells. We then identify cells that are outliers for the various QC metrics, based on the median absolute deviation (MAD) from the median value of each metric across all cells. Specifically, a value is considered an outlier if it is more than 3 MADs from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution.

For the 416B data, we identify cells with log-transformed library sizes that are more than 3 MADs below the median. A log-transformation is used to improve resolution at small values when type="lower". In particular, it guarantees that the threshold is not a negative value, which would be meaningless for quality control on these metrics. Moreover, these metrics can occasionally exhibit a heavy right tail, and the log-transformation makes the distribution seem more normal to justify the 99% rationale mentioned above.

qc.lib2 <- isOutlier(df$sum, log=TRUE, type="lower") We do the same for the log-transformed number of expressed genes. qc.nexprs2 <- isOutlier(df$detected, log=TRUE, type="lower")

isOutlier() will also return the exact filter thresholds for each metric in the attributes of the output vector. These is useful for checking whether the automatically selected thresholds are appropriate.

attr(qc.lib2, "thresholds")
##  lower higher
## 434083    Inf
attr(qc.nexprs2, "thresholds")
##  lower higher
##   5231    Inf

We identify outliers for the proportion-based metrics in a similar manner. Here, no transformation is required as we are identifying large outliers that should already be clearly distinguishable from zero.

qc.spike2 <- isOutlier(df$altexps_ERCC_percent, type="higher") attr(qc.spike2, "thresholds") ## lower higher ## -Inf 14.15 qc.mito2 <- isOutlier(df$subsets_Mito_percent, type="higher")
attr(qc.mito2, "thresholds")
##  lower higher
##   -Inf  11.92

A cell that is an outlier for any of these metrics is considered to be of low-quality and discarded.

discard2 <- qc.lib2 | qc.nexprs2 | qc.spike2 | qc.mito2

# Summarize the number of cells removed for each reason.
DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2),
SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2))
## DataFrame with 1 row and 5 columns
##     LibSize    NExprs SpikeProp  MitoProp     Total
##   <integer> <integer> <integer> <integer> <integer>
## 1         4         0         1         2         6

Alternatively, this entire process can be done in a single step using the quickPerCellQC() function. This is a wrapper that simply calls isOutlier() with the settings described above.

reasons <- quickPerCellQC(df, percent_subsets=c("subsets_Mito_percent",
"altexps_ERCC_percent"))
colSums(as.matrix(reasons))
##              low_lib_size            low_n_features high_subsets_Mito_percent
##                         4                         0                         2
##                         1                         6

With this strategy, the thresholds adapt to both the location and spread of the distribution of values for a given metric. This allows the QC procedure to adjust to changes in sequencing depth, cDNA capture efficiency, mitochondrial content, etc. without requiring any user intervention or prior experience. However, it does require some implicit assumptions that are discussed below in more detail.

#### 6.3.2.2 Assumptions of outlier detection

Outlier detection assumes that most cells are of acceptable quality. This is usually reasonable and can be experimentally supported in some situations by visually checking that the cells are intact, e.g., on the microwell plate. If most cells are of (unacceptably) low quality, the adaptive thresholds will obviously fail as they cannot remove the majority of cells. Of course, what is acceptable or not is in the eye of the beholder - neurons, for example, are notoriously difficult to dissociate, and we would often retain cells in a neuronal scRNA-seq dataset with QC metrics that would be unacceptable in a more amenable system like embryonic stem cells.

Another assumption discussed earlier is that the QC metrics are independent of the biological state of each cell. This assumption is most likely to be violated in highly heterogeneous cell populations where cell types that naturally have less RNA or more mitochondria are more likely to be considered outliers and removed, even if they are of high quality. The use of the MAD mitigates this problem to some extent by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells).

In general, these assumptions are either reasonable or their violations have little effect on downstream conclusions. Nonetheless, it is helpful to keep them in mind when interpreting the results.

#### 6.3.2.3 Considering experimental factors

More complex studies may involve batches of cells generated with different experimental parameters (e.g., sequencing depth). In such cases, the adaptive strategy should be applied to each batch separately. It makes little sense to compute medians and MADs from a mixture distribution containing samples from multiple batches. For example, if the sequencing coverage is lower in one batch compared to the others, it will drag down the median and inflate the MAD. This will reduce the suitability of the adaptive threshold for the other batches.

If each batch is represented by its own SingleCellExperiment, the isOutlier() function can be directly applied to each batch as shown above. However, if cells from all batches have been merged into a single SingleCellExperiment, the batch= argument should be used to ensure that outliers are identified within each batch. This allows isOutlier() to accommodate systematic differences in the QC metrics across batches.

We will again illustrate using the 416B dataset, which contains two experimental factors - plate of origin and oncogene induction status. We combine these factors together and use this in the batch= argument to isOutlier() via quickPerCellQC(). This results in the removal of more cells as the MAD is no longer inflated by (i) systematic differences in sequencing depth between batches and (ii) differences in number of genes expressed upon oncogene induction.

batch <- paste0(sce.416b$phenotype, "-", sce.416b$Plate)
batch.reasons <- quickPerCellQC(df, percent_subsets=c("subsets_Mito_percent",
"altexps_ERCC_percent"), batch=batch)
colSums(as.matrix(batch.reasons))
##              low_lib_size            low_n_features high_subsets_Mito_percent
##                         4                         2                         2
##                         4                         7

That said, the use of batch= involves the stronger assumption that most cells in each batch are of high quality. If an entire batch failed, outlier detection will not be able to act as an appropriate QC filter for that batch. For example, two batches in the Grun et al. (2016) human pancreas dataset contain a substantial proportion of putative damaged cells with higher ERCC content than the other batches (Figure 6.1). This inflates the median and MAD within those batches, resulting in a failure to remove the assumed low-quality cells. In such cases, it is better to compute a shared median and MAD from the other batches and use those estimates to obtain an appropriate filter threshold for cells in the problematic batches, as shown below.

library(scRNAseq)
sce.grun <- GrunPancreasData()

discard.ercc <- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor)
with.blocking <- plotColData(sce.grun, x="donor", y="altexps_ERCC_percent",

discard.ercc2 <- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor,
subset=sce.grun$donor %in% c("D17", "D2", "D7")) without.blocking <- plotColData(sce.grun, x="donor", y="altexps_ERCC_percent", colour_by=I(discard.ercc2)) gridExtra::grid.arrange(with.blocking, without.blocking, ncol=2) To identify problematic batches, one useful rule of thumb is to find batches with QC thresholds that are themselves outliers compared to the thresholds of other batches. The assumption here is that most batches consist of a majority of high quality cells such that the threshold value should follow some unimodal distribution across “typical” batches. If we observe a batch with an extreme threshold value, we may suspect that it contains a large number of low-quality cells that inflate the per-batch MAD. We demonstrate this process below for the Grun et al. (2016) data. ercc.thresholds <- attr(discard.ercc, "thresholds")["higher",] ercc.thresholds ## D10 D17 D2 D3 D7 ## 73.611 7.600 6.011 113.106 15.217 names(ercc.thresholds)[isOutlier(ercc.thresholds, type="higher")] ## [1] "D10" "D3" If we cannot assume that most batches contain a majority of high-quality cells, then all bets are off; we must revert to the approach of picking an arbitrary threshold value (Section 6.3.1) and hoping for the best. ### 6.3.3 Other approaches Another strategy is to identify outliers in high-dimensional space based on the QC metrics for each cell. We use methods from robustbase to quantify the “outlyingness” of each cells based on their QC metrics, and then use isOutlier() to identify low-quality cells that exhibit unusually high levels of outlyingness. stats <- cbind(log10(df$sum), log10(df$detected), df$subsets_Mito_percent, df$altexps_ERCC_percent) library(robustbase) outlying <- adjOutlyingness(stats, only.outlyingness = TRUE) multi.outlier <- isOutlier(outlying, type = "higher") summary(multi.outlier) ## Mode FALSE TRUE ## logical 183 9 This and related approaches like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016) as they can exploit patterns across many QC metrics. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. For completeness, we note that outliers can also be identified from the gene expression profiles, rather than QC metrics. We consider this to be a risky strategy as it can remove high-quality cells in rare populations. ## 6.4 Checking diagnostic plots It is good practice to inspect the distributions of QC metrics (Figure 6.2) to identify possible problems. In the most ideal case, we would see normal distributions that would justify the 3 MAD threshold used in outlier detection. A large proportion of cells in another mode suggests that the QC metrics might be correlated with some biological state, potentially leading to the loss of distinct cell types during filtering. Batches with systematically poor values for any metric can also be quickly identified for further troubleshooting or outright removal. colData(sce.416b) <- cbind(colData(sce.416b), df) sce.416b$block <- factor(sce.416b$block) sce.416b$phenotype <- ifelse(grepl("induced", sce.416b$phenotype), "induced", "wild type") sce.416b$discard <- reasons$discard gridExtra::grid.arrange( plotColData(sce.416b, x="block", y="sum", colour_by="discard", other_fields="phenotype") + facet_wrap(~phenotype) + scale_y_log10() + ggtitle("Total count"), plotColData(sce.416b, x="block", y="detected", colour_by="discard", other_fields="phenotype") + facet_wrap(~phenotype) + scale_y_log10() + ggtitle("Detected features"), plotColData(sce.416b, x="block", y="subsets_Mito_percent", colour_by="discard", other_fields="phenotype") + facet_wrap(~phenotype) + ggtitle("Mito percent"), plotColData(sce.416b, x="block", y="altexps_ERCC_percent", colour_by="discard", other_fields="phenotype") + facet_wrap(~phenotype) + ggtitle("ERCC percent"), ncol=1 ) Another useful diagnostic involves plotting the proportion of mitochondrial counts against some of the other QC metrics. The aim is to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active (e.g., hepatocytes). In this case, we do not observe any points in the top-right corner in Figure 6.3. plotColData(sce.416b, x="sum", y="subsets_Mito_percent", colour_by="discard", other_fields=c("block", "phenotype")) + facet_grid(block~phenotype) + theme(panel.border = element_rect(color = "grey")) Comparison of the ERCC and mitochondrial percentages can also be informative (Figure 6.4). Low-quality cells with small mitochondrial percentages, large spike-in percentages and small library sizes are likely to be stripped nuclei, i.e., they have been so extensively damaged that they have lost all cytoplasmic content. Conversely, cells with high mitochondrial percentages and low ERCC percentages may represent undamaged cells that are metabolically active. plotColData(sce.416b, x="altexps_ERCC_percent", y="subsets_Mito_percent", colour_by="discard", other_fields=c("block", "phenotype")) + facet_grid(block~phenotype) + theme(panel.border = element_rect(color = "grey")) ## 6.5 Cell calling for droplet data ### 6.5.1 Background An unique aspect of droplet-based data is that we have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced, resulting in non-zero counts for libraries that do not contain any cell. To demonstrate, we obtain the unfiltered count matrix for the PBMC dataset from 10X Genomics. library(BiocFileCache) bfc <- BiocFileCache("raw_data", ask = FALSE) raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples", "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz")) untar(raw.path, exdir=file.path(tempdir(), "pbmc4k")) library(DropletUtils) library(Matrix) fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") sce.pbmc <- read10xCounts(fname, col.names=TRUE) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 737280 ## metadata(1): Samples ## assays(1): counts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674 ## rowData names(2): ID Symbol ## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ... TTTGTCATCTTTAGTC-1 ## TTTGTCATCTTTCCTC-1 ## colData names(2): Sample Barcode ## reducedDimNames(0): ## spikeNames(0): ## altExpNames(0): The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 6.5), probably corresponding to cell-containing and empty droplets respectively. A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this unnecessarily discards libraries derived from cell types with low RNA content. bcrank <- barcodeRanks(counts(sce.pbmc)) # Only showing unique points for plotting speed. uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2) abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)

### 6.5.2 Testing for empty droplets

We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2019). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. This allows us to discriminate between well-sequenced empty droplets and droplets derived from cells with little RNA, both of which would have similar total counts in Figure 6.5. We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average.

# emptyDrops performs Monte Carlo simulations to compute p-values,
# so we need to set the seed to obtain reproducible results.
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))

# See ?emptyDrops for an explanation of why there are NA values.
summary(e.out$FDR <= 0.001) ## Mode FALSE TRUE NA's ## logical 1056 4233 731991 emptyDrops() uses Monte Carlo simulations to compute $$p$$-values for the multinomial sampling transcripts from the ambient pool. The number of Monte Carlo iterations determines the lower bound for the $$p$$-values (Phipson and Smyth 2010). The Limited field in the output indicates whether or not the computed $$p$$-value for a particular barcode is bounded by the number of iterations. If any non-significant barcodes are TRUE for Limited, we may need to increase the number of iterations. A larger number of iterations will result in a lower $$p$$-value for these barcodes, which may allow them to be detected after correcting for multiple testing. table(Sig=e.out$FDR <= 0.001, Limited=e.out$Limited) ## Limited ## Sig FALSE TRUE ## FALSE 1056 0 ## TRUE 1661 2572 As mentioned above, emptyDrops() assumes that barcodes with low total UMI counts are empty droplets. Thus, the null hypothesis should be true for all of these barcodes. We can check whether the hypothesis testing procedure holds its size by examining the distribution of $$p$$-values for low-total barcodes with test.ambient=TRUE. Ideally, the distribution should be close to uniform (Figure 6.6). Large peaks near zero indicate that barcodes with total counts below lower are not all ambient in origin. This can be resolved by decreasing lower further to ensure that barcodes corresponding to droplets with very small cells are not used to estimate the ambient profile. set.seed(100) limit <- 100 all.out <- emptyDrops(counts(sce.pbmc), lower=limit, test.ambient=TRUE) hist(all.out$PValue[all.out$Total <= limit & all.out$Total > 0],
xlab="P-value", main="", col="grey80") 

Once we are satisfied with the performance of emptyDrops(), we subset our SingleCellExperiment object to retain only the detected cells. Discerning readers will notice the use of which(), which conveniently removes the NAs prior to the subsetting.

sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)] It is worth pointing out that, at this point, we do not attempt to remove the ambient contamination from each library. Accurate quantification of the contamination rate in each cell is difficult as it generally requires some prior biological knowledge about genes that are expected to have mutually exclusive expression profiles and are highly abundant in the ambient solution (Young and Behjati 2018). Fortunately, ambient contamination usually has little effect on the downstream conclusions for routine analyses; cell type identities are usually easy enough to determine from the affected genes, notwithstanding a (mostly harmless) low background level of expression for marker genes that should be unique to a cell type. However, more susceptible analyses may require specific remedies like those discussed in Section 14.6. ### 6.5.3 Relationship with other QC metrics While emptyDrops() will distinguish cells from empty droplets, it makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. This is achieved using the same outlier-based strategy described in Section 6.3.2. Filtering on the mitochondrial proportion provides the most additional benefit in this situation, provided that we check that we are not removing a subpopulation of metabolically active cells (Figure 6.7). is.mito <- grep("^MT-", rowData(sce.pbmc)$Symbol)
pbmc.qc <- perCellQCMetrics(sce.pbmc, subsets=list(MT=is.mito))
discard.mito <- isOutlier(pbmc.qc$subsets_MT_percent, type="higher") summary(discard.mito) ## Mode FALSE TRUE ## logical 3922 311 plot(pbmc.qc$sum, pbmc.qc$subsets_MT_percent, log="x", xlab="Total count", ylab='Mitochondrial %') abline(h=attr(discard.mito, "thresholds")["higher"], col="red") emptyDrops() already removes cells with very low library sizes or (by association) low numbers of expressed genes. Thus, further filtering on these metrics is not strictly necessary. It may still be desirable to filter on both of these metrics to remove non-empty droplets containing cell fragments or stripped nuclei that were not caught by the mitochondrial filter. However, this should be weighed against the risk of losing genuine cell types as discussed in Section 6.3.2.2. Note that CellRanger version 3 automatically performs cell calling using an algorithm similar to emptyDrops(). If we had started our analysis with the filtered count matrix, we could go straight to computing other QC metrics. We would not need to run emptyDrops() manually as shown here, and indeed, attempting to do so would lead to nonsensical results if not outright software errors. Nonetheless, it may still be desirable to load the unfiltered matrix and apply emptyDrops() ourselves, on occasions where more detailed inspection or control of the cell-calling statistics is desired. ## 6.6 Removing low-quality cells Once low-quality cells have been identified, we can choose to either remove them or mark them. Removal is the most straightforward option and is achieved by subsetting the SingleCellExperiment by column. In this case, we use the low-quality calls from Section 6.3.2.3 to generate a subsetted SingleCellExperiment that we would use for downstream analyses. # Keeping the columns we DON'T want to discard. filtered <- sce.416b[,!reasons$discard]

The biggest practical concern during QC is whether an entire cell type is inadvertently discarded. There is always some risk of this occurring as the QC metrics are never fully independent of biological state. We can diagnose cell type loss by looking for systematic differences in gene expression between the discarded and retained cells. To demonstrate, we compute the average count across the discarded and retained pools in the 416B data set, and we compute the log-fold change between the pool averages.

# Using the 'discard' vector for demonstration,
# as it has more cells for stable calculation of 'lost'.

library(edgeR)
logged <- cpm(cbind(lost, kept), log=TRUE, prior.count=2)
logFC <- logged[,1] - logged[,2]
abundance <- rowMeans(logged)

If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 6.8, suggesting that the QC step did not inadvertently filter out a cell type in the 416B dataset.

plot(abundance, logFC, xlab="Average count", ylab="Log-FC (lost/kept)", pch=16)
points(abundance[is.mito], logFC[is.mito], col="dodgerblue", pch=16)

For comparison, let’s pretend that we applied a fixed threshold on the library size to filter cells in the PBMC data set. Specifically, we remove all libraries with a library size below 500.

alt.discard <- colSums(counts(sce.pbmc)) < 500

logged <- edgeR::cpm(cbind(lost, kept), log=TRUE, prior.count=2)
logFC <- logged[,1] - logged[,2]
abundance <- rowMeans(logged)

The presence of a distinct population in the discarded pool manifests in Figure 6.9 as a set of genes that are strongly upregulated in lost. This includes PF4, PPBP and CAVIN2, which (spoiler alert!) indicates that there is a platelet population that has been discarded by alt.discard.

plot(abundance, logFC, xlab="Average count", ylab="Log-FC (lost/kept)", pch=16)
platelet <- c("PF4", "PPBP", "CAVIN2")

library(org.Hs.eg.db)
ids <- mapIds(org.Hs.eg.db, keys=platelet, column="ENSEMBL", keytype="SYMBOL")
points(abundance[ids], logFC[ids], col="orange", pch=16)

If we suspect that cell types have been incorrectly discarded by our QC procedure, the most direct solution is to relax the QC filters for metrics that are associated with genuine biological differences. For example, outlier detection can be relaxed by increasing nmads= in the isOutlier() calls. Of course, this increases the risk of retaining more low-quality cells and encountering the problems discussed in Section 6.1. The logical endpoint of this line of reasoning is to avoid filtering altogether, as discussed in Section 6.7.

As an aside, it is worth mentioning that the true technical quality of a cell may also be correlated with its type. (This differs from a correlation between the cell type and the QC metrics, as the latter are our imperfect proxies for quality.) This can arise if some cell types are not amenable to dissociation or microfluidics handling during the scRNA-seq protocol. In such cases, it is possible to “correctly” discard an entire cell type during QC if all of its cells are damaged. Indeed, concerns over the computational removal of cell types during QC are probably minor compared to losses in the experimental protocol.

## 6.7 Marking low-quality cells

The other option is to simply mark the low-quality cells as such and retain them in the downstream analysis. The aim here is to allow clusters of low-quality cells to form, and then to identify and ignore such clusters during interpretation of the results. This approach avoids discarding cell types that have poor values for the QC metrics, giving users an opportunity to decide whether a cluster of such cells represents a genuine biological state.

marked <- sce.416b
marked$discard <- batch.reasons$discard

The downside is that it shifts the burden of QC to the interpretation of the clusters, which is already the bottleneck in scRNA-seq data analysis (Chapters 10, 11 and 12). Indeed, if we do not trust the QC metrics, we would have to distinguish between genuine cell types and low-quality cells based only on marker genes, and this is not always easy due to the tendency of the latter to “express” interesting genes (Section 6.1). Retention of low-quality cells also compromises the accuracy of the variance modelling, requiring, e.g., use of more PCs to offset the fact that the early PCs are driven by differences between low-quality and other cells.

For routine analyses, we suggest performing removal by default to avoid complications from low-quality cells. This allows most of the population structure to be characterized with no - or, at least, fewer - concerns about its validity. Once the initial analysis is done, and if there are any concerns about discarded cell types (Section ??), a more thorough re-analysis can be performed where the low-quality cells are only marked. This recovers cell types with low RNA content, high mitochondrial proportions, etc. that only need to be interpreted insofar as they “fill the gaps” in the initial analysis.

## Session Info

R Under development (unstable) (2019-12-29 r77627)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
[4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] org.Hs.eg.db_3.10.0         edgeR_3.29.0                limma_3.43.0
[4] Matrix_1.2-18               DropletUtils_1.7.1          robustbase_0.93-5
[7] scRNAseq_2.1.5              scater_1.15.12              ggplot2_3.2.1
[10] ensembldb_2.11.2            AnnotationFilter_1.11.0     GenomicFeatures_1.39.2
[13] AnnotationDbi_1.49.0        AnnotationHub_2.19.3        BiocFileCache_1.11.4
[16] dbplyr_1.4.2                SingleCellExperiment_1.9.1  SummarizedExperiment_1.17.1
[19] DelayedArray_0.13.2         BiocParallel_1.21.2         matrixStats_0.55.0
[22] Biobase_2.47.2              GenomicRanges_1.39.1        GenomeInfoDb_1.23.1
[25] IRanges_2.21.2              S4Vectors_0.25.8            BiocGenerics_0.33.0
[28] Cairo_1.5-10                BiocStyle_2.15.3            OSCAUtils_0.0.1

loaded via a namespace (and not attached):
[1] ggbeeswarm_0.6.0              colorspace_1.4-1              XVector_0.27.0
[4] BiocNeighbors_1.5.1           farver_2.0.1                  bit64_0.9-7
[7] interactiveDisplayBase_1.25.0 R.methodsS3_1.7.1             knitr_1.26
[10] zeallot_0.1.0                 Rsamtools_2.3.2               R.oo_1.23.0
[13] HDF5Array_1.15.2              shiny_1.4.0                   BiocManager_1.30.10
[16] compiler_4.0.0                httr_1.4.1                    dqrng_0.2.1
[19] backports_1.1.5               assertthat_0.2.1              fastmap_1.0.1
[22] lazyeval_0.2.2                later_1.0.0                   BiocSingular_1.3.1
[25] htmltools_0.4.0               prettyunits_1.0.2             tools_4.0.0
[28] rsvd_1.0.2                    gtable_0.3.0                  glue_1.3.1
[31] GenomeInfoDbData_1.2.2        reshape2_1.4.3                dplyr_0.8.3
[34] rappdirs_0.3.1                Rcpp_1.0.3                    vctrs_0.2.1
[37] Biostrings_2.55.4             ExperimentHub_1.13.5          rtracklayer_1.47.0
[40] DelayedMatrixStats_1.9.0      xfun_0.11                     stringr_1.4.0
[43] ps_1.3.0                      mime_0.8                      lifecycle_0.1.0
[46] irlba_2.3.3                   XML_3.98-1.20                 DEoptimR_1.0-8
[49] zlibbioc_1.33.0               scales_1.1.0                  hms_0.5.2
[52] promises_1.1.0                ProtGenerics_1.19.3           rhdf5_2.31.1
[55] yaml_2.2.0                    curl_4.3                      memoise_1.1.0
[58] gridExtra_2.3                 biomaRt_2.43.0                stringi_1.4.3
[61] RSQLite_2.2.0                 BiocVersion_3.11.1            highr_0.8
[64] rlang_0.4.2                   pkgconfig_2.0.3               bitops_1.0-6
[67] evaluate_0.14                 lattice_0.20-38               Rhdf5lib_1.9.0
[70] purrr_0.3.3                   GenomicAlignments_1.23.1      labeling_0.3
[73] cowplot_1.0.0                 bit_1.1-14                    processx_3.4.1
[76] tidyselect_0.2.5              plyr_1.8.5                    magrittr_1.5
[79] bookdown_0.16                 R6_2.4.1                      DBI_1.1.0
[82] pillar_1.4.3                  withr_2.1.2                   RCurl_1.95-4.12
[85] tibble_2.1.3                  crayon_1.3.4                  rmarkdown_2.0
[88] viridis_0.5.1                 progress_1.2.2                locfit_1.5-9.1
[91] grid_4.0.0                    blob_1.2.0                    callr_3.4.0
[94] digest_0.6.23                 xtable_1.8-4                  httpuv_1.5.2
[97] R.utils_2.9.2                 openssl_1.4.1                 munsell_0.5.0
[100] beeswarm_0.2.3                viridisLite_0.3.0             vipor_0.4.5
[103] askpass_1.1                  

### Bibliography

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

Ilicic, T., J. K. Kim, A. A. Kołodziejczyk, F. O. Bagger, D. J. McCarthy, J. C. Marioni, and S. A. Teichmann. 2016. “Classification of low quality cells from single-cell RNA-seq data.” Genome Biol. 17 (1):29.

Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac, M. Kasper, P. Lonnerberg, and S. Linnarsson. 2014. “Quantitative single-cell RNA-seq with unique molecular identifiers.” Nat. Methods 11 (2):163–66.

Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv.

Lun, A., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, participants in the 1st Human Cell Atlas Jamboree, and J. Marioni. 2019. “EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data.” Genome Biol. 20 (1):63.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

McCarthy, D. J., K. R. Campbell, A. T. Lun, and Q. F. Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8):1179–86.

Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9:Article 39.

Young, M. D., and S. Behjati. 2018. “SoupX Removes Ambient RNA Contamination from Droplet Based Single Cell RNA Sequencing Data.” bioRxiv.