Chapter 7 Normalization

7.1 Motivation

Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data (Stegle, Teichmann, and Marioni 2015). They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases.

At this point, it is worth noting the difference between normalization and batch correction (Chapter 26.2.6). Normalization occurs regardless of the batch structure and only considers technical biases, while batch correction - as the name suggests - only occurs across batches and must consider both technical biases and biological differences. Technical biases tend to affect genes in a similar manner, or at least in a manner related to their biophysical properties (e.g., length, GC content), while biological differences between batches can be highly unpredictable. As such, these two tasks involve different assumptions and generally involve different computational methods (though some packages aim to perform both steps at once, e.g., zinbwave). Thus, it is important to avoid conflating “normalized” and “batch-corrected” data, as these usually refer to different things.

We will mostly focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction. To demonstrate, we will use the Zeisel et al. (2015) dataset from the scRNAseq package.

## class: SingleCellExperiment 
## dim: 19839 2816 
## metadata(0):
## assays(1): counts
## rownames(19839): 0610005C13Rik 0610007N19Rik ... Zzef1 Zzz3
## rowData names(2): featureType ENSEMBL
## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01
##   1772066098_A12
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC repeat

7.2 Library size normalization

Library size normalization is the simplest strategy for performing scaling normalization. We define the library size as the total sum of counts across all genes for each cell. The “library size factor” for each cell is then directly proportional to its library size, where the proportionality constant is defined such that the mean size factor across all cells is equal to 1. This ensures that the normalized expression values are on the same scale as the original counts, which is useful for interpretation - especially when dealing with transformed data (see Section 7.5.1).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.176   0.568   0.868   1.000   1.278   4.084

In the Zeisel brain data, the library size factors differ by up to 10-fold across cells (Figure 7.1). This is typical of the variability in coverage in scRNA-seq data.

Distribution of size factors derived from the library size in the Zeisel brain dataset.

Figure 7.1: Distribution of size factors derived from the library size in the Zeisel brain dataset.

Strictly speaking, the use of library size factors assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells. That is, any upregulation for a subset of genes is cancelled out by the same magnitude of downregulation in a different subset of genes. This ensures that the library size is an unbiased estimate of the relative cell-specific bias. (Otherwise, the estimate would be compromised by composition biases, as discussed in Robinson and Oshlack (2010).) This may not be true in scRNA-seq applications, which means that library size normalization may not yield accurate normalized expression values for downstream analyses.

In practice, normalization accuracy is not a major consideration for exploratory scRNA-seq data analyses. Composition biases do not usually affect the separation of clusters, only the magnitude - and to a lesser extent, direction - of the log-fold changes between clusters or cell types. As such, library size normalization is usually sufficient in many applications where the aim is to identify clusters and the top markers that define each cluster.

7.3 Normalization by deconvolution

As previously mentioned, composition biases will be present when any unbalanced differential expression exists between samples. Consider the simple example of two cells where a single gene \(X\) is upregulated in one cell \(A\) compared to the other cell \(B\). This upregulation means that either (i) more sequencing resources are devoted to \(X\) in \(A\), thus decreasing coverage of all other non-DE genes when the total library size of each cell is experimentally fixed (e.g., due to library quantification); or (ii) the library size of \(A\) increases when \(X\) is assigned more reads or UMIs, increasing the library size factor and yielding smaller normalized expression values for all non-DE genes. In both cases, the net effect is that non-DE genes in \(A\) will incorrectly appear to be downregulated compared to \(B\).

The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. Normalization can be performed with the estimateSizeFactorsFromMatrix() function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014) or with the calcNormFactors() function (Robinson and Oshlack 2010) in the edgeR package. These assume that most genes are not DE between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal.

However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile. This is performed using the calculateSumFactors() function from scran, as shown below.

## clust.zeisel
##   1   2   3   4   5   6   7   8   9  10  11 
## 161 194 489 451 269 153 256 226 126 270 221
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.129   0.478   0.821   1.000   1.318   4.812

We use a pre-clustering step with quickCluster() where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations. By default, quickCluster() will use an approximate algorithm for PCA based on methods from the irlba package. The approximation relies on stochastic initialization so we need to set the random seed (via set.seed()) for reproducibility.

We see that the deconvolution size factors exhibit cell type-specific deviations from the library size factors in Figure 7.2. This is consistent with the presence of composition biases that are introduced by strong differential expression between cell types. Use of the deconvolution size factors adjusts for these biases to improve normalization accuracy for downstream applications.

Deconvolution size factor for each cell in the Zeisel brain dataset, compared to the equivalent size factor derived from the library size. The red line corresponds to identity between the two size factors.

Figure 7.2: Deconvolution size factor for each cell in the Zeisel brain dataset, compared to the equivalent size factor derived from the library size. The red line corresponds to identity between the two size factors.

Accurate normalization is most important for procedures that involve estimation and interpretation of per-gene statistics. For example, composition biases can compromise DE analyses by systematically shifting the log-fold changes in one direction or another. However, it tends to provide less benefit over simple library size normalization for cell-based analyses such as clustering. The presence of composition biases already implies strong differences in expression profiles, so changing the normalization strategy is unlikely to affect the outcome of a clustering procedure.

7.4 Normalization by spike-ins

Spike-in normalization is based on the assumption that the same amount of spike-in RNA was added to each cell (A. T. L. Lun et al. 2017). Systematic differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. To remove these biases, we equalize spike-in coverage across cells by scaling with “spike-in size factors”. Compared to the previous methods, spike-in normalization requires no assumption about the biology of the system (i.e., the absence of many DE genes). Instead, it assumes that the spike-in transcripts were (i) added at a constant level to each cell, and (ii) respond to biases in the same relative manner as endogenous genes.

Practically, spike-in normalization should be used if differences in the total RNA content of individual cells are of interest and must be preserved in downstream analyses. For a given cell, an increase in its overall amount of endogenous RNA will not increase its spike-in size factor. This ensures that the effects of total RNA content on expression across the population will not be removed upon scaling. By comparison, the other normalization methods described above will simply interpret any change in total RNA content as part of the bias and remove it.

We demonstrate the use of spike-in normalization on a different dataset involving T cell activation after stimulation with T cell recepter ligands of varying affinity (Richard et al. 2018).

## class: SingleCellExperiment 
## dim: 46603 528 
## metadata(0):
## assays(1): counts
## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ...
##   ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(0):
## colnames(528): SLX-12611.N701_S502. SLX-12611.N702_S502. ...
##   SLX-12612.i712_i522. SLX-12612.i714_i522.
## colData names(13): age individual ... stimulus time
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

We apply the computeSpikeFactors() method to estimate spike-in size factors for all cells. This is defined by converting the total spike-in count per cell into a size factor, using the same reasoning as in librarySizeFactors(). Scaling will subsequently remove any differences in spike-in coverage across cells.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.125   0.428   0.627   1.000   1.070  23.316

We observe a positive correlation between the spike-in size factors and deconvolution size factors within each treatment condition (Figure 7.3), indicating that they are capturing similar technical biases in sequencing depth and capture efficiency. However, we also observe that increasing stimulation of the T cell receptor - in terms of increasing affinity or time - results in a decrease in the spike-in factors relative to the library size factors. This is consistent with an increase in biosynthetic activity and total RNA content during stimulation, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the library size factors).

Size factors from spike-in normalization, plotted against the library size factors for all cells in the T cell dataset. Each plot represents a different ligand treatment and each point is a cell coloured according by time from stimulation.

Figure 7.3: Size factors from spike-in normalization, plotted against the library size factors for all cells in the T cell dataset. Each plot represents a different ligand treatment and each point is a cell coloured according by time from stimulation.

The differences between these two sets of size factors have real consequences for downstream interpretation. If the spike-in size factors were applied to the counts, the expression values in unstimulated cells would be scaled up while expression in stimulated cells would be scaled down. However, the opposite would occur if the deconvolution size factors were used. This can manifest as shifts in the magnitude and direction of DE between conditions when we switch between normalization strategies, as shown below for Malat1 (Figure 7.4).

Distribution of log-normalized expression values for _Malat1_ after normalization with the deconvolution size factors (left) or spike-in size factors (right). Cells are stratified by the ligand affinity and colored by the time after stimulation.

Figure 7.4: Distribution of log-normalized expression values for Malat1 after normalization with the deconvolution size factors (left) or spike-in size factors (right). Cells are stratified by the ligand affinity and colored by the time after stimulation.

Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In most cases, changes in total RNA content are not interesting and can be normalized out by applying the library size or deconvolution factors. However, this may not always be appropriate if differences in total RNA are associated with a biological process of interest, e.g., cell cycle activity or T cell activation. Spike-in normalization will preserve these differences such that any changes in expression between biological groups have the correct sign.

However! Regardless of whether we care about total RNA content, it is critical that the spike-in transcripts are normalized using the spike-in size factors. Size factors computed from the counts for endogenous genes should not be applied to the spike-in transcripts, precisely because the former captures differences in total RNA content that are not experienced by the latter. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification. Thus, whenever spike-in data is present, we must compute a separate set of size factors for the spike-in set. This is discussed below for the Zeisel dataset.

7.5 Applying the size factors

7.5.1 Scaling and log-transforming

Once we have computed the size factors, we use the logNormCounts() function from scater to compute normalized expression values for each cell. This is done by dividing the count for each gene/spike-in transcript with the appropriate size factor for that cell. The function also log-transforms the normalized values, creating a new assay called "logcounts". (Technically, these are “log-transformed normalized expression values”, but that’s too much of a mouthful to fit into the assay name.) These log-values will be the basis of our downstream analyses in the following chapters.

## [1] "counts"    "logcounts"

The log-transformation is useful as differences in the log-values represent log-fold changes in expression. This is important in downstream procedures based on Euclidean distances, which includes many forms of clustering and dimensionality reduction. By operating on log-transformed data, we ensure that these procedures are measuring distances between cells based on log-fold changes in expression. Or in other words, which is more interesting - a gene that is expressed at an average count of 50 in cell type \(A\) and 10 in cell type \(B\), or a gene that is expressed at an average count of 1100 in \(A\) and 1000 in \(B\)? Log-transformation focuses on the former by promoting contributions from genes with strong relative differences.

When log-transforming, we need to consider the pseudo-count that is added to avoid undefined values at zero. Larger pseudo-counts will effectively shrink the log-fold changes between cells towards zero for low-abundance genes, meaning that downstream analyses will be driven more by differences in expression for high-abundance genes. Conversely, smaller pseudo-counts will increase the contribution of low-abundance genes. Common practice is to use a pseudo-count of 1, for the simple pragmatic reason that it preserves sparsity in the original matrix (i.e., zeroes in the input remain zeroes after transformation). This works well in all but the most pathological scenarios (A. Lun 2018).

(Incidentally, the addition of the pseudo-count is the motivation for the centering of the size factors at unity. This ensures that both the pseudo-count and the normalized expression values are on the same scale. A pseudo-count of 1 can then be interpreted as an extra read or UMI for each gene. In practical terms, this means that the shrinkage effect of the pseudo-count diminishes as sequencing depth improves. This is the correct behavior as it allows log-fold change estimates to become increasingly accurate with deeper coverage. In contrast, if we had simply divided each cell’s counts by its library size before log-transformation, accuracy of the log-fold changes would never improve regardless of how much additional sequencing we performed.)

7.5.2 Downsampling and log-transforming

In rare cases, direct scaling of the counts is not appropriate due to the effect described by A. Lun (2018). Briefly, this is caused by the fact that the mean of the log-normalized counts is not the same as the log-transformed mean of the normalized counts. The difference between them depends on the mean and variance of the original counts, such that there is a systematic trend in the mean of the log-counts with respect to the count size. This typically manifests as trajectories correlated strongly with library size even after library size normalization, as shown in Figure 7.5 for synthetic scRNA-seq data generated with a pool-and-split approach (Tian et al. 2019).

PCA plot of all pool-and-split libraries in the SORT-seq CellBench data, computed from the log-normalized expression values with library size-derived size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right).

Figure 7.5: PCA plot of all pool-and-split libraries in the SORT-seq CellBench data, computed from the log-normalized expression values with library size-derived size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right).

As the problem arises from differences in the sizes of the counts, the most straightforward solution is to downsample the counts of the high-coverage cells to match those of low-coverage cells. This uses the size factors to determine the amount of downsampling for each cell required to reach the 1st percentile of size factors. (The small minority of cells with smaller size factors are simply scaled up. We do not attempt to downsample to the smallest size factor, as this would result in excessive loss of information for one aberrant cell with very low size factors.) We can see that this eliminates the library size factor-associated trajectories from the first two PCs, improving resolution of the known differences based on mixing ratios (Figure 7.6). The log-transformation is still necessary but no longer introduces a shift in the means when the sizes of the counts are similar across cells.

PCA plot of pool-and-split libraries in the SORT-seq CellBench data, computed from the log-transformed counts after downsampling in proportion to the library size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right).

Figure 7.6: PCA plot of pool-and-split libraries in the SORT-seq CellBench data, computed from the log-transformed counts after downsampling in proportion to the library size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right).

While downsampling is an expedient solution, it is statistically inefficient as it needs to increase the noise of high-coverage cells in order to avoid differences with low-coverage cells. It is also slower than simple scaling. Thus, we would only recommend using this approach after an initial analysis with scaled counts reveals suspicious trajectories that are strongly correlated with the size factors. In such cases, it is a simple matter to re-normalize by downsampling to determine whether the trajectory is an artifact of the log-transformation.

7.5.3 Other options

Of course, log-transformation is not the only possible transformation. More sophisticated approaches can be used such as dedicated variance stabilizing transformations (e.g., from the DESeq2 or sctransform packages), which out-perform the log-transformation for removal of the mean-variance trend. In practice, though, the log-transformation is a good default choice due to its simplicity (a.k.a., reliability, predictability and computational efficiency) and interpretability.

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocFileCache_1.9.1         dbplyr_1.4.2               
 [3] scRNAseq_1.99.6             scran_1.13.26              
 [5] scater_1.13.25              ggplot2_3.2.1              
 [7] SingleCellExperiment_1.7.11 SummarizedExperiment_1.15.9
 [9] DelayedArray_0.11.8         BiocParallel_1.19.3        
[11] matrixStats_0.55.0          Biobase_2.45.1             
[13] GenomicRanges_1.37.16       GenomeInfoDb_1.21.2        
[15] IRanges_2.19.16             S4Vectors_0.23.25          
[17] BiocGenerics_0.31.6         Cairo_1.5-10               
[19] BiocStyle_2.13.2            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] bitops_1.0-6                  bit64_0.9-7                  
 [3] httr_1.4.1                    tools_3.6.1                  
 [5] backports_1.1.5               R6_2.4.0                     
 [7] irlba_2.3.3                   HDF5Array_1.13.9             
 [9] vipor_0.4.5                   DBI_1.0.0                    
[11] lazyeval_0.2.2                colorspace_1.4-1             
[13] withr_2.1.2                   tidyselect_0.2.5             
[15] gridExtra_2.3                 curl_4.2                     
[17] bit_1.1-14                    compiler_3.6.1               
[19] BiocNeighbors_1.3.5           labeling_0.3                 
[21] bookdown_0.14                 scales_1.0.0                 
[23] rappdirs_0.3.1                stringr_1.4.0                
[25] digest_0.6.21                 R.utils_2.9.0                
[27] rmarkdown_1.16                XVector_0.25.0               
[29] pkgconfig_2.0.3               htmltools_0.4.0              
[31] fastmap_1.0.1                 limma_3.41.17                
[33] highr_0.8                     rlang_0.4.0                  
[35] RSQLite_2.1.2                 shiny_1.4.0                  
[37] DelayedMatrixStats_1.7.2      R.oo_1.22.0                  
[39] dplyr_0.8.3                   RCurl_1.95-4.12              
[41] magrittr_1.5                  BiocSingular_1.1.7           
[43] GenomeInfoDbData_1.2.1        Matrix_1.2-17                
[45] Rhdf5lib_1.7.5                Rcpp_1.0.2                   
[47] ggbeeswarm_0.6.0              munsell_0.5.0                
[49] viridis_0.5.1                 R.methodsS3_1.7.1            
[51] stringi_1.4.3                 yaml_2.2.0                   
[53] edgeR_3.27.13                 zlibbioc_1.31.0              
[55] rhdf5_2.29.3                  AnnotationHub_2.17.10        
[57] grid_3.6.1                    blob_1.2.0                   
[59] promises_1.1.0                dqrng_0.2.1                  
[61] ExperimentHub_1.11.6          crayon_1.3.4                 
[63] lattice_0.20-38               cowplot_1.0.0                
[65] locfit_1.5-9.1                zeallot_0.1.0                
[67] knitr_1.25                    pillar_1.4.2                 
[69] igraph_1.2.4.1                glue_1.3.1                   
[71] evaluate_0.14                 BiocManager_1.30.7           
[73] httpuv_1.5.2                  vctrs_0.2.0                  
[75] gtable_0.3.0                  purrr_0.3.2                  
[77] assertthat_0.2.1              xfun_0.10                    
[79] DropletUtils_1.5.10           mime_0.7                     
[81] rsvd_1.0.2                    xtable_1.8-4                 
[83] later_1.0.0                   viridisLite_0.3.0            
[85] tibble_2.1.3                  AnnotationDbi_1.47.1         
[87] beeswarm_0.2.3                memoise_1.1.0                
[89] statmod_1.4.32                interactiveDisplayBase_1.23.0

Bibliography

Anders, S., and W. Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biol. 11 (10):R106.

Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12):550.

Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv.

Lun, A. T., K. Bach, and J. C. Marioni. 2016. “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol. 17 (April):75.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8):849–58.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3):R25.

Stegle, O., S. A. Teichmann, and J. C. Marioni. 2015. “Computational and analytical challenges in single-cell transcriptomics.” Nat. Rev. Genet. 16 (3):133–45.

Tian, L., X. Dong, S. Freytag, K. A. Le Cao, S. Su, A. JalalAbadi, D. Amann-Zalcenstein, et al. 2019. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nat. Methods 16 (6):479–87.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226):1138–42.