Chapter 9 Dimensionality reduction

9.1 Overview

Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. For example, clustering aims to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. In these applications, each individual gene represents a dimension of the data. More intuitively, if we had a scRNA-seq data set with two genes, we could make a two-dimensional plot where each axis represents the expression of one gene and each point in the plot represents a cell. This concept can be extended to data sets with thousands of genes where each cell’s expression profile defines its location in the high-dimensional expression space.

As the name suggests, dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension, e.g., an “eigengene” (Langfelder and Horvath 2007). This reduces computational work in downstream analyses, as calculations only need to be performed for a few dimensions rather than thousands of genes; reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data; and enables effective plotting of the data, for those of us who are not capable of visualizing more than 3 dimensions.

We will use the Zeisel et al. (2015) dataset to demonstrate the applications of various dimensionality reduction methods in this chapter.

## class: SingleCellExperiment 
## dim: 19839 2816 
## metadata(0):
## assays(2): counts logcounts
## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty
## rowData names(2): featureType ENSEMBL
## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01
##   1772066098_A12
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC repeat

9.2 Principal components analysis

Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. This is best understood by imagining each axis as a line. Say we draw a line anywhere, and we move all cells in our data set onto this line by the shortest path. The variance captured by this axis is defined as the variance across cells along that line. In PCA, the first axis (or “principal component”, PC) is chosen such that it captures the greatest variance across cells. The next PC is chosen such that it is orthogonal to the first and captures the greatest remaining amount of variation, and so on.

By definition, the top PCs capture the dominant factors of heterogeneity in the data set. Thus, we can perform dimensionality reduction by restricting downstream analyses to the top PCs. This strategy is simple, highly effective and widely used throughout the data sciences. It takes advantage of the well-studied theoretical properties of the PCA - namely, that a low-rank approximation formed from the top PCs is the optimal approximation of the original data for a given matrix rank. It also allows us to use a wide range of fast PCA implementations for scalable and efficient data analysis.

When applying PCA to scRNA-seq data, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behaviour of many genes. By comparison, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise.

We perform the PCA on the log-normalized expression values using the runPCA() function from scater. By default, runPCA() will compute the first 50 PCs and store them in the reducedDims() of the output SingleCellExperiment object, as shown below. Here, we use only the top 2000 genes with the largest biological components to reduce computational work and noise. Specifically, PCA is generally robust to random noise but an excess of it may cause the earlier PCs to capture noise instead of biological structure. This effect can be avoided - or at least mitigated - by restricting the PCA to a subset of HVGs, for which we can use any of the strategies described in Chapter 8.

## [1] "PCA"
## [1] 2816   50

For large data sets, greater efficiency is obtained by using approximate SVD algorithms that only compute the top PCs. By default, most PCA-related functions in scater and scran will use methods from the irlba or rsvd packages to perform the SVD. We can explicitly specify the SVD algorithm to use by passing an BiocSingularParam object (from the BiocSingular package) to the BSPARAM= argument (see Section 20.2.2 for more details). Many of these approximate algorithms are based on randomization and thus require set.seed() to obtain reproducible results.

## [1] "PCA"   "IRLBA"

9.3 Choosing the number of PCs

9.3.1 Motivation

How many of the top PCs should we retain for downstream analyses? The choice of the number of PCs \(d\) is a decision that is analogous to the choice of the number of HVGs to use. Using more PCs will avoid discarding biological signal in later PCs, at the cost of retaining more noise. Most practitioners will simply set \(d\) to a “reasonable” but arbitrary value, typically ranging from 10 to 50. This is often satisfactory provided it is coupled with sufficient testing of alternative values to explore other perspectives of the data at a different bias-variance trade-off. Nonetheless, we will describe some more data-driven strategies to guide a suitable choice of \(d\).

9.3.2 Using the elbow point

A simple heuristic for choosing \(d\) involves identifying the elbow point in the percentage of variance explained by successive PCs. This refers to the “elbow” in the curve of a scree plot as shown in Figure 9.1.

## [1] 7
Percentage of variance explained by successive PCs in the Zeisel brain data. The identified elbow point is marked with a red line.

Figure 9.1: Percentage of variance explained by successive PCs in the Zeisel brain data. The identified elbow point is marked with a red line.

Our assumption is that each of the top PCs capturing biological signal should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained when we move past the last “biological” PC. This manifests as an elbow in the scree plot, the location of which serves as a natural choice for \(d\).

From a practical perspective, the use of the elbow point tends to retain fewer PCs compared to other methods. The definition of “much more variance” is relative so, in order to be retained, later PCs must explain a amount of variance that is comparable to that explained by the first few PCs. Strong biological variation in the early PCs will shift the elbow to the left, potentially excluding weaker (but still interesting) variation in the next PCs immediately following the elbow.

9.3.3 Using the technical noise

Another strategy is to retain all PCs until the percentage of total variation explained reaches some threshold \(T\). For example, we might retain the top set of PCs that explains 80% of the total variation in the data. Of course, it would be pointless to swap one arbitrary parameter \(d\) for another \(T\). Instead, we derive a suitable value for \(T\) by calculating the proportion of variance in the data that is attributed to the biological component. This is done using the denoisePCA() function with the variance modelling results from modelGeneVar() or related functions, where \(T\) is defined as the ratio of the sum of the biological components to the sum of total variances.

To illustrate, we use this strategy to pick the number of PCs in the 10X PBMC dataset. Note that explicit feature selection is not strictly necessary, as denoisePCA() will automatically restrict the PCA to genes with positive biological components (Section 8.3.4) to ensure that \(T\) is always a positive value. We also use IrlbaParam() to speed up the SVD as previously described.

## [1] 8

The dimensionality of the output represents the lower bound on the number of PCs required to retain all biological variation. Any fewer PCs will definitely discard some aspect of biological signal. Note that the converse is not true, i.e., there is no guarantee that the retained PCs capture all of the signal, which is only generally possible if no dimensionality reduction is performed at all. The returned value of \(d\) provides a reasonable choice of rank when we want to retain as much signal as possible while still removing some noise.

From a practical perspective, the denoisePCA() approach usually retains more PCs than the elbow point method. This is because the former does not compare PCs to each other and thus does not discard PCs corresponding to secondary factors of variation. The downside is that many minor aspects of variation may not be interesting (e.g., transcriptional bursting) and their retention would only add irrelevant noise. Thus, whether this is a “better” approach depends on the analyst’s willingness to increase noise in order to preserve weaker biological signals.

Incidentally, denoisePCA() imposes internal caps on the number of PCs that can be chosen in this manner. By default, the number is bounded within the “reasonable” limits of 5 and 50 to avoid selection of too few PCs (when technical noise is high relative to biological variation) or too many PCs (when technical noise is very low). For example, applying this function to the Zeisel brain data hits the upper limit:

## [1] 50

9.3.4 Based on population structure

Yet another method to choose \(d\) uses information about the number of subpopulations in the data. Consider a situation where each subpopulation differs from the others along a different axis in the high-dimensional space (e.g., because it is defined by a unique set of marker genes). This suggests that we should set \(d\) to the number of unique subpopulations minus 1, which guarantees separation of all subpopulations while retaining as few dimensions (and noise) as possible. We can use this reasoning to loosely motivate an a priori choice for \(d\) - for example, if we expect around 10 different cell types in our population, we would set \(d \approx 10\).

In practice, the number of subpopulations is usually not known in advance. Rather, we use a heuristic approach that uses the number of clusters as a proxy for the number of subpopulations. We perform clustering (graph-based by default, see Chapter 10) on the first \(d^*\) PCs and only consider the values of \(d^*\) that yield no more than \(d^*+1\) clusters. If we detect more clusters with fewer dimensions, we consider this to represent overclustering rather than distinct subpopulations, assuming that multiple subpopulations should not be distinguishable on the same axes. We test a range of \(d^*\) and set \(d\) to the value that maximizes the number of clusters while satisfying the above condition. This attempts to capture as many distinct (putative) subpopulations as possible by retaining biological signal in later PCs, up until the point that the additional noise reduces resolution.

## [1] 37
Number of clusters detected in the Zeisel brain dataset as a function of the number of PCs. The red unbroken line represents the theoretical upper constraint on the number of clusters, while the grey dashed line is the number of PCs suggested by `getClusteredPCs()`.

Figure 9.2: Number of clusters detected in the Zeisel brain dataset as a function of the number of PCs. The red unbroken line represents the theoretical upper constraint on the number of clusters, while the grey dashed line is the number of PCs suggested by getClusteredPCs().

This strategy is the most pragmatic as it directly addresses the role of the bias-variance trade-off in downstream analyses, specifically clustering. There is no need to preserve biological signal beyond what is distinguishable in later steps. However, it involves strong assumptions about the nature of the biological differences between subpopulations - and indeed, discrete subpopulations may not even exist in studies of continuous processes like differentiation.

9.3.5 Putting it together

Once we have chosen \(d\), applying it is as simple as subsetting the PC matrix by column. We then use the reducedDim()<- command to reassign the subsetted matrix back into the SingleCellExperiment object. For example, if we were to take the top 20 PCs, we would do:

## [1] 20

Downstream applications that use the "PCA" dimensionality reduction results in sce.zeisel will subsequently operate on the first 20 PCs only. Alternatively, some users may prefer to keep the full set of PCs, in which case the top set can be assigned to another name:

## [1] "PCA"    "IRLBA"  "PCA_20"

Note that this is not necessary if the desired number of PCs is directly specified in the initial call to runPCA(). Similarly, denoisePCA() will return its chosen number of PCs without requiring further user intervention.

9.4 Dimensionality reduction for visualization

9.4.1 Motivation

Another application of dimensionality reduction is to compress the data into 2 (sometimes 3) dimensions for plotting. This serves a separate purpose to the PCA-based dimensionality reduction described above. Algorithms are more than happy to operate on 10-50 PCs, but these are still too many dimensions for human comprehension. Further dimensionality reduction strategies are required to pack the most salient features of the data into 2 or 3 dimensions, which we will discuss below.

9.4.2 Visualizating with PCA

The simplest visualization approach is to plot the top 2 PCs (Figure 9.3):

PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors.

Figure 9.3: PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors.

The problem is that PCA is a linear technique, i.e., only variation along a line in high-dimensional space is captured by each PC. As such, it cannot efficiently pack differences in \(d\) dimensions into the first 2 PCs. This is demonstrated in Figure 9.3 where the top two PCs fail to resolve some subpopulations identified by Zeisel et al. (2015). If the first PC is devoted to resolving the biggest difference between subpopulations, and the second PC is devoted to resolving the next biggest difference, then the remaining differences will not be visible in the plot.

One workaround is to plot several of the top PCs against each other in pairwise plots (Figure 9.4). However, it is difficult to interpret multiple plots simultaneously, and even this approach is not sufficient to separate some of the annotated subpopulations.

PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors.

Figure 9.4: PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors.

There are some advantages to the PCA for visualization. It is predictable and will not introduce artificial structure in the visualization. It is also deterministic and robust to small changes in the input values. However, as shown above, PCA is usually not satisfactory for visualization of complex populations.

9.4.3 \(t\)-stochastic neighbor embedding

The de facto standard for visualization of scRNA-seq data is the \(t\)-stochastic neighbour embedding (\(t\)-SNE) method (Van der Maaten and Hinton 2008). This attempts to find a low-dimensional representation of the data that preserves the distances between each point and its neighbours in the high-dimensional space. Unlike PCA, it is not restricted to linear transformations, nor is it obliged to accurately represent distances between distance populations. This means that it has much more freedom in how it arranges cells in low-dimensional space, enabling it to separate many distinct clusters in a complex population (Figure 9.5).

$t$-SNE plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation.

Figure 9.5: \(t\)-SNE plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation.

One of the main disadvantages of \(t\)-SNE is that it is much more computationally intensive than other visualization methods. We mitigate this effect by setting dimred="PCA" in runtTSNE(), which instructs the function to perform the \(t\)-SNE calculations on the top PCs to exploit the data compaction and noise removal provided by the PCA. It is possible to run \(t\)-SNE on the original expression matrix but this is less efficient.

Another issue with \(t\)-SNE is that it requires the user to be aware of additional parameters (discussed here in some depth). It involves a random initialization so we need to (i) repeat the visualization several times to ensure that the results are representative and (ii) set the seed to ensure that the chosen results are reproducible. The “perplexity” is another important parameter that determines the granularity of the visualization (Figure 9.6). Low perplexities will favour resolution of finer structure, possibly to the point that the visualization is compromised by random noise. Thus, it is advisable to test different perplexity values to ensure that the choice of perplexity does not drive the interpretation of the plot.

$t$-SNE plots constructed from the top PCs in the Zeisel brain dataset, using a range of perplexity values. Each point represents a cell, coloured according to its annotation.

Figure 9.6: \(t\)-SNE plots constructed from the top PCs in the Zeisel brain dataset, using a range of perplexity values. Each point represents a cell, coloured according to its annotation.

Finally, it is tempting to interpret the \(t\)-SNE results as a “map” of single-cell identities. This is generally unwise as any such interpretation is easily misled by the size and positions of the visual clusters. Specifically, \(t\)-SNE will inflate dense clusters and compress sparse ones, such that we cannot use the size as a measure of subpopulation heterogeneity. Similarly, \(t\)-SNE is not obliged to preserve the relative locations of non-neighboring clusters, such that we cannot use their positions to determine relationships between distant clusters. We provide some suggestions on how to interpret these plots in Section 9.4.5.

Despite its shortcomings, \(t\)-SNE is proven tool for general-purpose visualization of scRNA-seq data and remains a popular choice in many analysis pipelines.

9.4.4 Uniform manifold approximation and projection

The uniform manifold approximation and projection (UMAP) method (McInnes, Healy, and Melville 2018) is an alternative to \(t\)-SNE for non-linear dimensionality reduction. It is roughly similar to \(t\)-SNE in that it also tries to find a low-dimensional representation that preserves relationships between neighbors in high-dimensional space. However, the two methods are based on different theory, represented by differences in the various graph weighting equations. This manifests as a different visualization as shown in Figure 9.7.

UMAP plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation.

Figure 9.7: UMAP plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation.

Compared to \(t\)-SNE, the UMAP visualization tends to have more compact visual clusters with more empty space between them. It also attempts to preserve more of the global structure than \(t\)-SNE. From a practical perspective, UMAP is much faster than \(t\)-SNE, which may be an important consideration for large datasets. (Nonetheless, we have still run UMAP on the top PCs here for consistency.) UMAP also involves a series of randomization steps so setting the seed is critical.

Like \(t\)-SNE, UMAP has its own suite of hyperparameters that affect the visualization. Of these, the number of neighbours (n_neighbors) and the minimum distance between embedded points (min_dist) have the greatest effect on the granularity of the output. If these values are too low, random noise will be incorrectly treated as high-resolution structure, while values that are too high will discard fine structure altogether in favour of obtaining an accurate overview of the entire dataset. Again, it is a good idea to test a range of values for these parameters to ensure that they do not compromise any conclusions drawn from a UMAP plot.

It is arguable whether the UMAP or \(t\)-SNE visualizations are more useful or aesthetically pleasing. UMAP aims to preserve more global structure but this necessarily reduces resolution within each visual cluster. However, UMAP is unarguably much faster, and for that reason alone, it is increasingly displacing \(t\)-SNE as the method of choice for visualizing large scRNA-seq data sets.

9.4.5 Interpreting the plots

Dimensionality reduction for visualization necessarily involves discarding information and distorting the distances between cells in order to fit high-dimensional data into a 2-dimensional space. One might wonder whether the results of such extreme data compression can be trusted. Some of our more quantitative colleagues consider such visualizations to be more artistic than scientific, fit for little but impressing collaborators and reviewers. Perhaps this perspective is not entirely invalid, but we suggest that there is some value to be extracted from them provided that they are accompanied by an analysis of a higher-rank representation.

To illustrate, consider the interaction between clustering and \(t\)-SNE. As a general rule, we would not perform clustering on the \(t\)-SNE coordinates. Rather, we would cluster on the first 10-50 PCs (Chapter (clustering)) and then visualize the cluster identities on the \(t\)-SNE plot. This ensures that clustering makes use of the information that was lost during compression into two dimensions. The \(t\)-SNE plot can then be used for a diagnostic inspection of the clustering output. In particular, the plot is most useful for checking whether two clusters are actually neighboring subclusters or whether a cluster can be split into further subclusters, which are generally safe interpretations of \(t\)-SNE coordinates.

It is worth elaborating on why we should not perform downstream analyses directly on the \(t\)-SNE or UMAP coordinates. Let us put aside the fact that operating on the high-dimensional representations preserves more information; from a naive perspective, using the \(t\)-SNE coordinates is very tempting as it ensures that any results are immediately consistent with the visualizations (regardless of whether they are right). However, this can actually be considered a disservice as it masks the uncertainty of the results, leading us to place more confidence in them than is warranted. Rather than being errors, major discrepancies can instead be useful for motivating further investigation into the more ambiguous parts of the dataset; conversely, the lack of discrepancies increases trust in the results.

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocSingular_1.2.0          scater_1.14.0              
 [3] ggplot2_3.2.1               scran_1.14.0               
 [5] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.0
 [7] DelayedArray_0.12.0         BiocParallel_1.20.0        
 [9] matrixStats_0.55.0          Biobase_2.46.0             
[11] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[13] IRanges_2.20.0              S4Vectors_0.24.0           
[15] BiocGenerics_0.32.0         Cairo_1.5-10               
[17] BiocStyle_2.14.0            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] viridis_0.5.1            edgeR_3.28.0            
 [3] viridisLite_0.3.0        DelayedMatrixStats_1.8.0
 [5] RcppParallel_4.4.4       assertthat_0.2.1        
 [7] statmod_1.4.32           highr_0.8               
 [9] BiocManager_1.30.9       dqrng_0.2.1             
[11] GenomeInfoDbData_1.2.2   vipor_0.4.5             
[13] yaml_2.2.0               ggrepel_0.8.1           
[15] pillar_1.4.2             lattice_0.20-38         
[17] glue_1.3.1               limma_3.42.0            
[19] digest_0.6.22            XVector_0.26.0          
[21] colorspace_1.4-1         plyr_1.8.4              
[23] cowplot_1.0.0            htmltools_0.4.0         
[25] Matrix_1.2-17            pkgconfig_2.0.3         
[27] bookdown_0.14            zlibbioc_1.32.0         
[29] purrr_0.3.3              scales_1.0.0            
[31] RSpectra_0.15-0          Rtsne_0.15              
[33] tibble_2.1.3             withr_2.1.2             
[35] lazyeval_0.2.2           magrittr_1.5            
[37] crayon_1.3.4             evaluate_0.14           
[39] FNN_1.1.3                beeswarm_0.2.3          
[41] tools_3.6.1              stringr_1.4.0           
[43] munsell_0.5.0            locfit_1.5-9.1          
[45] irlba_2.3.3              compiler_3.6.1          
[47] rsvd_1.0.2               rlang_0.4.1             
[49] grid_3.6.1               RCurl_1.95-4.12         
[51] BiocNeighbors_1.4.0      igraph_1.2.4.1          
[53] labeling_0.3             bitops_1.0-6            
[55] rmarkdown_1.16           gtable_0.3.0            
[57] reshape2_1.4.3           PCAtools_1.2.0          
[59] R6_2.4.0                 gridExtra_2.3           
[61] knitr_1.25               dplyr_0.8.3             
[63] uwot_0.1.4               stringi_1.4.3           
[65] ggbeeswarm_0.6.0         Rcpp_1.0.2              
[67] tidyselect_0.2.5         xfun_0.10               

Bibliography

Langfelder, P., and S. Horvath. 2007. “Eigengene networks for studying the relationships between co-expression modules.” BMC Syst Biol 1 (November):54.

McInnes, Leland, John Healy, and James Melville. 2018. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv E-Prints, February, arXiv:1802.03426.

Van der Maaten, L., and G. Hinton. 2008. “Visualizing Data Using T-SNE.” J. Mach. Learn. Res. 9 (2579-2605):85.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226):1138–42.