# Chapter 9 Dimensionality reduction

## 9.1 Overview

Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. For example, clustering aims to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. In these applications, each individual gene represents a dimension of the data. More intuitively, if we had a scRNA-seq data set with two genes, we could make a two-dimensional plot where each axis represents the expression of one gene and each point in the plot represents a cell. This concept can be extended to data sets with thousands of genes where each cell’s expression profile defines its location in the high-dimensional expression space.

Dimensionality reduction aims to reduce the number of separate dimensions8 in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension, e.g., an “eigengene” (Langfelder and Horvath 2007). This reduces computational work in downstream analyses, as calculations only need to be performed for a few dimensions rather than thousands of genes; reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data; and enables effective plotting of the data, for those of us who are not capable of visualizing more than 3 dimensions.

We will use the Zeisel et al. (2015) dataset to demonstrate the applications of various dimensionality reduction methods in this chapter.

### loading ###
library(scRNAseq)
sce.zeisel <- ZeiselBrainData()
sce.zeisel <- sce.zeisel[rowData(sce.zeisel)$featureType!="repeat",] library(scater) sce.zeisel <- aggregateAcrossFeatures(sce.zeisel, id=sub("_loc[0-9]+$", "", rownames(sce.zeisel)))

### gene-annotation ###
library(org.Mm.eg.db)
ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel),
keytype="SYMBOL", column="ENSEMBL")
rowData(sce.zeisel)$ENSEMBL <- ensembl ### quality-control ### stats <- perCellQCMetrics(sce.zeisel) qc <- quickCellQC(stats, percent_subsets="altexps_ERCC_percent", nmads=3) sce.zeisel <- sce.zeisel[,!qc$discard]

### normalization ###
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.zeisel)
sce.zeisel <- computeSumFactors(sce.zeisel, cluster=clusters, min.mean=0.1)
sce.zeisel <- logNormCounts(sce.zeisel)

### variance-modelling ###
dec.zeisel <- modelGeneVarWithSpikes(sce.zeisel, "ERCC")
sce.zeisel
## class: SingleCellExperiment
## dim: 19839 2937
## assays(2): counts logcounts
## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty
## rowData names(2): featureType ENSEMBL
## colnames(2937): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC repeat

## 9.2 Principal components analysis

### 9.2.1 Background

Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. This is best understood by imagining each axis as a line. Say we draw a line anywhere, and we move all cells in our data set onto this line by the shortest path. The variance captured by this axis is defined as the variance across cells along that line. In PCA, the first axis (or “principal component”, PC) is chosen such that it captures the greatest variance across cells. The next PC is chosen such that it is orthogonal to the first and captures the greatest remaining amount of variation, and so on.

By definition, the top PCs capture the dominant factors of heterogeneity in the data set. Thus, we can perform dimensionality reduction by restricting downstream analyses to the top PCs. This strategy is simple, highly effective and widely used throughout the data sciences. It takes advantage of the well-studied theoretical properties of the PCA - namely, that a low-rank approximation formed from the top PCs is the optimal approximation of the original data for a given matrix rank. It also allows us to use a wide range of fast PCA implementations for scalable and efficient data analysis.

When applying PCA to scRNA-seq data, our assumption is that biological processes affect multiple genes in a coordinated manner. This means the the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behaviour of many genes. By comparison, random (technical) noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, suggesting that noise is mostly concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise.

### 9.2.2 Performing the PCA

We perform the PCA on the log-normalized expression values using the runPCA() function from scater. This is based on a mathematical technique called the singular decomposition decomposition (SVD). By default, runPCA() will use an exact SVD based on base R’s svd() function and return the first 50 PCs for downstream analysis. These PCs are stored in the reducedDims() of the output SingleCellExperiment object, as shown below.

chosen.hvgs <- head(order(dec.zeisel$bio, decreasing=TRUE), 2000) library(scater) sce.zeisel <- runPCA(sce.zeisel, subset_row=chosen.hvgs) reducedDimNames(sce.zeisel) ## [1] "PCA" dim(reducedDim(sce.zeisel, "PCA")) ## [1] 2937 50 In the example above, the PCA is performed on the top 2000 genes with the largest biological components to reduce computational work and noise. Specifically, PCA is generally robust to noise but an excess of it may cause the earlier PCs to capture noise instead of biological structure. This effect can be avoided - or at least mitigated - by restricting the PCA to HVGs with large biological components. Any of the selection strategies described in Chapter 8 could be used for feature selection here. For large data sets, greater efficiency is obtained by using approximate SVD algorithms that only compute the top PCs. For example, we can use methods from the irlba package: library(BiocSingular) set.seed(1000) sce.zeisel <- runPCA(sce.zeisel, subset_row=chosen.hvgs, BSPARAM=IrlbaParam(), name="IRLBA") reducedDimNames(sce.zeisel) ## [1] "PCA" "IRLBA" The BSPARAM= argument from BiocSingular provides a flexible interface for specifying different SVD algorithms. Any function that accepts a BSPARAM= argument can often be sped up by using an alternative SVD algorithm. Another strategy is to perform the SVD using methods from the rsvd package: set.seed(1001) # Slightly different seed to add some variety! sce.zeisel <- runPCA(sce.zeisel, subset_row=chosen.hvgs, BSPARAM=RandomParam(), name="RSVD") reducedDimNames(sce.zeisel) ## [1] "PCA" "IRLBA" "RSVD" The majority of these approximate methods are based on randomization and thus require set.seed() to obtain reproducible results9. ### 9.2.3 Choosing the number of PCs #### 9.2.3.1 Motivation How many of the top PCs should we retain for downstream analyses? The choice of the number of PCs $$d$$ is a decision that is analogous to the choice of the number of HVGs to use. Using more PCs will avoid discarding biological signal in later PCs, at the cost of retaining more noise. Most practitioners will simply set $$d$$ to a “reasonable” but arbitrary value, typically ranging from 10 to 50. This is often satisfactory provided it is coupled with sufficient testing of alternative values to explore other perspectives of the data at a different bias-variance trade-off. Nonetheless, we will describe some more data-driven strategies to guide a suitable choice of $$d$$. #### 9.2.3.2 Using the elbow point A simple heuristic for choosing $$d$$ involves identifying the elbow point in the percentage of variance explained by successive PCs. This refers to the “elbow” in the curve of a scree plot as shown in Figure 9.1. # Percentage of variance explained is tucked away in the attributes. percent.var <- attr(reducedDim(sce.zeisel), "percentVar") chosen.elbow <- PCAtools::findElbowPoint(percent.var) chosen.elbow ## [1] 7 plot(percent.var, xlab="PC", ylab="Variance explained (%)") abline(v=chosen.elbow, col="red") Our assumption is that each of the top PCs capturing biological signal should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained when we move past the last “biological” PC. This manifests as an elbow in the scree plot, the location of which serves as a natural choice for $$d$$. From a practical perspective, the use of the elbow point tends to retain fewer PCs compared to other methods. The definition of “much more variance” is relative so, in order to be retained, later PCs must explain a amount of variance that is comparable to that explained by the first few PCs. Strong biological variation in the early PCs will shift the elbow to the left, potentially excluding weaker (but still interesting) variation in the next PCs immediately following the elbow. #### 9.2.3.3 Using the technical noise Another strategy is to retain all PCs until the percentage of total variation explained reaches some threshold $$T$$. For example, one might retain the top set of PCs that explains 80% of the total variation in the data. Of course, it would be pointless to swap one arbitrary parameter $$d$$ for another $$T$$. Instead, we obtain a suitable value for $$T$$ by calculating the proportion of variance in the data that is attributed to the biological component. This is done using the denoisePCA() function with the variance modelling results from modelGeneVar() or related functions. library(scran) set.seed(001001001) denoised <- denoisePCA(sce.zeisel, technical=dec.zeisel, subset.row=chosen.hvgs, BSPARAM=IrlbaParam()) ncol(reducedDim(denoised)) ## [1] 100 The dimensionality of the output represents the lower bound on the number of PCs required to retain all biological variation. Any fewer PCs will definitely discard some aspect of biological signal. Note that the converse is not true, i.e., there is no guarantee that the retained PCs capture all of the signal, which is only generally possible if no dimensionality reduction is performed at all. The returned value of $$d$$ provides a reasonable choice of rank when we want to retain as much signal as possible while still removing some noise. From a practical perspective, the denoisePCA() approach retains more PCs than the elbow point method. This is because the former does not compare PCs to each other and thus does not discard PCs corresponding to secondary factors of variation. The downside is that many minor aspects of variation may not be interesting (e.g., transcriptional bursting) and their retention would only add irrelevant noise. Thus, whether this is a “better” approach depends on the analyst’s willingness to increase noise in order to preserve weaker biological signals. #### 9.2.3.4 Based on population structure Yet another method to choose $$d$$ uses information about the number of subpopulations in the data. Consider a situation where each subpopulation differs from the others along a different axis in the high-dimensional space (e.g., because it is defined by a unique set of marker genes). This suggests that we should set $$d$$ to the number of unique subpopulations minus 1, which guarantees separation of all subpopulations while retaining as few dimensions (and noise) as possible. We can use this reasoning to loosely motivate an a priori choice for $$d$$ - for example, if we expect around 10 different cell types in our population, we would set $$d \approx 10$$. In practice, the number of subpopulations is usually not known in advance. Rather, we use a heuristic approach that uses the number of clusters as a proxy for the number of subpopulations. We perform clustering (graph-based by default, see Chapter 10) on the first $$d^*$$ PCs and only consider the values of $$d^*$$ that yield no more than $$d^*+1$$ clusters. If we detect more clusters with fewer dimensions, we consider this to represent overclustering rather than distinct subpopulations, assuming that multiple subpopulations should not be distinguishable on the same axes. We test a range of $$d^*$$ and set $$d$$ to the value that maximizes the number of clusters while satisfying the above condition. This attempts to capture as many distinct (putative) subpopulations as possible by retaining biological signal in later PCs, up until the point that the additional noise reduces resolution. pcs <- reducedDim(sce.zeisel) choices <- getClusteredPCs(pcs) metadata(choices)$chosen
## [1] 28
plot(choices$n.pcs, choices$n.clusters,
xlab="Number of PCs", ylab="Number of clusters")
abline(a=1, b=1, col="red")
abline(v=metadata(choices)\$chosen, col="grey80", lty=2)

This strategy is the most pragmatic as it directly addresses the role of the bias-variance trade-off in downstream analyses, specifically clustering. There is no need to preserve biological signal beyond what is distinguishable in later steps. However, it involves strong assumptions about the nature of the biological differences between subpopulations - and indeed, discrete subpopulations may not even exist in studies of continuous processes like differentiation.

### 9.2.4 Putting it together

Once we have chosen $$d$$, enforcing it is as simple as subsetting the PC matrix by column. We then use the reducedDim()<- command to reassign the subsetted matrix back into the SingleCellExperiment object. For example, if we were to take the top 20 PCs, we would do:

reducedDim(sce.zeisel, "PCA") <- reducedDim(sce.zeisel, "PCA")[,1:20]
ncol(reducedDim(sce.zeisel, "PCA"))
## [1] 20

Downstream applications that use the "PCA" dimensionality reduction results in sce.zeisel will subsequently operate on the first 20 PCs only. Alternatively, some users may prefer to keep the full set of PCs, in which case the top set can be assigned to another name:

reducedDim(sce.zeisel, "PCA_20") <- reducedDim(sce.zeisel, "PCA")[,1:20]
reducedDimNames(sce.zeisel)
## [1] "PCA"    "IRLBA"  "RSVD"   "PCA_20"

## 9.3 Dimensionality reduction for visualization

### 9.3.1 Motivation

Another application of dimensionality reduction is to compress the data into 2 (sometimes 3) dimensions for plotting. This serves a separate purpose to the PCA-based dimensionality reduction described above. Algorithms are more than happy to operate on 10-50 PCs, but these are still too many dimensions for human comprehension. Further dimensionality reduction strategies are required to pack the most salient features of the data into 2 or 3 dimensions, which we will discuss below.

### 9.3.2 Visualizating with PCA

The simplest visualization approach is to plot the top 2 PCs (Figure 9.3):

plotReducedDim(sce.zeisel, dimred="PCA", colour_by="level1class")

The problem is that PCA is a linear technique, i.e., only variation along a line in high-dimensional space is captured by each PC. As such, it cannot efficiently pack differences in $$d$$ dimensions into the first 2 PCs. This is demonstrated in Figure 9.3 where the top two PCs fail to resolve some subpopulations identified by Zeisel et al. (2015). If the first PC is devoted to resolving the biggest difference between subpopulations, and the second PC is devoted to resolving the next biggest difference, then the remaining differences will not be visible in the plot.

One workaround is to plot several of the top PCs against each other in pairwise plots (Figure 9.4). However, it is difficult to interpret multiple plots simultaneously, and even this approach is not sufficient to separate some of the annotated subpopulations.

plotReducedDim(sce.zeisel, dimred="PCA", ncomponents=4,
colour_by="level1class")

There are some advantages to the PCA for visualization. It is predictable and will not introduce artificial structure in the visualization. It is also deterministic and robust to small changes in the input values. However, as shown above, PCA is usually not satisfactory for visualization of complex populations.

### 9.3.3 t-stochastic neighbor embedding

The de facto standard for visualization of scRNA-seq data is the $$t$$-stochastic neighbour embedding ($$t$$-SNE) method (Van der Maaten and Hinton 2008). This attempts to find a low-dimensional representation of the data that preserves the distances between each point and its neighbours in the high-dimensional space. Unlike PCA, it is not restricted to linear transformations, nor is it obliged to accurately represent distances between distance populations. This means that it has much more freedom in how it arranges cells in low-dimensional space, enabling it to separate many distinct clusters in a complex population (Figure 9.5).

set.seed(00101001101)
sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA")
plotReducedDim(sce.zeisel, dimred="TSNE", colour_by="level1class")

One of the main disadvantages of $$t$$-SNE is that it is much more computationally intensive than other visualization methods. We mitigate this effect by performing the calculations on the top PCs with dimred="PCA" in runtTSNE(). This takes advantage of the PCA to compact the data and remove noise, followed by $$t$$-SNE to create the visualization. It is possible to run $$t$$-SNE on the original expression matrix but this is less efficient.

Another issue with $$t$$-SNE is that it requires the user to be aware of additional parameters (discussed here in some depth). It involves a random initialization so we need to set the seed to ensure that the chosen results are reproducible. The “perplexity” is another important parameter that determines the granularity of the visualization (Figure 9.6). Low perplexities will favour resolution of finer structure, possibly to the point that the visualization is compromised by random noise. Thus, it is advisable to test different perplexity values to ensure that the choice of perplexity does not drive the interpretation of the plot.

set.seed(100)
sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA", perplexity=5)
out5 <- plotReducedDim(sce.zeisel, dimred="TSNE",
colour_by="level1class") + ggtitle("perplexity = 5")

set.seed(100)
sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA", perplexity=20)
out20 <- plotReducedDim(sce.zeisel, dimred="TSNE",
colour_by="level1class") + ggtitle("perplexity = 20")

set.seed(100)
sce.zeisel <- runTSNE(sce.zeisel, dimred="PCA", perplexity=80)
out80 <- plotReducedDim(sce.zeisel, dimred="TSNE",
colour_by="level1class") + ggtitle("perplexity = 80")

multiplot(out5, out20, out80, cols=3)

Despite its shortcomings, $$t$$-SNE is proven tool for general-purpose visualization of scRNA-seq data. It sees routine use in many analysis pipelines and will likely continue to do so for some time.

### 9.3.4 Uniform manifold approximation and projection

The uniform manifold approximation and projection (UMAP) method (???) is an alternative to $$t$$-SNE for non-linear dimensionality reduction. It is roughly similar to $$t$$-SNE in that it also tries to find a low-dimensional representation that preserves relationships between neighbors in high-dimensional space. However, the two methods are based on different theory, represented by differences in the various graph weighting equations10. This manifests as a different visualization as shown in Figure 9.7.

set.seed(1100101001)
sce.zeisel <- runUMAP(sce.zeisel, dimred="PCA")
plotReducedDim(sce.zeisel, dimred="UMAP", colour_by="level1class")

Compared to $$t$$-SNE, the UMAP visualization tends to have more compact visual clusters with more empty space between them. It also attempts to preserve more of the global structure than $$t$$-SNE. From a practical perspective, UMAP is much faster than $$t$$-SNE, which may be an important consideration for large datasets. (Nonetheless, we have still run UMAP on the top 20 PCs here for consistency.) UMAP also involves a series of randomization steps so setting the seed is critical.

Like $$t$$-SNE, UMAP has its own suite of hyperparameters that affect the visualization. Of these, the number of neighbours (n_neighbors) and the minimum distance between embedded points (min_dist) have the greatest effect on the granularity of the output. If these values are too low, random noise will be incorrectly treated as high-resolution structure, while values that are too high will discard fine structure altogether in favour of obtaining an accurate overview of the entire dataset. Again, it is a good idea to test a range of values for these parameters to ensure that they do not compromise any conclusions drawn from a UMAP plot.

It is arguable whether the UMAP or $$t$$-SNE visualizations are more useful or aesthetically pleasing. However, it is clear that UMAP is much faster. For that reason alone, it is increasingly displacing $$t$$-SNE as the method of choice for visualizing large scRNA-seq data sets.

### 9.3.5 Interpreting the plots

Dimensionality reduction for visualization necessarily involves discarding information and distorting the distances between cells. One might wonder whether the results of such extreme data compression can be trusted. Indeed, it would probably be unwise to directly analyze the low-dimensional coordinates used for plotting. Rather, the plots should only be used to interpret or communicate the results of quantitative analyses based on a more accurate higher-rank representation of the data.

To illustrate, consider the interaction between clustering and $$t$$-SNE. As a general rule, we would not perform clustering on the $$t$$-SNE coordinates. Rather, we would cluster on the first 10-50 PCs (Chapter (clustering)) and then visualize the cluster identities on the $$t$$-SNE plot. This ensures that clustering makes use of the information that was lost during compression into two dimensions. Given a discrepancy between the visible clusters on a $$t$$-SNE plot and those identified by our clustering, we would be inclined to favour the latter.

## Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] scran_1.13.18               BiocSingular_1.1.5
[3] scater_1.13.18              ggplot2_3.2.1
[5] SingleCellExperiment_1.7.8  SummarizedExperiment_1.15.9
[7] DelayedArray_0.11.4         BiocParallel_1.19.2
[9] matrixStats_0.55.0          Biobase_2.45.1
[11] GenomicRanges_1.37.15       GenomeInfoDb_1.21.1
[13] IRanges_2.19.14             S4Vectors_0.23.21
[15] BiocGenerics_0.31.5         Cairo_1.5-10
[17] BiocStyle_2.13.2            OSCAUtils_0.0.1

loaded via a namespace (and not attached):
[1] viridis_0.5.1            edgeR_3.27.13
[3] viridisLite_0.3.0        DelayedMatrixStats_1.7.2
[5] RcppParallel_4.4.3       assertthat_0.2.1
[7] statmod_1.4.32           highr_0.8
[9] BiocManager_1.30.4       dqrng_0.2.1
[11] GenomeInfoDbData_1.2.1   vipor_0.4.5
[13] yaml_2.2.0               ggrepel_0.8.1
[15] pillar_1.4.2             lattice_0.20-38
[17] glue_1.3.1               limma_3.41.16
[19] digest_0.6.20            XVector_0.25.0
[21] colorspace_1.4-1         plyr_1.8.4
[23] cowplot_1.0.0            htmltools_0.3.6
[25] Matrix_1.2-17            pkgconfig_2.0.2
[27] bookdown_0.13            zlibbioc_1.31.0
[29] purrr_0.3.2              scales_1.0.0
[31] RSpectra_0.15-0          Rtsne_0.15
[33] tibble_2.1.3             withr_2.1.2
[35] lazyeval_0.2.2           magrittr_1.5
[37] crayon_1.3.4             evaluate_0.14
[39] FNN_1.1.3                beeswarm_0.2.3
[41] tools_3.6.1              stringr_1.4.0
[43] munsell_0.5.0            locfit_1.5-9.1
[45] irlba_2.3.3              compiler_3.6.1
[47] rsvd_1.0.2               rlang_0.4.0
[49] grid_3.6.1               RCurl_1.95-4.12
[51] BiocNeighbors_1.3.3      igraph_1.2.4.1
[53] labeling_0.3             bitops_1.0-6
[55] rmarkdown_1.15           gtable_0.3.0
[57] reshape2_1.4.3           PCAtools_1.1.10
[59] R6_2.4.0                 gridExtra_2.3
[61] knitr_1.24               dplyr_0.8.3
[63] uwot_0.1.3               stringi_1.4.3
[65] ggbeeswarm_0.6.0         Rcpp_1.0.2
[67] tidyselect_0.2.5         xfun_0.9                

### Bibliography

Langfelder, P., and S. Horvath. 2007. “Eigengene networks for studying the relationships between co-expression modules.” BMC Syst Biol 1 (November):54.

Van der Maaten, L., and G. Hinton. 2008. “Visualizing Data Using T-SNE.” J. Mach. Learn. Res. 9 (2579-2605):85.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226):1138–42.

1. Duh.

2. Some of us leave coded messages in binary via the seeds. Just so you know.

3. That we will not discuss here, thank god.