# Chapter 17 Trajectory Analysis

## 17.1 Overview

Many biological processes manifest as a continuum of dynamic changes in the cellular state. The most obvious example is that of differentiation into increasingly specialized cell subtypes, but we might also consider phenomena like the cell cycle or immune cell activation that are accompanied by gradual changes in the cell’s transcriptome. We characterize these processes from single-cell expression data by identifying a “trajectory”, i.e., a path through the high-dimensional expression space that traverses the various cellular states associated with a continuous process like differentiation. In the simplest case, a trajectory will be a simple path from one point to another, but we can also observe more complex trajectories involve branching to multiple endpoints.

A related concept is that of “pseudotime”, defined as the positioning of cells along the trajectory that quantifies the relative activity of the underlying biological process. For example, the pseudotime for a differentiation trajectory might represent the degree of differentiation from a pluripotent cell to a terminal state. This metric allows us to tackle questions related to the global population structure in a more quantitative manner. (It is worth noting that pseudotime is rather poorly named as it may or may not have much to do with actual time. For example, one can imagine a continuum of stress states with cells moving in either direction over time, but the pseudotime will only increase in one direction.)

In this section, we will demonstrate several different approaches to trajectory analysis using the haematopoietic stem cell (HSC) dataset from Nestorowa et al. (2016).

```
#--- data-loading ---#
library(scRNAseq)
sce.nest <- NestorowaHSCData()
#--- gene-annotation ---#
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
anno <- select(ens.mm.v97, keys=rownames(sce.nest),
keytype="GENEID", columns=c("SYMBOL", "SEQNAME"))
rowData(sce.nest) <- anno[match(rownames(sce.nest), anno$GENEID),]
#--- quality-control-grun ---#
library(scater)
stats <- perCellQCMetrics(sce.nest)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent")
sce.nest <- sce.nest[,!qc$discard]
#--- normalization ---#
library(scran)
set.seed(101000110)
clusters <- quickCluster(sce.nest)
sce.nest <- computeSumFactors(sce.nest, clusters=clusters)
sce.nest <- logNormCounts(sce.nest)
#--- variance-modelling ---#
set.seed(00010101)
dec.nest <- modelGeneVarWithSpikes(sce.nest, "ERCC")
top.nest <- getTopHVGs(dec.nest, prop=0.1)
#--- dimensionality-reduction ---#
set.seed(101010011)
sce.nest <- denoisePCA(sce.nest, technical=dec.nest, subset.row=top.nest)
sce.nest <- runTSNE(sce.nest, dimred="PCA")
#--- clustering ---#
snn.gr <- buildSNNGraph(sce.nest, use.dimred="PCA")
colLabels(sce.nest) <- factor(igraph::cluster_walktrap(snn.gr)$membership)
```

`sce.nest`

```
## class: SingleCellExperiment
## dim: 46078 1656
## metadata(0):
## assays(2): counts logcounts
## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000107391 ENSMUSG00000107392
## rowData names(3): GENEID SYMBOL SEQNAME
## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810
## colData names(4): cell.type FACS sizeFactor label
## reducedDimNames(3): diffusion PCA TSNE
## altExpNames(1): ERCC
```

## 17.2 Obtaining pseudo-times

### 17.2.1 Cluster-based minimum spanning tree

The *TSCAN* package employs a simple yet effective approach to trajectory reconstruction. It clusters cells to summarize the data into a smaller set of discrete units, computes cluster centroids by averaging the cell coordinates and then forms the minimum spanning tree (MST) across centroids. The MST is simply an undirected acyclic graph that passes through each centroid exactly once and can be thought of as the most parsimonious structure that captures the transitions between clusters. We demonstrate below on the Nestorowa dataset (Figure 17.2), computing the centroids in the low-dimensional space to take advantage of data compaction and denoising (Chapter 9).

```
# TODO: get the TSCAN authors to allow me to plug in existing
# dimensionality reduction and clustering results, rather than
# forcing users to go along with their defaults.
library(scater)
by.cluster <- aggregateAcrossCells(sce.nest, ids=colLabels(sce.nest))
centroids <- reducedDim(by.cluster, "PCA")
dmat <- dist(centroids)
dmat <- as.matrix(dmat)
g <- igraph::graph.adjacency(dmat, mode = "undirected", weighted = TRUE)
mst <- igraph::minimum.spanning.tree(g)
set.seed(1000)
plot(mst)
```

For reference, we can draw the same lines between the centroids in a \(t\)-SNE plot (Figure 17.3). It is then straightforward to identify interesting clusters such as those at bifurcations or endpoints. Keep in mind that the MST is generated from distances in the PC space and is merely being visualized in the \(t\)-SNE space; its interpretation is not compromised by the distortions required to obtain a two-dimensional visualization.

```
# TODO: stuff this into a function somewhere.
pairs <- Matrix::which(mst[] > 0, arr.ind=TRUE)
coords <- reducedDim(by.cluster, "TSNE")
group <- rep(seq_len(nrow(pairs)), 2)
stuff <- data.frame(rbind(coords[pairs[,1],], coords[pairs[,2],]), group)
plotTSNE(sce.nest, colour_by="label") +
geom_line(data=stuff, mapping=aes(x=X1, y=X2, group=group))
```

We obtain a pseudotime ordering by projecting the cells onto the MST. In other words, we move each cell onto the edge of the MST to which it is closest; the pseudotime is then calculated as the distance along the MST from this new position to a “root node”. For our purposes, we will arbitrarily pick one of the endpoint nodes as the root, though a more careful choice based on the biological annotation of each node may yield more relevant orderings (e.g., picking a node corresponding to a more pluripotent state).

```
# TODO: for the love of god, we definitely need to move this into a function!
.map2edges <- function(points, center, edge.ends, previous) {
all.distances <- list()
all.pseudo <- list()
edge.len <- list()
# Computing distance of each point from each edge.
# Edges defined from 'center' to 'edge.ends'.
for (i in rownames(edge.ends)) {
edge.end <- edge.ends[i,]
delta <- center - edge.end
max.d <- sqrt(sum(delta^2))
delta <- delta/max.d
centered <- t(t(points) - center)
proj <- as.numeric(centered %*% delta)
proj <- pmax(0, pmin(proj, max.d))
mapped <- outer(proj, delta)
dist <- sqrt(rowSums((centered - mapped)^2))
all.distances[[i]] <- dist
all.pseudo[[i]] <- proj
edge.len[[i]] <- max.d
}
all.distances <- do.call(cbind, all.distances)
all.pseudo <- do.call(cbind, all.pseudo)
chosen <- colnames(all.distances)[max.col(-all.distances)]
# Flipping the distance of points to the previous node,
# in order to enforce a directional pseudo-time.
dist.previous <- 0
if (!is.na(previous)) {
on.previous <- chosen==previous
dist.previous <- edge.len[[previous]]
previous.proj <- dist.previous - all.pseudo[on.previous,previous,drop=FALSE]
if (all(on.previous)) {
return(list(dist=dist.previous, pseudo=list(previous.proj)))
}
}
# Filling out the branches, where points are NA for a branch's
# pseudo-time if they were assigned to another branch.
output <- list()
for (leftover in setdiff(rownames(edge.ends), previous)) {
empty <- rep(NA_real_, nrow(points))
if (!is.na(previous)) {
empty[on.previous] <- previous.proj
}
current <- chosen==leftover
empty[current] <- all.pseudo[current,leftover]
output[[leftover]] <- empty
}
list(dist=dist.previous, pseudo=output)
}
originals <- reducedDim(sce.nest, "PCA")
cluster <- colLabels(sce.nest)
starting.cluster <- names(igraph::V(mst)[igraph::degree(mst)==1])[1]
collated <- list()
latest <- starting.cluster
parents <- NA_character_
progress <- list(rep(NA_real_, length(cluster)))
cumulative <- 0
while (length(latest)) {
new.latest <- new.parents <- character(0)
new.progress <- list()
new.cumulative <- numeric(0)
for (i in seq_along(latest)) {
curnode <- latest[i]
all.neighbors <- names(igraph::adjacent_vertices(mst, curnode, mode="all")[[1]])
in.cluster <- cluster==curnode
mapped <- .map2edges(originals[in.cluster,,drop=FALSE], center=centroids[curnode,],
edge.ends=centroids[all.neighbors,,drop=FALSE], previous=parents[i])
edge.len <- mapped$dist
pseudo <- mapped$pseudo
collected.progress <- list()
for (j in seq_along(pseudo)) {
sofar <- progress[[i]] # yes, using 'i' here.
sofar[in.cluster] <- pseudo[[j]] + cumulative[i]
collected.progress[[j]] <- sofar
}
all.children <- setdiff(all.neighbors, parents[i])
if (length(all.children)==0) {
collated[[curnode]] <- collected.progress[[1]]
} else {
new.latest <- c(new.latest, all.children)
new.parents <- c(new.parents, rep(curnode, length(all.children)))
new.progress <- c(new.progress, collected.progress)
new.cumulative <- c(new.cumulative, rep(cumulative[i] + edge.len, length(all.children)))
}
}
latest <- new.latest
parents <- new.parents
progress <- new.progress
cumulative <- new.cumulative
}
tscan.pseudo <- do.call(cbind, collated)
plotTSNE(sce.nest, colour_by=I(rowMeans(tscan.pseudo, na.rm=TRUE)), text_by="label") +
geom_line(data=stuff, mapping=aes(x=X1, y=X2, group=group))
```

*TSCAN* gains several advantages from using clusters to form the MST. The most obvious is that of computational speed as calculations are performed over clusters rather than cells. The relative coarseness of clusters protects against the per-cell noise that would otherwise reduce the stability of the MST. The interpretation of the MST is also relatively straightforward as it uses the same clusters as the rest of the analysis, allowing us to recycle previous knowledge about the biological annotations assigned to each cluster.

However, the reliance on clustering is also a double-edged sword. If the clusters are not sufficiently granular, it is possible for *TSCAN* to overlook a trajectory if the entirety of the trajectory occurs in a single cluster. In addition, the MST does poorly at handling more complex events like cycles (e.g., the cell cycle, obviously) or bubbles (e.g., multiple differentation paths to the same terminal cell type). Whether or not this is a problem depends on the complexity of the global structure of the population of interest.

### 17.2.2 Principal curves

To identify a trajectory, one might imagine simply “fitting” a one-dimensional curve so that it passes through the cloud of cells in the high-dimensional expression space. This is the idea behind principal curves (Hastie and Stuetzle 1989), effectively a non-linear generalization of PCA where the axes of most variation are allowed to bend. We use the *slingshot* package (Street et al. 2018) to fit a principal curve to the PC coordinates, which yields a pseudotime ordering of cells based on their relative positions when projected onto the curve.

```
library(slingshot)
sce.sling <- slingshot(sce.nest, reducedDim='PCA')
head(sce.sling$slingPseudotime_1)
```

`## [1] 89.44417 76.33613 87.88466 76.93112 82.40780 72.09984`

Here, we fitted the principal curve to the PC space for the same reasons as described above. We can then visualize the literal path taken by the fitted curve in that space (Figure 17.5).

```
# Setting up the colors.
library(RColorBrewer)
colors <- colorRampPalette(brewer.pal(11,'Spectral')[-6])(100)
plotcol <- colors[cut(sce.sling$slingPseudotime_1, breaks=100)]
# Creating a PCA plot.
plot(reducedDim(sce.sling, "PCA"), col = plotcol, pch=16, asp = 1)
lines(SlingshotDataSet(sce.sling), lwd=2, col='black')
```

For other dimensionality reduction results, we color by the pseudotime ordering to identify the direction of the trajectory (Figure 17.6). This is effectively a continuous generalization of the coloring by cluster assignment observed in other chapters.

```
library(scater)
sce.sling <- runUMAP(sce.sling, dimred="PCA")
# TODO: make ggcells robust to random crap in the colData().
# Also need to add a function to auto-generate a path.
sce.sling$cell.type <- sce.sling$FACS <- NULL
library(viridis)
ggcells(sce.sling, mapping=aes(x=UMAP.1,
y=UMAP.2, col=slingPseudotime_1)) +
geom_point() + scale_color_viridis()
```

The previous `slingshot()`

call assumed that all cells in the dataset were part of a single one-dimensional trajectory, which fails to consider more complex events like bifurcations. To accommodate this, we use our previously computed cluster assignments to build a rough sketch for the global structure in the form of a MST across the cluster centroids. Each path through the MST from a designated root node is treated as a lineage; principal curves are then simultaneously fitted to all lineages, with some averaging across curves to encourage consistency in regions that are common to multiple lineages. This allows `slingshot()`

to capture branching events based on divergence in the principal curves (Figure 17.7).

```
sce.sling2 <- slingshot(sce.nest, cluster=colLabels(sce.nest), reducedDim='PCA')
plot(reducedDim(sce.sling2, "PCA"), col="grey80", pch=16, asp = 1)
lines(SlingshotDataSet(sce.sling2), lwd=2, col='black')
```

When operating in this mode, `slingshot()`

produces one pseudotime ordering for each principal curve. Cells not assigned to a particular curve will be assigned `NA`

values for that curve’s ordering. We can use `slingshotBranchID()`

to determine whether a particular cell is shared across multiple curves or is unique to a subset of curves (i.e., is located “after” branching). In this case, we can see that most cells jump directly from a global common segment (`1,2,3`

) to one of the curves (`1`

, `2`

, `3`

) without any further hierarchy, i.e., no noticeable internal branch points.

```
curve.assignments <- slingBranchID(sce.sling2)
table(curve.assignments)
```

```
## curve.assignments
## 1 1,2 1,2,3 1,3 2 2,3 3
## 435 6 892 2 222 39 60
```

For larger datasets, we can speed up the algorithm by approximating each principal curve with a fixed number of points. By default, `slingshot()`

uses one point per cell to define the curve, which is unnecessarily precise when the number of cells is large. Indeed, the approximated curves in Figure 17.8 are quite similar to those in Figure 17.7.

```
sce.sling3 <- slingshot(sce.nest, cluster=colLabels(sce.nest),
reducedDim='PCA', approx_points=100)
plot(reducedDim(sce.sling3, "PCA"), col="grey80", pch=16, asp = 1)
lines(SlingshotDataSet(sce.sling3), lwd=2, col='black')
```

## 17.3 Characterizing trajectories

### 17.3.1 Changes along a trajectory

### 17.3.2 Changes between lineages

## 17.4 Finding the root

### 17.4.1 Overview

### 17.4.2 Entropy-based methods

### 17.4.3 RNA velocity

## Session information

```
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /home/luna/Software/R/R-4-0-branch/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] viridis_0.5.1 viridisLite_0.3.0
[3] RColorBrewer_1.1-2 slingshot_1.6.0
[5] princurve_2.1.4 scater_1.16.0
[7] ggplot2_3.3.0 SingleCellExperiment_1.10.1
[9] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
[11] matrixStats_0.56.0 Biobase_2.48.0
[13] GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
[15] IRanges_2.22.1 S4Vectors_0.26.0
[17] BiocGenerics_0.34.0 OSCAUtils_0.0.2
[19] BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] BiocSingular_1.4.0 DelayedMatrixStats_1.10.0
[3] assertthat_0.2.1 BiocManager_1.30.10
[5] highr_0.8 GenomeInfoDbData_1.2.3
[7] vipor_0.4.5 yaml_2.2.1
[9] pillar_1.4.4 lattice_0.20-41
[11] glue_1.4.1 digest_0.6.25
[13] XVector_0.28.0 colorspace_1.4-1
[15] cowplot_1.0.0 htmltools_0.4.0
[17] Matrix_1.2-18 pkgconfig_2.0.3
[19] bookdown_0.19 zlibbioc_1.34.0
[21] purrr_0.3.4 scales_1.1.1
[23] processx_3.4.2 RSpectra_0.16-0
[25] BiocParallel_1.22.0 tibble_3.0.1
[27] farver_2.0.3 ellipsis_0.3.1
[29] withr_2.2.0 magrittr_1.5
[31] crayon_1.3.4 evaluate_0.14
[33] ps_1.3.3 nlme_3.1-147
[35] FNN_1.1.3 beeswarm_0.2.3
[37] tools_4.0.0 lifecycle_0.2.0
[39] stringr_1.4.0 munsell_0.5.0
[41] irlba_2.3.3 callr_3.4.3
[43] compiler_4.0.0 rsvd_1.0.3
[45] rlang_0.4.6 grid_4.0.0
[47] RCurl_1.98-1.2 BiocNeighbors_1.6.0
[49] igraph_1.2.5 bitops_1.0-6
[51] labeling_0.3 rmarkdown_2.1
[53] gtable_0.3.0 codetools_0.2-16
[55] R6_2.4.1 gridExtra_2.3
[57] knitr_1.28 dplyr_0.8.5
[59] uwot_0.1.8 ape_5.3
[61] stringi_1.4.6 ggbeeswarm_0.6.0
[63] Rcpp_1.0.4.6 vctrs_0.3.0
[65] tidyselect_1.1.0 xfun_0.13
```

### Bibliography

Hastie, T., and W. Stuetzle. 1989. “Principal Curves.” *J Am Stat Assoc* 84 (406): 502–16.

Nestorowa, S., F. K. Hamey, B. Pijuan Sala, E. Diamanti, M. Shepherd, E. Laurenti, N. K. Wilson, D. G. Kent, and B. Gottgens. 2016. “A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation.” *Blood* 128 (8): 20–31.

Street, K., D. Risso, R. B. Fletcher, D. Das, J. Ngai, N. Yosef, E. Purdom, and S. Dudoit. 2018. “Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics.” *BMC Genomics* 19 (1): 477.