# Chapter 14 Integrating Datasets

## 14.1 Motivation

Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.

Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for common downstream analysis. However, existing methods based on linear models (Ritchie et al. 2015; Leek et al. 2012) assume that the composition of cell populations are either known or the same across batches. To overcome these limitations, bespoke methods have been developed for batch correction of single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require a priori knowledge about the composition of the population. This allows them to be used in workflows for exploratory analyses of scRNA-seq data where such knowledge is usually unavailable.

Here, we will demonstrate the application of these methods on two case studies involving PBMCs and pancreatic cells. We will focus on the method proposed by Haghverdi et al. (2018), which is based on the detection of mutual nearest neighbours (MNNs). The MNN approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches.

## 14.2 Simple two batch example

### 14.2.1 Setting up the data

We will use two separate 10X Genomics datasets involving PBMCs obtained from different donors. Each dataset was obtained from the TENxPBMCData package and separately subjected to basic processing steps. Separate processing prior to the batch correction step is more convenient, scalable and (on occasion) more reliable. For example, outlier-based QC on the cells is more effective when performed within a batch (Chapter ???). The same can also be said for trend fitting when modelling the mean-variance relationship (Section ???).

### loading ###
library(TENxPBMCData)
pbmc3k <- TENxPBMCData('pbmc3k')

### quality-control ###
is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx) library(scater) pbmc3k <- calculateQCMetrics(pbmc3k, feature_controls=list(Mito=is.mito)) high.mito <- isOutlier(pbmc3k$pct_counts_Mito, nmads=3, type="higher")
pbmc3k <- pbmc3k[,!high.mito]

### normalization ###
pbmc3k <- normalize(pbmc3k)

### variance-modelling ###
library(scran)
fit3k <- trendVar(pbmc3k, use.spikes = FALSE)
dec3k <- decomposeVar(pbmc3k, fit3k)

### feature-selection ###
chosen.hvgs <- which(dec3k$bio > 0) pbmc3k ## class: SingleCellExperiment ## dim: 32738 2609 ## metadata(1): log.exprs.offset ## assays(2): counts logcounts ## rownames(32738): ENSG00000243485 ENSG00000237613 ... ## ENSG00000215616 ENSG00000215611 ## rowData names(11): ENSEMBL_ID Symbol_TENx ... total_counts ## log10_total_counts ## colnames: NULL ## colData names(47): Sample Barcode ... ## pct_counts_in_top_200_features_Mito ## pct_counts_in_top_500_features_Mito ## reducedDimNames(0): ## spikeNames(0): ### loading ### library(TENxPBMCData) pbmc4k <- TENxPBMCData('pbmc4k') ### quality-control ### is.mito <- grep("MT", rowData(pbmc4k)$Symbol_TENx)

library(scater)
pbmc4k <- calculateQCMetrics(pbmc4k, feature_controls=list(Mito=is.mito))
high.mito <- isOutlier(pbmc4k$pct_counts_Mito, nmads=3, type="higher") pbmc4k <- pbmc4k[,!high.mito] ### normalization ### pbmc4k <- normalize(pbmc4k) ### variance-modelling ### library(scran) fit4k <- trendVar(pbmc4k, use.spikes = FALSE) dec4k <- decomposeVar(pbmc4k, fit4k) pbmc4k ## class: SingleCellExperiment ## dim: 33694 4182 ## metadata(1): log.exprs.offset ## assays(2): counts logcounts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ## ENSG00000277475 ENSG00000268674 ## rowData names(11): ENSEMBL_ID Symbol_TENx ... total_counts ## log10_total_counts ## colnames: NULL ## colData names(47): Sample Barcode ... ## pct_counts_in_top_200_features_Mito ## pct_counts_in_top_500_features_Mito ## reducedDimNames(0): ## spikeNames(0): To prepare for the batch correction: 1. We subset all batches to the common “universe” of features. In this case, it is straightforward as both batches use Ensembl gene annotation19. universe <- intersect(rownames(pbmc3k), rownames(pbmc4k)) length(universe) ## [1] 31232 1. We rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects. (Size factors only remove biases between cells within a single batch.) This improves the quality of the correction by removing one aspect of the technical differences between batches. library(batchelor) rescaled <- multiBatchNorm(pbmc3k[universe,], pbmc4k[universe,]) pbmc3k <- rescaled[[1]] pbmc4k <- rescaled[[2]] 1. We obtain a single set of features for batch correction by compute the average biological component across all batches. Here, we take all genes with positive biological components to ensure that all interesting biology is retained. However, other feature selection strategies described in Chapter ??? are also reasonable. mean.bio <- (dec3k[universe,"bio"] + dec4k[universe,"bio"])/2 chosen <- universe[mean.bio > 0] length(chosen) ## [1] 8166 ### 14.2.2 Performing MNN correction Consider a cell $$a$$ in batch $$A$$, and identify the cells in batch $$B$$ that are nearest neighbours to $$a$$ in the expression space defined by the selected features. Repeat this for a cell $$b$$ in batch $$B$$, identifying its nearest neighbours in $$A$$. Mutual nearest neighbours are pairs of cells from different batches that belong in each other’s set of nearest neighbours. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which can yield batch-corrected values. The batchelor package provides an implementation of the MNN approach via the fastMNN() function. We apply it to our two PBMC batches to remove the batch effect across the highly variable genes in chosen. To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space. # Using randomized SVD here, as this is faster than # irlba for file-backed matrices. set.seed(1000101001) mnn.out <- fastMNN(pbmc3k, pbmc4k, d=50, k=20, BSPARAM=BiocSingular::RandomParam(deferred=TRUE)) The function returns a SingleCellExperiment object containing corrected values for downstream analyses like clustering or visualization. Each column of mnn.out corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen. mnn.out ## class: SingleCellExperiment ## dim: 31232 6791 ## metadata(2): merge.order merge.info ## assays(1): reconstructed ## rownames(31232): ENSG00000243485 ENSG00000237613 ... ## ENSG00000198695 ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## spikeNames(0): The batch field in the column metadata contains a vector specifying the batch of origin of each cell. head(mnn.out$batch) 
## [1] 1 1 1 1 1 1

The corrected matrix in the reducedDims slot contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses.

dim(reducedDim(mnn.out, "corrected"))
## [1] 6791   50

The k parameter is the most important and specifies the number of nearest neighbours to consider when defining MNN pairs. This should be interpreted as the minimum frequency of any shared cell type or state in each batch.

• Larger values will improve the precision of the correction by increasing the number of MNN pairs.
• Larger k provides some robustness to violations of the assumption that the batch vector is orthogonal to the biological subspace (Haghverdi et al. 2018), by allowing the neighbour search to ignore biological variation in each batch to identify the correct MNN pairs.
• However, larger k can also reduce accuracy by allowing incorrect MNN pairs to form between cells of different types.

We suggest starting with the default k and increasing it if one is confident that the same cell types are not adequately merged across batches. This is better than starting with a large k as incorrect merging is much harder to diagnose than insufficient merging.

### 14.2.3 Correction diagnostics

#### 14.2.3.1 Clustering and visualization

We use graph-based clustering on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from both batches. In this particular scenario, we expect that both PBMC populations will contain the same set of cell types, so all clusters should contain contributions from both batches.

library(scran)
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters <- igraph::cluster_walktrap(snn.gr)$membership tab <- table(Cluster=clusters, Batch=mnn.out$batch)
tab
##        Batch
## Cluster    1    2
##      1   152  181
##      2   331  589
##      3   531 1205
##      4   277  502
##      5   283  597
##      6   617  579
##      7   203  181
##      8    13   51
##      9    17   69
##      10    7   18
##      11   11   16
##      12   18   61
##      13  131   86
##      14    3   36
##      15    4    8
##      16   11    3

We can also visualize the corrected coordinates using a $$t$$-SNE plot (Figure 14.1). The presence of visual clusters containing cells from both batches provides a comforting illusion that the correction was successful.

library(scater)
set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, use_dimred="corrected")

mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")

Needless to say, the mixing of cells from different batches is not an effective diagnostic when the batches involved actually contain unique cell subpopulations. If a cluster only contains cells from a single batch, one can always debate whether that is caused by a failure of the correction method or if there is truly a batch-specific population. For example, do batch-specific metabolic or differentiation states represent distinct subpopulations? Or should they be merged together? We will not attempt to answer this here, only noting that each batch correction algorithm will make different (and possibly inappropriate) decisions on what constitutes “shared” and “unique” populations.

#### 14.2.3.2 Percentage of variance lost

For fastMNN(), one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the lost.var field in the metadata of mnn.out, which contains a matrix of the variance lost in each batch (column) at each merge step (row).

metadata(mnn.out)$merge.info$lost.var
##          [,1]     [,2]
## [1,] 0.004368 0.003173

Large proportions of lost variance suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern.

## 14.3 Session Info

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] scater_1.13.9               ggplot2_3.2.0
[3] scran_1.13.9                HDF5Array_1.13.4
[5] rhdf5_2.29.0                batchelor_1.1.4
[7] SingleCellExperiment_1.7.0  SummarizedExperiment_1.15.5
[9] DelayedArray_0.11.4         BiocParallel_1.19.0
[11] matrixStats_0.54.0          Biobase_2.45.0
[13] GenomicRanges_1.37.14       GenomeInfoDb_1.21.1
[15] IRanges_2.19.10             S4Vectors_0.23.17
[17] BiocGenerics_0.31.5         BiocStyle_2.13.2
[19] Cairo_1.5-10

loaded via a namespace (and not attached):
[1] viridis_0.5.1            dynamicTreeCut_1.63-1
[3] edgeR_3.27.9             BiocSingular_1.1.5
[5] viridisLite_0.3.0        DelayedMatrixStats_1.7.1
[7] assertthat_0.2.1         statmod_1.4.32
[9] BiocManager_1.30.4       dqrng_0.2.1
[11] GenomeInfoDbData_1.2.1   vipor_0.4.5
[13] yaml_2.2.0               pillar_1.4.2
[15] lattice_0.20-38          glue_1.3.1
[17] limma_3.41.15            digest_0.6.20
[19] XVector_0.25.0           colorspace_1.4-1
[21] htmltools_0.3.6          Matrix_1.2-17
[23] pkgconfig_2.0.2          bookdown_0.12
[25] zlibbioc_1.31.0          purrr_0.3.2
[27] scales_1.0.0             tibble_2.1.3
[29] withr_2.1.2              lazyeval_0.2.2
[31] magrittr_1.5             crayon_1.3.4
[33] evaluate_0.14            beeswarm_0.2.3
[35] tools_3.6.0              stringr_1.4.0
[37] Rhdf5lib_1.7.3           munsell_0.5.0
[39] locfit_1.5-9.1           irlba_2.3.3
[41] compiler_3.6.0           rsvd_1.0.2
[43] rlang_0.4.0              grid_3.6.0
[45] RCurl_1.95-4.12          BiocNeighbors_1.3.3
[47] igraph_1.2.4.1           bitops_1.0-6
[49] rmarkdown_1.14           gtable_0.3.0
[51] R6_2.4.0                 gridExtra_2.3
[53] knitr_1.23               dplyr_0.8.3
[55] stringi_1.4.3            ggbeeswarm_0.6.0
[57] Rcpp_1.0.2               tidyselect_0.2.5
[59] xfun_0.8                

### Bibliography

Butler, A., P. Hoffman, P. Smibert, E. Papalexi, and R. Satija. 2018. “Integrating single-cell transcriptomic data across different conditions, technologies, and species.” Nat. Biotechnol. 36 (5):411–20.

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5):421–27.

Leek, J. T., W. E. Johnson, H. S. Parker, A. E. Jaffe, and J. D. Storey. 2012. “The sva package for removing batch effects and other unwanted variation in high-throughput experiments.” Bioinformatics 28 (6):882–83.

Lin, Y., S. Ghazanfar, K. Y. X. Wang, J. A. Gagnon-Bartsch, K. K. Lo, X. Su, Z. G. Han, et al. 2019. “scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets.” Proc. Natl. Acad. Sci. U.S.A. 116 (20):9775–84.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7):e47.

1. As we shall see later, this step can be much, much, much more painful. As is often said, biologists would rather share a toothbrush than nomenclature.