Chapter 21 Dealing with big data

21.1 Motivation

Advances in scRNA-seq technologies have increased the number of cells that can be assayed in routine experiments. Public databases such as GEO are continually expanding with more scRNA-seq studies, while large-scale projects such as the Human Cell Atlas are expected to generate data for billions of cells. For effective data analysis, the computational methods need to scale with the increasing size of scRNA-seq data sets. This section discusses how we can use various aspects of the Bioconductor ecosystem to tune our analysis pipelines for greater speed and efficiency.

21.2 Fast approximations

21.2.1 Nearest neighbor searching

Identification of neighbouring cells in PC or expression space is a common procedure that is used in many functions, e.g., buildSNNGraph(), doubletCells(). The default is to favour accuracy over speed by using an exact nearest neighbour (NN) search, implemented with the \(k\)-means for \(k\)-nearest neighbours algorithm (Wang 2012). However, for large data sets, it may be preferable to use a faster approximate approach. The BiocNeighbors framework makes it easy to switch between search options by simply changing the BNPARAM= argument in compatible functions. To demonstrate, we will use the 10X PBMC data:

#--- setup ---#
library(OSCAUtils)
chapterPreamble(use_cache = TRUE)

#--- loading ---#
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

library(DropletUtils)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)

#--- gene-annotation ---#
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, 
    column="SEQNAME", keytype="GENEID")

#--- cell-detection ---#
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

#--- quality-control ---#
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]

#--- normalization ---#
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)

#--- variance-modelling ---#
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)

#--- dimensionality-reduction ---#
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)

set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")

set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")

#--- clustering ---#
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
sce.pbmc$cluster <- factor(clust)
## class: SingleCellExperiment 
## dim: 33694 3922 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3922): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... TTTGTCACAGGTCCAC-1
##   TTTGTCATCCCAAGAT-1
## colData names(3): Sample Barcode cluster
## reducedDimNames(3): PCA TSNE UMAP
## spikeNames(0):
## altExpNames(0):

We had previously clustered on a shared nearest neighbor graph generated with an exact neighbour search (Section 10.3). We repeat this below using an approximate search, implemented using the Annoy algorithm. This involves constructing a AnnoyParam object to specify the search algorithm and then passing it to the buildSNNGraph() function. The results from the exact and approximate searches are consistent with most clusters from the former re-appearing in the latter. This suggests that the inaccuracy from the approximation can be largely ignored.

##      Approx
## Exact   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
##    1    0 463   0   0   0  49   0  73   0   0   0   0   0   0   0   0   0   0   0
##    2  518   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##    3    0   0 364   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##    4    0   0   0 451   0   0   7   0   0   0   0   0   0   0   0   0   0   0   0
##    5    0   0   0   0 170   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##    6    0   0   0   0   0 790   0   0   0   0   0   0   0   1   0   0   0   0   0
##    7    0   0   0   0   0   0 295   0   0   0   0   0   0   0   0   0   0   0   0
##    8    0   0   0   0   0   0  30   0   0   0   0   0   0   0  77   0   0   0   0
##    9    0   0   0   0   0   0   0   0  45   0   0   0   0   0   0   0   0   0   0
##    10   1   0   0   0   0   0   0   0   0  45   0   0   0   0   0   0   0   0   0
##  [ reached getOption("max.print") -- omitted 8 rows ]

Note that Annoy writes the NN index to disk prior to performing the search. Thus, it may not actually be faster than the default exact algorithm for small datasets, depending on whether the overhead of disk write is offset by the computational complexity of the search. It is also not difficult to find situations where the approximation deteriorates, especially at high dimensions, though this may not have an appreciable impact on the biological conclusions.

## [1] 0.5619

21.2.2 Singular value decomposition

The singular value decomposition (SVD) underlies the PCA used throughout our analyses, e.g., in denoisePCA(), fastMNN(), doubletCells(). (Briefly, the right singular vectors are the eigenvectors of the gene-gene covariance matrix, where each eigenvector represents the axis of maximum remaining variation in the PCA.) The default base::svd() function performs an exact SVD that is not performant for large datasets. Instead, we use fast approximate methods from the irlba and rsvd packages, conveniently wrapped into the BiocSingular package for ease of use and package development. Specifically, we can change the SVD algorithm used in any of these functions by simply specifying an alternative value for the BSPARAM= argument.

##  num [1:3922, 1:20] 15.3 13.41 -8.46 -7.86 6.38 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3922] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "percentVar")= num [1:20] 20.26 10.02 5.36 2.19 1.41 ...
##  num [1:3922, 1:20] 15.3 13.41 -8.46 -7.86 6.38 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3922] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "percentVar")= num [1:20] 20.26 10.02 5.36 2.19 1.41 ...

Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with negligible loss of accuracy. This motivates their default use in many scran and scater functions, at the cost of requiring users to set the seed to guarantee reproducibility. IRLBA can occasionally fail to converge and require more iterations (passed via maxit= in IrlbaParam()), while RSVD involves an explicit trade-off between accuracy and speed based on its oversampling parameter (p=) and number of power iterations (q=). We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is much faster for file-backed matrices (Section 26.8).

21.3 Parallelization

Parallelization of calculations across genes or cells is an obvious strategy for speeding up scRNA-seq analysis workflows. The BiocParallel package provides a common interface for parallel computing throughout the Bioconductor ecosystem, manifesting as a BPPARAM= argument in compatible functions. We can pick from a diverse range of parallelization backends depending on the available hardware and operating system. For example, we might use forking across 2 cores to parallelize the variance calculations on a Unix system:

## DataFrame with 33694 rows and 6 columns
##                     mean       total        tech         bio   p.value       FDR
##                <numeric>   <numeric>   <numeric>   <numeric> <numeric> <numeric>
## RP11-34P13.3 0.000000000 0.000000000 0.000000000 0.00000e+00       NaN       NaN
## FAM138A      0.000000000 0.000000000 0.000000000 0.00000e+00       NaN       NaN
## OR4F5        0.000000000 0.000000000 0.000000000 0.00000e+00       NaN       NaN
## RP11-34P13.7 0.002196632 0.002262180 0.002255590 6.59070e-06  0.492098  0.745136
## RP11-34P13.8 0.000549026 0.000599631 0.000563762 3.58694e-05  0.333112  0.745136
## ...                  ...         ...         ...         ...       ...       ...
## AC233755.2     0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN
## AC233755.1     0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN
## AC240274.1     0.0101976   0.0120565   0.0104711  0.00158536  0.152349  0.745136
## AC213203.1     0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN
## FAM231B        0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN

Another approach would be to distribute jobs across a network of computers, which yields the same result:

## DataFrame with 33694 rows and 6 columns
##                     mean       total        tech         bio   p.value       FDR
##                <numeric>   <numeric>   <numeric>   <numeric> <numeric> <numeric>
## RP11-34P13.3 0.000000000 0.000000000 0.000000000 0.00000e+00       NaN       NaN
## FAM138A      0.000000000 0.000000000 0.000000000 0.00000e+00       NaN       NaN
## OR4F5        0.000000000 0.000000000 0.000000000 0.00000e+00       NaN       NaN
## RP11-34P13.7 0.002196632 0.002262180 0.002255590 6.59070e-06  0.492098  0.745136
## RP11-34P13.8 0.000549026 0.000599631 0.000563762 3.58694e-05  0.333112  0.745136
## ...                  ...         ...         ...         ...       ...       ...
## AC233755.2     0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN
## AC233755.1     0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN
## AC240274.1     0.0101976   0.0120565   0.0104711  0.00158536  0.152349  0.745136
## AC213203.1     0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN
## FAM231B        0.0000000   0.0000000   0.0000000  0.00000000       NaN       NaN

For high-performance computing (HPC) systems with a cluster of compute nodes, we can distribute jobs via the job scheduler using the BatchtoolsParam class. The example below assumes a SLURM cluster, though the settings can be easily configured for a particular system (see here for details).

Parallelization is best suited for CPU-intensive calculations where the division of labor results in a concomitant reduction in compute time. It is not suited for tasks that are bounded by other compute resources, e.g., memory or file I/O (though the latter is less of an issue on HPC systems with parallel read/write). In particular, R itself is inherently single-core, so many of the parallelization backends involve (i) setting up one or more separate R sessions, (ii) loading the relevant packages and (iii) transmitting the data to that session. Depending on the nature and size of the task, this overhead may outweigh any benefit from parallel computing.

21.4 Out of memory representations

The count matrix is the central structure around which our analyses are based. In most of the previous chapters, this has been held fully in memory as a dense matrix or as a sparse dgCMatrix. Howevever, in-memory representations may not be feasible for very large data sets, especially on machines with limited memory. For example, the 1.3 million brain cell data set from 10X Genomics (Zheng et al. 2017) would require over 100 GB of RAM to hold as a matrix and around 30 GB as a dgCMatrix. This makes it challenging to explore the data on anything less than a HPC system.

The obvious solution is to use a file-backed matrix representation where the data are held on disk and subsets are retrieved into memory as requested. While a number of implementations of file-backed matrices are available (e.g., bigmemory, matter), we will be using the implementation from the HDF5Array package. This uses the popular HDF5 format as the underlying data store, which provides a measure of standardization and portability across systems. We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data set, as provided by the TENxBrainData package.

## class: SingleCellExperiment 
## dim: 27998 20000 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): Ensembl Symbol
## colnames: NULL
## colData names(4): Barcode Sequence Library Mouse
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

Examination of the SingleCellExperiment object indicates that the count matrix is a HDF5Matrix. From a comparison of the memory usage, it is clear that this matrix object is simply a stub that points to the much larger HDF5 file that actually contains the data. This avoids the need for large RAM availability during analyses.

## <27998 x 20000> matrix of class HDF5Matrix and type "integer":
##              [,1]     [,2]     [,3]     [,4] ... [,19997] [,19998] [,19999] [,20000]
##     [1,]        0        0        0        0   .        0        0        0        0
##     [2,]        0        0        0        0   .        0        0        0        0
##     [3,]        0        0        0        0   .        0        0        0        0
##     [4,]        0        0        0        0   .        0        0        0        0
##     [5,]        0        0        0        0   .        0        0        0        0
##      ...        .        .        .        .   .        .        .        .        .
## [27994,]        0        0        0        0   .        0        0        0        0
## [27995,]        0        0        0        1   .        0        2        0        0
## [27996,]        0        0        0        0   .        0        1        0        0
## [27997,]        0        0        0        0   .        0        0        0        0
## [27998,]        0        0        0        0   .        0        0        0        0
## 2160 bytes
## [1] 76264332

Manipulation of the count matrix will generally result in the creation of a DelayedArray object from the DelayedArray package. This remembers the operations to be applied to the counts and stores them in the object, to be executed when the modified matrix values are realized for use in calculations. The use of delayed operations avoids the need to write the modified values to a new file at every operation, which would unnecessarily require time-consuming disk I/O.

## <27998 x 20000> matrix of class DelayedMatrix and type "double":
##              [,1]     [,2]     [,3] ... [,19999] [,20000]
##     [1,]        0        0        0   .        0        0
##     [2,]        0        0        0   .        0        0
##     [3,]        0        0        0   .        0        0
##     [4,]        0        0        0   .        0        0
##     [5,]        0        0        0   .        0        0
##      ...        .        .        .   .        .        .
## [27994,]        0        0        0   .        0        0
## [27995,]        0        0        0   .        0        0
## [27996,]        0        0        0   .        0        0
## [27997,]        0        0        0   .        0        0
## [27998,]        0        0        0   .        0        0

Many functions described in the previous workflows are capable of accepting HDF5Matrix objects. This is powered by the availability of common methods for all matrix representations (e.g., subsetting, combining, methods from DelayedMatrixStats) as well as representation-agnostic C++ code using beachmat (A. T. L. Lun, Pages, and Smith 2018). For example, we compute QC metrics below with the same calculateQCMetrics() function that we used in the other workflows.

## DataFrame with 20000 rows and 10 columns
##             sum  detected percent_top_50 percent_top_100 percent_top_200 percent_top_500
##       <integer> <integer>      <numeric>       <numeric>       <numeric>       <numeric>
## 1          3060      1546        24.1830         34.5752         46.5033         65.8170
## 2          3500      1694        22.7429         33.2286         45.6571         64.8286
## 3          3092      1613        22.3480         33.8292         45.7309         64.0039
## 4          4420      2050        24.7511         33.7783         44.8190         61.4706
## 5          3771      1813        23.0443         33.1742         45.0544         63.1398
## ...         ...       ...            ...             ...             ...             ...
## 19996      4431      2050        23.0196         32.7917         44.5046         60.9795
## 19997      6988      2704        18.6606         28.5633         40.5839         58.2856
## 19998      8749      2988        23.9113         33.6267         44.4965         60.9784
## 19999      3842      1711        24.7267         36.8037         48.7507         66.6840
## 20000      1775       945        29.8592         40.9014         54.0845         74.9296
##       subsets_Mt_sum subsets_Mt_detected subsets_Mt_percent     total
##            <integer>           <integer>          <numeric> <integer>
## 1                123                  10            4.01961      3060
## 2                118                  11            3.37143      3500
## 3                 58                   9            1.87581      3092
## 4                131                  10            2.96380      4420
## 5                100                   8            2.65182      3771
## ...              ...                 ...                ...       ...
## 19996            127                   9           2.866170      4431
## 19997             60                   9           0.858615      6988
## 19998            305                  11           3.486113      8749
## 19999            129                   8           3.357626      3842
## 20000             26                   6           1.464789      1775

Needless to say, data access from file-backed representations is slower than that from in-memory representations. The time spent retrieving data from disk is an unavoidable cost of reducing memory usage. Whether this is tolerable depends on the application. One example usage pattern involves performing the heavy computing quickly with in-memory representations on HPC systems with plentiful memory, and then distributing file-backed counterparts to individual users for exploration and visualization on their personal machines.

Session Info

R Under development (unstable) (2019-12-29 r77627)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TENxBrainData_1.7.0         HDF5Array_1.15.2            rhdf5_2.31.1               
 [4] BiocSingular_1.3.1          scater_1.15.12              ggplot2_3.2.1              
 [7] fossil_0.3.7                shapefiles_0.7              foreign_0.8-74             
[10] maps_3.3.0                  sp_1.3-2                    BiocNeighbors_1.5.1        
[13] scran_1.15.14               SingleCellExperiment_1.9.1  SummarizedExperiment_1.17.1
[16] DelayedArray_0.13.2         BiocParallel_1.21.2         matrixStats_0.55.0         
[19] Biobase_2.47.2              GenomicRanges_1.39.1        GenomeInfoDb_1.23.1        
[22] IRanges_2.21.2              S4Vectors_0.25.8            BiocGenerics_0.33.0        
[25] Cairo_1.5-10                BiocStyle_2.15.3            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] ggbeeswarm_0.6.0              colorspace_1.4-1              XVector_0.27.0               
 [4] bit64_0.9-7                   AnnotationDbi_1.49.0          interactiveDisplayBase_1.25.0
 [7] knitr_1.26                    zeallot_0.1.0                 dbplyr_1.4.2                 
[10] shiny_1.4.0                   BiocManager_1.30.10           compiler_4.0.0               
[13] httr_1.4.1                    dqrng_0.2.1                   backports_1.1.5              
[16] assertthat_0.2.1              Matrix_1.2-18                 fastmap_1.0.1                
[19] lazyeval_0.2.2                limma_3.43.0                  later_1.0.0                  
[22] htmltools_0.4.0               tools_4.0.0                   rsvd_1.0.2                   
[25] igraph_1.2.4.2                gtable_0.3.0                  glue_1.3.1                   
[28] GenomeInfoDbData_1.2.2        dplyr_0.8.3                   rappdirs_0.3.1               
[31] Rcpp_1.0.3                    vctrs_0.2.1                   ExperimentHub_1.13.5         
[34] DelayedMatrixStats_1.9.0      xfun_0.11                     stringr_1.4.0                
[37] ps_1.3.0                      beachmat_2.3.1                mime_0.8                     
[40] lifecycle_0.1.0               irlba_2.3.3                   statmod_1.4.32               
[43] AnnotationHub_2.19.3          edgeR_3.29.0                  zlibbioc_1.33.0              
[46] scales_1.1.0                  promises_1.1.0                yaml_2.2.0                   
[49] curl_4.3                      memoise_1.1.0                 gridExtra_2.3                
[52] stringi_1.4.3                 RSQLite_2.2.0                 BiocVersion_3.11.1           
[55] rlang_0.4.2                   pkgconfig_2.0.3               bitops_1.0-6                 
[58] evaluate_0.14                 lattice_0.20-38               purrr_0.3.3                  
[61] Rhdf5lib_1.9.0                bit_1.1-14                    processx_3.4.1               
[64] tidyselect_0.2.5              magrittr_1.5                  bookdown_0.16                
[67] R6_2.4.1                      snow_0.4-3                    DBI_1.1.0                    
[70] pillar_1.4.3                  withr_2.1.2                   RCurl_1.95-4.12              
[73] tibble_2.1.3                  crayon_1.3.4                  BiocFileCache_1.11.4         
[76] rmarkdown_2.0                 viridis_0.5.1                 locfit_1.5-9.1               
[79] grid_4.0.0                    blob_1.2.0                    callr_3.4.0                  
[82] digest_0.6.23                 xtable_1.8-4                  httpuv_1.5.2                 
[85] munsell_0.5.0                 beeswarm_0.2.3                viridisLite_0.3.0            
[88] vipor_0.4.5                  

Bibliography

Lun, A. T. L., H. Pages, and M. L. Smith. 2018. “beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.” PLoS Comput. Biol. 14 (5):e1006135.

Wang, X. 2012. “A Fast Exact K-Nearest Neighbors Algorithm for High Dimensional Search Using K-Means Clustering and Triangle Inequality.” Proc Int Jt Conf Neural Netw 43 (6):2351–8.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January):14049.