Chapter 17 Adaptations for Large-scale Data

Large datasets such as the Human Cell Atlas (with over 1.3 million cells) have the potential to benefit from special adaptations that enable analysis in compute-constrained environments (such as personal laptops). Here, we briefly cover topics that aim to ease working with scRNA-seq data to make it faster and more tractable.

Do note however that these adaptations do not universally result in improved computational efficiency. For example, parallelization does incur costs, and disk-backed data representation will generally be slower than purely in-memory representations. In both cases, hardware specifics may dictate any potential gains in efficiency, as can be imagined in the case of a solid state drive (SSD) being faster for disk-backed data representations relative to a hard disk drive (HDD).

Thus, with the right compute environment, data, and task, these adaptations can yield significant computational improvements.

17.1 Approximate Methods

A general approach that works across all scales of data involves changing the task itself. Some methods - such as PCA or nearest neighbor searches - have been extended to include versions that provide approximate results. Generally, these adaptations result in acceptable losses in accuracy for significant computational gains. In some cases, approximate methods may even be desirable for the results themselves, as has been shown by the FIt-SNE approach.

Some example packages that provide approximate versions of popular methods include:

  • BiocSingular via the IrlbaParam() and RandomParam() for approximate singular value decomposition (SVD)
  • BiocNeighbors via the AnnoyParam() for approximate nearest neighbor searches

These packages provide users (and developers) a common interface, enabling modular swapping of key algorithms within functions. For example, we can see the immediate benefit of using an approximate method for PCA as provided by the BSPARAM argument in the following code, which utilizes the bench package for profiling:

## # A tibble: 2 x 4
##   expression                                    min   median mem_alloc
##   <bch:expr>                               <bch:tm> <bch:tm> <bch:byt>
## 1 runPCA(tiny_sce, BSPARAM = IrlbaParam()) 532.06ms 532.06ms        0B
## 2 runPCA(tiny_sce, BSPARAM = ExactParam())    1.02s    1.02s        0B

We can see that the approximate, irlba based implementation is about 6 times faster than the exact version in this case.

We can also try this with nearest neighbor searches. Here we provide the BNPARAM argument to build a shared nearest neighbors graph using different algorithms under the hood:

## # A tibble: 2 x 4
##   expression                                            min median
##   <bch:expr>                                          <bch> <bch:>
## 1 buildSNNGraph(tiny_sce_pca, BNPARAM = AnnoyParam())  4.5s   4.5s
## 2 buildSNNGraph(tiny_sce_pca, BNPARAM = KmknnParam()) 3.53s  3.53s
## # … with 1 more variable: mem_alloc <bch:byt>

We can see from the above benchmark that in our tiny dataset, we don’t see much, if any, benefit of using an approximate method (via the AnnoyParam(), which uses the Annoy library).

However, if we increase our dataset to something larger..

## # A tibble: 2 x 4
##   expression                                          min  median mem_alloc
##   <bch:expr>                                     <bch:tm> <bch:t> <bch:byt>
## 1 buildSNNGraph(big_sce, BNPARAM = AnnoyParam())    37.8s   37.8s        0B
## 2 buildSNNGraph(big_sce, BNPARAM = KmknnParam())    51.3s   51.3s        0B

We see (more consistently) that we have realized some gains from using the approximate-based nearest neighbors search.

17.2 Parallelization

Many tasks that arise in the analysis of scRNA-seq data are able to be parallelized. In other words, the problem can be broken down into smaller pieces that can be solved independently. Parallel computing can be performed in computing environments with access to multiple cores.

Bioconductor has reduced the complexity of implementing parallelized software via the BiocParallel package. This enables a common interface across Bioconductor software packages for parallel computing.

Across many Bioconductor packages, an argument that will often be present in functions that are parallelizable is the BPPARAM argument. Here, it is possible to specify the parallel back-end that can be used for evaluation.

Again, note that parallelization does incur an overhead cost in splitting up the data, sending it off, and combining the results, and thus your mileage may vary depending on the specifics.

Below is an example usage of the BiocParallel library supplying BPPARAM argument for constructing the UMAP representation from the PCA results:

## # A tibble: 2 x 4
##   expression                                           min median mem_alloc
##   <bch:expr>                                        <bch:> <bch:> <bch:byt>
## 1 runUMAP(tiny_sce_pca, BPPARAM = SerialParam())     7.55s  7.55s        0B
## 2 runUMAP(tiny_sce_pca, BPPARAM = MulticoreParam())  6.79s  6.79s        0B

On this dataset and equipment, there’s not much, if any, benefit to parallelization, but your mileage may vary dependent on those aspects.

17.3 On-Disk Data

The matrix of data from a single-cell experiment can be on the order of tens to hundreds of gigabytes, depending on the number of features and cells measured. A standard approach to import and represent these matrices has been to load the entire data set into memory using either the matrix object in base R or sparse and dense matrix classes from the Matrix R package. This approach however may prove intractable in computing environments with limited memory.

Disk-backed representations such as HDF5 free us from having to load an entire dataset into memory, and thus make it possible to work with large-scale scRNA-seq data.

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] HDF5Array_1.13.8            rhdf5_2.29.3               
 [3] BiocNeighbors_1.3.3         scran_1.13.18              
 [5] BiocSingular_1.1.5          bench_1.0.4                
 [7] scater_1.13.18              ggplot2_3.2.1              
 [9] SingleCellExperiment_1.7.8  SummarizedExperiment_1.15.9
[11] DelayedArray_0.11.4         BiocParallel_1.19.2        
[13] matrixStats_0.55.0          Biobase_2.45.1             
[15] GenomicRanges_1.37.15       GenomeInfoDb_1.21.1        
[17] IRanges_2.19.14             S4Vectors_0.23.21          
[19] BiocGenerics_0.31.5         Cairo_1.5-10               
[21] BiocStyle_2.13.2            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] viridis_0.5.1            edgeR_3.27.13           
 [3] viridisLite_0.3.0        DelayedMatrixStats_1.7.2
 [5] RcppParallel_4.4.3       assertthat_0.2.1        
 [7] statmod_1.4.32           BiocManager_1.30.4      
 [9] dqrng_0.2.1              GenomeInfoDbData_1.2.1  
[11] vipor_0.4.5              yaml_2.2.0              
[13] pillar_1.4.2             lattice_0.20-38         
[15] glue_1.3.1               limma_3.41.16           
[17] digest_0.6.20            XVector_0.25.0          
[19] colorspace_1.4-1         htmltools_0.3.6         
[21] Matrix_1.2-17            pkgconfig_2.0.2         
[23] bookdown_0.13            zlibbioc_1.31.0         
[25] purrr_0.3.2              scales_1.0.0            
[27] RSpectra_0.15-0          tibble_2.1.3            
[29] withr_2.1.2              lazyeval_0.2.2          
[31] cli_1.1.0                magrittr_1.5            
[33] crayon_1.3.4             evaluate_0.14           
[35] fansi_0.4.0              FNN_1.1.3               
[37] beeswarm_0.2.3           tools_3.6.1             
[39] stringr_1.4.0            Rhdf5lib_1.7.5          
[41] munsell_0.5.0            locfit_1.5-9.1          
[43] irlba_2.3.3              compiler_3.6.1          
[45] rsvd_1.0.2               rlang_0.4.0             
[47] grid_3.6.1               RCurl_1.95-4.12         
[49] igraph_1.2.4.1           bitops_1.0-6            
[51] rmarkdown_1.15           gtable_0.3.0            
[53] R6_2.4.0                 gridExtra_2.3           
[55] knitr_1.24               dplyr_0.8.3             
[57] uwot_0.1.3               utf8_1.1.4              
[59] stringi_1.4.3            ggbeeswarm_0.6.0        
[61] Rcpp_1.0.2               tidyselect_0.2.5        
[63] xfun_0.9