Chapter 19 Analyzing repertoire sequencing data

This page is under construction. Enjoy these platelets at work instead.

Figure 19.1: This page is under construction. Enjoy these platelets at work instead.

19.1 Motivation

An organism’s immune repertoire is defined as the set of T and B cell subtypes that contain genetic diversity in the T cell receptor (TCR) components or immunoglobin chains, respectively. This diversity is important for ensuring that the adaptive immune system can respond effectively to a wide range of antigens. We can profile the immune repertoire by simply sequencing the relevant transcripts (Georgiou et al. 2014; Rosati et al. 2017), which can be combined with previously mentioned technologies (Zheng et al. 2017) to achieve single-cell resolution. This data can then be used to characterize an individual’s immune response based on the expansion of T or B cell clones, i.e., multiple cells with the same sequences for each TCR component or immunoglobulin chain.

By itself, single-cell repertoire sequencing data can be readily analyzed using tools from the ImmCantation suite. In particular, the alakazam package provides functions to perform common analyses to quantify clonal diversity, reconstruct lineages, examine amino acid properties, etc. However, in most experiments, we also obtain standard RNA sequencing data for each cell; the challenge lies in how we can integrate these two datasets together in the same analysis. This is not entirely trivial as each cell may have zero, one or multiple sequences for any given repertoire component, whereas we only obtain a single expression profile for that cell.

We would like to define a single data structure that captures both the expression profile and repertoire state for each cell. This ensures synchronization during operations like subsetting (as previously discussed for the SingleCellExperiment class) and reduces book-keeping errors throughout the course of an interactive analysis. We achieve this using the SplitDataFrame class from the S4Vectors package, which allows us to accommodate repertoire sequencing data into existing Bioconductor classes while retaining compatibility with ImmCantation functions. We demonstrate on a publicly available 10X Genomics dataset using mouse PBMCs:

#--- setup ---#
library(OSCAUtils)
chapterPreamble(use_cache = TRUE)

#--- loading ---#
library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
exprs.data <- bfcrpath(bfc, file.path(
    "http://cf.10xgenomics.com/samples/cell-vdj/3.1.0",
    "vdj_v1_hs_pbmc3",
    "vdj_v1_hs_pbmc3_filtered_feature_bc_matrix.tar.gz"))
untar(exprs.data, exdir=tempdir())

library(DropletUtils)
sce.pbmc <- read10xCounts(file.path(tempdir(), "filtered_feature_bc_matrix"))
sce.pbmc <- splitAltExps(sce.pbmc, rowData(sce.pbmc)$Type)

#--- quality-control ---#
library(scater)
is.mito <- grep("^MT-", rowData(sce.pbmc)$Symbol)
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=is.mito))

high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
low.adt <- stats$`altexps_Antibody Capture_detected` < nrow(altExp(sce.pbmc))/2

discard <- high.mito | low.adt
sce.pbmc <- sce.pbmc[,!discard]

#--- normalization ---#
library(scran)

set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)

# TODO: move into scran somewhere.
library(DelayedMatrixStats)
ambient <- rowMeans(counts(altExp(sce.pbmc)))
sf.amb <- colMedians(DelayedArray(counts(altExp(sce.pbmc)))/ambient)
sf.amb <- sf.amb/mean(sf.amb)
sizeFactors(altExp(sce.pbmc)) <- sf.amb

sce.pbmc <- logNormCounts(sce.pbmc, use_altexps=TRUE)

#--- dimensionality-reduction ---#
set.seed(100000)
altExp(sce.pbmc) <- runTSNE(altExp(sce.pbmc))

set.seed(1000000)
altExp(sce.pbmc) <- runUMAP(altExp(sce.pbmc))

#--- clustering ---#
g.adt <- buildSNNGraph(altExp(sce.pbmc), k=10, d=NA)
clust.adt <- igraph::cluster_walktrap(g.adt)$membership
altExp(sce.pbmc)$cluster <- factor(clust.adt)

set.seed(1010010)
subclusters <- quickSubCluster(sce.pbmc, clust.adt,
    prepFUN=function(x) {
        dec <- modelGeneVarByPoisson(x)
        top <- getTopHVGs(dec, prop=0.1)
        denoisePCA(x, dec, subset.row=top)
    }, 
    clusterFUN=function(x) {
        g.gene <- buildSNNGraph(x, k=10, use.dimred = 'PCA')
        igraph::cluster_walktrap(g.gene)$membership
    }
)
## class: SingleCellExperiment 
## dim: 33538 6660 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674
## rowData names(3): ID Symbol Type
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): Antibody Capture

19.2 Analyzing the T cell receptor repertoire

19.2.1 Data processing

First, we obtain the filtered TCR contig annotations for the same set of cells. Each row of the resulting data frame contains information about a single TCR component sequence in one cell, broken down into the alleles of the V(D)J genes making up that component (v_gene, d_gene, j_gene) where possible. The number of reads and UMIs supporting the set of allele assignments for a cell is also shown, though only the UMI count should be used for quantifying expression of a particular TCR sequence. Each cell is assigned to a clonotype (raw_clonotype_id) based on the combination of the \(\alpha\)-chain (TRA) and \(\beta\)-chain (TRB) sequences in that cell.

##              barcode is_cell                   contig_id high_confidence length chain     v_gene
## 1 AAACCTGAGATCTGAA-1    True AAACCTGAGATCTGAA-1_contig_1            True    521   TRB   TRBV20-1
## 2 AAACCTGAGATCTGAA-1    True AAACCTGAGATCTGAA-1_contig_2            True    474   TRA   TRAV13-1
## 3 AAACCTGAGGAACTGC-1    True AAACCTGAGGAACTGC-1_contig_1            True    496   TRB    TRBV7-2
## 4 AAACCTGAGGAACTGC-1    True AAACCTGAGGAACTGC-1_contig_2            True    505   TRA TRAV23/DV6
## 5 AAACCTGAGGAGTCTG-1    True AAACCTGAGGAGTCTG-1_contig_1            True    495   TRA      TRAV2
## 6 AAACCTGAGGAGTCTG-1    True AAACCTGAGGAGTCTG-1_contig_2            True    526   TRB    TRBV6-2
##   d_gene  j_gene c_gene full_length productive               cdr3
## 1   None TRBJ2-7  TRBC2        True       True     CSARDKGLSYEQYF
## 2   None  TRAJ44   TRAC        True       True CAASIGPLGTGTASKLTF
## 3   None TRBJ2-1  TRBC2        True       True      CASSLGPSGEQFF
## 4   None  TRAJ34   TRAC        True       True       CAASDNTDKLIF
## 5   None  TRAJ38   TRAC        True       True   CAVEANNAGNNRKLIW
## 6   None TRBJ1-1  TRBC1        True       True      CASSRTGGTEAFF
##                                                  cdr3_nt reads umis raw_clonotype_id
## 1             TGCAGTGCTAGAGACAAGGGGCTTAGCTACGAGCAGTACTTC  9327   12     clonotype100
## 2 TGTGCAGCAAGTATCGGCCCCCTAGGAACCGGCACTGCCAGTAAACTCACCTTT  3440    3     clonotype100
## 3                TGTGCCAGCAGCTTGGGACCATCGGGTGAGCAGTTCTTC 32991   29     clonotype101
## 4                   TGTGCAGCAAGCGATAACACCGACAAGCTCATCTTT 10714    9     clonotype101
## 5       TGTGCTGTGGAGGCTAATAATGCTGGCAACAACCGTAAGCTGATTTGG  1734    3     clonotype102
## 6                TGTGCCAGCAGTCGGACAGGGGGCACTGAAGCTTTCTTT 15530   13     clonotype102
##           raw_consensus_id
## 1 clonotype100_consensus_1
## 2 clonotype100_consensus_2
## 3 clonotype101_consensus_2
## 4 clonotype101_consensus_1
## 5 clonotype102_consensus_1
## 6 clonotype102_consensus_2

The challenge in incorporating all of these data structures into a single object lies in the fact that each cell may have zero, one or many TCR/immunoglobulin sequences. This precludes direct storage of repertoire information in the colData() of the SingleCellExperiment, which expects a 1:1 mapping from each cell to each repertoire sequence. Instead, we store the repertoire data as a SplitDataFrameList object where each cell corresponds to a variable-row DataFrame containing information for zero-to-many sequences. This is easily done using the DataFrame() constructor and the split() function, as shown below:

19.2.2 Basic diagnostics

We start by generating some basic cell-level diagnostics. For each cluster, we determine the proportion of cells that have at least one sequence of a TCR component (Figure 19.2). Clusters with large counts are most likely to be T cells, though some background level of TCR expression may be observed in other clusters due to a mixture of clustering uncertainty, ambient contamination, doublet formation and expression in other cell types.

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
##  160  507  662   39  691 1415   32  650   76 1037  121   47   68   25   15  562  139   32   44  120 
##   21   22   23   24 
##   84   65   52   17
Proportion of cells in each cluster that express at least one sequence of the TCR $\alpha$ or $\beta$-chains.

Figure 19.2: Proportion of cells in each cluster that express at least one sequence of the TCR \(\alpha\) or \(\beta\)-chains.

We can refine this to only consider the productive sequences, i.e., contigs that are likely to produce a functional protein (Figure 19.3). This is achieved by subsetting the SplitDataFrameList with a LogicalList object, the latter of which specifies the rows of each per-cell DataFrame corresponding to productive sequences.

Proportion of cells in each cluster that express at least one productive sequence of the TCR $\alpha$ or $\beta$-chains.

Figure 19.3: Proportion of cells in each cluster that express at least one productive sequence of the TCR \(\alpha\) or \(\beta\)-chains.

In contrast, using a conventional logical or integer vector to subset the SplitDataFrameList will subset the cells themselves (rather than the rows for each cell, as shown above). For example, we can use this approach to count the number of cells in each cluster that have multiple sequences for a component (Figure ??).

Proportion of cells in each cluster that express two or more sequences of the TCR $\alpha$ or $\beta$-chains.

Figure 19.4: Proportion of cells in each cluster that express two or more sequences of the TCR \(\alpha\) or \(\beta\)-chains.

19.2.3 Quantifying clonal expansion

Quantification of clonal expansion is the most obvious application of repertoire sequencing data. Cells with the same T cell clonotype are assumed to target the same antigen, and any increase in the frequency of a clonotype provides evidence for T cell activation and proliferation upon stimulation by the corresponding antigen. Thus, we can gain some insights into the immune activity of each T cell cluster by counting the number of expanded clonotypes in each cluster, as shown below for the \(\alpha\)-chain. (We do not exclude clonotypes containing cells in different clusters, which can arise for various genuine biological reasons, e.g., differences in the microenvironment.)

## [1] 92
## [1] 96
Proportion of cells in each cluster that have multiple clonotypes, based on the TCR $\alpha$ or $\beta$-chains.

Figure 19.5: Proportion of cells in each cluster that have multiple clonotypes, based on the TCR \(\alpha\) or \(\beta\)-chains.

We can use this information to determine whether a particular T cell cluster is enriched for expanding clonotypes. The example below uses Fisher’s exact test to identify a significant increase in the proportion of expanded clonotypes in a cluster of putative CD8+ T cells compared to another cluster of CD4+ T cells (Figure 19.6). This provides some relative measure of the average immune activity of each cluster though, of course, further subclustering may reveal finer subpopulations with more or less activity than the average of the parent cluster.

##   Expanded Unexpanded
## 2      138        309
## 6       80       1193
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mat
## p-value <2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  4.870 9.129
## sample estimates:
## odds ratio 
##       6.65
Distribution of log-abundance values for each antibody-derived tag in the PBMC dataset, across all cells in either T cell cluster (top and bottom panels).

Figure 19.6: Distribution of log-abundance values for each antibody-derived tag in the PBMC dataset, across all cells in either T cell cluster (top and bottom panels).

19.2.4 Quantifying gene expression and properties

We may also be interested in the expression of specific TCR genes, which can provide some insight into the type of antigens being targeted (Shugay et al. 2018). We again use Fisher’s exact test to test if our two T cell clusters of interest differ in the proportion of TCR sequences containing each gene. This yields a number of significant differences that may or may not be interesting, I don’t know. As usual, composition effects apply here.

## DataFrame with 6 rows and 3 columns
##                     OR     p.value         FDR
##              <numeric>   <numeric>   <numeric>
## TRAV1-2      17.325061 9.22092e-48 4.14941e-46
## TRAV9-2       0.347538 7.02164e-05 1.57987e-03
## TRAV38-2/DV8  2.387645 1.47719e-04 2.21578e-03
## TRAV13-1      0.407479 2.72881e-04 3.06991e-03
## TRAV38-1      0.152513 1.28213e-03 1.09826e-02
## TRAV12-1      2.217839 1.49621e-03 1.09826e-02

The same philosophy can be applied to detect significant differences in biophysical properties. I have no idea what to say about this. I can only assume that gravy refers to hydrophobicity, and not the sauce. (One 300g can of gravy gives me a quarter of my daily fat, according to Google. And half my sodium as well. What a bargain!) And maybe basic TCRs will prefer to bind acidic pepties? Now we just have to find all those acidic antigens out there - how hard can it be? What kind of pathogen uses acidic residues, anyway? Prolines, now that’s a sign of class.

##                      cdr3_AA_LENGTH cdr3_AA_GRAVY cdr3_AA_BULK cdr3_AA_ALIPHATIC cdr3_AA_POLARITY
## AAACCTGAGATCTGAA-1               18        0.6889        13.42            0.8167            7.806
## AAACCTGAGGAACTGC-1               12        0.1083        14.68            0.8167            8.642
## AAACCTGAGGAGTCTG-1               16       -0.4250        14.52            0.8562            8.794
## AAACCTGAGGCTCTTA-1               14        0.4286        13.09            0.6286            7.979
## AAACCTGAGGCTCTTA-1.1             NA            NA           NA                NA               NA
## AAACCTGTCAACGGGA-1               17       -0.2118        14.15            0.8059            8.324
##                      cdr3_AA_CHARGE cdr3_AA_BASIC cdr3_AA_ACIDIC cdr3_AA_AROMATIC
## AAACCTGAGATCTGAA-1           0.9260       0.05556         0.0000          0.05556
## AAACCTGAGGAACTGC-1          -1.0734       0.08333         0.1667          0.08333
## AAACCTGAGGAGTCTG-1           0.9265       0.12500         0.0625          0.06250
## AAACCTGAGGCTCTTA-1           0.9260       0.07143         0.0000          0.07143
## AAACCTGAGGCTCTTA-1.1             NA            NA             NA               NA
## AAACCTGTCAACGGGA-1           2.0378       0.17647         0.0000          0.17647
## DataFrame with 9 rows and 2 columns
##                         Delta     p.value
##                     <numeric>   <numeric>
## cdr3_AA_LENGTH    -1.00000000 3.06864e-06
## cdr3_AA_GRAVY     -0.01590909 6.97870e-01
## cdr3_AA_BULK       0.50125000 3.89143e-11
## cdr3_AA_ALIPHATIC  0.06428571 8.22883e-05
## cdr3_AA_POLARITY  -0.08064893 1.33808e-03
## cdr3_AA_CHARGE    -0.99761076 2.41862e-14
## cdr3_AA_BASIC      0.00641026 1.00025e-03
## cdr3_AA_ACIDIC     0.07142857 3.40111e-14
## cdr3_AA_AROMATIC   0.02456140 1.84476e-07

19.3 Analyzing the immunoglobulin repertoire

Immunoglobulins are good.

19.4 Multi-sample analyses

A much more interesting analysis is possible when there are multiple samples and treatment conditions, where the aim is to determine which clusters contain T cell clonotypes that expand in response to treatment. In much the same vein as the analysis in Chapter 14, we can obtain a cluster-by-sample count matrix by simply cbind()ing together the tra.counts.clonal vector from each sample. To demonstrate, we will pretend that our cells are derived from 4 samples that have been merged together for common clustering (Chapter 13).

We then test for condition-specific differences in the proportion of clonotypes that are expanded. A natural approach would be to apply a beta-binomial model to each cluster, using the expanded and TCR-containing counts as input; for the time being, though, we will just perform a simple Wilcoxon ranked sum test. We test on proportions to accommodate biological differences in T cell abundance between samples well as technical differences in, say, sequencing depth that might affect the success rate of TCR calls on a sample-by-sample basis.

## 
##  Wilcoxon rank sum exact test
## 
## data:  prop[tclust.1, 1:2] and prop[tclust.1, 3:4]
## W = 0, p-value = 0.3
## alternative hypothesis: true location shift is not equal to 0
## 
##  Wilcoxon rank sum exact test
## 
## data:  prop[tclust.2, 1:2] and prop[tclust.2, 3:4]
## W = 1, p-value = 0.7
## alternative hypothesis: true location shift is not equal to 0

Session Info

R Under development (unstable) (2019-12-29 r77627)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] alakazam_0.3.0              scater_1.15.12              ggplot2_3.2.1              
 [4] BiocFileCache_1.11.4        dbplyr_1.4.2                SingleCellExperiment_1.9.1 
 [7] SummarizedExperiment_1.17.1 DelayedArray_0.13.2         BiocParallel_1.21.2        
[10] matrixStats_0.55.0          Biobase_2.47.2              GenomicRanges_1.39.1       
[13] GenomeInfoDb_1.23.1         IRanges_2.21.2              S4Vectors_0.25.8           
[16] BiocGenerics_0.33.0         Cairo_1.5-10                BiocStyle_2.15.3           
[19] OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] nlme_3.1-143             bitops_1.0-6             bit64_0.9-7             
 [4] progress_1.2.2           httr_1.4.1               tools_4.0.0             
 [7] backports_1.1.5          R6_2.4.1                 irlba_2.3.3             
[10] vipor_0.4.5              DBI_1.1.0                lazyeval_0.2.2          
[13] colorspace_1.4-1         ade4_1.7-13              withr_2.1.2             
[16] prettyunits_1.0.2        tidyselect_0.2.5         gridExtra_2.3           
[19] processx_3.4.1           bit_1.1-14               curl_4.3                
[22] compiler_4.0.0           BiocNeighbors_1.5.1      labeling_0.3            
[25] bookdown_0.16            scales_1.1.0             readr_1.3.1             
[28] callr_3.4.0              rappdirs_0.3.1           stringr_1.4.0           
[31] digest_0.6.23            rmarkdown_2.0            XVector_0.27.0          
[34] pkgconfig_2.0.3          htmltools_0.4.0          highr_0.8               
[37] rlang_0.4.2              RSQLite_2.2.0            DelayedMatrixStats_1.9.0
[40] farver_2.0.1             dplyr_0.8.3              RCurl_1.95-4.12         
[43] magrittr_1.5             BiocSingular_1.3.1       GenomeInfoDbData_1.2.2  
[46] Matrix_1.2-18            Rcpp_1.0.3               ggbeeswarm_0.6.0        
[49] munsell_0.5.0            ape_5.3                  viridis_0.5.1           
[52] lifecycle_0.1.0          stringi_1.4.3            yaml_2.2.0              
[55] MASS_7.3-51.5            zlibbioc_1.33.0          grid_4.0.0              
[58] blob_1.2.0               crayon_1.3.4             lattice_0.20-38         
[61] cowplot_1.0.0            hms_0.5.2                zeallot_0.1.0           
[64] knitr_1.26               ps_1.3.0                 pillar_1.4.3            
[67] igraph_1.2.4.2           seqinr_3.6-1             glue_1.3.1              
[70] evaluate_0.14            BiocManager_1.30.10      vctrs_0.2.1             
[73] tidyr_1.0.0              gtable_0.3.0             purrr_0.3.3             
[76] assertthat_0.2.1         xfun_0.11                rsvd_1.0.2              
[79] viridisLite_0.3.0        tibble_2.1.3             beeswarm_0.2.3          
[82] memoise_1.1.0           

Bibliography

Georgiou, G., G. C. Ippolito, J. Beausang, C. E. Busse, H. Wardemann, and S. R. Quake. 2014. “The promise and challenge of high-throughput sequencing of the antibody repertoire.” Nat. Biotechnol. 32 (2):158–68.

Rosati, E., C. M. Dowds, E. Liaskou, E. K. K. Henriksen, T. H. Karlsen, and A. Franke. 2017. “Overview of methodologies for T-cell receptor repertoire analysis.” BMC Biotechnol. 17 (1):61.

Shugay, M., D. V. Bagaev, I. V. Zvyagin, R. M. Vroomans, J. C. Crawford, G. Dolton, E. A. Komech, et al. 2018. “VDJdb: a curated database of T-cell receptor sequences with known antigen specificity.” Nucleic Acids Res. 46 (D1):D419–D427.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January):14049.