Chapter 1 Introduction

Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language.

1.1 What you will learn

The goal of this book is to provide a solid foundation in the usage of Bioconductor tools for single-cell RNA-seq analysis by walking through various steps of typical workflows using example datasets. We strive to tackle key concepts covered in the manuscript, “Orchestrating Single-Cell Analysis with Bioconductor”, with each workflow covering these in varying detail, as well as essential preliminaries that are important for following along with the workflows on your own.

1.1.1 Preliminaries

For those unfamiliar with R (and those looking to learn more), we recommend reading the Learning R and More chapter, which first and foremost covers how to get started with R. We point to many great online resources for learning R, as well as related tools that are nice to know for bioinformatic analysis. For advanced users, we also point to some extra resources that go beyond the basics. While we provide an extensive list of learning resources for the interested audience in this chapter, we only ask for some familiarity with R before going to the next section.

We then briefly cover getting started with Using R and Bioconductor. Bioconductor, being its own repository, has a unique set of tools, documentation, resources, and practices that benefit from some extra explanation.

Data Infrastructure merits a separate chapter. The reason for this is that common data containers are an essential part of Bioconductor workflows because they enable interoperability across packages, allowing for “plug and play” usage of cutting-edge tools. Specifically, here we cover the SingleCellExperiment class in depth, as it has become the working standard for Bioconductor based single-cell analysis packages.

Finally, before diving into the various workflows, armed with knowledge about the SingleCellExperiment class, we briefly discuss the datasets that will be used throughout the book in About the Data.

1.1.2 Workflows

All workflows begin with data import and subsequent quality control and normalization, going from a raw (count) expression matrix to a clean one. This includes adjusting for experimental factors and possibly even latent factors. Using the clean expression matrix, feature selection strategies can be applied to select the features (genes) driving heterogeneity. Furthermore, these features can then be used to perform dimensionality reduction, which enables downstream analysis that would not otherwise be possible and visualization on 2- or 3-dimensions.

From there, the workflows largely focus on differing downstream analyses. Clustering details how to segment a scRNA-seq dataset, and differential expression provides a means to determine what drives the differences between different groups of cells. Integrating datasets walks through merging scRNA-seq datasets, an area of need as the number of scRNA-seq datasets continues to grow and comparisons between datasets must be done. Finally, we touch upon how to work with large scale data, specifically where it becomes impractical or impossible to work with data solely in-memory.

As an added bonus, we dedicate a chapter to interactive visualization, which focuses on using the iSEE package to enable active exploration of a single cell experiment’s data.

1.2 What you won’t learn

The field of bioinformatic analysis is large and filled with many potential trajectories depending on the biological system being studied and technology being deployed. Here, we only briefly survey some of the many tools available for the analysis of scRNA-seq, focusing on Bioconductor packages. It is impossible to thoroughly review the plethora of tools available through R and Bioconductor for biological analysis in one book, but we hope to provide the means for further exploration on your own.

Thus, it goes without saying that you may not learn the optimal workflow for your own data from our examples - while we strive to provide high quality templates, they should be treated as just that - a template from which to extend upon for your own analyses.

1.3 Who we wrote this for

We’ve written this book with the interested experimental biologist in mind, and do our best to make few assumptions on previous programming or statistical experience. Likewise, we also welcome more seasoned bioinformaticians who are looking for a starting point from which to dive into single-cell RNA-seq analysis. As such, we welcome any and all feedback for improving this book to help increase accessibility and refine technical details.

1.4 Why we wrote this

This book was conceived in the fall of 2018, as single-cell RNA-seq analysis continued its rise in prominence in the field of biology. With its rapid growth, and the ongoing developments within Bioconductor tailored specifically for scRNA-seq, it became apparent that an update to the Orchestrating high-throughput genomic analysis with Bioconductor paper was necessary for the age of single-cell studies.

We strive to highlight the fantastic software by people who call Bioconductor home for their tools, and in the process hope to showcase the Bioconductor community at large in continually pushing forward the field of biological analysis.

1.5 Acknowledgements

We would like to thank all Bioconductor contributors for their efforts in creating the definitive leading-edge repository of software for biological analysis. It is truly extraordinary to chart the growth of Bioconductor over the years. We are thankful for the wonderful community of scientists and developers alike that together make the Bioconductor community special.

We would first and foremost like to thank the Bioconductor core team and the emerging targets subcommittee for commissioning this work, Stephanie Hicks and Raphael Gottardo for their continuous mentorship, and all our contributors to the companion manuscript of this book.

We’d also like to thank Garret Grolemund and Hadley Wickham for their book, R for Data Science, from which we drew stylistic and teaching inspiration. We also thank Levi Waldron and Aaron Lun for advice on the code-related aspects of managing the online version of this book.