Using a cell state by cell state logistic regression

2018-10-20

Using a cell-state-by-cell-state logistic regression-based approach (similar in spirit to a one way analysis-of-variance with post hoc analysis), we identified putative core elements of cell-specific transcription for 17 cell states representing nine unique purified human cell types from different germ layers, degree of specification, and developmental age (including neural progenitor cells, fibroblasts, keratinocytes, hepatocytes, mesothelial cells, myoepithelial cells, kidney epithelial cells, pluripotent stem cells, definitive endoderm, smooth muscle cells, and endothelial cells) (Chin et al., 2009; Chin et al., 2010; Patterson et al., 2012). This collection of data represents an improvement over previously described databases (e.g., BioGPS (Wu et al., 2009)) in that we used strictly purified cells from tissue as opposed to whole tissues, and all the analyses were carried out in the same lab to minimize batch effect. In addition, data from cells differentiated from human pluripotent stem cells were included along with tissue-derived counterparts, opening the possibility of identification of gene expression patterns that change across developmental stages. Finally, our collection also included the same cell type (endothelial) derived from different locations within the body to provide information on regional specialization. The detailed list of cell states is provided in Table 1, along with corresponding shorthand notation used throughout the paper. We also make use of an independent and publically available data set consisting of 84 different cell types and tissues to validate our results. We found that many common “marker” genes typically used to define various cell types of the nervous system were in fact expressed in many cells not associated with Oxamflatin or spinal cord (Pankratz et al., 2007; Zhao et al., 2004). For example, NESTIN was highly expressed in 7 out of 17 cell states. We also show how identified core expression modules changed during development or as a result of spatial specification in different tissues. Using results generated from this approach, we built an interactive web-based application for dissemination and exploration of our results, yielding a valuable resource with a novel perspective on human cell fate, as well as potential leads for inducing one cell state from another. As validation that our approaches can yield factors important for particular cell fates, we provide evidence that CEMA-predicted factors can indeed drive cell fate.
Results
Discussion We have generated a compendium of 17 cell-state-specific gene expression data, and analyzed it to identify unique gene expression patterns. This approach focused on (1) cell-state-specific data, and (2) data from a single laboratory and platform. The latter is an important distinction because of well-known issues with inter-laboratory effects that plague meta-analyses (Guenther et al., 2010). The focus on data from a single laboratory can be viewed either as a restriction, or as a benefit. In general, integration of independent microarray studies is challenging and there is increasing acceptance that only data from the same platform can be integrated (Lukk et al., 2010). It has, however, been shown that when data are combined from different laboratories, and where biological experiments are replicated across laboratories, that the biological effects are stronger than the laboratory effects. Nevertheless, for such a merger to be informative, one must have the same biological condition across several labs, otherwise, lab-specific effects cannot be distinguished from biological effects because both are being changed at the same time. Because our focus was to investigate cell-specificity through developmental stages and across regional specification, it was difficult to amass such redundant public data across laboratories. Additionally, when we tried including new cell types generated in other laboratories, apparent artifacts were introduced in to the analysis. As such, we focused our analysis on the rich cell-state-specific compendium of data generated in our laboratory, for which no such artifacts were apparent.