Chieh Lin (林潔) | PhD Candidate of Machine Learning @ Carnegie Mellon University

Selected Research Projects and Publications

Single-cell time-series mapping of cell fate trajectories reveals an expanded developmental potential for human PSC-derived distal lung progenitors

CSHMM predicts the precise timing of Wnt modulation that maintains lung cell fate.

Killian Hurley, Jun Ding, ... , Chieh Lin, ... , Ziv Bar-Joseph, and Darrell N Kotton. Single-cell time-series mapping of cell fate trajectories reveals an expanded developmental potential for human PSC-derived distal lung progenitors , (under revision in Cell Stem Cell) [PDF(bioarxiv)]

In this project, we plan to use CSHMM to improve the differentiation protocol for lung cell types. We showed that CSHMM can not only identify the relevant signalling pathways which determine cell fate, but can also predict with precision the timing of pathway modulation to increase differentiation efficiency to the target cell type.

Single-cell Lineage Tracing by Integrating CRISPR-Cas9 Mutations with Transcriptomic Data

LinTIMat is a new model that combines both mutation and expression data for reconstructing better cell lineage tree

Hamim Zafar*, Chieh Lin*, Ziv Bar-Joseph, Single-cell Lineage Tracing by Integrating CRISPR-Cas9 Mutations with Transcriptomic Data , (under review in Nature Communications) [github, PDF(bioarxiv), ISMB/ECCB 2019 Poster] (* equally-contributed, order determined by coin flip)

We developed a novel method, LinTIMaT, which learns a probabilistic model that reconstructs cell lineages by integrating mutation and expression data. Analysis shows that expression data helps resolve the ambiguities arising in when lineages are inferred based on mutations alone, while also enabling the integration of different individual lineages for the reconstruction of a consensus lineage tree.

Inferring TF activation order in time series scRNA-Seq studies

We desing a new probabilistic model that utilize transcrition factor-target (TF-target) relationship and is capable to assign continuous activation time of TFs

Chieh Lin, Jun Ding, Ziv Bar-Joseph, Inferring TF activation order in time series scRNA-Seq studies, (under revision in PLOS Computational Biology) [github]

We developed the Continuous-State Hidden Markov Models TF (CSHMM-TF) method which integrates probabilistic modeling of scRNA-Seq data with the ability to assign TFs to specific activation points in the model. TFs are assumed to influence the emission probabilities for cells assigned to later time points allowing us to identify not just the TFs controlling each path but also their order of activation.

Continuous State HMMs for Modeling Time Series Single Cell RNA-Seq Data

Designing a probabilistic model for learning the continuous process of cell development with time-series single-cell RNA-Seq datasets

Chieh Lin, Ziv Bar-Joseph, Continuous State HMMs for Modeling Time Series Single Cell RNA-Seq Data, (Bioinformatics 2019) [Link, PDF]

Designing a new method based on continuous-state HMMs (CSHMMs) for representing and modeling time-series scRNA-Seq data. We provide well-defined CSHMM model and efficient learning and inference algorithms which allow the method to determine both the structure of the branching process and the assignment of cells to these branches. We show that the CSHMM method accurately infers branching topology and correctly and continuously assign cells to paths, improving upon prior methods proposed for this task. Analysis of genes based on the continuous cell assignment identifies known and novel markers for different cell types.

Cell lineage inference from SNP and scRNA-Seq data

Using sequence mutations from scRNA-Seq data to improve the model accuracy of reconstructing developmental trajectories

Jun Ding, Chieh Lin, Ziv Bar-Joseph, Cell lineage inference from SNP and scRNA-Seq data, (Nucleic Acids Research 2019), [Link, PDF].

We develop a new method to detect significant, cell type specific, sequence mutations from scRNA-Seq data. We show that only a few mutations are enough for reconstructing good developmental trajectories. Integrating these mutations with expression data further improves the accuracy of the reconstructed models. As we show, the majority of mutations we identify are likely RNA editing events indicating that such information can be used to distinguish cell types.

Using neural networks for reducing the dimensions of single-cell RNA-Seq data

Using domain knowledge to design/improve Machine Learning models for learning the embedding of single cells

Chieh Lin, Siddhartha Jain, Hannah Kim, Ziv Bar-Joseph, Using neural networks for reducing the dimensions of single-cell RNA-Seq data, (Nucleic Acids Research 2017), [Link, PDF].

Designing new Neural Network architectures for dimensionality reduction (learning embeddings) of scRNA-Seq data by using domain knowledge of protein-protein interaction (PPI) and transcrition factor-target (TF-target) relationships. We showed that this new embedding improves the peroformance of clustering and retrieval task. We also performed functional analysis of the set of highly weighted nodes for each cell type and showed that even though Neural Networks are often described as a ‘black box’ learning method, many of these are functionally related to the cell type they were selected for.

Rule-based Cytochrome P450 Inhibition Prediction Models

Learning rule-based C5.0 classifier for Cytochrome P450 Inhibition Prediction

Bo-Han Su, Yi-shu Tu, Chieh Lin, Chi-Yu Shao, Olivia A Lin, Yufeng J Tseng, Rule-based Cytochrome P450 Inhibition Prediction Models, (Journal of Chemical Information and Modeling 2015), [Link, PDF].

Chi-Yu Shao, Bo-Han Su, Yi-Shu Tu, Chieh Lin, Olivia A Lin, Yufeng J Tseng, CypRules: A rule-based P450 inhibition prediction server, (Bioinformatics 2015), [Link, PDF].

We utilized a rule-based C5.0 algorithm with different descriptors, including PaDEL, Mold2, and PubChem fingerprints, to construct rule-based inhibition prediction models for five major CYP enzymes—CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4—that account for 90% of drug oxidation or hydrolysis. We also developed a rational sampling algorithm for the selection of compounds in the training data set, to enhance the performance of these CYP prediction models. Our models significantly outperformed all of the currently available models. Also, they can be used for rapid virtual screening of a large set of compounds due to their ruleset-based nature. Moreover, such rule-based prediction models can provide rulesets for structural features related to the five major CYP enzymes.

Polymer Name to Structure Project

We developed a brand new framework that can map various polymer names of the same polymer to its minimal repeating units. This process incorporates data mining, text mining, and computer-aided material design. System can accurately predict simple polymer structures with > 0.9 accuracy

Advances in polymer science have made polymers essential in our everyday life along with unprecedented quantities of data from past several decades. However, to organize such “big data” scattered and accumulated in the text throughout the mass journals, patents, and web pages has been challenging and inefficient due to the complexity and the ambiguity of the polymer representations. We developed a first of its kind automated framework converting various polymer representations to the corresponding polymer structures, PolyName2Structure. Machine learning models and algorithms for handling polymer structures were built to predict polymerization pathway, identify the reacting group(s), and generate repeating units after polymerization. This PolyName2Structure achieved >90% accuracy in prediction of the structure of the polymer listed in a commercial catalog and embodied the first step toward resolving the complexity of the data structure for polymers through building a practical vehicle that enables text mining of structural polymers information. Manuscript in preparation

Computer-aided drug design projects

We proposed candidate molecules that could be new drugs for cancer and schizophrenia drugs by performing computational techniques including fragment docking, virtual screening, LeadOp, LeadOp+R. Target proteins are Topoisomerase II and D-amino acide oxidase (DAO) proteins. The top-ranked molecules have been confirmed more effective by wet-lab experiments.

We collected starting molecules that are known to be the inhibitors of target proteins (Topoisomerase II and D-amino acide oxidase (DAO)), then we identify the important structures as scaffolds. Then, we perform fragment docking to generate a list of fragment that could improve the molecule. With scaffold and possible fragments, we attach the fragments to scaffold with LeadOp or LeadOp+R methods and generates millions of drug candidates. Then we use virtual screening and other criteria such as Lipinski's rule of five to filter the drug candidates and output a final rank list of drug candidates. Our collaborators verify the efficacy of the top-ranked drug candidates and selects some to go through clinical trial process.

For a full list of publications, have a look at my Google Scholar page.

Selected Research Projects and Publications

Curriculum vitae