PhD candidate of Machine Learning
Carnegie Mellon University
I am a fifth-year PhD student in Machine Learning at the Carnegie Mellon University, advised by Prof. Ziv Bar-Joseph , who leads Systems Biology Group. I plan to graduate before May 2020. My main area of interest is applying/designing Machine Learning models for solving real-world problems, including (but not limited to) improving the understanding of complex biological systems, drug discovery, healthcare problems, etc. My research with my advisor Ziv Bar-Joseph is mainly about designing new machine learning models to analyze time-series single-cell RNA sequencing data and find out potential biological insight in the models.
When I was a student in National Taiwan University, I joined Computational Molecular Design and Biomarker Detection in Metabolomics lab , lead by Prof. Yufeng Jane Tseng. My research projects with Prof. Tseng are mainly about using computational techniques for drug design and applying machine learning models in the Cheminformatics field.
Though my past research projects are either about biology or about chemistry, I am open to any other real-world problems.
CSHMM predicts the precise timing of Wnt modulation that maintains lung cell fate.
Killian Hurley, Jun Ding, ... , Chieh Lin, ... , Ziv Bar-Joseph, and Darrell N Kotton. Single-cell time-series mapping of cell fate trajectories reveals an expanded developmental potential for human PSC-derived distal lung progenitors , (under revision in Cell Stem Cell) [PDF(bioarxiv)]
In this project, we plan to use CSHMM to improve the differentiation protocol for lung cell types. We showed that CSHMM can not only identify the relevant signalling pathways which determine cell fate, but can also predict with precision the timing of pathway modulation to increase differentiation efficiency to the target cell type.
LinTIMat is a new model that combines both mutation and expression data for reconstructing better cell lineage tree
Hamim Zafar*, Chieh Lin*, Ziv Bar-Joseph, Single-cell Lineage Tracing by Integrating CRISPR-Cas9 Mutations with Transcriptomic Data , (under review in Nature Communications) [github, PDF(bioarxiv), ISMB/ECCB 2019 Poster] (* equally-contributed, order determined by coin flip)
We developed a novel method, LinTIMaT, which learns a probabilistic model that reconstructs cell lineages by integrating mutation and expression data. Analysis shows that expression data helps resolve the ambiguities arising in when lineages are inferred based on mutations alone, while also enabling the integration of different individual lineages for the reconstruction of a consensus lineage tree.
We desing a new probabilistic model that utilize transcrition factor-target (TF-target) relationship and is capable to assign continuous activation time of TFs
Chieh Lin, Jun Ding, Ziv Bar-Joseph, Inferring TF activation order in time series scRNA-Seq studies, (under revision in PLOS Computational Biology) [github]
We developed the Continuous-State Hidden Markov Models TF (CSHMM-TF) method which integrates probabilistic modeling of scRNA-Seq data with the ability to assign TFs to specific activation points in the model. TFs are assumed to influence the emission probabilities for cells assigned to later time points allowing us to identify not just the TFs controlling each path but also their order of activation.
Designing a probabilistic model for learning the continuous process of cell development with time-series single-cell RNA-Seq datasets
Chieh Lin, Ziv Bar-Joseph, Continuous State HMMs for Modeling Time Series Single Cell RNA-Seq Data, (Bioinformatics 2019) [Link, PDF]
Designing a new method based on continuous-state HMMs (CSHMMs) for representing and modeling time-series scRNA-Seq data. We provide well-defined CSHMM model and efficient learning and inference algorithms which allow the method to determine both the structure of the branching process and the assignment of cells to these branches. We show that the CSHMM method accurately infers branching topology and correctly and continuously assign cells to paths, improving upon prior methods proposed for this task. Analysis of genes based on the continuous cell assignment identifies known and novel markers for different cell types.
Using sequence mutations from scRNA-Seq data to improve the model accuracy of reconstructing developmental trajectories
Jun Ding, Chieh Lin, Ziv Bar-Joseph, Cell lineage inference from SNP and scRNA-Seq data, (Nucleic Acids Research 2019), [Link, PDF].
We develop a new method to detect significant, cell type specific, sequence mutations from scRNA-Seq data. We show that only a few mutations are enough for reconstructing good developmental trajectories. Integrating these mutations with expression data further improves the accuracy of the reconstructed models. As we show, the majority of mutations we identify are likely RNA editing events indicating that such information can be used to distinguish cell types.
Using domain knowledge to design/improve Machine Learning models for learning the embedding of single cells
Chieh Lin, Siddhartha Jain, Hannah Kim, Ziv Bar-Joseph, Using neural networks for reducing the dimensions of single-cell RNA-Seq data, (Nucleic Acids Research 2017), [Link, PDF].
Designing new Neural Network architectures for dimensionality reduction (learning embeddings) of scRNA-Seq data by using domain knowledge of protein-protein interaction (PPI) and transcrition factor-target (TF-target) relationships. We showed that this new embedding improves the peroformance of clustering and retrieval task. We also performed functional analysis of the set of highly weighted nodes for each cell type and showed that even though Neural Networks are often described as a ‘black box’ learning method, many of these are functionally related to the cell type they were selected for.
Learning rule-based C5.0 classifier for Cytochrome P450 Inhibition Prediction
Bo-Han Su, Yi-shu Tu, Chieh Lin, Chi-Yu Shao, Olivia A Lin, Yufeng J Tseng, Rule-based Cytochrome P450 Inhibition Prediction Models, (Journal of Chemical Information and Modeling 2015), [Link, PDF].
Chi-Yu Shao, Bo-Han Su, Yi-Shu Tu, Chieh Lin, Olivia A Lin, Yufeng J Tseng, CypRules: A rule-based P450 inhibition prediction server, (Bioinformatics 2015), [Link, PDF].
We utilized a rule-based C5.0 algorithm with different descriptors, including PaDEL, Mold2, and PubChem fingerprints, to construct rule-based inhibition prediction models for five major CYP enzymes—CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4—that account for 90% of drug oxidation or hydrolysis. We also developed a rational sampling algorithm for the selection of compounds in the training data set, to enhance the performance of these CYP prediction models. Our models significantly outperformed all of the currently available models. Also, they can be used for rapid virtual screening of a large set of compounds due to their ruleset-based nature. Moreover, such rule-based prediction models can provide rulesets for structural features related to the five major CYP enzymes.
We developed a brand new framework that can map various polymer names of the same polymer to its minimal repeating units. This process incorporates data mining, text mining, and computer-aided material design. System can accurately predict simple polymer structures with > 0.9 accuracy
Advances in polymer science have made polymers essential in our everyday life along with unprecedented quantities of data from past several decades. However, to organize such “big data” scattered and accumulated in the text throughout the mass journals, patents, and web pages has been challenging and inefficient due to the complexity and the ambiguity of the polymer representations. We developed a first of its kind automated framework converting various polymer representations to the corresponding polymer structures, PolyName2Structure. Machine learning models and algorithms for handling polymer structures were built to predict polymerization pathway, identify the reacting group(s), and generate repeating units after polymerization. This PolyName2Structure achieved >90% accuracy in prediction of the structure of the polymer listed in a commercial catalog and embodied the first step toward resolving the complexity of the data structure for polymers through building a practical vehicle that enables text mining of structural polymers information. Manuscript in preparation
We proposed candidate molecules that could be new drugs for cancer and schizophrenia drugs by performing computational techniques including fragment docking, virtual screening, LeadOp, LeadOp+R. Target proteins are Topoisomerase II and D-amino acide oxidase (DAO) proteins. The top-ranked molecules have been confirmed more effective by wet-lab experiments.
We collected starting molecules that are known to be the inhibitors of target proteins (Topoisomerase II and D-amino acide oxidase (DAO)), then we identify the important structures as scaffolds. Then, we perform fragment docking to generate a list of fragment that could improve the molecule. With scaffold and possible fragments, we attach the fragments to scaffold with LeadOp or LeadOp+R methods and generates millions of drug candidates. Then we use virtual screening and other criteria such as Lipinski's rule of five to filter the drug candidates and output a final rank list of drug candidates. Our collaborators verify the efficacy of the top-ranked drug candidates and selects some to go through clinical trial process.
For a full list of publications, have a look at my Google Scholar page.