Jordan Cahoon

Projects

Domain-Guided Informational Retrieval for Clinical Records

Winter 2025 rotation project advised by Emily Alsentzer. Details to come.

Genome-wide association studes powered by DNA language models

Fall 2024 rotation project with the Salzman Lab.

Detecting ancient introgression with genealogical trees

Project advised by Charleston Chiang, Iain Mathieson, and Sara Mathieson. Details to come.

Improving Large Language Models for Code-Switch Language

Code-switch (CS) language is a common linguistic phenomenon where multilingual speakers fluently blend multiple languages together. Despite its frequency in both spoken and written language, CS is largely understudied in natural language processing. Improving model performance on CS is critical in ensuring language models can be deployed in real world settings. Our project aims to improve performance of pre-trained large language models on Spanish-English CS, colloquially known as Spanglish, for language modeling and sentiment analysis. In this work, we demonstrate that multilingual pre-trained language models perform significantly worse on Spanglish compared to both English and Spanish. We find that fine-tuning pre-trained models on Spanish, English, and Spanglish all enhance the performance on CS input for both tasks. Particularly, for the language modeling task, fine-tuning on Spanglish achieves the largest drop in perplexity compared to the other fine-tuning methods. We also analyzed the cross-lingual embedding space to understand how fine-tuning on CS may affect the quality of embeddings. By comparing the similarity of embeddings for parallel sentences in Spanish and English, we find fine-tuning models on CS sentences potentially improves language agnostic embeddings more effectively than fine-tuning on English or Spanish alone. Currently CS research is restricted by the few, small available datasets. As more data is gathered, we expect the benefits of fine-tuning models on CS to be more apparent compared to training on monolingual input. This project was part of independent research for CSCI 499: Large Langage Models in Natural Language Processing.

Report

Continuous Stress Prediction for Healthcare Workers

Healthcare workers are constantly exposed to high-stress working conditions that increase burnout rates and lower the quality of patient care. While it is vital to continuously monitor healthcare worker stress to provide necessary interventions, traditional survey methods can interfere with tasks in real-world healthcare contexts. Wearable devices offer a non-invasive way to detect worker stress continuously; however, predictions may be influenced by an individual’s activities and definitions of stress. In this project, we identified extreme gradient boosting trees as a promising method for continuous stress prediction, achieving a ROC-AUC of 0.83 using real-world health worker data. In addition to benchmarking continuous stress prediction, we assessed the dataset generalizability with domain adaptation. Interestingly, we found that supervised domain adaptation improves performance across all datasets, suggesting the capacity to aggregate future datasets. Through this project, we highlighted numerous challenges to generalizing stress detection for health workers, including differing stress definitions and sensor types among available datasets.

Short Paper Accepted at ACM-BCB 2023

Imputation Around the World

By predicting unobserved genotypes based on sequenced individuals, imputation increases marker density and enables large-scale genome-wide association studies. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) contains a substantial number of admixed African and Hispanic/Latino samples. As a result, these populations are imputed with nearly the same efficacy as European cohorts. However, imputation for ancestrally diverse populations primarily residing outside of North America still falls short in performance due to persisting underrepresentation. My research aims to quantify quality weaknesses in the TOPMed reference panel and offer solutions to help close the quality gap of analysis between European and Non-European populations.

Publication Interactive Map of Populations

An Analysis on Substance Abuse Referrals in the United States Fostercare System

The National Youth in Transition Database (NYTD) hosts one of the largest sources of information on youth aging out of the foster care system in the United States. It covers nearly 10 years and consists of basic demographic information, outcome data, and independent living service (ILS) utilization for thousands of individuals. This report examined substance abuse referrals to analyze indicators for high-risk behavior in youth who may lack adequate support systems. We identified that having a connection to an adult and current school enrollment increases the likelihood of substance abuse referrals. This suggests that youth without strong mentorship may be at risk for not having access to substance abuse referrals. On the other hand, educational aid, public food assistance, and other public financial assistance may deter referral, meaning youth who receive these aid forms may have lower rates of substance abuse.

This report was conducted as independent research for CSCI 461: Artificial Intelligence in Sustainable Development under the guidance of Bistra Dilkina.

Report

Computational Agroecology

The current state of modern agriculture relies on monocultures, where a singular crop is grown in a field. This practice requires massive land clearing and destroys the local biosystems. The alternative solution is to grow multiple crops in the same field, known as polycultures. This method is more sustainable and leverages natural symbiotic relationships to increase crop yield. Yet testing which crops are optimal to grow within the same field is extremely time consuming. My group developed a reinforcement learning agent that systematically predicts polyculture composition and crop placement. This agent will help farmers incorporate polyculture design in their crops.

Repo

Malaria Prediction with Weather Features

The mosquitos vectors for malaria rely on shallow puddles of water to breed. Hence, weather factors including humidity, precipitation, and temperature, are highly associated with malaria outbreaks. While outbreak surveillance records may not be retained, weather data is consistently recorded and extremely abundant. We developed a model that utilizes climate and outbreak data to predict future outbreaks. Additionally, this model is transferrable to other regions with sparser outbreak data.

UoT Project X Submission Repo

Triple Liftover

Because the human genome has multiple iterations, the Liftover program is a critical tool to update genetic data to the most current version of the human genome. However, Liftover is unable to correctly detect palindromic single nucleotide variants (SNVs), which destabilizes the surrounding region. Our method, Triple Liftover, predicts and corrects SNVs located in inverted regions using a heuristic-based approach. Overall, my work focused on structuring the program so the design was flexible and user friendly.

Repo Paper

Helminth Immunology

Intestinal parasitic worm (helminth) infection is one of the leading causes of morbidity in undeveloped countries yet the immune mechanisms of worm expulsion are not well understood. Single Cell RNA-Sequencing (scRNA-seq) can be used to elucidate the complex pathways within Type II Inflammation, the main immune response to helminth infections. My role focused on processing and analyzing scRNA-seq data to uncover how gene regulation of CRTH2 is related to positive worm outcomes in mice models.

Paper