BioHackrXiv CMU-DNAnexus Hackathon · 2025 Preprint

Addressing Background Genomic and Environmental Effects on Health through Accelerated Computing and Machine Learning: Results from the 2025 Hackathon at Carnegie Mellon University

S. Sabata, J. Kubica, R. Gupta, L.W. Ericson, H.C. Atanda, G. Subramaniam, ... A. Jha, et al.

Abstract

In March 2025, 34 scientists from the United States, Ireland, the United Kingdom, Switzerland, France, Germany, Spain, India, and Australia gathered in Pittsburgh, Pennsylvania and virtually for a collaborative biohackathon, hosted by DNAnexus and Carnegie Mellon University Libraries. The goal of the hackathon was to explore machine learning approaches for multimodal problems in computational biology using public datasets. Teams worked on the following innovative projects: applying machine learning techniques for clustering and similarity analysis of haplotypes; adapting the StructLMM framework to study Gene-Gene (GxG) interactions; creating a nextflow workflow for generating an imputation reference panel using large-scale cohort data; optimizing discovery of causal relationships in large electronic health record (EHR) datasets using the open source causal analysis software Tetrad; examining the evolution of a graph neural network in a Lenski-esque experiment; and developing tools and workflows for generating pathway intersection diagrams and graph-based analyses for multiomics data. All projects were dedicated to study the background genomic and environmental effects underlying complex genotype-phenotype relationships. Their objective was to set foundations for further studies on predicting complex phenotypic traits using integrative multi-omic and environmental analyses. Haplotype analysis plays a critical role in understanding genetic variation and evolutionary relationships. This study presents a computational pipeline on DNANexus that integrates haplotype data processing, ancestral recombination graph (ARG) reconstruction, and machine learning techniques to explore genetic similarity and clustering among samples. We used SHAPEIT2 phased variant call format (VCF) files from chromosomes 6, 8, 21, and 22 of the 1000 Genomes Project, converted the data into haplotype (HAP) format using Plink2, and applied preprocessing steps to standardize the input for ARG Needle. We also filtered chromosome 6 haplotypes for TNF and HLA-A variants and chromosome 8 for beta defensin, as TNF is one of the least variable genes in the human genome, while HLA-A and beta defensin are amongst the most variable. We obtained 61, 313, and 486 deduplicated biallelic SNPs for TNF, HLA-A, and beta defensin, respectively. We then performed hierarchical clustering and similarity matrix calculation from these gene-specific haplotypes.

Key Findings

  • Developed a cloud-based computational pipeline on DNANexus integrating haplotype processing and Ancestral Recombination Graph (ARG) reconstruction.
  • Analyzed 1000 Genomes Project data across chromosomes 6, 8, 21, and 22 to explore genetic similarity and clustering.
  • Focused on highly variable genes (HLA-A, beta defensin) and conserved genes (TNF) to study evolutionary relationships.
  • Demonstrated scalable machine learning approaches for multimodal problems in computational biology using public datasets.
Published: Jun 2025 Citations: 1 DOI: 10.37044/osf.io/3a8cn_v1
View PDF
Copied to clipboard
Publisher

© 2021–2026 Avish Jha

designed with intent &