AI Model Decodes How DNA Sequence Shapes Gene Expression in Single Cells Across Dozens of Diseases
From AnyHelix Team · 27 May 2026 · 3 min read
A new AI tool can predict how DNA sequence influences gene activity in specific cell types and disease states, offering a powerful way to interpret disease-linked genetic variants and design targeted regulatory elements. The model, named Decima, is described today in Nature Methods by researchers led by Gokcen Eraslan at Genentech, with first author Avantika Lal.
Decima was trained on single-cell RNA sequencing data from over 22 million cells, spanning 201 cell types, 271 tissues, and 82 diseases. It learns to forecast gene expression levels from the DNA sequence surrounding a gene, at a resolution that captures differences between, say, a liver cell and an immune cell, or between a healthy intestinal fibroblast and one from a patient with Crohn’s disease.
The team showed that Decima could accurately pick out genes whose expression defines a cell’s identity purely from genomic sequence. It highlighted known regulatory motifs—such as those for transcription factors that drive lung epithelial cell lineages or that repress neuronal genes in non-neuronal cells—validating that the model had learned biologically meaningful rules.
Crucially, the tool also predicts the impact of noncoding genetic variants in individual cell types. When tested against fine-mapped expression quantitative trait loci (eQTLs) in blood cells, Decima’s variant effect scores distinguished causal variants from background noise better than a previous state-of-the-art model. For over 800 disease-associated variants from genome-wide association studies, Decima often pinpointed the cell types where the variant’s effect was strongest—for example, linking hypertension-related variants to macrophages and monocytes, and height-associated variants to fibroblasts.
Moving beyond interpretation, the researchers used Decima as a design engine. In a proof-of-concept, they computationally evolved a synthetic 200-base-pair regulatory element that drove high predicted expression specifically in fibroblasts from Crohn’s disease patients, while keeping activity low in healthy gut cells and other intestinal cell types. The designed sequence recruited motifs linked to inflammation and fibroblast identity, though it has not yet been tested in cells.
The study has important caveats. The model captures only cis-regulatory mechanisms (those acting on the same DNA molecule) and was trained on discrete cell-type categories, which may miss regulation in continuous cell states. Its ability to predict disease-versus-healthy expression changes was modest, with an average correlation of 0.24 across all comparisons, though it still recovered known disease-associated transcription factors. Moreover, because the model’s output is tied to the cell types and conditions it saw during training, it cannot currently predict regulation in entirely new contexts.
Nevertheless, Decima adds a much-needed layer of regulatory insight to the rapidly growing single-cell atlases. It provides a way to generate mechanistic hypotheses for noncoding variants that have been statistically linked to disease but lack cellular context, and it opens a path toward designing gene therapy vectors with cell-type and disease-state specificity.
Reference: Lal, A. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. Nat. Methods (2026). https://doi.org/10.1038/s41592-026-03102-0