Unified Deep Learning Model Boosts Peptide Discovery by 42% in Immunopeptidomics
From AnyHelix Team · 26 May 2026 · 3 min read
A single deep learning model that handles both database-driven peptide searches and de novo sequencing has markedly improved the interpretation of mass spectrometry data, especially for hard-to-analyze samples such as those used in cancer immunology. Scientists report today in Nature Machine Intelligence that the model, called pUniFind, identified over 42% more peptides in immunopeptidomics experiments compared with a widely used conventional engine, while also uncovering thousands of peptides not catalogued in existing protein databases but traceable to the human genome. The work was led by corresponding author Hao Chi at Chinese Academy of Sciences, with first author Jiale Zhao.
Mass spectrometry-based proteomics typically relies on matching experimental spectra to known protein sequences. pUniFind was trained on more than 100 million peptide–spectrum matches obtained through open database searches, which allow for unexpected modifications and non-specific digestion. It uses cross-modality training tasks—predicting spectra from peptide sequences and generating peptide sequences from spectra—to learn deep representations of both data types. This design enables the model to score peptide–spectrum matches in an end-to-end fashion rather than relying on handcrafted features.
Across nine species datasets, pUniFind consistently identified more peptides than existing tools, with gains ranging from 2% to 18% at the peptide level. Validation experiments using entrapment databases, metabolic labeling, and mixed-species searches confirmed that the improvement did not come at the cost of accuracy. In immunopeptidomics—where peptides are presented on cell surfaces and are central to immunotherapy target discovery—pUniFind identified 42.6% more peptides than the open search engine Open-pFind and 17.4% more than MSFragger with MSBooster. The model also excelled in metaproteomics, identifying 6.3% more peptides than Open-pFind.
The study further demonstrated a unified open de novo sequencing workflow capable of handling over 1,300 post-translational modifications without prior knowledge. In modification-rich datasets, pUniFind yielded 60% more peptide–spectrum matches than existing de novo methods despite a 300-times larger search space. In standard de novo sequencing of yeast data, it recovered 40.2% more peptides than Casanovo v.2. When applied to human immunopeptidomics data, pUniFind’s de novo mode recalled 38.5% more peptides than a database search, including 1,891 peptides that mapped to the genome but were absent from the reference proteome. A new quality control module based on multiple deep learning features increased the consistency of de novo results with RNA-Seq evidence from 65.4% to 85.0%.
The model currently does not integrate chromatographic retention time into its database search scoring, partly due to variability across instruments and laboratories, and it has not yet been adapted for data-independent acquisition (DIA) data. The authors note that extending the framework to DIA analysis is a priority for future work.
By showing that end-to-end deep learning scoring can outperform traditional feature-based engines across diverse proteomics applications, the study sets the stage for more sensitive and reliable detection of modified and non-canonical peptides. For fields like cancer immunotherapy, where the identification of presented peptide antigens is critical, such gains could accelerate the discovery of new therapeutic targets.
Reference: Zhao, J., Mao, P., Wang, K. et al. A large-scale unified deep learning model for peptide mass spectrum interpretation trained on multimodal data. Nat Mach Intell (2026). https://doi.org/10.1038/s42256-026-01234-8