Skip to content Skip to footer

scPRINT: Gene Network Inference from 50M Cells by Jérémie Kalfon

TL;DR

In this talk, Jérémie Kalfon presents his paper “scPRINT: pre-training on 50 million cells allows robust gene network predictions” at the Foundation Models for Biology Seminar Series by GenBio AI. He introduces scPRINT, a transformer-based foundation model trained on over 50 million single-cell RNA-seq profiles to infer gene networks. scPRINT enables scientists to predict cell-specific regulatory networks, correct batch effects, denoise data, and annotate cell types—pushing the boundaries of how AI can decode complex cellular biology.

Introduction

Imagine trying to navigate from New York to LA without a map. That’s what it’s like studying cell biology without gene regulatory networks.

Gene regulatory networks (GRNs) offer a blueprint of how genes influence one another, and how information propagates through a cell. They are essential for understanding and controlling cellular differentiation, disease progression, and drug response. Yet most models struggle to generate accurate, cell-specific GRNs at scale.

scPRINT solves this challenge by using a large-scale transformer model trained on diverse single-cell datasets, enabling high-resolution predictions across cell types and species. 

Watch the Full Talk

Inside the scPRINT Architecture

scPRINT (Single-cell PRe-trained Inference of Networks with Transformers) is a bidirectional transformer designed to process scRNA-seq data. Key components include:

  • Training Scale: 50M+ cells, 80B tokens from CELLxGENE
  • Input Embeddings:
    • Gene ID (via ESM2 protein embedding)
    • Gene value (log-normalized expression)
    • Gene position (Location on the genome)
  • Multiple Pretraining Objectives:
    • Denoising downsampled gene expression
    • Bottleneck learning through compressed embeddings
    • Multilabel classification of cell type, disease, tissue, sex, and more

The result is a model with disentangled embeddings that capture distinct facets of cell identity and function.

How scPRINT Outperforms Existing Methods

scPRINT was benchmarked against leading tools like scGPT, Geneformer v2, DeepSEM, and GENIE3 for gene network inference. In both simulated and real biological datasets, it consistently outperformed competitors.

  • Network Recovery: scPRINT recovered the most accurate models of gene networks on both real and simulated data.
  • Real Datasets: Networks generated by scPRINT were significantly enriched for known biology, including transcription factors, cell-type markers, andregulatory pathways
  • Disease Contexts: In benign prostatic hyperplasia (BPH), scPRINT uncovered links between ion exchange, senescence, and chronic inflammation in fibroblasts, providing new leads for therapeutic development.

These results demonstrate the model’s robustness, accuracy, and versatility.

Interpretable Attention for Gene Network Analysis

A standout feature of scPRINT is its ability to generate interpretable attention maps. These matrices show which genes the model considers connected, enabling researchers to:

  • Visualize gene networks
  • Select biologically relevant attention heads
  • Customize outputs based on known networks like Omnipath or ENCODE

This interpretability enhances both hypothesis generation and validation in genomics research.

Use Cases in Modern Genomic Research

scPRINT has broad applications in computational biology and biomedical research:

  • Genome-wide gene network inference
  • Batch effect correction and denoising
  • Cell type and phenotype prediction
  • Single-cell multi-omics integration
  • Explainable AI for cell modeling

Its training efficiency (an A40 GPU for 48 hours) and open-source availability make it accessible for both academic and industry labs.

Final Thoughts

For decades, researchers have struggled to go beyond static snapshots of gene expression and uncover the intricate networks that control cell function. With scPRINT, we begin to move from isolated observations to a unified, dynamic understanding of cellular behavior.

scPRINT marks a significant advancement in our ability to model cell-specific gene networks. By combining large-scale single-cell transcriptomics, biologically informed embeddings, and transformer-based learning, it brings us closer to reconstructing cell biology not just through isolated data points, but through robust, interpretable systems.

For further details, check out the full paper on scPRINT and scPRINT’s GitHub repository.


Join us in our mission to push the frontiers of AI-driven biology and strive to make a lasting impact on medicine, biotechnology, and human health. We are hiring across teams. Visit our Careers page to learn more and apply. Follow us on X, YouTube, and LinkedIn.  

Leave a comment