
At GenBio AI, we are developing an AI-Driven Digital Organism (AIDO), a system of multiscale foundation models, which learn representations for a diverse range of biological data, from molecular interactions to cellular behaviors to phenotype information. By leveraging cutting-edge AI methods, we aim to create a unified AI-driven framework for predicting, simulating and programming biology across all scales. We will use this system to address fundamental questions in drug discovery, bio-engineering, and disease prevention.
Multiscale Complexity of Biology
One of the major challenges in building foundation models for biology is that biology operates in a language vastly different from natural languages and images. They encompass multiscale complexities spanning from the molecular level (DNA, RNA, and proteins), through network levels (protein interaction networks, regulatory networks, and gene expression within cells), to intricate systems like cell-cell interactions, organs, individuals, and societies. Historically, numerous specialized machine learning and computational biology models have been developed to address specific issues within various facets of biology and life sciences. However, these models are often constructed with limited labeled data and tailored for particular applications—well-known as the so-called “one-model for one-task” principle, resulting in uncontextualized and often suboptimal performance and limited transferability to other problems.
Despite recent advancements, most current foundation models in biology are designed and built for individual data modalities. They do not account for the multiscale nature of biology and the multimodal characteristics of biological data, and therefore, are not quite “foundational” for biology. A truly foundational model that is intrinsically integrative, multiscale, and capable of addressing biological questions across diverse scales is still missing.

Multiscale Foundation Models for Biology
It is our view that a foundation model for biology—which can be a system of component FMs—needs to incorporate multiple types of data and biological constraints arising from different biological scales. Furthermore, such a system is more than just an agglomeration of modality-specific FMs. It must involve system-wide harmonization through nested or hierarchical representation propagation, utilization, fine-tuning, or continual pretraining. It should also have the ability to connect different FM modules from the system, and provide a foundation to address more complex prediction, simulation, and programming tasks arising from molecules, cells, organisms, and beyond. We call such an integrated multiscale system of foundation models as an AI-Driven Digital Organism (AIDO).
In constructing the AIDO, several key challenges need to be addressed. First, it is essential to determine what constitutes a valid set of component foundation models within an AIDO, and how to achieve a parsimonious yet gapless collection. This involves deciding which models are necessary to capture the complexity of biological systems and ensuring that they cover a broad spectrum of biological phenomena. Second, constructing these models requires careful consideration of data acquisition and preprocessing. The availability and quality of biological data significantly impact the performance and applicability of the foundation models. Third, selecting suitable deep learning architectures is vital for developing effective foundation models constituting the AIDO. It is necessary to explore which types of architectures are most appropriate for capturing the intricate patterns and relationships inherent in specific biological data modalities. Moreover, integrating biological knowledge into these models poses a significant challenge. Effective approaches must be developed to incorporate domain-specific knowledge, ensuring that the models not only learn from data but also adhere to known biological principles. This integration can enhance the models’ interpretability and reliability. Additionally, incorporating the multiscale and multimodal nature of biology is essential for connecting different pretrained component FMs. Developing methods that integrate data from various scales and modalities will enable the models to capture the interconnectedness of biological systems, from molecular interactions to cellular processes and organismal behaviors.

Taking into account these aspects, we posit that an AIDO should be built in a modular and connectable way such that these modules can be combined and cascaded to address problems arising at different biological scales and complexities. An engineering viable approach to building the AIDO is to develop it in 3 stages. The first stage is to build up a necessary set of fundamental building blocks or modules representing the major data modalities arising in biology in a “divide-and-conquer” fashion. The second stage is to develop a set of new deep learning architectures that integrate the central dogma, regulatory rules, and the interconnected nature of biology, as well as different data modalities or modules in a bottom-up fashion to reflect the multiscale, nested, and hierarchical organization of biological systems. These architectures can bridge the existing gaps by integrating biological knowledge into the models and developing models that can seamlessly operate across various biological scales and modalities — thereby “dots are connected”. In the third stage, the modules and connected modules are unified into a networked system, where representations and embeddings can be passed around in different nodes and levels of the systems, and especially feedback and gradient signals from the coarser and topper level of the system can be propagated all the way back to the bottom level of the system to further improve these modules. This is like the “aligning and optimization” phase in an assembly process. With a set of benchmarks and supervisory tasks from different levels and scales of biology, all the system modules can be jointly adapted and aligned to achieve synergy towards an overall better or even emergent system-level performance.
What We Have Achieved So Far
As an initial step for building stage 1 of an AIDO, we have completed the pretraining of the following set of modules:

- AIDO.DNA: AIDO.DNA is a versatile DNA foundation model with 7 billion parameters, trained on 10.6 billion nucleotides spanning 796 species. By scaling model size while maintaining a concise context length of 4K nucleotides, AIDO.DNA demonstrates significant advancements across a wide range of supervised, generative, and zero-shot tasks, supporting applications in functional genomics, synthetic biology, and drug development. Notably, AIDO.DNA outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models.
- AIDO.RNA: AIDO.RNA is a general-purpose RNA foundation model with 1.6 billion parameters, trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It sets a new benchmark in performance across a wide range of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction, and RNA inverse folding. After domain adaptation, AIDO.RNA excels in modeling protein-level tasks, highlighting its potential to leverage the central dogma for enhancing biomolecular representations.
- AIDO.Protein: AIDO.Protein is the first protein foundation model to incorporate a Mixture-of-Experts (MoE) architecture, scaling to 16 billion parameters. By leveraging this innovative design, it achieves exceptional training and inference efficiency. Pretrained on 1.2 trillion amino acids from UniRef90 and ColabFoldDB, AIDO.Protein delivers state-of-the-art performance across the xTrimoPGLM benchmark, excels in the ProteinGym DMS benchmark, and sets new records in structure-conditioned protein sequence generation. These groundbreaking results position AIDO.Protein as a transformative model for protein understanding and design.
- AIDO.StructureTokenizer: AIDO.StructureTokenizer is an innovative protein structure tokenizer based on VQ-VAE, specifically designed to capture essential structural information while maintaining the sequential relationship with amino acids. It demonstrates exceptional performance in 3D structure reconstruction and homology detection, proving its efficiency in protein structure tokenization. When combined with our protein language model, AIDO.Protein, for structure prediction, it significantly outperforms ESM3 in benchmarks such as CASP14, CASP15, and CAMEO.
- AIDO.RAGPLM and AIDO.RAGFold: AIDO.RAGPLM and AIDO.RAGFold are innovative models designed for retrieval-augmented protein language modeling and structure prediction. AIDO.RAGPLM integrates pretrained protein language models with retrieved MSA, incorporating co-evolutionary information and compensating for insufficient MSA through large-scale pretraining. It surpasses single-sequence models in perplexity, contact prediction, and fitness prediction and serves as the foundation for AIDO.RAGFold. AIDO.RAGFold achieves TM-scores comparable to AlphaFold2 when sufficient MSA is available while operating up to eight times faster, and significantly outperforms AlphaFold2 in low-MSA scenarios. Additionally, a novel MSA retriever improves retrieval speed by 45–90 times and expands the MSA training set by 32%, enhancing the efficiency and accuracy of these models for protein structure prediction.
- AIDO.Cell: AIDO.Cell is a series of single-cell foundation models, ranging from 3 million to 650 million parameters, pretrained on 50 million human cells from diverse tissues. Unlike many models, AIDO.Cell processes the entire transcriptome as input without relying on truncation or sampling techniques, enabling it to learn precise and comprehensive representations of a cell’s transcriptional context. The 100M parameter version of AIDO.Cell achieves state-of-the-art results in tasks such as zero-shot clustering, cell-type classification, and perturbation modeling.
We will continue improving these modules, as well as building more modules to cover other biological modalities and scales. Once we have developed modules for different modalities and scales of biological data, we can connect and combine these modules to address more complex biological problems, and build better models by linking information arising from different scales of biology.
The GenBio AI Platform
Realizing the AIDO vision requires long-term, continuous, and sustainable development and community efforts. In fostering such an effort, we make the weights of certain matured versions of the models and the adaptation software packages publicly available for reproducibility, community building, and standardization. The goal is to connect life science, medicine, pharmacy, and public health through a shared technical paradigm, with broader community involvement from academia, industries, and governments, aligned on purpose-driven data generation and collection efforts, and close-loop collaboration across the cycle of data, model, hypothesis, outcome, and further data generation.
Through continuous versioning and upgrading of base models, APIs, task-suites, data banks, and bio-entity representation repositories—fueled by ever-increasing willingness of data sharing, federated data mining, and the never-fading demand for synthetic biology and personalized medicine—we believe a new community of AIDO users and developers can emerge and grow from this open-source effort, as seen in the LLM field, to pursue the common vision of mirroring life in the physical world on a computer with AI.
As an initial release of the AIDO platform, our platform is designed as an interactive AI-powered toolkit that allows researchers to simulate biological phenomena across multiple scales:
- Multiscale Biological Simulations: We provide an interface for users to build computational biology models and simulate biological phenomena spanning from molecules to cells and tissues. The platform supports simulations for a range of biological queries, helping researchers understand interactions across different biological levels.
- Adaptation to Sparse Data: The platform incorporates self-supervised learning techniques and efficient finetuning to make reliable predictions, even when limited data is available. This capability is particularly useful for applications like rare disease research and early-stage drug discovery.
- Pan-Modal Integration: By integrating diverse biological data sources and scales (e.g., genomes, proteins, and cellular interactions), our platform offers a comprehensive view of biological systems, assisting researchers in modeling complex biological processes holistically.
- User Interaction and Customization: Researchers can interact with the models through a streamlined interface, allowing for model customization, data analysis, and visualization of results. This enables iterative exploration of hypotheses across biological scales.
How GenBio AI is Different
GenBio AI’s system of multiscale foundation models integrate data across all scales of biological information, from molecules and genes to cells, tissues, and entire populations. Unlike models that focus on a single data type, our approach is an integrated system designed to capture complex interactions across diverse biological scales. This comprehensive integration enables more versatile solutions, addressing challenges across drug discovery, disease modeling, and synthetic biology with precision and adaptability.
Our models are not only multimodal but also designed to work in data-scarce environments, using self-supervised learning to make accurate predictions even with limited data. This makes GenBio AI especially valuable in tackling complex problems like rare diseases and emerging pathogens. By transitioning from reactive experimentation to a proactive simulation-driven approach, we aim to transform how biological research is conducted, paving the way for breakthroughs in personalized medicine, precision agriculture, and environmental sustainability.
Looking Ahead
At GenBio AI, we envision a future where AI not only decodes biology but also drives the discovery of new life-transforming solutions. Our system of multiscale foundational models is designed to understand biological systems at every scale, paving the way for innovations in drug development, personalized medicine, and disease prevention. We aim to make biology programmable, helping researchers move from trial-and-error experiments to predictive simulations that accelerate breakthroughs.
Our platform will soon be available to researchers, biotechnologists, and healthcare professionals, enabling them to simulate complex biological scenarios even with limited datasets. By providing early access to our models, we are fostering a collaborative environment where scientists can explore and expand the frontiers of biomedical research and bio-engineering. We encourage experts across diverse fields to engage with our models, contribute feedback, and drive the development of tools that can reshape biological understanding.
GenBio AI is also developing specialized models tailored to address critical issues in oncology, neurodegenerative diseases, and genomics. These models will offer advanced capabilities for generating insights, testing hypotheses, and identifying potential targets for therapeutic intervention. By focusing on both general applications and specific domains, we aim to empower scientists to build the next generation of biological solutions that tackle the most pressing challenges in healthcare and beyond.
As we progress, our goal is to continuously expand the capabilities of our foundation models, integrating more data sources and enhancing the models’ predictive accuracy. Our vision is to create a comprehensive AI platform allowing seamless interaction with biological data at all levels—from molecules to populations. This approach will transform not only scientific research but also the development of sustainable solutions for agriculture, public health, and the environment. We believe that AI-powered biology can catalyze a new era of scientific innovation, one that drives humanity closer to a healthier and more sustainable future.