Think of it as the world’s largest eavesdropper on the symphony of life: Evo 2, an AI model trained on DNA from more than 100,000 species, can not only spot troublemaking mutations lurking in the human genome but also compose brand-new genetic sequences from scratch—heralding a future where biology bows to code.
Developed by a team from Arc Institute and NVIDIA—with participation from Stanford University, UC Berkeley, and UC San Francisco—Evo 2 was released on February 19, 2025. Alongside it comes a user-friendly interface, Evo Designer. The underlying code rests on Arc Institute’s GitHub page and is integrated into the NVIDIA BioNeMo framework, a collaboration that aims to speed scientific discovery. Additionally, Arc Institute partnered with AI research lab Goodfire to create a mechanistic interpretability visualizer for peering into the model’s inner workings—specifically, the features and motifs it identifies in genomic data. By sharing everything from training data to model weights, this team claims Evo 2 is the largest-scale, fully open-source AI model in biology yet.
Evo 2 follows in the footsteps of Evo 1, a smaller prototype once limited to single-cell genomes. This second iteration, however, has become the largest artificial intelligence model in the field of biology: it’s trained on 9.3 trillion nucleotides extracted from over 128,000 complete genomes—including bacteria, archaea, phages, humans, plants, and various single-celled and multicellular eukaryotes.
“Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write, and think in the language of nucleotides,” says Patrick Hsu, Arc Institute Co-Founder, Arc Core Investigator, an Assistant Professor of Bioengineering and Deb Faculty Fellow at University of California, Berkeley, and a co-senior author on the Evo 2 preprint. "Evo 2 has a generalist understanding of the tree of life that's useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We’re excited to see what the research community builds on top of these foundation models.”
Proponents say evolution has seeded hidden signals in DNA and RNA, exactly the sort of patterns that Evo 2 can latch onto. “Just as the world has left its imprint on the language of the Internet used to train large language models, evolution has left its imprint on biological sequences,” says the preprint’s other co-senior author Brian Hie, an Assistant Professor of Chemical Engineering at Stanford University, the Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, and Arc Institute Innovation Investigator in Residence. “These patterns, refined over millions of years, contain signals about how molecules work and interact.” Brian will also be presenting his work at SynBioBeta 2025: The Global Synthetic Biology Conference this May in San Jose, California.
Under the hood, Evo 2’s training consumed several months of compute time on NVIDIA DGX Cloud AI via AWS, using more than 2,000 NVIDIA H100 GPUs in collaboration with NVIDIA’s own research division. This hardware might sound excessive, but the team needed to handle up to 1 million nucleotides in one go, allowing the model to spot connections between genomic regions that are miles apart, figuratively speaking. Achieving that scale wasn’t trivial; Greg Brockman, Co-Founder and President of OpenAI, spent a portion of his sabbatical leading the charge with a new AI architecture called StripedHyena 2. The result: Evo 2 trains on 30 times more data than Evo 1 and can handle eight times the nucleotide sequence length, a leap that goes well beyond standard deep learning.
So what can Evo 2 actually do? Early experiments suggest it’s capable of identifying crucial mutations in human genes that might be benign or pathogenic. A test on BRCA1 gene variants demonstrated over 90% accuracy in classifying which ones pose a risk for breast cancer. Such predictive insights could help researchers cut back on laborious cell or animal studies, accelerating drug development and our hunt for the genetic culprits behind disease.
Beyond merely flagging harmful variants, Evo 2 could guide the engineering of novel biological tools. For instance, “if you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells,” says co-author and computational biologist Hani Goodarzi, an Arc Core Investigator and an Associate Professor of Biochemistry and Biophysics at the University of California, San Francisco. “This precise control could help develop more targeted treatments with fewer side effects.”
The overarching goal is to offer Evo 2 as a bedrock for specialized AI models in biology. “In a loose way, you can think of the model almost like an operating system kernel—you can have all of these different applications that are built on top of it,” says Arc’s Chief Technology Officer Dave Burke, a co-author on the preprint. “From predicting how single DNA mutations affect a protein's function to designing genetic elements that behave differently in different cell types, as we continue to refine the model and researchers begin using it in creative ways, we expect to see beneficial uses for Evo 2 we haven't even imagined yet.”
Given the ethical and safety implications, the developers excluded data from pathogens infecting humans and other complex organisms and programmed the system not to spit out productive answers about these potentially risky agents. The team enlisted Tina Hernandez-Boussard, a Stanford Professor of Medicine, to guide responsible usage and deployment.
“Evo 2 has fundamentally advanced our understanding of biological systems,” says Anthony Costa, director of digital biology at NVIDIA. “By overcoming previous limitations in the scale of biological foundation models with a unique architecture and the largest integrated dataset of its kind, Evo 2 generalizes across more known biology than any other model to date — and by releasing these capabilities broadly, the Arc Institute has given scientists around the world a new partner in solving humanity’s most pressing health and disease challenges.”
For the nitty-gritty, you can find the preprint, “Genome modeling and design across all domains of life with Evo 2,” on BioRxiv. There’s also a companion machine learning paper, “Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale,” for anyone itching to dive into the technical details.
Think of it as the world’s largest eavesdropper on the symphony of life: Evo 2, an AI model trained on DNA from more than 100,000 species, can not only spot troublemaking mutations lurking in the human genome but also compose brand-new genetic sequences from scratch—heralding a future where biology bows to code.
Developed by a team from Arc Institute and NVIDIA—with participation from Stanford University, UC Berkeley, and UC San Francisco—Evo 2 was released on February 19, 2025. Alongside it comes a user-friendly interface, Evo Designer. The underlying code rests on Arc Institute’s GitHub page and is integrated into the NVIDIA BioNeMo framework, a collaboration that aims to speed scientific discovery. Additionally, Arc Institute partnered with AI research lab Goodfire to create a mechanistic interpretability visualizer for peering into the model’s inner workings—specifically, the features and motifs it identifies in genomic data. By sharing everything from training data to model weights, this team claims Evo 2 is the largest-scale, fully open-source AI model in biology yet.
Evo 2 follows in the footsteps of Evo 1, a smaller prototype once limited to single-cell genomes. This second iteration, however, has become the largest artificial intelligence model in the field of biology: it’s trained on 9.3 trillion nucleotides extracted from over 128,000 complete genomes—including bacteria, archaea, phages, humans, plants, and various single-celled and multicellular eukaryotes.
“Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write, and think in the language of nucleotides,” says Patrick Hsu, Arc Institute Co-Founder, Arc Core Investigator, an Assistant Professor of Bioengineering and Deb Faculty Fellow at University of California, Berkeley, and a co-senior author on the Evo 2 preprint. "Evo 2 has a generalist understanding of the tree of life that's useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We’re excited to see what the research community builds on top of these foundation models.”
Proponents say evolution has seeded hidden signals in DNA and RNA, exactly the sort of patterns that Evo 2 can latch onto. “Just as the world has left its imprint on the language of the Internet used to train large language models, evolution has left its imprint on biological sequences,” says the preprint’s other co-senior author Brian Hie, an Assistant Professor of Chemical Engineering at Stanford University, the Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, and Arc Institute Innovation Investigator in Residence. “These patterns, refined over millions of years, contain signals about how molecules work and interact.” Brian will also be presenting his work at SynBioBeta 2025: The Global Synthetic Biology Conference this May in San Jose, California.
Under the hood, Evo 2’s training consumed several months of compute time on NVIDIA DGX Cloud AI via AWS, using more than 2,000 NVIDIA H100 GPUs in collaboration with NVIDIA’s own research division. This hardware might sound excessive, but the team needed to handle up to 1 million nucleotides in one go, allowing the model to spot connections between genomic regions that are miles apart, figuratively speaking. Achieving that scale wasn’t trivial; Greg Brockman, Co-Founder and President of OpenAI, spent a portion of his sabbatical leading the charge with a new AI architecture called StripedHyena 2. The result: Evo 2 trains on 30 times more data than Evo 1 and can handle eight times the nucleotide sequence length, a leap that goes well beyond standard deep learning.
So what can Evo 2 actually do? Early experiments suggest it’s capable of identifying crucial mutations in human genes that might be benign or pathogenic. A test on BRCA1 gene variants demonstrated over 90% accuracy in classifying which ones pose a risk for breast cancer. Such predictive insights could help researchers cut back on laborious cell or animal studies, accelerating drug development and our hunt for the genetic culprits behind disease.
Beyond merely flagging harmful variants, Evo 2 could guide the engineering of novel biological tools. For instance, “if you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells,” says co-author and computational biologist Hani Goodarzi, an Arc Core Investigator and an Associate Professor of Biochemistry and Biophysics at the University of California, San Francisco. “This precise control could help develop more targeted treatments with fewer side effects.”
The overarching goal is to offer Evo 2 as a bedrock for specialized AI models in biology. “In a loose way, you can think of the model almost like an operating system kernel—you can have all of these different applications that are built on top of it,” says Arc’s Chief Technology Officer Dave Burke, a co-author on the preprint. “From predicting how single DNA mutations affect a protein's function to designing genetic elements that behave differently in different cell types, as we continue to refine the model and researchers begin using it in creative ways, we expect to see beneficial uses for Evo 2 we haven't even imagined yet.”
Given the ethical and safety implications, the developers excluded data from pathogens infecting humans and other complex organisms and programmed the system not to spit out productive answers about these potentially risky agents. The team enlisted Tina Hernandez-Boussard, a Stanford Professor of Medicine, to guide responsible usage and deployment.
“Evo 2 has fundamentally advanced our understanding of biological systems,” says Anthony Costa, director of digital biology at NVIDIA. “By overcoming previous limitations in the scale of biological foundation models with a unique architecture and the largest integrated dataset of its kind, Evo 2 generalizes across more known biology than any other model to date — and by releasing these capabilities broadly, the Arc Institute has given scientists around the world a new partner in solving humanity’s most pressing health and disease challenges.”
For the nitty-gritty, you can find the preprint, “Genome modeling and design across all domains of life with Evo 2,” on BioRxiv. There’s also a companion machine learning paper, “Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale,” for anyone itching to dive into the technical details.