[Brett Sayles/Pexels]

Remember the 'Big Data' Problem? Yeah, Me Neither

The big data problem of the mid-00s has shifted to a small data issue, where synthetic biologists, with high-throughput testing platforms and multimodal datasets, are uniquely equipped to provide the necessary data for training machine learning models
AI & Digital Biology
by
Jenna E Gallegos, PhD
|
July 9, 2024

Ten years ago, all anybody could talk about was big data. Suddenly, sequencing costs had plummeted, and there weren’t enough Python aficionados out there to analyze the mountains of sequencing data. This so-called “big data” problem was touted as the only thing standing between sequencing reads and curing cancer.

At the SynBioBeta conference this year, a different problem emerged—a “small data” problem. Artificial intelligence is poised to be the solution to everything; if only there were enough high-quality, diverse data to feed the hungry machine learning models.

Has the Hostile AI Takeover Already Happened?

If an artificial being gained sentience and wanted to take over the world, it would probably start by trying to gain popularity and power by coercing everyone into talking about it. At SynBioBeta, AI took center stage, both literally and figuratively. The main stage hosted nearly a dozen talks and fireside chats with artificial intelligence (or intelligent algorithms) in the title. AI infiltrated nearly a dozen more breakout panel sessions. 

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

Synthetic Biology has long had a special relationship with AI because synthetic biology is inherently interdisciplinary. Biologists devise and perform the experiments, engineers figure out how to do them faster and at scale, and computer scientists create algorithms to analyze the data and build models.  

But suddenly, synthetic biologists and our protein engineering friends aren’t the only ones in the game. There were speakers at SynBioBeta with pedigrees that included Google and Facebook. Even Salesforce (yes, that Salesforce) is using AI for protein design

Data analysis is no longer a bottleneck, and it’s way smarter than ever. 

All Models Are Only as Good as Their Training Sets

Modern large language models like ChatGPT are trained on the entire internet. That’s about 200 billion words. DNA contains trillions of possible words (amino acid combinations). That’s a lot of data, and it’s just DNA. Imagine what we can learn from RNA, proteins, and secondary metabolites.  The problem is we’re not accessing it.

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

“We’re not so much compute limited. We’re data limited,” said Sean Mclain, founder and CEO of Absci, on a follow-up call about the general state of the field. Compared to the internet, the publicly available biological data is much more limited.

It’s also biased.

What’s amazing about artificial intelligence is that it can find patterns and correlations we didn’t know to look for. What’s unfortunate about humans is that we tend to collect data in a biased fashion based on patterns and correlations we suspect exist. 

So, of the trillions of words possible in DNA, in our limited human capacity, we tend to focus on only the subset of information that we expect is likely to be relevant.  Take Genome-Wide Association Studies (GWAS). A relatively recent feat of data science and biology, GWAS studies look at millions of mutations across individuals and identify those that correlate with disease. We’ve learned a lot from them about disease biology, but over the course of about 15 years, only a handful of therapeutics have resulted. 

The problem is that the datasets feeding GWAS are one-dimensional and limited. By using GWAS as a starting point to identify drug targets, we’re really limiting our search. Biology is not one-to-one, so a single compound may have multiple targets within the same biological pathway, none of which would be individually significant in GWAS. That’s why companies like Empress Therapeutics are using AI to find compounds that are conserved in healthy people and varied in disease instead of starting with GWAS targets.

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

“We as humans have hypotheses on what targets we should go after, what epitopes we should be looking at,” explained Mclain. “Where we want to go is we want to be able to have AI tell us what target we should we be going after for this particular disease and for this particular patient population.”

We already know that immune-oncology is complicated, and often combination therapies prove to be more effective, but the exact combination is a bit of a black box. The holy grail of therapies will likely involve multiplexing multiple CARs, switches with multiple triggers to really differentiate cancer from self, and a combination of PROTACs, bifunctional small molecules, and bispecific and trispecific antibodies. This is clearly a problem for AI, but right now, there’s not enough multimodal data for it to learn from. 

Data as a Limiting Reagent

A resounding theme of the AI talks at SynBioBeta was the need for more rich, complex data sets to feed AI models and the high-throughput technologies to build them. 

Fortunately, this is a problem that synthetic biology is uniquely positioned to solve. Synthetic biologists have been developing technologies to generate and test massive CRISPR and protein libraries in a variety of organisms as well as in vitro for years. High-throughput engineering platforms like Inscripta’s Onyx have, in some semblance, been a hammer without a nail. 

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

Generally, the field has been focused on iteration towards a target goal (optimal strain/enzyme, etc). If we repurpose these tools with the goal of generating information instead of a particular result, we can massively increase the amount of data generated.

Take Insamo, a startup using synthetic biology to probe the massive drug-like chemical space.  Instead of chemically synthesizing a small number of target cyclic peptides, they radically re-engineered codon tables in yeast so that they could be biomanufactured to the tune of 1013 per day.  

LabGenius is applying AI to not only the “learn” phase of the design, build, test, and learn cycle but also the design and build phase. Their platform can test the functional performance of >28,000 novel antibody designs in under twelve weeks. 

Absci is also generating its own data with a scalable platform that takes them from AI-designed antibodies to wet lab-validated candidates in as little as six weeks. They’re poised to get the first antibody-based therapy designed with generative AI into the clinic in a timeline of just 24 months (vs the industry standard 5.5 years).

These are just a few examples of the many platforms out there capable of generating banquets of multimodal data for machine learning models to feed on. Data is now a commodity. And, as the market for learning from that data becomes ever-more crowded with software engineers from every industry, synthetic biologists have a unique opportunity to solve the “small data problem.” 

Related Articles

No items found.

Remember the 'Big Data' Problem? Yeah, Me Neither

by
Jenna E Gallegos, PhD
July 9, 2024
[Brett Sayles/Pexels]

Remember the 'Big Data' Problem? Yeah, Me Neither

The big data problem of the mid-00s has shifted to a small data issue, where synthetic biologists, with high-throughput testing platforms and multimodal datasets, are uniquely equipped to provide the necessary data for training machine learning models
by
Jenna E Gallegos, PhD
July 9, 2024
[Brett Sayles/Pexels]

Ten years ago, all anybody could talk about was big data. Suddenly, sequencing costs had plummeted, and there weren’t enough Python aficionados out there to analyze the mountains of sequencing data. This so-called “big data” problem was touted as the only thing standing between sequencing reads and curing cancer.

At the SynBioBeta conference this year, a different problem emerged—a “small data” problem. Artificial intelligence is poised to be the solution to everything; if only there were enough high-quality, diverse data to feed the hungry machine learning models.

Has the Hostile AI Takeover Already Happened?

If an artificial being gained sentience and wanted to take over the world, it would probably start by trying to gain popularity and power by coercing everyone into talking about it. At SynBioBeta, AI took center stage, both literally and figuratively. The main stage hosted nearly a dozen talks and fireside chats with artificial intelligence (or intelligent algorithms) in the title. AI infiltrated nearly a dozen more breakout panel sessions. 

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

Synthetic Biology has long had a special relationship with AI because synthetic biology is inherently interdisciplinary. Biologists devise and perform the experiments, engineers figure out how to do them faster and at scale, and computer scientists create algorithms to analyze the data and build models.  

But suddenly, synthetic biologists and our protein engineering friends aren’t the only ones in the game. There were speakers at SynBioBeta with pedigrees that included Google and Facebook. Even Salesforce (yes, that Salesforce) is using AI for protein design

Data analysis is no longer a bottleneck, and it’s way smarter than ever. 

All Models Are Only as Good as Their Training Sets

Modern large language models like ChatGPT are trained on the entire internet. That’s about 200 billion words. DNA contains trillions of possible words (amino acid combinations). That’s a lot of data, and it’s just DNA. Imagine what we can learn from RNA, proteins, and secondary metabolites.  The problem is we’re not accessing it.

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

“We’re not so much compute limited. We’re data limited,” said Sean Mclain, founder and CEO of Absci, on a follow-up call about the general state of the field. Compared to the internet, the publicly available biological data is much more limited.

It’s also biased.

What’s amazing about artificial intelligence is that it can find patterns and correlations we didn’t know to look for. What’s unfortunate about humans is that we tend to collect data in a biased fashion based on patterns and correlations we suspect exist. 

So, of the trillions of words possible in DNA, in our limited human capacity, we tend to focus on only the subset of information that we expect is likely to be relevant.  Take Genome-Wide Association Studies (GWAS). A relatively recent feat of data science and biology, GWAS studies look at millions of mutations across individuals and identify those that correlate with disease. We’ve learned a lot from them about disease biology, but over the course of about 15 years, only a handful of therapeutics have resulted. 

The problem is that the datasets feeding GWAS are one-dimensional and limited. By using GWAS as a starting point to identify drug targets, we’re really limiting our search. Biology is not one-to-one, so a single compound may have multiple targets within the same biological pathway, none of which would be individually significant in GWAS. That’s why companies like Empress Therapeutics are using AI to find compounds that are conserved in healthy people and varied in disease instead of starting with GWAS targets.

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

“We as humans have hypotheses on what targets we should go after, what epitopes we should be looking at,” explained Mclain. “Where we want to go is we want to be able to have AI tell us what target we should we be going after for this particular disease and for this particular patient population.”

We already know that immune-oncology is complicated, and often combination therapies prove to be more effective, but the exact combination is a bit of a black box. The holy grail of therapies will likely involve multiplexing multiple CARs, switches with multiple triggers to really differentiate cancer from self, and a combination of PROTACs, bifunctional small molecules, and bispecific and trispecific antibodies. This is clearly a problem for AI, but right now, there’s not enough multimodal data for it to learn from. 

Data as a Limiting Reagent

A resounding theme of the AI talks at SynBioBeta was the need for more rich, complex data sets to feed AI models and the high-throughput technologies to build them. 

Fortunately, this is a problem that synthetic biology is uniquely positioned to solve. Synthetic biologists have been developing technologies to generate and test massive CRISPR and protein libraries in a variety of organisms as well as in vitro for years. High-throughput engineering platforms like Inscripta’s Onyx have, in some semblance, been a hammer without a nail. 

Free An artist’s illustration of artificial intelligence (AI). This illustration depicts language models which generate text. It was created by Wes Cockx as part of the Visualising AI project l... Stock Photo
[Google DeepMind via Pexels]

Generally, the field has been focused on iteration towards a target goal (optimal strain/enzyme, etc). If we repurpose these tools with the goal of generating information instead of a particular result, we can massively increase the amount of data generated.

Take Insamo, a startup using synthetic biology to probe the massive drug-like chemical space.  Instead of chemically synthesizing a small number of target cyclic peptides, they radically re-engineered codon tables in yeast so that they could be biomanufactured to the tune of 1013 per day.  

LabGenius is applying AI to not only the “learn” phase of the design, build, test, and learn cycle but also the design and build phase. Their platform can test the functional performance of >28,000 novel antibody designs in under twelve weeks. 

Absci is also generating its own data with a scalable platform that takes them from AI-designed antibodies to wet lab-validated candidates in as little as six weeks. They’re poised to get the first antibody-based therapy designed with generative AI into the clinic in a timeline of just 24 months (vs the industry standard 5.5 years).

These are just a few examples of the many platforms out there capable of generating banquets of multimodal data for machine learning models to feed on. Data is now a commodity. And, as the market for learning from that data becomes ever-more crowded with software engineers from every industry, synthetic biologists have a unique opportunity to solve the “small data problem.” 

RECENT INDUSTRY NEWS
RECENT INSIGHTS
Sign Up Now