jewelry Image source: qiye on Pixabay
Home » Emerging technologies » Why the future of protein design rests on cloud computing and machine learning

Why the future of protein design rests on cloud computing and machine learning

Imagine the most beautiful string of pearls you’ve ever seen. Long and elegant, it is comprised of pearls painted with 20 different colors, each with its own unique characteristics — some metallic, some shiny, some pearlescent. There is no other string of pearls quite like this one — change the order of the colored beads, and the entire string changes. It was created specifically for the person to whom it belongs.

Now imagine that the string of pearls isn’t for wearing on your body. In fact, it can’t even be seen with the naked eye. Proteins are the pearl strings that make life possible. They are comprised of unique combinations of 20 different amino acids strung together. The amino acids have unique characteristics — some water loving, some water hating, some acidic — and their specific order dictates the function of the protein in the human body — or bacteria or plant or other organism from which the protein comes. The potential combinations — and therefore, functions — are infinitesimal. Because of this, proteins are a critical tool in the synthetic biology toolkit.

The protein folding problem

The synthetic biologist can leverage the power of proteins in one of two ways: optimize and build upon a protein already existing in nature, or create a protein de novo to do a completely new function not observed in nature. No matter which approach is taken, the same problem will be faced: the protein folding problem.

At the root of the protein folding problem is the characteristic that makes proteins so versatile: the countless number of amino acid combinations that can comprise a protein. And, proteins aren’t simply strings of amino acids. No, this biological string of pearls has a complex 3D structure, comprised of alpha-helices and beta sheets, and some proteins have several subunits — all of this is determined by the unique properties of each amino acid, and how they interact with each other due to the specific sequence they are in.

An average protein is about 300 amino acids long — it doesn’t take a mathematician to figure out how difficult it would be for a person to start with a 1D sequence of 300 amino acids and predict how that sequence will self-organize into a functional 3D structure in the cell.

protein design

Examples of protein structures from the Protein Data Bank (PDB)

The protein design problem

Synthetic biologists creating proteins de novo for complex, elegant new functions also face the protein folding problem in reverse: the protein design problem. Rather than starting with a string of amino acids and predicting 3D structure from that, synthetic biologists creating de novo proteins usually start with a model of the folded protein that they desire, shaped to perform the specific function they want. They then have to work backwards, identifying just the right sequence of amino acids to form a functional protein. Making the problem even more difficult, the perfect sequence probably has never existed in nature, meaning they truly are starting with a clean slate.

Fortunately, both the protein folding and protein design problems can be tackled with a single element: a good understanding of the physics that go into amino acid interactions to make a model that can be used to predict whether a particular sequence will have the desired function. But human beings can’t make these models with pen and paper. Instead, they must use the power of computers. Such industrialized computational protein design is exactly the approach used by Seattle-based synthetic biology company Arzeda Corporation.

Cloud computing: the major protein design reagent

Arzeda’s approach to providing new products and improving existing products for their partners and customers is simple in complex: take a desired function, model it computationally, identify sequences likely to create proteins with the desired function, build and express the gene candidates, and voilà! New or improved protein.

Because of the protein folding and protein design problems, a lot of computing is needed up front to successfully design proteins at the scale performed by Arzeda on a daily basis — and it requires a very important reagent: cloud computing. According to Alexandre Zanghellini, CEO and co-founder of Arzeda, the company spent upwards of USD $150,000 on cloud computing in 2018, and is on track to spend half a million in 2019. This is because to perform the first step of their process — computational protein modeling — in a timely fashion, thousands of computers are needed. It’s the equivalent of transforming an artisanal process into a highly refined industrial one.


“We use multiple cloud providers, and we have developed specific software tools to be able to use that large amount of CPU and distribute the work that is needed for protein design in a fully automated manner, so that compared to what you typically see in the academic world, we’re able to do that at such a scale, and in such an automated manner, that we can take the human out of the equation,” says Zanghellini.

Making sense of complex data with machine learning

Protein design is a very complex problem with lots of degrees of freedom and lots of different components, and is therefore one of the best examples of complexity with several aspects beyond what the human mind can do, says Zanghellini. According to him, machine learning has great potential to find correlations and patterns that can be used for protein design — far beyond what an engineer or a computer scientist could come up with.

Pointing to AlphaFold, a DeepMind Technologies project that demonstrated the power of applying AI and deep learning to the protein design problem, Zanghellini says, “I’m bullish that this is going to be a great development in the field, and we [Arzeda] hope to be at the forefront of that.”

Protein folding


Looking even further into the future, Zanghellini sees technologies like improved graphics processing units (GPUs), field programmable gate array (FPGA) chips, and quantum computing drastically speeding up protein design. But, at the crux of it all he says, may be a technology we’re all intimately familiar with: DNA synthesis and the availability of faster, cheaper, longer fragments. It is so important he says, that if Arzeda hit the market 10 years ago, the company probably wouldn’t have survived due to the cost of DNA synthesis back then. After all, all of the compute power in the world won’t ensure that your protein behaves the way the computer predicted it would — rapid in vitro protein function tests are the final critical piece in successful protein design.

Zanghellini sums it up this way: “It all boils down to how fast you can do your measurement of whether your protein works or not. Companies … are working on these things, [which] immediately translates to an order of magnitude more samples being tested, which in turn means more machine learning and [method] improvements.”

Zanghellini recently spoke with SynBioBeta’s John Cumbers on these topics and more on The SynBioBeta Podcast, conversations with synthetic biology’s leading thinkers on building a better world with biology. The podcast will be launching in a few weeks. To hear everything Zanghellini had to say about protein design, be sure to sign up for our news digest and sign up for our news digest and add this episode to your playlist.

Embriette Hyde

Embriette Hyde

A trained microbiologist with over 7 years of experience in microbiome research, Dr. Embriette Hyde is passionate about bringing science to the public. She is currently working as Managing Editor at SynBioBeta and mentors K-12 students through Schmahl Science Workshops, fostering passion for science from a young age.

Stay updated on the latest news in synthetic biology

Join our weekly newsletter

Sign up

Job opportunities