This article is brought to you by our sponsor, Riffyn, which is helping scientists use their data to identify hidden cause and effect relationships, resulting in increased reproducibility and faster scientific discoveries.
By current estimates, data scientists spend only 20% of their time analyzing the data in front of them. The rest of their time is spent on data cleanup, sorting through the noise in an attempt to find the gold. This problem is a direct result of the sheer amount of data being produced today, which is itself a byproduct of the increased parallelization in laboratory experiments, more sophisticated instrumentation, the digital revolution that began in the nineties.
We are drowning in a sea of information, and acquiring useful data is challenging and time consuming. This begs the question: how many discoveries have we missed? What if we already hold the answers to some of the most interesting research questions in life sciences, but we just can’t find them under piles and piles of data?
Rethinking data: Rethink your experiment
As the rate of data production soared, we scrambled to find an immediate solution to the major problem of the time: where were we going to store all of that data? Add to that a new, equally difficult challenge: how do we understand the data that we have access to and find meaningful correlations that can be translated into solutions to real-world problems?
This is an especially difficult challenge, as the complexity of the biological data produced today far outweighs that of the data produced even two or three decades ago. All too often, data are hard to find, are poorly annotated, and lack context. Researchers are often unsure which variables matter, making it difficult to integrate with other data or to find relevant patterns. Data fragmentation and lack of reproducibility is a serious issue, with an estimated $30B of R&D dollars in the United States lost each year on unrepeatable scientific research. To solve this problem, we need to rethink data — not just how to store and analyze it, but also how to produce it. And this rethinking process begins with rethinking the scientific experiment.
Timothy Gardner of Riffyn wrote last year that “the experiment, which of course is the fundamental unit of science, is generally misunderstood and misused.” He explained that most scientists view experiments solely as a means for finding answers to a particular scientific question. But, they are much more than that. They can be viewed as a kind of instrument (just like a physical instrument) for gathering high-quality, reproducible data — regardless of their positive or negative implications for a hypothesis. Then each experiment becomes an incremental contribution to an aggregate data set that is more trustworthy, easier to analyze, and more likely reveal meaningful discoveries. Or, as Gardner puts it, if we begin thinking about each of our experiments as a “thing” that can been seen, built upon, and improved, then we can begin building them into a “supply chain of scientific methods and experimental data whose final product is knowledge of unassailable quality.”
Re-thinking the scientific experiment in this way leads to an important major outcome. The experiment, which is designed better from the start, leads to high-quality data that directly addresses the researchers’ question. Now, 80% of the data analysts time can be spent analyzing data, rather than trying to make sense of it and eliminate useless noise.
Riffyn is enabling scientists to reach this point using its cloud-based Scientific Development Environment (SDE). The platform supports researchers from experimental design to machine-learning aided analysis. A key design paradigm of the SDE is that experimental procedures and parameters are transformed into the visible, improvable things Gardner wrote about. The SDE encourages scientists to design experiments with final analyses in mind, thus helping them to capture high-quality, usable data that are ready for analysis the moment they are collected. The SDE also helps scientists identify noise and errors in their data sets, allowing them to reject artifactual results, and identify true associations and correlations. And a simple sharing interface makes collaboration easy — no more disjointed datasets lacking context and that are impossible to combine with other datasets. Inaccessible methods, data fragmentation, and experimental error and missed discoveries become things of the past.
Reshaping scientific enterprise
A key differentiator in industry is a cheaper, faster time to market. Systematic data acquisition and analysis like that facilitated by Riffyn enables companies to adopt a more directed R&D approach, turning it into a more agile decision making process. This leads to better resource utilization and cost reduction — and, most importantly, more rapid market entry.
One of the main industries harvesting the fruit of structured data science is the pharmaceutical sector, with several companies accelerating drug discovery by using artificial intelligence to predict molecule behavior and drug target suitability. ATUM (formerly DNA2.0) uses machine learning to aid their protein engineering and expression platforms, which allows them to screen thousands of candidate genes and molecules faster than ever. Others are leveraging the power of automation in their processes as well. For example, Zymergen couples automation with machine learning and data analytics to identify chemical building blocks for the materials and products they develop.
As long as technology continues to grow, data production will grow along with it. Science will become increasingly digitized — it must if we are to use science to affect change in the modern world. If we are to truly reshape the scientific enterprise, we must also start re-thinking the scientific experiment. That will allow us to fully harness the power of data to enter a new realm of scientific productivity and impact.0