When data science goes with the flow: insitro introduces redun

insitro
3 min readNov 4, 2021

--

Today, insitro is open sourcing a new data science tool called redun, purpose-built for complex, rapidly evolving scientific workflows, spanning multiple data types. Redun is an expressive, efficient, and easy-to-use workflow framework, built on top of the popular Python programming language. It offers multiple abstractions that we have come to rely on in most modern high-level languages (control flow, composability, recursion, high order functions, etc), along with automatic parallelization, caching, and (critically) data provenance tracking.

Our work on redun began two years ago, when our data engineering team began assessing the many available approaches to data science workflow development and execution. Our work at insitro involves a large amount of data across a broad range of data modalities, some generated through our in-house high-throughput biology and chemistry efforts, and some obtained from external sources (public and proprietary). As a consequence, we found ourselves suffering from an explosion of tools and practices, along with a similar increase in development and maintenance complexity. We were using best practice tools, each well designed for their own domain, but our complexity grew from challenges in composing them into larger systems that could span our company (LIMS to data warehouse to bioinformatics to ML). As our work spanned across systems, we also found ourselves losing essential features for fast moving scientific work, such as generalized data access (e.g., file-based bioinformatics vs. API and database intensive data engineering), incremental and heterogeneous compute (e.g., batch, Spark, multiprocessing, GPU), as well as unified data provenance and sharing.

Understanding the root cause of these issues led to the development of redun, which takes the somewhat contrarian view that languages designed expressly for dataflows are unnecessarily restrictive, and lose critical abstractions that are key to modern high-level languages. Instead, redun expresses workflows as lazy expressions, which are then evaluated by a scheduler that performs automatic parallelization, caching, and data provenance logging. This insight and the associated specific features (which are inspired by concepts from multiple domains) are critical in composing large hierarchical workflows (pipelines of pipelines) that can span multiple varied systems from our lab to our data science and machine learning teams.

Figure 1. Data science workflow’s missing half. In their work, data scientists produce two main deliverables: code and data. In producing and sharing code, data scientists have tools, such as git, to record the history of their code changes (i.e., code provenance) and ultimately share it with others. In contrast, the workflow for recording and sharing data provenance is less standardized and causes significant friction in reproducing and sharing scientific results.

Most critically, we have found that redun allows us to complete what is often the “missing half” of data science tooling. Specifically, while the past decade of software engineering progress has led to the creation of outstanding tools for tracking of code provenance (i.e., source control), there is little of the same consistent tracking for data provenance. But we know from our work that data gets created, updated, and processed constantly — as we do experiments, perform QC, improve our data processing pipelines, etc. And yet, there are currently very limited tools that let us keep track of what were the exact experiments and processing steps that led to the creation of a given data file. This hugely impacts the reproducibility of data-centric work, in science and more broadly.

Figure 2. Completing the data science workflow. With redun, we are exploring how to provide tooling for recording and sharing data provenance that is just as powerful as code provenance tooling. Specifically, we have found that defining a portable data structure, the call graph, can enable the same local recording and syncing capabilities. As call graphs accumulate, one can trace the computational lineage of any file, or compare the differences between executions (“It worked before, what changed?”).

We believe that we are at the start of a computational wave that will have a transformative impact on our industry. We believe that, in time, more and more high quality companies will look to embrace computational approaches on complex multimodal data for drug discovery and development, as well as for other scientific endeavors. We have found redun to be an important enabler of our work, and we hope that making redun broadly available can enable others and help drive positive change. We hope to be a part of the rising tide of innovation, across our community, in support of high-quality, reproducible science, and for the benefit of patients in need.

Our work is available on GitHub as well as on the Python Package Index.

--

--

insitro

insitro is a data-driven drug discovery and development company using machine learning and data generation at scale to transform the way drugs are discovered.