Today, insitro is open sourcing a new data science tool called redun, purpose-built for complex, rapidly evolving scientific workflows, spanning multiple data types. Redun is an expressive, efficient, and easy-to-use workflow framework, built on top of the popular Python programming language. It offers multiple abstractions that we have come to rely on in most modern high-level languages (control flow, composability, recursion, high order functions, etc), along with automatic parallelization, caching, and (critically) data provenance tracking.
Our work on redun began two years ago, when our data engineering team began assessing the many available approaches to data science workflow development and execution. Our work at insitro involves a large amount of data across a broad range of data modalities, some generated through our in-house high-throughput biology and chemistry efforts, and some obtained from external sources (public and proprietary). As a consequence, we found ourselves suffering from an explosion of tools and practices, along with a similar increase in development and maintenance complexity. We were using best practice tools, each well designed for their own domain, but our complexity grew from challenges in composing them into larger systems that could span our company (LIMS to data warehouse to bioinformatics to ML). As our work spanned across systems, we also found ourselves losing essential features for fast moving scientific work, such as generalized data access (e.g., file-based bioinformatics vs. API and database intensive data engineering), incremental and heterogeneous compute (e.g., batch, Spark, multiprocessing, GPU), as well as unified data provenance and sharing.
Understanding the root cause of these issues led to the development of redun, which takes the somewhat contrarian view that languages designed expressly for dataflows are unnecessarily restrictive, and lose critical abstractions that are key to modern high-level languages. Instead, redun expresses workflows as lazy expressions, which are then evaluated by a scheduler that performs automatic parallelization, caching, and data provenance logging. This insight and the associated specific features (which are inspired by concepts from multiple domains) are critical in composing large hierarchical workflows (pipelines of pipelines) that can span multiple varied systems from our lab to our data science and machine learning teams.
Most critically, we have found that redun allows us to complete what is often the “missing half” of data science tooling. Specifically, while the past decade of software engineering progress has led to the creation of outstanding tools for tracking of code provenance (i.e., source control), there is little of the same consistent tracking for data provenance. But we know from our work that data gets created, updated, and processed constantly — as we do experiments, perform QC, improve our data processing pipelines, etc. And yet, there are currently very limited tools that let us keep track of what were the exact experiments and processing steps that led to the creation of a given data file. This hugely impacts the reproducibility of data-centric work, in science and more broadly.
We believe that we are at the start of a computational wave that will have a transformative impact on our industry. We believe that, in time, more and more high quality companies will look to embrace computational approaches on complex multimodal data for drug discovery and development, as well as for other scientific endeavors. We have found redun to be an important enabler of our work, and we hope that making redun broadly available can enable others and help drive positive change. We hope to be a part of the rising tide of innovation, across our community, in support of high-quality, reproducible science, and for the benefit of patients in need.
Our work is available on GitHub as well as on the Python Package Index.