Data science infrastructure for agricultural sustainability and food security

Brad Chapman bio photo By Brad Chapman Comment

I’m excited to be joining Ginkgo Bioworks. Ginkgo and the synthetic biology community have an incredible amount of useful data in intricate experimental designs, measured screening outcomes and pre-existing biological knowledge. I’ll help organize, present, compute on and do science with this data.

I hope to enable downstream applications that improve agricultural sustainability and food security. I’m motivated to help with building an increasingly fair and climate friendly agricultural system due to the dire warnings about the state of our planet. There are many different ways to contribute on mitigating climate change, from personal action to politics to your daily work; see Drawdown, Sliced and Mattermost for some comprehensive lists. Ginkgo and their downstream applications provided a unique opportunity to make use of my programming, genomics and synthetic biology background to work towards healthy and sustainable food production.

The incredibly difficult part of this change is moving away from the amazing community of researchers I’ve had the privilege to work with. Our incredible team at Harvard Chan School. continues to teach me new things every day and makes supporting science a joy. Rewarding collaborations with AstraZeneca, the University of Melbourne Centre for Cancer Research and Veritas Genetics provide a continuous source of scientific challenges alongside the inspiration of seeing smart teams answer difficult questions.

Data science infrastructure in synthetic biology

At Ginkgo my initial focus will be on providing infrastructure and scientific support to enable application specific data science teams. The data challenges at Ginkgo are similar to those faced across multiple companies and large communities. They have large amount of valuable process and analysis data in custom internal formats, such that querying requires knowing multiple data models. Additionally, there is high level biological knowledge about experiments and biological systems that gets captured in ad-hoc ways and requires structure to improve interoperability. The challenge will be to build the smallest and leanest set of processes, tools and interfaces that organize this data to increase scientific productivity.

My rough technical plan is semi-automated data organization. First, use OpenRefine reconciliation services through SciGraph to map unstructured data into ontologies like SBOL. With this in place, store the data as RDF-like triples of entities with structured attributes and values. Ideally this uses the time-based approach within Datomic and models high level entity relationships with Hodur. Finally, expose these heterogeneous backends in a standardized way for external access using a GraphQL/Lacinia or custom Datalog interface. I’m open to any and all suggestions on alternative approaches to explore. Planning and predictions are always perilous in both science and computing.

This work will provide a template and documentation of approaches for data cleaning and organizing, along with practical tools to help with this process. The outcome – modeled and organized data – will not only help scientists work more effectively on Ginkgo’s synthetic biology platform, but also help coordinate with other researchers and the general public. I want to work collaboratively to connect and enable the diverse synthetic biology community through practical data science. I’d love to discuss with anyone working on similar problems.

Future of bcbio

The unfortunate downside of this change is moving away from personalized medicine and cancer variant analysis. I spent a majority of my time over the past eight years at Harvard working with the bcbio community on high throughput sequencing analysis pipelines. bcbio supports a large number of researchers and we’re committed to helping maintain and continue its development. Our team at Harvard Chan has active contributors like Rory Kirchner and Lorena Pantano who plan to continue to develop bcbio. We’re also hiring at the Harvard Chan core and would love to hear from bcbio community developers with an interest in working more directly on it. Similarly, if you depend on bcbio and want to collaborate please also get in touch. I’ll continue to work part time on bcbio over the next few months to improve documentation and stability and make transitions as painless as possible.

I’m so appreciative to the bcbio and open bioinformatics community and all the groups I’ve worked with, and welcome feedback and thoughts.

comments powered by Disqus