Cloud Dataflow for Nanopore DNA Sequencers [Case Study]

TheAppSolution

Blog

Case-Studies

Case Study: Real-time Diagnostics from Nanopore DNA Sequencers

Update 28 Aug. 2023

7 min read

Summarize with AI:

ChatGPT Perplexity Claude Google AI

In this issue:

About Nanopore DNA Sequencers
Nano steam Project Tasks & Challenges
Data Scalability
Data Processing and Analysis Algorithms
DNA Analysis Tools
Reducing Overhead
Developing a User Interface
Project’s Tech Stack & Team
Looking for a development partner?

Data Analysis in Healthcare is a matter of life and death, and it’s also a very time-consuming task when you do not have the proper tools. When we are talking about sepsis – the dangerous condition when the body starts to attack its organs and tissues in attempts to fight off the bacteria or other causes – the risk of losing the patient due to sepsis increases by 4% with each hour.

The researchers from the University of Queensland and the Google Cloud Platform developers have teamed up with the APP Solutions developers to provide medical doctors with a tool to help patients before they suffer from septic shock.

With the emergence of nanopore DNA sequencers, this task becomes manageable and much more efficient. These sequencers stream raw data and generate results within 24 hours, which is a significant advantage, especially when doctors need to identify pathogenic species and antibiotic resistance profile.

The primary challenge, from the technical point of view, lies with data processing, which requires significant resources for processing and subsequent storage of incoming data. The APP Solutions team tackled the development of a cloud-based solution to solve this challenge.

About the Project: Nanopore DNA Sequencers

Our team worked on the cloud-based solution for the Nanopore DNA Sequencing, and we have developed a Cloud Dataflow integrated with the following technologies:

FastQ Record Aligner
JAPSA Summarizer
Cloud Datastore and App Engine
App Dashboard

The pipeline itself consists of the following elements:

Chiron base caller implemented as a deep neural-network
Detectors for species and antibiotic resistance genes
Databases for long-term experimental data storage and post-hoc analysis
A browser-based dynamic dashboard to visualize analysis results as they are generated

Overall, the system is designed to perform the following actions:

Resistance Gene Detection: this pipeline identifies antibiotic resistance genes present in a sample and points out actionable insights, e.g., what treatment regimen to apply to a particular patient.
Species Proportion Estimation: this pipeline estimates the proportion of pathogenic species present in a sample. Proportion estimation can be useful in a variety of applications including clinical diagnostics, biosecurity, and logistics/supply-chain auditing.

The software is open-source, built on the open-source packages:

JAPSA
TensorFlow
Apache Beam
D3

We have used Google Cloud to implement the data analysis application due to its scaling capacity, reliability, and cost-effectiveness. It includes a wide array of scalable features for Tensor Processing Units and AI accelerator microchips.

The transformation of information follows this sequence:

Integration – files are uploaded to the Google Cloud Platform and streamed into the processing pipeline;
Base-calling stage – machine learning model infers DNA sequences from electrical signals;
Alignment stage – via a DNA database, the samples are analyzed to find pathogen sequences and other anomalies;
Summarization stage – calculation of each pathogen’s percentage in the particular sample;
Storage and visualization – the results are saved to Google Firestore DB and subsequently visualized in real-time with D3.js.

Watch the video about the project:

Nanostream Project Tasks & Challenges

Ensuring Data Scalability

Nanopore Sequencer DNA Analysis is a resource-demanding procedure that requires speed and efficiency to be genuinely useful in serving its cause.

Due to the high volume of data and tight time constraints, the system needs to scale accordingly, which was achieved via the Google Cloud Platform and its autoscaling features. GCP secures smooth and reliable scalability for data processing operations.

To keep the data processing workflow uninterrupted no matter the workload, we used Apache Beam.

Refining Data Processing and Analysis Algorithms

Accuracy is the central requirement for the data processing operation in genomics, especially in the context of DNA Analysis and pathogen detection.

The project required a fine-tuned, tight-knit data processing operation with an emphasis on providing a broad scope of results in minimal time.

Our task was to connect the analytics application to the cloud platform and guarantee an effective information turnaround. The system was thoroughly tested to ensure the accuracy of results and efficiency of the processing.

Integrating with DNA Analysis Tools

DNA Analysis tools for Nanopore sequencers were not initially developed for cloud platforms and distributed services. The majority of the analysis tools were just desktop utilities, but this significantly limited capability. We needed to integrate the desktop-based DNA analysis tools into a unified, scalable system.

We have reinterpreted desktop-based DNA analysis tools for HTTP format and distributed them as web services, which made them capable of processing large quantities of data in a shorter timespan.

Securing Cost-Effectiveness & Reducing Overhead

Nanopore DNA Sequencers are a viable solution for swift pathogen analysis and more competent medical treatment. However, the maintenance of such devices can be a challenging task for medical facilities due to resource and personnel requirements. Also, the scope of its use is relatively limited in comparison with the required expenditures.

We moved the entire system to Google Cloud Platform to solve this issue, allowing the service to be accessed and scaled without unnecessary overhead expenses.

Developing Accessible User Interface

Machine learning and big data analysis systems can process much data, but it’s useless until the insights are presented in such a way that is understandable. In the case of the Nanopore DNA Sequencing solution, the idea was to give a tool to the medical staff that would help them make decisions in critical situations and save lives. Therefore, an accessible presentation was one of the essential elements of this research project.

The system needed an easy-to-follow and straightforward interface that provided all the required data in a digestible form, avoiding confusion.

To create the most convenient user interface design scheme, we have applied extensive user testing. The resulting user interface is an interactive dashboard with multiple types of visualization and reporting at hand that requires minimal effort to get accustomed to and start using it.

When it came to visualization, the initial format of choice was a pie chart. However, it was proven insufficient in more complex scenarios.

Because of that, we have concluded that there was a need to expand the visualization library and add a couple of new options, which was where the D3 data visualization library helped us out.

Throughout extensive testing, we have figured out that Sunburst diagrams are doing an excellent job of showing the elements of the sample in an accessible form.

Project’s Tech Stack & Team

There were many technologies involved, the majority of which had to do with big data analysis and cloud:

JAPSA
TensorFlow
Chiron Base Caller
Google Cloud
Google Cloud Storage
Google Cloud PubSub
Google FireStore
Google Cloud Dataflow
Apache Beam
D3 Data Visualization Library
JavaScript

Related articles:

How to Pick Best Tech Stack for Your Project

Calmerry Telemedicine Platform Case Study

From the APP Solutions’ side, we had four people working on this Nanopore DNA Sequencers project:

2 Data Engineers
1 DevOps Engineer
1 Web Developer

Creating Nanopore DNA Sequencing Cloud-Based Solutions

This project was an incredible experience for our team. We had a chance to dive deep into the healthcare industry as well as machine learning, data analysis, and Google Cloud platform capabilities.

While we were exploring the possibilities of data analysis in healthcare applications – we found out many parallels between data analysis in other fields.

We have managed to apply our knowledge of cloud infrastructure and build a system that is capable of processing large quantities of data in a relatively short time – and help doctors save patients’ lives!

Learn more about the project and check out our contributions to GitHub:

What our clients say

Looking for a big data analytics partner?

Denys

Administrator

Case Study: Custom Ad Fraud Detection System

What are Data Loss Prevention (DLP) Best Practices?

Content

About the Project: Nanopore DNA Sequencers
Nanostream Project Tasks & Challenges
Ensuring Data Scalability
Refining Data Processing and Analysis Algorithms
Integrating with DNA Analysis Tools
Securing Cost-Effectiveness & Reducing Overhead
Developing Accessible User Interface
Project's Tech Stack & Team
Creating Nanopore DNA Sequencing Cloud-Based Solutions
What our clients say