Case Study: Real-time Diagnostics from Nanopore DNA Sequencers

Data Analysis in Healthcare is a matter of life and death, and it's also a very time-consuming task when you do not have proper tools. When we are talking about sepsis - the dangerous condition when the body starts to attack its organs and tissues in attempts to fight off the bacteria or other causes - the risk of losing the patient due to sepsis increases by 4% with each hour.

The researchers from the University of Queensland and the Google Cloud Platform developers have teamed up with the APP Solutions developers to provide medical doctors with a tool to help patients before they suffer from septic shock.

With the emergence of nanopore DNA sequencers, this task becomes manageable and much more efficient. These sequencers stream raw data and generate results within 24 hours, which is a significant advantage, especially when doctors need to identify pathogenic species and antibiotic resistance profile.

The primary challenge, from the technical point of view, lies with data processing, which requires significant resources for processing and subsequent storage of incoming data. The APP Solutions team tackled the development of a cloud-based solution to solve this challenge.

About the Project: Nanopore DNA Sequencers

Our team worked on the cloud-based solution for the Nanopore DNA Sequencing, and we have developed a Cloud Dataflow integrated with the following technologies:

  • FastQ Record Aligner
  • JAPSA Summarizer
  • Cloud Datastore and App Engine
  • App Dashboard

The pipeline itself consists of the following elements:

  • Chiron base caller implemented as a deep neural-network
  • Detectors for species and antibiotic resistance genes
  • Databases for long-term experimental data storage and post-hoc analysis
  • A browser-based dynamic dashboard to visualize analysis results as they are generated

Overall, the system is designed to perform the following actions:

  • Resistance Gene Detection: this pipeline identifies antibiotic resistance genes present in a sample and points out actionable insights, e.g., what treatment regimen to apply to a particular patient.
  • Species Proportion Estimation: this pipeline estimates the proportion of pathogenic species present in a sample. Proportion estimation can be useful in a variety of applications including clinical diagnostics, biosecurity, and logistics/supply-chain auditing.

The software is open-source, built on the open-source packages:

  • JAPSA
  • TensorFlow
  • Apache Beam
  • D3

We have used Google Cloud to implement the data analysis application due to its scaling capacity, reliability, and cost-effectiveness. It includes a wide array of scalable features for Tensor Processing Units and AI accelerator microchips.

The transformation of information follows this sequence:

  1. Integration - files are uploaded to the Google Cloud Platform and streamed into the processing pipeline;
  2. Basecalling stage - machine learning model infers DNA sequences from electrical signals;
  3. Alignment stage - via a DNA database, the samples are analyzed to find pathogen sequences and other anomalies;
  4. Summarization stage - calculation of each pathogen's percentage in the particular sample;
  5. Storage and visualization - the results are saved to Google Firestore DB and subsequently visualized in real-time with D3.js.

Watch the video about the project: 

 

Nanostream Project Tasks & Challenges

Ensuring Data Scalability

Nanopore Sequencer DNA Analysis is a resource-demanding procedure that requires speed and efficiency to be genuinely useful in serving its cause.

Due to the high volume of data and tight time constraints, the system needs to scale accordingly, which was achieved via the Google Cloud Platform and its autoscaling features. GCP secures smooth and reliable scalability for data processing operations.

To keep data processing workflow uninterrupted no matter the workload, we used Apache Beam.

Refining Data Processing and Analysis Algorithms

Accuracy is the central requirement for the data processing operation in genomics, especially in the context of DNA Analysis and pathogen detection.

The project required fine-tuned, tight-knit data processing operation with an emphasis on providing a broad scope of results in minimal time.

Our task was to connect the analytics application to the cloud platform and guarantee an effective information turnaround. The system was thoroughly tested to ensure the accuracy of results and efficiency of the processing.

Integrating with DNA Analysis Tools

DNA Analysis tools for Nanopore sequencers were not initially developed for cloud platforms and distributed services. The majority of the analysis tools were just desktop utilities, but this significantly limited capability. We needed to integrate the desktop-based DNA analysis tools into a unified, scalable system.

We have reinterpreted desktop-based DNA analysis tools for HTTP format and distributed them as web services, which made them capable of processing large quantities of data in a shorter timespan.

Securing Cost-Effectiveness & Reducing Overhead

Nanopore DNA Sequencers are a viable solution for swift pathogen analysis and more competent medical treatment. However, the maintenance of such devices can be a challenging task for medical facilities due to resource and personnel requirements. Also, the scope of its use is relatively limited in comparison with the required expenditures.

We moved the entire system to Google Cloud Platform to solve this issue, allowing the service to be accessed and scaled without unnecessary overhead expenses.

Developing Accessible User Interface

Machine learning and big data analysis systems can process much data, but it's useless until the insights are presented in such a way that is understandable. In the case of Nanopore DNA Sequencing solution, the idea was to give a tool to the medical staff that would help them make decisions in critical situations and save lives. Therefore, an accessible presentation was one of the essential elements of this research project.

The system needed an easy-to-follow and straightforward interface that provided all the required data in a digestible form, avoiding confusion.

To create the most convenient user interface design scheme, we have applied extensive user testing. The resulting user interface is an interactive dashboard with multiple types of visualization and reporting at hand that requires minimal effort to get accustomed to and start using it.

When it came to visualization, the initial format of choice was a pie chart. However, it was proven insufficient in more complex scenarios.

Because of that, we have concluded that there was a need to expand the visualization library and add a couple of new options, which was where the D3 data visualization library helped us out.

Throughout extensive testing, we have figured out that Sunburst diagrams are doing an excellent job of showing the elements of the sample in an accessible form.

Project's Tech Stack & Team

There were many technologies involved, the majority of which had to do with big data analysis and cloud: 

  • JAPSA
  • TensorFlow
  • Chiron Base Caller
  • Google Cloud
  • Google Cloud Storage
  • Google Cloud PubSub
  • Google FireStore
  • Google Cloud Dataflow
  • Apache Beam
  • D3 Data Visualization Library
  • JavaScript

Read our article: How to Pick Best Tech Stack for Your Project

From the APP Solutions' side, we had four people working on this Nanopore DNA Sequencers project: 

  • 2 Data Engineers
  • 1 DevOps Engineer
  • 1 Web Developer

Creating Nanopore DNA Sequencing Cloud-Based Solutions

This project was an incredible experience for our team. We had a chance to dive deep into the healthcare industry as well as machine learning, data analysis, and Google Cloud platform capabilities.

While we were exploring the possibilities of data analysis in healthcare applications - we found out many parallels between data analysis in other fields.

We have managed to apply our knowledge of cloud infrastructure and build a system that is capable of processing large quantities of data in a relatively short time - and help doctors save patients' lives!

Learn more about the project and check out our contributions to the GitHub:

Looking for a big data analytics development partner?

Contact us

 
Volodymyr Bilyk

Content Manager