COVID-19 Public Dataset Program by Google Cloud Platform

Free access to datasets and data analyzing tools at cloud scale has a significant impact on the research process, especially in the global response for combating COVID-19. 

As a Google Cloud Platform Partner, we want to share information with researchers, data scientists, and analysts about available hosted repositories of public datasets for tracking the COVID-19 outbreak.

Free datasets provide access to essential information eliminating the need to search for onboard large data files. You can access the datasets, along with a description of the data and sample queries to advance research from within the Google Cloud Console. All the data GCP includes in the program is public and freely available, while the program will remain in effect until September 15, 2020. 

You can also use these datasets and BigQuery ML for training your machine learning model inside BigQuery at no additional cost.  

Currently, Google Cloud Platform datasets include the following databases: 

With all these databases and BigQuery ML, you can develop a data-driven model for the spread of this infectious disease, better understand, study, and analyze the impact of COVID-19. Together with the Google Cloud team, we believe that the COVID-19 Public Dataset Program will enable better and faster research to combat the spread of this disease.

For more information, visit About COVID-19 Public Datasets and COVID-19 Public Datasets BigQuery Public Datasets Program pages on the official Google Cloud Platform website. 

Want to receive reading suggestions once a month?

Subscribe to our newsletters

Back-end Tech Stack for Custom eLearning Solutions

Some time ago we received a request from a client who wanted to create an online learning platform. The platform had to include both real-time webinars and on-demand video streaming functionality. The client contacted us after receiving an estimate from other developers which exceeded his budget. Thus, our main goal was to find a solution that offered existing modules for this project. 

We conducted research and found that the Google Cloud Platform we are partnering with, offered special components for projects with video streaming, as well as a bunch of other handy functions, including analytics and cloud data storage. 

We’ve already talked about how to create an online learning platform and essential features for the eLearning platform. In this article, we share our technological expertise in implementing GCP’s components for a custom online learning platform development. 

But before diving into tech details, let’s find out what an e-learning platform consists of. 

Online learning marketplace architecture and main components

The online education marketplace should consist of four main solution components, closely connected to each other. They are:

  • A web-based application that will contain most of the features to simplify the development process of the education marketplace
  • Video processing for transcoding, uploading, and streaming educational videos on-demand
  • Data processing that will receive events from other parts of the custom eLearning service and ingest them
  • Cloud storage for storing original uploaded on-demand education video files
elearning marketplace development tech stack

[eLearning marketplace architecture and main components by The APP Solutions]

With this in mind, let’s move further and find out more about each architecture component and the technologies to implement them. 

Educational software from scratch: essential tech stack

A web application is a part of an online learning platform that includes most features, such as user authentication, user management, and others. Most of the functionality will be integrated into web applications using API

Architecture

When developing an e-learning web app, you can choose among the following options: 

  • Building a monolithic eLearning application if you’re considering developing a learning platform software with a limited amount of content, users, etc. In other words, you are not going to significantly scale it in the future. 
  • Splitting eLearning apps back-end into microservices is an excellent choice if you are going to create an online learning platform where third-party educational organizations will post and sell their online courses. 
learning website architecture development

[Source: Medium]

While monolithic app architecture significantly simplifies initial educational marketplace development, it could become too large with time, and hence, difficult to manage, while microservices architecture allows updating, deploying, and scaling each element independently. As for the web frameworks, we suggest using Symfony4 (PHP) or Django (Python). 

Video service

Video service allows both you and other users to upload video materials to the platform and sends transcoded videos to user devices. As you may know, the quality of streaming video depends on many factors, such as video formats, mobile devices operation memory, slow and fast internet speeds. To provide users with high-quality video streaming, video service transcodes video files and creates multiple versions of the same video in different sizes, thus making it possible for users to watch online courses even with slow Internet while consuming a small amount of the user’s device memory. 

WebRTC server

WebRTC stands for Web Real-Time Communication. This component of an e-learning platform enables platform users to arrange video streaming sessions. WebRTC is an open-source framework with a library that is available via a web browser. This tool also supports native platforms such as iOS and Android. The developers will add and set this component via an application programming interface. As an alternative, you can also use Kurento or Janus open source products. 

Video Processing tech stack 

This component of eLearning platform architecture is responsible for transcoding both pre-recorded and live streaming video lessons. Now, let’s take a closer look at its tech stack. 

Transcoding service 

This service transcodes videos uploaded to the platform by you or other users. The service uses the FFmpeg library to transcode uploaded videos, which is open-sourced. 

The transcoding service’s primary function is to decompress the raw files and convert them into different compressed formats. In this way, your learning platform will provide fast, efficient streaming over the Internet right into a desktop or mobile client. 

The main reason we suggest the FFmpeg library is that traditional transcoding services require many processors, a lot of memory, and a lot of storage. For example, it takes more than an hour to transcode an hour’s worth of high-definition video. At the same time, the FFmpeg library allows transcoding such videos in real-time. 

elearning marketplace app development

[Source: Intel] 

Live transcoding service 

Live Transcoding Service receives a Real-time Transport Protocol stream from WebRTC (Web Real-Time Communication) server. It transcodes it to HTTP Live Streaming (also known as HLS), an HTTP-based adaptive bitrate streaming communications protocol. In other words, it allows teachers to stream live lessons. 

But why does your platform need it? Let’s explain. 

Imagine you want to do a live broadcast using a camera and encoder. You’ve already compressed your video with an RTMP encoder and selected the H.264 video codec at 1080p. But if you attempt to stream it directly, users will not see it. Thus, a live transcoding service is essential to do a live broadcast to all users with slower data speeds, tablets, mobile phones, and connected TV devices, allowing playback on almost any screen on the planet. Thus, without using this service, viewers without sufficient bandwidth can’t view the stream. 

live video transcoding service

[Source: Wowza]

Data Processing tech stack 

Streaming ETL

Streaming ETL processes the movement of real-time data from one place to another. This element will be connected with both analytics and real-time databases. 

ETL means database functions, such as extract, transform, and load. Let’s look at these functions in more detail. 

  • Extract means collecting data from some source, which will be in other parts of the system. 
  • Transform means any processes performed on that data
  • Loads refer to sending the processed data to a destination, such as analytics and real-time databases. 

For the streaming data processing pipeline, we suggest using Cloud Dataflow to receive events from other parts of the system and ingest them into BigQuery, responsible for data analytics. 

Storage tech stack 

The storage of your learning platform will include four main elements. They are the Main database, analytics, file storage, and Real-time database. Let’s check the technologies required for each component. 

Main database

The main database will organize data items into columns and rows. This component ensures all co-dependent elements of your platform are working simultaneously.  

Since all other platform components are located on Google Cloud Platform, we suggest using relational databases, such as the CloudSQL database as the central database for a web application. CloudSQL makes it easy to set up, maintain, manage, and administer your relational databases on the Google Cloud Platform. This database supports MySQL, PostgreSQL, or SQL Server. 

Analytics 

This element gathers all information about the actions happening on your platform and analyzes them to give you handy insights about your platform users. Since your platform will include a massive amount of data, Google Analytics does not suit this task. Therefore, we suggest applying BigQuery Analytics to handle vast amounts of data to analyze it in a more streamlined way. BigQuery Analytics also allows storing necessary data of special events in separate spreadsheets. For example, thanks to BigQuery Analytics, you can receive a detailed report on the most popular courses, how many users bought paid subscriptions, how many users access your platform via a web browser,  and the number of mobile device users. This tool will also be handy when you start running advertising campaigns and need to find out the most effective advertising channel.

File storage

In the file storage, you will store all the materials used by your learning platform. These materials will include uploaded video files, transcoded video files, images, etc. Since on-premise solutions do not suit a streaming platform’s growing needs, we suggest applying Google Cloud Storage for all your files. Moreover, Google Cloud Storage enables low-latency content delivery across the globe, thanks to geo-redundant storage with the highest availability and performance level.

Real-Time database

For storing videos that users are watching, your eLearning system needs a real-time database. For this purpose, we suggest leveraging The Firebase Realtime Database. It is a cloud-hosted NoSQL database that allows storing and synchronizing data between users in real-time. Thanks to Android, iOS, and Javascript SDKs of Firebase, your users can access your online courses via web browsers, iOS, and Android devices. Besides this, this database supports offline data access. 

Now that we’ve examined the essential components of an eLearning platform’s back-end, let’s find out how much this project will cost. 

Related reading: 

Case Study: Video Streaming App Proof of Concept

Download Free E-book with DevOps Checklist

Download Now

How much does it cost to create your own eLearning platform?

Despite most of your platform’s back-end components being open-source, you still need to pay a development team that will implement them. Besides this, developing a custom online learning marketplace still requires front-end, design, testing, and project management services, for which you must also pay. 

So, let’s check out how much money you need to receive a fully-fledged eLearning platform. 

  • Back-end requires from 800 to 1000 hours 
  • Front-end from 300 to 500 hours 
  • Design from 250 to 400 hours 
  • QA around 1000 hours 
  • Project management from 2500 to 3500 hours 
  • Project documentation will take from 650 hours 

To receive a more detailed estimate of your eLearning marketplace, use our project cost calculator

eLearning tools and technologies: Final word

The right back-end architecture components are essential for sophisticated projects, such as in the eLearning marketplace. The choice of back-end components will impact the quality and downtime of both videos on-demand and live streams. 

Thanks to cloud providers like Google Cloud Platform, developers can leverage the best software for creating an eLearning project’s back-end, which significantly reduces the time and money required to deliver the project. 

What our clients say

Receive Estimates of Your Online Learning Project

Use Project Cost Calculator

Google Cloud Services for Big Data Projects

Google Cloud Platform provides various services for data analysis and Big Data applications. All those services are integrable with other Google Cloud products, and all of them have their pros and cons. 

This article will review what services Google Cloud Platform can offer for data and Big Data applications and what those services do. We’ll also check out what benefits and limitations they have, the pricing strategy of each service, and their alternatives.

Cloud PubSub

Cloud PubSub is a message queue broker that allows applications to exchange messages reliably, quickly, and asynchronously. Based on the publish-subscription pattern.

Visualization of PubSub workflow

[Visualization of PubSub workflow]

The diagram above describes the basic flow of the PubSub. First, publisher applications publish messages to a PubSub topic. Then the topic sends messages to PubSub subscriptions; the subscriptions store messages; subscriber applications read messages from the subscriptions.

Benefits

  • A highly reliable communication layer
  • High capacity

Limitations

  • 10 MB is the maximum size for one message
  • 10 MB is the maximum size for one request, which means if we need to send ten messages per request, the average total length for each notification will be 1 MB.
  • The maximum attribute value size is 1 MB

Pricing strategy

You pay for transferred data per GB.

Analogs & alternatives

  • Apache Kafka
  • RabbitMQ
  • Amazon SQS
  • Azure Service Bus
  • Other Open Source Message Brokers

Google Cloud IoT Core

The architecture of Cloud IoT Core

[The architecture of Cloud IoT Core]

Cloud IoT Core is an IoT devices registry. This service allows devices to connect to the Google Cloud Platform, receive messages from other devices, and send messages to those devices. To receive messages from devices, IoT Core uses Google PubSub.

Benefits

  • MQTT and HTTPS transfer protocols
  • Secure device connection and management

Pricing Strategy

You pay for the data volume that you transfer across this service.

Analogs & alternatives

  • AWS IoT Core
  • Azure IoT

Cloud Dataproc

Cloud Dataproc for Apache Spark and Apache Hadoop

Cloud Dataproc is a faster, easier, and more cost-effective way to run Apache Spark and Apache Hadoop in Google Cloud. Cloud Dataproc is a cloud-native solution covering all operations related to deploying and managing Spark or Hadoop clusters. 

In simple terms, with Dataproc, you can create a cluster of instances on Google Cloud Platform, dynamically change the size of the cluster, configure it, and run MapReduce jobs.

Benefits

  • Fast deployment
  • Fully managed service means you need just the right code, no operation work
  • Dynamically resize the cluster
  • Auto-Scaling feature

Limitations

  • No choice of selecting a specific version of the used framework
  • You cannot pause/stop Data Proc Cluster to save money. Only delete the cluster. It’s possible to do via Cloud Composer
  • You cannot choose a cluster manager, only YARN

Pricing strategy

You pay for each used instance with some extra payment—Google Cloud Platform bills for each minute when the cluster works.

Analogs & alternatives

  • Set-up cluster on virtual machines
  • Amazon EMR
  • Azure HDInsight

Cloud Dataflow

The place of Cloud Dataflow in a Big Data application on Google Cloud Platform

[The place of Cloud Dataflow in a Big Data application on Google Cloud Platform]

Cloud Dataflow is a managed service for developing and executing a wide range of data processing patterns, including ETL, batch, streaming processing, etc. In addition, Dataflow is used for building data pipelines. This service is based on Apache Beam and supports Python and Java jobs.

Benefits

  • Combines batch and streaming with a single API
  • Speedy deployment
  • A fully managed service, no operation work
  • Dynamic work rebalancing
  • Autoscaling

Limitations

  • Based on a single solution, therefore, inherits all limitations of Apache Beam
  • The maximum size for a single element value in Streaming Engine is 100 Mb

Pricing strategy

Cloud Dataflow jobs are billed per second, based on the actual use of Cloud Dataflow.

Analogs & alternatives

  • Set-up cluster on virtual machines and run Apache Beam via in-built runner
  • As far as I know, other cloud providers don’t have analogs.

Google Cloud Dataprep

The interface of Dataprep

[The interface of Dataprep]

Dataprep is a tool for visualizing, exploring, and preparing data you work with. You can build pipelines to ETL your data for different storage. And do it on a simple and intelligible web interface.

For example, you can use Dataprep to build the ETL pipeline to extract raw data from GCS, clean up this data, transform it to the needed view, and load it into BigQuery. Also, you can schedule a daily/weekly/etc job that will run this pipeline for new raw data.

Benefits

  • Simplify building of ETL pipelines
  • Provide a clear and helpful web interface
  • Automate a lot of manual jobs for data engineers
  • Built-in scheduler
  • To perform ETL jobs, Dataprep uses Google Dataflow

Limitations

  • Works only with BigQuery and GCS

Pricing Strategy

For data storing, you pay for data storage. For executing ETL jobs, you pay for Google Dataflow.

Cloud Composer

Cloud Composer is a workflow orchestration service

Cloud Composer is a workflow orchestration service to manage data processing. Cloud Composer is a cloud interface for Apache Airflow. Composer automates the ETL jobs. One example is to create a Dataproc cluster, perform transformations on extracted data (via a Dataproc PySpark job), upload the results to BigQuery, and then shut down the Dataproc collection.

Benefits

  • Fills the gaps of other Google Cloud Platform solutions, like Dataproc
  • Inherits all advantages of Apache Airflow

Limitations

  • Provides the Airflow web UI on a public IP address
  • Inherits all rules of Apache Airflow

Pricing Strategy

You pay only for resources on which Composer is deployed. But the Composer will be deployed to 3 instances.

Analogs & alternatives

  • Custom deployed Apache Airflow
  • Other orchestration open source solution

BigQuery

BigQuery is a data warehouse

[Example of integration BigQuery into a data processing solution with different front-end integrations] 

BigQuery is a data warehouse. BigQuery allows us to store and query massive datasets of up to hundreds of Petabytes. BigQuery is very familiar to relational databases by their structure. It has a table structure, uses SQL, supports batch and streaming writing into the database, and is integrated with all Google Cloud Platform services, including Dataflow, Apache Spark, Apache Hadoop, etc. It’s best for use in interactive queuing and offline analytics.

Benefits

  • Huge capacity, up to hundreds of Petabytes
  • SQL
  • Batch and streaming writing
  • Support complex queries
  • Built-in ML
  • Serverless
  • Shared datasets — you can share datasets between different projects
  • Global locations
  • All popular data processing tools have interfaces to BigQuery

Limitations

  • It doesn’t support transactions, but those who need transitions in the OLAP solution
  • The maximum size of the row is 10Mb

Pricing strategy

You pay separately for stored information(for each Gb) and executed queries.

You can choose one of two payment models concerning executed queries, either paying for each processed Terabyte or a stable monthly cost depending on your preferences.

Analogs & alternatives

  • Amazon Redshift
  • Azure Cosmos DB

Cloud BigTable

Google Cloud BigTable is Google's NoSQL Big Data database service

Google Cloud BigTable is Google’s NoSQL Big Data database service. The same database powers many core Google services, including Search, Analytics, Maps, and Gmail. Bigtable is designed to handle massive workloads at consistent low latency and high throughput, so it’s an excellent choice for operational and analytical applications, including IoT, user analytics, and financial data analysis.

Cloud Bigtable is based on Apache HBase. This database has an enormous capacity and is suggested for use more than Terabyte data. One example, BigTable is the best for time-series data and IoT data.

Benefits

  • Has good performance on 1Tb or more data
  • Cluster resizing without downtime
  • Incredible scalability
  • Support API of Apache HBase

Limitations

  • Has bad performance on less than 300 Gb data
  • It doesn’t suit real-time
  • It doesn’t support ACID operations
  • The maximum size of a single value is 100 Mb
  • The maximum size of all values in a row is 256 Mb
  • The maximum size of the hard disk is 8 Tb per node
  • A minimum of three nodes in the cluster

Pricing Strategy

BigTable is very expensive. You pay for nodes (minimum $0.65 per hour per node) and storage capacity(minimum 26$ per Terabyte per month)

Analogs & alternatives

  • Custom deployed Apache HBase

Cloud Storage

GCS is blob storage for files

GCS is blob storage for files. You can store any amount of any size files there.

Benefits

  • Good API for all popular programming languages and operating systems
  • Immutable files
  • Versions of files
  • Suitable for any size files
  • Suitable for any amount of files
  • Etc

Pricing Strategy

GCS has a couple of pricing plans. In a standard plan, you pay for 1Gb of saved data.

Analogs & alternatives

  • Amazon S3
  • Azure Blob Storage

How to make your IT project secured?

Download Project Security Checklist

Other Google Cloud Services

There are a few more services that I should mention.

Google Cloud Compute Engine provides virtual machines with any performance capacity.

Google CloudSQL is a cloud-native solution to host MySQL and PostgreSQL databases. Has built-in vertical and horizontal scaling, firewall, encrypting, backups, and other benefits of using Cloud solutions. Has a terabyte capacity. Supports complex queries and transactions

Google Cloud Spanner is a fully managed, scalable, relational database service. Supports SQL queries, auto replication, transactions. It has a one-petabyte capacity and suits best for large-scale database applications which store more than a couple of terabytes of data.

Google StackDriver monitors Google services and infrastructure, and your application is hosted in a Google Cloud Platform.

Cloud Datalab is a way to visualize and explore your data. This service provides a cloud-native way to host Python Jupyter notebooks.

Google Cloud AutoML and Google AI Platform allow training and hosting of high-quality custom machine learning models with minimal effort.

Conclusion

Now you are familiar with the primary data services that Google Cloud Platform provides. This knowledge can help you to build a good data solution. But, of course, Clouds are not a silver bullet, and in case you use Clouds in the wrong way, it can significantly affect your monthly infrastructure billing.

Thus, carefully build your proposal’s architecture and choose the necessary services for your needs to reach your needed business goals. Explore all benefits and limitations for each particular case. Care about costs. And, of course, remember about the scalability, reliability, and maintainability of your solution.

Useful links: