Clear Project Real-time Data Analytics & Content Moderation

In this issue:

Short overview
Project Background
Project Details
Challenges
Scalability – BigQuery
Building a serverless data warehouse
Content Moderation Tools
Spam & Fraud Detection Tools
Integration of Image Recognition
Superior Privacy – Data Loss Prevention
Tech Stack
Personnel
Conclusion

Project Background

The client had an online chat platform. The platform was active for quite a while which meant it needed a major facelift in order to keep going.

The primary issue with the platform was its increasingly lacking manageability. The moderating tools were obsolete. This resulted in a slowly growing element of toxicity in the chatroom.

In order to make the chatrooms less toxic and more civil – the content moderation system needed a major overhaul. The main points were the detection of bullying and obscene content.

The other major problem was scam/fraud and bot activity. As a niche public communication service, the platform needed to keep such things out of the system by all means necessary.

In addition to this, there were reasonable concerns about the privacy of conversations and user profiles due to the threat of cyberbullying and hacking by means of social engineering.

Finally, due to its age, the system needed an upgrade of its scaling capacity in order to deliver the best possible user experience without experiencing any fails and glitches in the process.

Project Details

Our task regarding the project can be described as:

to upgrade the existing online chat platform into a modern scalable system that can handle a large workload, and arm it with relevant content moderation and fraud detection tools.

A highly scalable system with effective moderation, anti-fraud tools, and superior data protection compliance.

Our primary goal was to migrate the system to the Cloud Platform. These days, the cloud is the most fitting solution for any platform with a big workload – it is an easy solution for scaling and further development of the system.

The biggest task of the entire project was the development of the content moderation system. We needed a specific solution that would be effective, but not overly penetrative into the user’s conversations.

The development of the fraud detection system was the other big challenge. Online chat platforms are fraud-prone – there is often something fishy going on. In addition to spam – there were more elaborate fishing scam schemes that needed to be taken care of.

Last but not least was maintaining a high level of privacy and data safety. Due to the nature of the platform, there are always reasonable concerns about the privacy of conversations and user profiles. Because of this, we implemented a DLP protocol that makes personal data out of reach for malicious individuals. + erase personal data from analytics

Challenges

Scalability – BigQuery

An online Chat platform is as good as its scalability capabilities. Given the fact that chat platforms harbor thousands of conversations at the same time – it needs to stay afloat while processing a large amount of data. In addition to this, the system needs to be safe from glitches and crashes that negatively affect the user experience.

In order to provide the best possible environment for scalability – we decided to use the Google Cloud Platform. Its autoscaling features secure smooth and reliable data processing operations.

Building a serverless data warehouse

Highly scalable, highly flexible. Simple to use.

In addition to that, we have BigQuery for nice and simple database management.

Content Moderation Tools

We used Google DataFlow as a foundational element. This system is built around moderation guidelines that describe the do’s and don’ts of the platform and include specific patterns of hate speech and bullying that are unwelcomed on the platform.

Overall, the system:

Monitors the conversations
Performs topic modeling
Classifies its topic and context in case of alarm
Defines whether there is abusive or obscene behavior involved
Checks the image content

Blurs the image in case of obscenity

Bans the user in case of violence, spam, or other banned content

The important point was to avoid full-on user surveillance and turn the system on only in cases of the conversation content crossing the line and activating the algorithm.

Spam & Fraud Detection Tools

Fraud is one of the most biting issues of online chat platforms. Aside from toxic behavior and bullying – fraudulent activity is the third biggest issue plaguing anonymous online chats. While there is an uncontrollable element of social engineering at play – it is possible to cut the fraudsters off before the damage is done by implementing early warning systems.

In order to do this, we implemented an automated system of fraud detection. It is built upon a database of examples of fraudulent behavior which is used as a reference point for subsequent operations.

The solution includes:

Text classification – analyzing the content of messages for typical spam verbiage and potentially fraudulent messages.
Image Classification
Anomaly-based Bot detection – if the behavior of a particular user falls down into the bot-like spam pattern.

Integration of Image Recognition

Given the nature of this particular online chat platform – it was important to keep an eye on image content as it could be one of the major ways of enacting cyberbullying, scams, and obscene behavior.

Because of this, we implemented the Google AutoML image recognition system CNN that classifies images and takes action if there is something violating the guidelines.

There are two services at play:

Google Vision API for general image recognition
Google AutoML Vision as a platform-specific solution.

Together, these services analyze the image content that is sent in conversations.

In cases where there is any semblance of gore or otherwise obscene content – the image is blurred.
In cases where images are accompanied by wholesale, toxic behavior, with distinct patterns of hate speech and bullying – the user is fully banned.

Superior Privacy – Data Loss Prevention

Maintaining privacy is one of the fundamental elements of an online chat platform. It is the foundation of trust and growth of the service.

Because of this, the system needs to be secure from any sort of breaches and other compromises of personal data.

In order to follow GDPR guidelines and maintain the appropriate level of privacy – we implemented a Data Loss Prevention tool.

This protocol monitors the content for sensitive information and deletes it – so that it is not identifiable in databases.

Tech Stack

Google cloud platform
Big Query
Google DataFlow
Image Recognition
Google Machine vision API
Google Auto ML

Personnel

Project Manager
Business Analyst
Data Engineer
QA

Conclusion

This project can be considered a huge accomplishment for our team. What started out as a relatively simple overhaul of the system slowly evolved into a series of complex solutions that bring the platform to an entirely new level.

We are especially proud of the flexible content moderation system that can keep things civil in the chatroom, while not being overbearing or overly noticeable.

An effective fraud detection system that can handle various types of chat-based fraud with ease

This project can be considered as a big milestone for our team. Over the years we have worked on different aspects of big data operation and developed many projects that involved data processing and analytics. However, this project gave us the chance to create an entire system from the ground up, integrate it with the existing infrastructure, and bring it all to a completely new level.

During the development of this project, we utilized more streamlined workflows that allowed us to make the whole turnaround much faster. Because of this, we managed to deploy an operating prototype of the system ahead of the planned date and dedicated more time to its testing and refinement.

Denys

administrator

7 Types of Data Breaches and How to Prevent Them

How to Make a Virtual Assistant like Siri and Google Assistant

Clear Project: Real-time Data Analytics & Content Moderation