Clear Project: Real-time Data Analytics & Content Moderation
Project Background
The client had an online chat platform. The platform was active for quite a while which meant it needed a major facelift in order to keep going.
The primary issue with the platform was its increasingly lacking manageability. The moderating tools were obsolete. This resulted in a slowly growing element of toxicity in the chatroom.
In order to make the chatrooms less toxic and more civil – the content moderation system needed a major overhaul. The main points were the detection of bullying and obscene content.
The other major problem was scam/fraud and bot activity. As a niche public communication service, the platform needed to keep such things out of the system by all means necessary.
In addition to this, there were reasonable concerns about the privacy of conversations and user profiles due to the threat of cyberbullying and hacking by means of social engineering.
Finally, due to its age, the system needed an upgrade of its scaling capacity in order to deliver the best possible user experience without experiencing any fails and glitches in the process.
Project Details
Our task regarding the project can be described as:
- to upgrade the existing online chat platform into a modern scalable system that can handle a large workload, and arm it with relevant content moderation and fraud detection tools.
- A highly scalable system with effective moderation, anti-fraud tools, and superior data protection compliance.
Our primary goal was to migrate the system to the Cloud Platform. These days, the cloud is the most fitting solution for any platform with a big workload – it is an easy solution for scaling and further development of the system.
The biggest task of the entire project was the development of the content moderation system. We needed a specific solution that would be effective, but not overly penetrative into the user’s conversations.
The development of the fraud detection system was the other big challenge. Online chat platforms are fraud-prone – there is often something fishy going on. In addition to spam – there were more elaborate fishing scam schemes that needed to be taken care of.
Last but not least was maintaining a high level of privacy and data safety. Due to the nature of the platform, there are always reasonable concerns about the privacy of conversations and user profiles. Because of this, we implemented a DLP protocol that makes personal data out of reach for malicious individuals. + erase personal data from analytics
Challenges
Scalability – BigQuery
An online Chat platform is as good as its scalability capabilities. Given the fact that chat platforms harbor thousands of conversations at the same time – it needs to stay afloat while processing a large amount of data. In addition to this, the system needs to be safe from glitches and crashes that negatively affect the user experience.
In order to provide the best possible environment for scalability – we decided to use the Google Cloud Platform. Its autoscaling features secure smooth and reliable data processing operations.
Building a serverless data warehouse
Highly scalable, highly flexible. Simple to use.
In addition to that, we have BigQuery for nice and simple database management.
Content Moderation Tools
The biggest task of the entire project was the development of the content moderation system. We needed a specific solution that would be effective, but not overly penetrative into the user’s conversations.
We used Google DataFlow as a foundational element. This system is built around moderation guidelines that describe the do’s and don’ts of the platform and include specific patterns of hate speech and bullying that are unwelcomed on the platform.
Overall, the system:
- Monitors the conversations
- Performs topic modeling
- Classifies its topic and context in case of alarm
- Defines whether there is abusive or obscene behavior involved
- Checks the image content
- Blurs the image in case of obscenity
- Bans the user in case of violence, spam, or other banned content
The important point was to avoid full-on user surveillance and turn the system on only in cases of the conversation content crossing the line and activating the algorithm.
Spam & Fraud Detection Tools
Fraud is one of the most biting issues of online chat platforms. Aside from toxic behavior and bullying – fraudulent activity is the third biggest issue plaguing anonymous online chats. While there is an uncontrollable element of social engineering at play – it is possible to cut the fraudsters off before the damage is done by implementing early warning systems.
In order to do this, we implemented an automated system of fraud detection. It is built upon a database of examples of fraudulent behavior which is used as a reference point for subsequent operations.
The solution includes:
- Text classification – analyzing the content of messages for typical spam verbiage and potentially fraudulent messages.
- Image Classification
- Anomaly-based Bot detection – if the behavior of a particular user falls down into the bot-like spam pattern.
Integration of Image Recognition
Given the nature of this particular online chat platform – it was important to keep an eye on image content as it could be one of the major ways of enacting cyberbullying, scams, and obscene behavior.
Because of this, we implemented the Google AutoML image recognition system CNN that classifies images and takes action if there is something violating the guidelines.
There are two services at play:
- Google Vision API for general image recognition
- Google AutoML Vision as a platform-specific solution.
Together, these services analyze the image content that is sent in conversations.
- In cases where there is any semblance of gore or otherwise obscene content – the image is blurred.
- In cases where images are accompanied by wholesale, toxic behavior, with distinct patterns of hate speech and bullying – the user is fully banned.
Superior Privacy – Data Loss Prevention
Maintaining privacy is one of the fundamental elements of an online chat platform. It is the foundation of trust and growth of the service.
Because of this, the system needs to be secure from any sort of breaches and other compromises of personal data.
In order to follow GDPR guidelines and maintain the appropriate level of privacy – we implemented a Data Loss Prevention tool.
This protocol monitors the content for sensitive information and deletes it – so that it is not identifiable in databases.
Tech Stack
- Google cloud platform
- Big Query
- Google DataFlow
- Image Recognition
- Google Machine vision API
- Google Auto ML
Personnel
- Project Manager
- Business Analyst
- Data Engineer
- QA
Conclusion
This project can be considered a huge accomplishment for our team. What started out as a relatively simple overhaul of the system slowly evolved into a series of complex solutions that bring the platform to an entirely new level.
We are especially proud of the flexible content moderation system that can keep things civil in the chatroom, while not being overbearing or overly noticeable.
- An effective fraud detection system that can handle various types of chat-based fraud with ease
This project can be considered as a big milestone for our team. Over the years we have worked on different aspects of big data operation and developed many projects that involved data processing and analytics. However, this project gave us the chance to create an entire system from the ground up, integrate it with the existing infrastructure, and bring it all to a completely new level.
During the development of this project, we utilized more streamlined workflows that allowed us to make the whole turnaround much faster. Because of this, we managed to deploy an operating prototype of the system ahead of the planned date and dedicated more time to its testing and refinement.