Why doesn’t Netflix crash? . . . Data Processing in Motion
It’s that time! Our favourite web series has just been released on a streaming platform. Millions of users are waiting for it. Everyone starts streaming the moment it is available. But the platform can’t handle the millions of requests and crashes. The same doesn’t happen with Netflix – does it? Why doesn’t Netflix crash?
In 2018 Netflix managed 1.2 trillion records in a day, equal to 7 petabytes. A platform like Netflix can drive such traffic with no hiccups, thanks to its well-matured IT team, who designed and implemented metadata streaming using Kafka that handles millions of records per second.
Kafka is horizontally scalable and can be scaled to hundreds of clusters; a well-tuned Kafka can handle trillions of data records. This data is almost equivalent to multi Petabytes.
What is Kafka, and why is it so popular?
Kafka is a distributed event store and stream-processing platform. It is an open-source system maintained under the Apache Software Foundation and provides a unified, high-throughput, low-latency platform for the real-time handling of data feeds.
Kafka can connect to external systems to import-export data via Kafka connect. It works on the mechanism of Publish and Subscribes, based on the commit log. It allows users to subscribe and publish data to various systems or real-time applications.
Kafka delivers messages at network limited throughput using a cluster of machines with latencies as low as 2ms. It is highly scalable up to a thousand brokers, trillions of messages per day, petabytes of data, and hundreds of thousands of partitions.
A few organisations using Kafka are Netflix, LinkedIn and Uber.
Background
Kafka was originally developed at LinkedIn and subsequently open-sourced in early 2011. Jay Kreps, Neha Narkhede and Jun Rao helped co-create Kafka through the Apache Incubator program, with its graduation from the program occurring on 23 October 2012.
Jay Kreps named the software after the author Franz Kafka because it is “a system optimised for writing”, and he liked Kafka’s work.
Third Generation Kafka
In addition to Apache Kafka, a fully managed Kafka platform is provided from Confluent. This means we now have three generations of Kafka;
- The first generation is open-source Apache Kafka, launched in 2010.
- The same team formed Confluent Kafka in 2014, a fully managed Kafka platform.
- Later Confluent team came up with an on-cloud solution that is a fully managed Kafka and infra services where the customer can relieve themselves of mundane operational work and focus on delivering business values. Confluent expands the benefits of Kafka with enterprise-grade features while removing the burden of Kafka management or monitoring.
Today, over 80% of the Fortune 100 are powered by data streaming technology – and the majority of those leverage Confluent.
By integrating historical and real-time data into a single, central source of truth, Confluent makes it easy to:
- Build an entirely new category of modern, event-driven applications
- Gain a universal data pipeline
- Unlock powerful new use cases with full scalability, performance, and reliability.
Swamped with data
In today’s world, we are generating tons of data. Everything generates data, from a bank transaction, recording the steps in a fitness band, running a server, and booking a taxi service.
It’s a big challenge to store, process and enrich this data. To do this, companies can spend a considerable chunk of time and money developing various applications.
What if this could be done within one tool? What if filtering, segregation and enrichment could be done within data in motion? This would save a lot of IT costs. This is possible through Kafka using KsqlDB. It gives freedom to get away from many Microservices and manage them. KsqlDB will be covered later in the blog.
Real-Time Data beats Slow Data
Everything is moving to a digital platform in today’s world, so data should be available in real-time.
Imagine an Immigration Agent at the airport who doesn’t have real-time data and allows the wrong person to enter the country. The person is dangerous to society, but the agent only realises this when the IT system processes the next batch job. But it’s too late, as the person had left the airport hours ago.
Similarly, the app-based taxi model is based on real-time data, providing information such as cost and waiting time. This data should be processed inflow. The surcharge is also calculated based on this data. For Uber, real-time information is the backbone of the services.
Quality beats quantity
Imagine Uber is sending all the car data available in the city with GPS coordinates irrespective of the user’s location and requirements. This data is correct, but as a user, you don’t need the GPS data of the car in coordinate format. Also, users need data specific to the area they are looking for. We need enriched data where coordinates are converted into actual street locations in this scenario.
Below is the example which shows the data of all the trains running at a specific time. It has a lot of data! But is all this data required for a single user? Wouldn’t it make more sense if the user received only data for the next two trains that will take them to their destination? This information can be generated from very little data. It’s enriched adequately according to user requirements which is more beneficial for the user.
KSQL
ksqlDB is a database purpose-built for stream processing applications and enables real-time data streams instantaneously with just a few SQL statements.
It makes data immediately actionable by continuously processing data streams generated throughout the business.
It offers a single solution for collecting data streams, enriching them, and serving queries on new derived streams and tables. Fewer moving parts in the data architecture mean less infrastructure to deploy, maintain, scale, and secure.
Problem Statement
To process and enrich the data, we tend to deploy microservices/applications written in any programming language like C, Java, Python etc.
Writing this code will increase the project’s cost, not only in terms of time and development costs but also because these applications need to be hosted on infrastructure, which will incur a high initial charge. Then, after the deployment of the application, there will be operational costs as well.
Proposed approach
Using Confluent Kafka
We can build real-time applications with the same ease and familiarity as building traditional apps on a relational database through the SQL syntax.
ksqlDB is built on top of Kafka Streams, a lightweight, powerful Java library for enriching, transforming, and processing real-time data streams. Having Kafka Streams at its core means ksqlDB is built on well-designed and easily understood layers of abstractions.
You can query the tables and streams by continuously subscribing to changing query results as new events occur (push queries) or looking up results at a point in time (pull queries), removing the need to integrate separate systems to serve each.
Here we are taking an example where data comes from the server through the source connector. Once the data is reached in the source Kafka topic, we filter the data and keep only the required data in streams. This filtered data can be sent to a different topic based on business logic.
We can create two different streams for the various business logic to separate the data into other streams.
First Stream filters the data based on the state name “LA”.
CREATE STREAM TEST_LA WITH (KAFKA_TOPIC=’Elastic_topic’, PARTITIONS=1,REPLICAS=1) AS SELECT * FROM TEST_ORIGINAL TEST_ORIGINAL WHERE (TEST_ORIGINAL.ADDRESS->STATE = ‘LA’) EMIT CHANGES;
The second Stream is filtering the data based on the state name “CA”.
CREATE STREAM TEST_CA WITH (KAFKA_TOPIC=’Splunk_topic’, PARTITIONS=1, REPLICAS=1) AS SELECT * FROM TEST_ORIGINAL TEST_ORIGINAL WHERE (TEST_ORIGINAL.ADDRESS->STATE = ‘CA’) EMIT CHANGES;
As we know, enriched data makes more sense. After segregation, this data can be enriched within KsqlDB through a KTable or Stream.
Different topics can be sent to different destinations through the Kafka sink connector. We send user logs to ElasticSearch and Machine logs to Splunk in this scenario. Later these data can be used independently by different systems.
The above example can also be used for unstructured data by formatting them through Regex.
Summary
In the above example, we have saved 60 per cent in infrastructure, development and operations costs. Earlier, 100 GB of log data was stored in both Elastic and Splunk daily. So the whole system used 200 GB of storage per day and 2 X 1.6 million transactions reached Splunk and Elastic.
After the above POC, we have discarded 30 GB of unnecessary logs within Kafka. The remaining 70 Gb are segregated to different topics based on User logs and machine logs. User logs occupied approx 15GB in ElasticSearch, and machine logs occupied 55 GB in Splunk. Overall we have reduced from 200 GB per day transaction to 70 GB transaction, which is approximately 65% of disk storage reduction.
About Skillfield
Skillfield is a data services and cyber security company empowering organisations to excel in the digital era.
Contact us if data platform challenges or data complexity issues prevent your organisation from making data-driven decisions. Let’s talk about how we can help.
Author: Abid Ali
References
- En.wikipedia.org. 2022. Apache Kafka – Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Apache_Kafka> [Accessed 25 May 2022].
- Docs.confluent.io. 2022. What is Confluent Platform? | Confluent Documentation. [online] Available at: <https://docs.confluent.io/platform/current/platform.html> [Accessed 25 May 2022].
- Theage.com.au. 2022. Watch Melbourne’s public transport system move in real time. [online] Available at: <https://www.theage.com.au/national/victoria/watch-melbournes-public-transport-system-move-in-real-time-20150506-ggvt0i.html> [Accessed 25 May 2022].
- Google Maps. 2022. Google Maps. [online] Available at: <https://www.google.com/maps> [Accessed 25 May 2022].
- DiDi Australia. 2022. Rider – DiDi Australia. [online] Available at: <https://australia.didiglobal.com/rider/> [Accessed 25 May 2022].
- Kai Waehner. 2022. Comparison of Open Source Apache Kafka vs Vendors including Confluent, Cloudera, Red Hat, Amazon MSK – Kai Waehner. [online] Available at: <https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/> [Accessed 25 May 2022].
- Docs.confluent.io. 2022. ksqlDB Overview | Confluent Documentation. [online] Available at: <https://docs.confluent.io/platform/current/ksqldb/index.html> [Accessed 25 May 2022].
- https://skillfield.com.au/monitoring-kubernetes-and-docker-container-logs/