Real Time Big Data Processing

let’s first we will try to answer the following:

Yandex

You will need to know what is Big Data in terms of streaming services:

so big data follows the 3 v’s velocity,variety and volume. And it is quite understandable that processing this huge amount of data is not possible with sets of Traditional Software.Still let’s see how can we process it via Traditional Approach:

Now let’s understand the cons of the method Batch means a subset of whole data accumulated over time thus it is the lag representation of real time data and the more lag the more value we have lost like weather forecast , News index all needs to be done real quick so that our Business has some meaning:

So What We want to achieve by not considering batch is stream process:

But why we are so considering the lag (which is hour for batch ) here . It is because of the Moments of Business case study like New Friends Recommendation By FaceBook or An offer in Linkedin the minute lag is enough . Just dial some random friend from your phone and sign-in your contact with facebook you will see the profile of that friend in your recommendation in a fly. Or some special offer on linkedin targeted towards some special cluster of persons .Example of more less lag(seconds) will be ticket booking . Go to Make My Trip . Then when next time you log in to the internet the Ad will hunt you. So from business perspective elimination of DFS storing and decreasing lag for processing data is a must.

Now let’s move to delivery semantics:

So Delivery semantics is the integrity of the data when moved from storage A to B . In the process most often either data gets duplicated or lost. We cant change this but it is possible to ensure compliance under weaker conditions.This is not a bug rather a feature and this compliance is termed as semantics.

According to the theory when we work with delivery semantics means we are working with atomic pieces of data we call it message.

Different semantics

1.At most one:

confirms no duplication but loss of data..Used where data loss(minimal) is acceptable but duplication is not acceptable like : Fraud Detection.

2.At least once:

No message will be lost. can be duplicated . used in example like unique site visitor.

3.Exactly once:

Now How can we transfer at least once to exactly once

Approach:

Event Based Approach:

Micro-batch Approach:

Lambda And Kappa Architecture:

Data Storage:

in the next post we will explore the same with example and notebooks.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s