Big Data Essentials(unix command,DFS):

from internet search you will have a result like this for Big data definition:

extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.”much IT investment is going towards managing and maintaining big data” .

let’s understand it first: every time you search some thing on search engine it returns millions of rows instantaneously . This huge pile of data can not fit into one machine and retrieve from there this is Big data.


Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.

The Three Vs of Big Data


The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.


Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action.


Variety refers to the many types of data that are available. Traditional data types were structured and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and semistructured data types, such as text, audio, and video require additional preprocessing to derive meaning and support metadata.

Benefits of Big Data and Data Analytics:

  • Big data makes it possible for you to gain more complete answers because you have more information.
  • More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.

Big Data Use Cases

Big Data Use Cases

Big data can help you address a range of business activities, from customer experience to analytics. Here are just a few.

Product Development 
Companies like Netflix and Procter & Gamble use big data to anticipate customer demand. They build predictive models for new products and services by classifying key attributes of past and current products or services and modeling the relationship between those attributes and the commercial success of the offerings. In addition, P&G uses data and analytics from focus groups, social media, test markets, and early store rollouts to plan, produce, and launch new products.

Predictive Maintenance 
Factors that can predict mechanical failures may be deeply buried in structured data, such as the equipment year, make, and model of a machine, as well as in unstructured data that covers millions of log entries, sensor data, error messages, and engine temperature. By analyzing these indications of potential issues before the problems happen, organizations can deploy maintenance more cost effectively and maximize parts and equipment uptime.

Customer Experience 
The race for customers is on. A clearer view of customer experience is more possible now than ever before. Big data enables you to gather data from social media, web visits, call logs, and other data sources to improve the interaction experience and maximize the value delivered. Start delivering personalized offers, reduce customer churn, and handle issues proactively.

Fraud and Compliance 
When it comes to security, it’s not just a few rogue hackers; you’re up against entire expert teams. Security landscapes and compliance requirements are constantly evolving. Big data helps you identify patterns in data that indicate fraud and aggregate large volumes of information to make regulatory reporting much faster.

Machine Learning 
Machine learning is a hot topic right now. And data—specifically big data—is one of the reasons why. We are now able to teach machines instead of program them. The availability of big data to train machine-learning models makes that happen.

Operational Efficiency 
Operational efficiency may not always make the news, but it’s an area in which big data is having the most impact. With big data, you can analyze and assess production, customer feedback and returns, and other factors to reduce outages and anticipate future demands. Big data can also be used to improve decision-making in line with current market demand.

Drive Innovation 
Big data can help you innovate by studying interdependencies between humans, institutions, entities, and process and then determining new ways to use those insights. Use data insights to improve decisions about financial and planning considerations. Examine trends and what customers want to deliver new products and services. Implement dynamic pricing. There are endless possibilities.

UNIX Command Line Interface(CLI):

what a file system follows:

Available file managers that is used by community:

command line syntax:


to list file and dir stored in your working space or current directory.


to see how much disk space is used by files in your current directory.


to see how much free space is left .

so basically to work with file system we use :

The next field is file system exploration:

  • mkdir : To make new directory bash syntax: $mkdir your_dir_name
  • cp : To copy existing file/dir . bash syntax: $cp <your_original_file> <duplicate_file> use -r to do the same with dir.
  • mv: To move file from from one place to another.(it creates the same file in target region and then delete the original one) bash syntax: $mv <your_original_file> <Target_dir/duplicate_file>
  • rm: To remove file from folder or entire dir . bash syntax: $rm <your_original_file>. use -r to remove dir.

file content exploration:

commands for processes:

here is a full list of commands :

awk“Aho, Weinberger and Kernigan”, Bell Labs, 1970s. Interpreted programming language for text processing.
awk -F(see above) + Set the field separator.
catDisplay the contents of a file at the command line, is also used to copy and or append text files into a document. Named after its function to con-cat-enate files.
cdChange the current working directory. Also known as chdir (change directory).
cd /Change the current directory to root directory.
cd ..Change the current directory to parent directory.
cd ~Change the current directory to your home directory.
cpMake copies of files and directories.
cp -rCopy directories recursively.
cutDrop sections of each line of input by bytes, characters, or fields, separated by a delimiter (the tab character by default).
cut -d -f-d is for delimiter instead of tab character, -f select only those fields (ex.: “cut -d “,“ -f1 multilined_file.txt” – will mean that we select only the first field from each comma-separated line in the file)
duEstimate (and display) the file space usage – space used under a particular directory or files on a file system.
dfDisplay the amount of available disk space being used by file systems.
df -hUse human readable format.
freeDisplay the total amount of free and used memory (use vm_stat instead on MacOS).
free -mDisplay the amount of memory in megabytes.
free -gDisplay the amount of memory in gigabytes.
grepProcess text and print any lines which match a regular expression (“global regular expression print”)
headPrint the beginning of a text file or piped data. By default, outputs the first 10 lines of its input to the command line.
head -nOutput the first n lines of input data (ex.: “head -5 multilined_file.txt”).
killSend a signal to kill a process. The default signal for kill is TERM (which will terminate the process).
lessIs similar to more, but has the extended capability of allowing both forward and backward navigation through the file.
lsList the contents of a directory.
ls -lList the contents of a directory + use a long format, displaying Unix file types, permissions, number of hard links, owner, group, size, last-modified date and filename.
ls -lhList the contents of a directory + print sizes in human readable format. (e.g. 1K, 234M, 2G, etc.)
ls -lSSort by file size
manDisplay the manual pages which provide documentation about commands, system calls, library routines and the kernel.
mkdirCreate a directory on a file system (“make directory”)
moreDisplay the contents of a text file one screen at a time.
mvRename files or directories or move them to a different directory.
niceRun a command with a modified scheduling priority.
psProvide information about the currently running processes, including their process identification numbers (PIDs) (“process status”).
ps aSelect all processes except both session leaders and processes not associated with a terminal.
pwdAbbreviated from “print working directory”, pwd writes the full pathname of the current working directory.
rmRemove files or directories.
rm -rRemove directories and their contents recursively.
sortSort the contents of a text file.
sort -rSort the output in the reverse order. Reverse means – to reverse the result of comparsions
sort -k-k or –key=POS1[,POS2] Start a key at POS1 (origin 1), end it at POS2 (default end of the line) (ex.: “sort -k2,2 multilined_file.txt”).
sort -nCompare according to string numerical value.
tailPrint the tail end of a text file or piped data. Be default, outputs the last 10 lines of its input to the command line.
tail -nOutput the last n lines of input data (ex.: “tail -2 multilined_file.txt”).
topProduce an ordered list of running processes selected by user-specified criteria, and updates it periodically.
touchUpdate the access date and or modification date of a file or directory or create an empty file.
trReplace or remove specific characters in its input data set (“translate”).
tr -dDelete characters, do not translate.
vimIs a text editor (“vi improved”). It can be used for editing any kind of text and is especially suited for editing computer programs.
wcPrint a count of lines, words and bytes for each input file (“word count”)
wc -cPrint only the number of characters.
wc -lPrint only the number of lines.

Storing Big Amount of data:

there are two options :

  • scale up : increase capacity of physical hardware (not very much practical )
  • scale out : increase the number node of physical hardware (practical)

from where the DFS concept comes:

What makes GFS a success:

What is the difference between GFS and HDFS:

hdfs is the open source version of GFS, besides it is written in java in compare to GFS written in C++.

How can you read a file from hdfs system:

Typical HDFS block system:

if replica is in finalized state or frozen state ,means you can read from any file (datanode) and will get same content . The GS code or generation stamp for this case will be same.

In next blog post we will explore the hands on about the topic Archetechting namenode and then we will move towards Hive.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s