Become More Productive By Doing Command line Data science(part-II)

If you haven’t seen the first post see it here to get familiar with the topic and also get some hands on with linux command and pipeline . In this post we will talk about the more advance and really useful utils of command line like wise making reusable bash commands , web scrapping an many more .

we use a lot of commands and pipelines that basically fit on one line. Let us call those one-liners. Being able to perform complex tasks with just a one-liner is what makes the command line powerful. It’s a very different experience from writing traditional programs.

Some tasks you perform only once, and some you perform more often. Some tasks are very specific and others can be generalized. If you foresee or notice that you need to repeat a certain one-liner on a regular basis, it is worthwhile to turn this into a command-line tool of its own. So, both one-liners and command-line tools have their uses. Recognizing the opportunity requires practice and skill. The advantage of a command-line tool is that you do not have to remember the entire one-liner and that it improves readability if you include it into some other pipeline.

We believe that creating reusable command-line tools makes you a more efficient and productive data scientist in the long run. You gradually build up your own data science toolbox from which you can draw existing tools and apply it to problems you have encountered previously. It requires practice in order to be able to recognize the opportunity to turn a one-liner or existing code into a command-line tool.

In order to turn a one-liner into a shell script, we need to use some shell scripting. We shall only demonstrate the usefulness a small subset of concepts from shell scripting. This subset includes variables, conditionals, and loops.

you’ll learn how to:

  • Convert one-liners into shell scripts.
  • Make existing Python, R, and Java code part of the command line.

 Converting One-liners into Shell Scripts

for example see the command line one liner

this one-liner returns the top ten words of the text file i downloaded from the web with url given above the corresponding process is

  • Downloading an text file using curl.
  • Converting the entire text to lowercase using tr (Meyering 2012c).
  • Extracting all the words using grep (Meyering 2012a) and put each word on separate line.
  • Sort these words in alphabetical order using sort (Haertel and Eggert 2012).
  • Remove all the duplicates and count how often each word appears in the list using uniq(Stallman and MacKenzie 2012b).
  • Sort this list of unique words by their count in descending order using sort.
  • Keep only the top 10 lines (i.e., words) using head.

Each command-line tool used in this one-liner offers a man page. So in case you would like to know more about, say, grep, you can run man grep from the command line . like if you run man grep you will find information like this:

Now there is nothing wrong in run a one liner just once but imagine that we wanted the top 10 words of a news website on a hourly basis. In those cases, it would be best to have this one-liner as a separate building block that can be part of something bigger. Because we want to add some flexibility to this one-liner in terms of parameters, we will turn it into a shell script.

Because we use Bash as our shell, the script will be written in the programming language Bash. This allows us to take the one-liner as the starting point, and gradually improve on it. To turn this one-liner into a reusable command-line tool, we’ll walk you through the following six steps:

  1. Copy and paste the one-liner into a file.
  2. Add execute permissions.
  3. Define a so-called shebang.
  4. Remove the fixed input part.
  5. Add a parameter.
  6. Optionally extend your PATH.

1: Copy and Paste

Now create a file named top_10_words.sh by touch command then open it by nano command to edit you can also use vi editor

After copy paste the previous command to your shell script run it on terminal by bash command and you will get same result as previous

2: Add Permission to Execute

The reason we cannot execute our file directly is that we do not have the correct access permissions. In particular, you, as a user, need to have the permission to execute the file. In this section we change the access permissions of our file.

In order to work with the file lets first copy the file with cp command and change 10 to 11 to make a new file.

both the files arrived

To change the access permissions of a file, we need to use a command-line tool called chmod(MacKenzie and Meyering 2012a), which stands for change mode. It changes the file mode bits of a specific file. The following command gives the user, you, the permission to execute top-words-2.sh:

$ cd ~/directory_path/
$ chmod u+x top_11_words.sh

The command-line argument u+x consists of three characters: (1) u indicates that we want to change the permissions for the user who owns the file, which is you, because you created the file; (2) + indicates that we want to add a permission; and (3) x, which indicates the permissions to execute. Let us now have a look at the access permissions of both files:

3: Define Shebang

Although we can already execute the file on its own, we should add a so-called shebang to the file. The shebang is a special line in the script, which instructs the system which executable should be used to interpret the commands.

In our case we want to use bash to interpret our commands. Example shows what the file top-words.sh looks like with a shebang.

#!/usr/bin/env bash
curl -s http://www.gutenberg.org/files/76/76-0.txt |
tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort |
uniq -c | sort -nr | head -n 10

The name shebang comes from the first two characters: a hash (she) and an exclamation mark (bang). It is not a good idea to leave it out, as we have done in the previous step, because then the behavior of the script is undefined. The Bash shell, which is the one that we are using, uses the executable /bin/sh by default. Other shells may have different defaults.

Sometimes you will come across scripts that have a shebang in the form of !/usr/bin/bash or !/usr/bin/python (in the case of Python, as we will see in the next section). While this generally works, if the bash or python (Python Software Foundation 2014) executables are installed in a different location than /usr/bin, then the script does not work anymore. It is better to use the form that we present here, namely !/usr/bin/env bash and !/usr/bin/env python, because the env (Mlynarik and MacKenzie 2012) executable is aware where bash and python are installed. In short, using env makes your scripts more portable.

4: Remove Fixed Input

We know have a valid command-line tool that we can execute from the command line. But we can do better than this. We can make our command-line tool more reusable. The first command in our file is curl, which downloads the text from which we wish to obtain the top 10 most-used words. So, the data and operations are combined into one.

What if we wanted to obtain the top 10 most-used words from another e-book or our adult.data file , or any other text for that matter? The input data is fixed within the tools itself. It would be better to separate the data from the command-line tool.

5: Parametrize

There is one more step that we can perform in order to make our command-line tool even more reusable: parameters. In our command-line tool there are a number of fixed command-line arguments, for example -nr for sort and -n 10 for head. It is probably best to keep the former argument fixed. However, it would be very useful to allow for different values for the head command. This would allow the end user to set the number of most-often used words to be outputted. Example shows what our file top-words-5.sh looks like if we parametrize head.

Now you can parametrize our bash command like shown below:

After the previous five steps we are finally finished building a reusable command-line tool. There is, however, one more step that can be very useful. In this optional step we are going to ensure that you can execute your command-line tools from everywhere.

Currently, when you want to execute your command-line tool, you either have to navigate to the directory it is in or include the full path name as shown in step 2. This is fine if the command-line tool is specifically built for, say, a certain project. However, if your command-line tool could be applied in multiple situations, then it is useful to be able to execute form everywhere, just like the command-line tools that come with Ubuntu.

To accomplish this, Bash needs to know where to look for your command-line tools. It does this by traversing a list of directories which are stored in an environment variable called PATH. In a fresh Data Science Toolbox, the PATH looks like this:

$ echo $PATH | fold

The directories are delimited by colons. Here is the list of directories:

$ echo $PATH | tr ':' '\n'

To change the PATH permanently, you’ll need to edit the .bashrc or .profile file located in your home directory. If you put all your custom command-line tools into one directory, say, ~/tools, then you only change the PATH once. As you can see, the Data Science Toolbox already has /home/vagrant/.bin in its PATH. Now, you no longer need to add the ./, but you can just use the filename. Moreover, you do no longer need to remember where the command-line tool is located.

Scrubbing Data

you’ll learn how to:

  • Convert data from one format to another.
  • Apply SQL queries to CSV.
  • Filter lines.
  • Extract and replace values.
  • Split, merge, and extract columns.

CSV, which is the main format we’re working with in this section is actually not the easiest format to work with. Many CSV data sets are broken or incompatible with each other because there is no standard syntax, unlike XML and JSON.

Once our data is in the format we want it to be, we can apply common scrubbing operations. These include filtering, replacing, and merging data. The command line is especially well-suited for these kind of operations, as there exist many powerful command-line tools that are optimized for handling large amounts of data. Tools that we’ll discuss in this chapter include classic ones such as: cut

Filtering Lines

Location Based:

The most straightforward way to filter lines is based on their location. This may be useful when you want to inspect, say, the top 10 lines of a file, or when you extract a specific row from the output of another command-line tool. To illustrate how to filter based on location, let’s create a dummy file that contains 10 lines:

$ seq -f "Line %g" 10 | tee data/lines
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10

see what type of file you created by file file_name(lines here)

Removing the last 3 lines can be done with head:

$ < lines head -n -3
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7

Based on Pattern

Sometimes you want to extract or remove lines based on their contents. Using grep, the canonical command-line tool for filtering lines, we can print every line that matches a certain pattern or regular expression. 

grep -i Private adult.data

Based on Randomness

When you’re in the process of formulating your data pipeline and you have a lot of data, then debugging your pipeline can be cumbersome. In that case, sampling from the data might be useful. The main purpose of the command-line tool sample (Janssens 2014f) is to get a subset of the data by outputting only a certain percentage of the input on a line-by-line basis.

$ seq 1000 | sample -r 1% | jq -c '{line: .}'
{"line":53}
{"line":119}
{"line":141}
{"line":228}
{"line":464}
{"line":476}
{"line":523}
{"line":657}
{"line":675}
{"line":865}
{"line":948}

By Now you have a pretty good grasp on command line utils that is usually enough when you work on cloud and other technologies related to data field .

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s