Become More Productive By Doing Command line Data science

command line: If you run a windows machine press windows+R you will see a window will appear type cmd in it and you will a black screen will appear and that’s what command line is.

A typical command line interface for windows

If you are in ubuntu or mac your terminal is your command line :

A typical Ubuntu terminal

Now a valid questions you may asked that Today, data scientists can choose from an overwhelming collection of exciting technologies and programming languages. Python, R, Hadoop, Julia, Pig, Hive, and Spark are but a few examples. You may already have experience in one or more of these. If so, then why should you still care about the command line for doing data science? What does the command line have to offer that these other technologies and programming languages do not? We will discuss this in this post.

The command line has many great advantages that can really make you a more efficient and productive data scientist. Roughly grouping the advantages, the command line is: agile, augmenting, scalable, extensible, and ubiquitous. We elaborate on each advantage below.

The Command Line is Agile:

  • The command line provides a so-called read-eval-print-loop (REPL) .
  • The command line is very close to the filesystem.

The Command Line is Augmenting

  • The command line is presented here as an augmenting technology that amplifies the technologies you’re currently employing.
  • The command line integrates well with other technologies. On the one hand, you can often employ the command line from your own environment.
  • The command line can easily cooperate with various databases and file types such as Microsoft Excel.

 The Command Line is Scalable

The command line is automatable, it becomes scalable and repeatable. It is not straightforward to automate pointing and clicking, which makes a GUI a less suitable environment for doing scalable and repeatable data science.

 The Command Line is Extensible And Ubiquitous

The command line itself is language agnostic. This allows the command-line tools to be written in many different programming languages. The open source community is producing many free and high-quality command-line tools that we can use for data science.

Because the command line comes with any Unix-like operating system, including Ubuntu Linux and macOS, it can be found in many places. According to an article on Top 500 Supercomputer Sites, 95% of the top 500 supercomputers are running GNU/Linux. So, if you ever get your hands on one of those supercomputers (or if you ever find yourself in Jurassic Park with the doors locks not working), you better know your way around the command line!

But GNU/Linux not only runs on supercomputers. It also runs on servers, laptops, and embedded systems. These days, many companies offer cloud computing, where you can easily launch new machines on the fly. If you ever log in to such a machine (or a server in general), there’s a good chance that you’ll arrive at the command line.

Getting Started With Command Line Tools:

In this section, you’ll learn:

  • How to install the Docker image.
  • Essential concepts and tools necessary to perform data science at the command line.

To install Docker go to your ubuntu terminal and update it by sudo apt-get update command

Then type sudo apt-get remove docker docker-engine docker.io

Next install docker by typing sudo apt install docker.io your docker will be installed

Now start it by typing sudo systemctl start docker

then enable it at startup by sudo systemctl enable docker

check your installation by printing the version like you used to do in case of git

Now pull the docker image that we will work with by typing

docker pull datascienceworkshops/data-science-at-the-command-line

if you get permission denial like me use sudo before it to resolve for temporary

Now run the image to get into the linux envo created to do the job by typing

docker run --rm -it datascienceworkshops/data-science-at-the-command-line 

To exit the container run exit

If you want to get data in and out of the container, you can add a volume, which means that a local directory gets mapped to a directory inside the container. We recommend that you create a new directory, navigate to this new directory, and then run the following when you’re on macOS or Linux:

docker run --rm -it -v`pwd`:/data datascienceworkshops/data-science-at-the-command-line

Environment Decoding:

Command-line tools: . There are different types of command-line tools, which we will discuss in the next section. Examples of tools are: ls ,cat ,wget ,grep , less , awk (to be continued)

Terminal: The terminal, which is the second layer, is the application where we type our commands in. If you see the following text:

YOU can see i just downloaded my blogs index page , make one file delete what i did and last clear the screen by command now imagine all the work with a GUI

Shell: The third layer is the shell. Once we have typed in our command and pressed <Enter>, the terminal sends that command to the shell. The shell is a program that interprets the command. The Docker image uses Bash as the shell, but there are many others available. Once you have become a bit more proficient at the command line, you may want to look into a shell called the Z shell. It offers many additional features that may increase your productivity at the command line.

OS: The fourth layer is the operating system, which is GNU/Linux in our case. Linux is the name of the kernel, which is the heart of the operating system. The kernel is in direct contact with the CPU, disks, and other hardware. The kernel also executes our command-line tools .

Combining Command-line Tools

combining command line tools often automate your data science job

Download our data from uci ml repository

count the line in the data file by wc -l adult.data for word change the l to w.see the first entries in dataset like you used to do in pandas by typing head -n 2 adult.data (default is 10 /for pandas 5)

Now lets you have a dataset with millions of row and you want to see couple of row from top and bottom and analysis it you can make it very easily by

Now suppose you want to see the missing values in a dataset you can do it by grep command

How To Obtain Data

According to the Unix philosophy, text is a universal interface. Almost every command-line tool takes text as input, produces text as output, or both. This is the main reason why command-line tools can work so well together. However, as we’ll see, even just text can come in multiple forms.

Data can be obtained in several ways—for example by downloading it from a server, by querying a database, or by connecting to a Web API. Sometimes, the data comes in a compressed form or in a binary format such as Microsoft Excel. In this chapter, we discuss several tools that help tackle this from the command line.

you can get data in to your toolbox from many sources like:

  • Obtain data from the Internet
  • Query databases
  • Connect to Web APIs
  • Decompress files
  • Convert Microsoft Excel spreadsheets into usable data

follow the command provided in the image to reach the files:

You can pipe the data to a tool called csvlook (Groskopf 2014d), which will nicely format the data into a table. Here, we’ll display a subset of the columns using csvcut such that the table fits on the page:

Most companies store their data in a relational database. Examples of relational databases are MySQL, PostgreSQL, and SQLite. These databases all have a slightly different way of interfacing with them. Some provide a command-line tool or a command-line interface, while others do not. Moreover, they are not very consistent when it comes to their usage and output.

Fortunately, there is a command-line tool called sql2csv, which is part of the Csvkit suite. Because it leverages the Python SQLAlchemy package, we only have to use one tool to execute queries on many different databases through a common interface, including MySQL, Oracle, PostgreSQL, SQLite, Microsoft SQL Server, and Sybase. The output of sql2csv is, as its name suggests, in CSV format.

We can obtain data from relational databases by executing a SELECT query on them. (sql2csvalso support INSERTUPDATE, and DELETE queries, but that’s not the purpose of this chapter.) To select a specific set of data from an SQLite database named iris.dbsql2csv can be invoked as follows

Next we will explore the web scrapping , reusable command line , building data pipeline from command line till than bye.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s