Data visualization is far most important thing in your data science or data analytics journey. It is the visualization that attract the viewers to see your work that impress the shareholder to invest and the authority to give a positive review towards your work but correct representation of data is not that simple not only you required to have a solid foundation on visualization tools but you needs to keep an eye on the variables you used , understanding the relationship between and foremost understanding the visualization graphics to establish your finding. In this post I am going to cover in details aspects of ggplot2: the visualization package available in R so that you could represent your understanding in more meaningful way .
By now I hope you have a solid foundation of statistics and data visualization .As you might have known that A typical data science project finding or EDA depends on statistics ,Design , Data analysis and perception where your task is to convert your perception into reality via valid data analysis.
Data visualization explains the exploratory and explanatory versions of data . As a developer your job is to explore the data to find meaningful relations but at the same time you have to explain the same with labels and other stuffs.
What is the difference Between Exploratory and Explanatory data visualization?
Well to be honest the exploratory version is for piar related review suppose you have a scatter plot between two variables where your intuition is to fit a quadratic plot you can plot it into another plot to test your understanding and share your same intuition with your colleagues but the explanatory version requires more perfection and a publication ready font .
How to develop general intuition for plotting data
suppose you are working with mtcars dataset in R and you want to visualize between the relationship with the data variables. so what can you do . The str() function in R helps you to list all the variables a dataframe have with its structure .
Now see the structure of the dataset looks like this;
The mpg has different continuous value but the cyl has only few but same value so the str R interpret here is wrong we have to interpret cyl as level and plot the graph .
As you can see the scatter plot has no meaning in it self but if we change the cyl as factor (category if you are familiar with tensorflow or pandas ) we can have a meaning full plot over this plot
look how changing the type of cyl improves the same graph.
What Are The Essential Elements of Graphis?
Now let’s give you one example of how quickly and efficiently you can build one graph that is a well round figure explains the variable well, the particular graph plots the scatter plot between wt and mpg variable and the color and size of the point are adjust according to disp variable so you can see the change according to the size and color
Another important aesthetics part is shape of data but unlike color or size it must have to be categorical if you use disp in shape also you will get error
So what are the layers of ggplot :
if we examine the basic of lets say diamonds dataset installed in Rstudio for you , you can check the dataset scatter plot by using ggplot aesthetics and also can draw a smooth line that fits the point
if you see console R also provides the formula it used to draw the line
You can also add color to this same plot by using a variable clarity
you can also objectify same plot by dividing the plot in layer grammer like this
Now you can use different plot aesthetics to improve the plot like alpha , se etc in your plot.
now lets decode the structure of ggplot basic the first thing you need to check is data which is the source of your plot next you need to define your aesthetics like x and y coordinates assuming you are plotting 2-D plot. And Now if you just give the type grammer like what type of plot you want to draw ex: geom_point(), geom_bar() etc , that’s it your Basic job is ready.
Next thing is that you need to be aware that the plot type and data should match each other like you need continuous value for histogram and discrete value for bar plot. Next to identify more granular specification of the data like you have plotted the survival rate against sex ratio but is that ok for Titanic EDA what if you want a more clear picture of this ratio we will use facet to explain this we will facet the plot with another variable let’s say Passenger class (for those who have no idea of what Titanic Machine learning is visit the Kaggel page to understand).
Next thing comes as the labelling when you are publishing a graph to your reader all labels need to be in position because it might be the only thing your reader will try to focus on.
Now let’s discuss what we have said above with the example of Titanic if we plot survival with passenger class we can see that the dead person count is more in class 3 which is a cheap class and that is believable because in titanic there is not enough boat to support all the life of passenger and it is obvious that in our society the richer will get more privileged than smart under economically achiever section.
Now we have more intuition as we always say ladies first we might be wondering are ladies survived more in number than males and does it true in every section of Titanic we can easily see it by plot
You can see our intuition is true now an interesting but true fact of society you can realized from the plot if you are male and poor your survival rate is very less and it is an important fact from the consideration of model development.
Now suppose you have this out of box thinking that there must a relation between the first letter of ticket and survival you can also check it by our plot
See the plot does not confirm your statement . We can have more type of plots to visualize in next post we will explore all the advance level ggplot grammars when you develop basic strength of visualization.
To run all the codes and play with the code follow the steps:
- clone the repository https://github.com/MachineLearningWithHuman/R
- move to repository R
- go to folder Visualization_basic
- play with Titanic file and the visualization basic file .