this post is inspired by stanford.edu
The process of statistics starts when we identify what group we want to study or learn something about. We call this group the population. Note that the word populationhere (and in the entire course) does not refer only to people; it is used in the broader statistical sense to refer not only to people, but also to animals, objects, and so on. For example, we might be interested in
- The opinions of the population of U.S. adults about the death penalty
- How the population of mice react to a certain chemical
- The average price of the population of all one-bedroom apartments in a certain city
Population, then, is the entire group that is the target of our interest:
In most cases, the population is so large that, as much as we want to, there is absolutely no way we can study all of it (imagine trying to get the opinions of all U.S. adults about the death penalty). A more practical approach would be to examine and collect data only from a subgroup of the population, which we call a sample. We call this first step, which involves choosing a sample and collecting data from it, producing data.
It should be noted that since, for practical reasons, we need to compromise and examine only a sub-group of the population rather than the whole population, we should make an effort to choose a sample in such a way that it will represent the population well. For example, if we choose a sample from the population of U.S. adults, and ask their opinions about the death penalty, we do not want our sample to consist of only Republicans or only Democrats.
Once the data have been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way. This second step, which consists of summarizing the collected data, is called exploratory data analysis.
Now we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results. Before we can do so, we need to look at how the sample we’re using may differ from the population as a whole, so that we can factor that into our analysis. To examine this difference, we use probability.
In essence, probability is the “machinery” that allows us to draw conclusions about the population based on the data collected about the sample.
Finally, we can use what we’ve discovered about our sample to draw conclusions about our population. We call this final step in the process inference.
This is the Big Picture of statistics.
what you will need to follow along with me
- Collecting data,
- Summarizing data, and
- Interpreting data.
In addition to being able to apply these processes, you can learn how to use statistical software packages to help manage, summarize, and interpret data. The statistics package exercises included throughout the course provide you the opportunity to explore a dataset and answer questions based on the output using R, Excelor SAS. In each hands-on, you can choose to view instructions for completing the activity in R, Excel and SAS, depending on which statistics package you choose to use.
We will start be explaining what is EDA and its overview to you:
Exploratory Data Analysis (EDA) Overview
Before we jump into exploratory data analysis and really appreciate its importance in the process of statistical analysis, let’s step back for a minute and ask:
What do we really mean by data?
Data are pieces of information about individuals organized into variables. By an individual, we mean a particular person or object. By a variable, we mean a particular characteristic of the individual.
A dataset is a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables.
The following dataset shows medical records from a particular survey:
In this example, the individuals are patients, and the variables are Gender, Age, Weight, Height, Smoking, and Race. Each row, then, gives us all the information about a particular individual (in this case, patient), and each column gives us information about a particular characteristic of all the patients.
Variables can be classified into one of two types: categorical or quantitative.
- Categorical variables take category or label values and place an individual into one of several groups. Each observation can be placed in only one category, and the categories are mutually exclusive
- Quantitative variables take numerical values and represent some kind of measurement.
Let’s Explore a dataset:
In this activity we
- Learn how to open and examine a dataset.
- Practice classifying variables by their type: quantitative or categorical.
- Learn how to handle categorical variables whose values are numerically coded.
BackGround Of data:
Clinical depression is the most common mental illness in the United States, affecting 19 million adults each year (Source: NIMH, 1999). Nearly 50% of individuals who experience a major episode will have a recurrence within 2 to 3 years. Researchers are interested in comparing therapeutic solutions that could delay or reduce the incidence of recurrence.
In a study conducted by the National Institutes of Health, 109 clinically depressed patients were separated into three groups, and each group was given one of two active drugs (imipramine or lithium) or no drug at all. For each patient, the dataset contains the treatment used, the outcome of the treatment, and several other interesting characteristics.
Here is a summary of the variables in our dataset:
- Hospt: The patient’s hospital, represented by a code for each of the 5 hospitals (1, 2, 3, 5, or 6)
- Treat: The treatment received by the patient (Lithium, Imipramine, or Placebo)
- Outcome: Whether or not a recurrence occurred during the patient’s treatment (Recurrence or No Recurrence)
- Time: Either the time in days till the first recurrence, or if a recurrence did not occur, the length in days of the patient’s participation in the study.
- AcuteT: The time in days that the patient was depressed prior to the study.
- Age: The age of the patient in years, when the patient entered the study.
- Gender: The patient’s gender (1 = Female, 2 = Male)
the data should look like this
Hospt Treat Outcome Time AcuteT Age Gender
1 1 0 1 36.143 211 33 1
2 1 1 0 105.143 176 49 1
3 1 1 0 74.571 191 50 1
4 1 0 1 49.714 206 29 2
5 1 0 0 14.429 63 29 1
6 1 2 1 5.000 70 30 2
now if we want to see the age variable in R Studio we will get a result like this
Often it is easier to use labels for categorical variables that are as close as possible to the meanings of the categories. Now we will recode the variable gender with the labels “Male” and “Female.”
In the next post we will explore the variable and there characteristics .
github link:click here