 # Analysis Of Public health data with R with Logistic Regression

As we are talking about logistic regression to be used in place of linear regression some points needs to keep in mind while we are using it.

Why does linear regression not work with binary outcomes?

Binary outcomes only have two values. The example we are using throughout this course is diabetes, where individuals either have diabetes or they don’t. For our regression model, we could code this outcome so that individuals with diabetes = 1 and those without diabetes = 0. If we just ran a linear regression model with this binary outcome and one continuous predictor variable, then the model will plot a straight line through these points just as we have seen with simple linear regression in the course on Linear Regression for Public Health.

The graph on the left shows the relation between the continuous predictor variable (“cpred” on the X axis) and some continuous outcome variable (“cont_outcome”) on the Y axis. It shows that the predicted values from the linear regression model (red line) are reasonable for the continuous dependent variable, even if the model does not explain the relationship very well because lots of the points are far from the red line. However, the graph on the right clearly shows that the linear model does not fit the data well at all when the outcome is binary (“outcome” on the Y axis). The predicted values often correspond to impossible values of the outcome, i.e. values other than 0 or 1. It just doesn’t make sense.

For a binary outcome, we are generally most interested in modelling the proportion of individuals with the outcome of interest, i.e. the proportion of individuals with diabetes in our example. This is equivalent to the probability of an individual having diabetes. Although probabilities are continuous variables, they can only take values from 0 to 1. However, as we have seen in the graph above (right), linear models will predict values below 0 and above 1. Luckily, we can transform our variable of interest into one that can be modelled in a regression equation using something called a link function.

As its name suggests, a link function describes the relationship that links our variable of interest to the variable that we use in our regression equation. It’s a mathematical trick. The link function that’s most often used for logistic regression is called the logit. Instead of directly modelling the probability, we model the logit of the probability. The logit of a probability p is equivalent to the natural logarithm (log) of the odds (equation below).

As the equation above shows, to get back to from its natural log (y), you raise to the power of y. This ‘anti-log’ transformation is known as exponentiating.

The reason we model the log(odds) rather than just the odds as the outcome variable is because it can take any value from minus infinity (when p = 0) to positive infinity (when p = 1). Odds, on the other hand, can only take positive values. Using the log(odds) as the outcome variable means that we can run a regression model in a similar way to normal linear regression with a continuous variable, and still ensure that the predicted values for probabilities are between 0 and 1 (graph below).

Now let’s work with our clinical data

After loading the data into your workspace given in github you will see the structure of data which is like this

Now if we want to see the distribution of cholesterol we can see that by simple plot:

How to infer knowledge from plots : see most of the value ranges from 100-300 and there are some value beyond 400 and below 100 so these are the value to observe when you are doing research.

Now look at gender distribution

gender
female male
0.581 0.419

see the analysis you will get a plot like this to check expertise like this assumes that the relation between age and the log odds of having diabetes is linear (more on this in detail in the next section). Is that reasonable? The easiest way is just to plot one against the other. :

Now look at the age plot use simple plot instead of ggplot in case of plotting that is simple

So what’s a probability density? When the bins are of equal width, then the height of each column in the plot reflects the frequency, i.e. it counts the number of patients. When the bins aren’t of equal width, then the area of the column is proportional to the frequency. I think I prefer the frequency one, but it’s a question of personal choice. I think the plots with bins of either 5 or even 10 years are good enough to show the distribution, though of course the one with bins of 5 give more information. Histograms are affected by the choice of bins, so some people prefer to use fancier plots instead to describe the distribution, such as kernel density plots (also known simply as density plots):

Rather than a set of blocky columns, we see a curve. This curve smoothes out the noise using a method called kernel smoothing (hence the name kernel density plot), which uses a weighted average of neighbouring data, i.e. of frequencies for ages just above and just below each age. The “bandwidth” mentioned in the above plot reflects the amount of data (i.e. ages above and below each age) used during the averaging. The details aren’t important for using the method. It’s simple enough to do in R that you could argue that you would be better off using it for continuous variables than histograms. For age in our example, I don’t think it matters at all, as long as you remember that the density plot involves smoothing – a kind of modelling – and so gives values on the graph, e.g. ages under 19 and over 92, that don’t actually have values in the real data set, whereas the histogram displays only what’s actually in the data.

these are the things needs to perform while you are doing research with data of course it is only tip of the iceberg do yourself more analysis but statistical so that you can become a good data model builder.