John Hawkins: October 2014

Logistic Regression

Regression is performed when you want to produce a function that will predict the value of something you don't know (the dependent variable) on the basis of a collection of things you do know (the independent variables).

The problem is that regression is typically done with a linear function and very few real world processes are linear. Hence, a great deal of statistics and machine learning research concerns methods for fitting non-linear functions, but controlling for the explosion in complexity that comes with it.

Logistic Regression is one of the methods that tries to solve this problem. In particular, Logistic Regression produces an output between 0 and 1 which can be interpreted as the probability of your target event happening.

Let's look at the form of Logistic Regression to get a better understanding:

You start with the goal of a function that approximates the probability of the target T for any input vector X :

p(T) = F(X)

In order to assure that F(X) takes the form of a valid probability (i.e. always between 0 and 1) we make us of the logistic function 1/(1+e^-K). If K is a big number the e^-K approaches 0 and hence the output of the logistic function approaches 1. If on the other hand K is a very small number the e^-K becomes very large and the output of the logistic function approaches 0.

So we are fitting the following function:

p(T) = 1 / [ 1 + e^-g(X) ]

We have added the function g(X) to afford us some flexibility in how we feed the input vector X into the logistic function. Here is where we place our usual linear regression function. We say

g(X) = B_0 + B_1 * X_1 + B_2 * X_2 + ...... + B_N * X_N

i.e. a linear function over all the dimensions of the input X.

Now, in order to perform our linear regression, we need to transform the function definition. You can do the transformation yourself if you like. What you will find is that with some re-arrangement you find that the function g(X) is equal to:

g(X) = - ln [ (1-p) / p ]

And by exploiting the properties of the logarithm you can further re-arrange to get the log odds ratio.

g(X) = ln [ p / (1-p) ]

An astute reader might notice a problem. For a target value of 1 (i.e p=1) the fraction is undefined. Luckily we can use the properties of the logarithm again and define our target as

ln [ p / (1-p) ] = ln(p) - ln(1-p)

...and this is the target value onto which you perform the linear regression.

In other words you fit the value of the parameters (the Bs) so that

B_0 + B_1*X_1 + B_2*X_2 + ...... + B_N*X_N = ln(p) - ln(1-p)

That is all well and good, how can we do that with R ? you might ask.

Well, I have gone ahead and converted some code from a bunch of different tutorials into a little R workbook that will take you through applied Logistic Regression in R. You can find the Logistic Regression Code Example in my GitHub account right here.

It all boils down to using the Generalised Linear Model function.

This R function will fit your Logistic Regression for you.

If you follow that code example to the end you will get a plot like the one below, which shows you the original data in green, the model fitted to that data in black, and some predictions for unseen parts of the input space in red.

Logistic Regression allows you a great deal of flexibility in your model. The parameterized linear model can be changed how you want, adding or removing independent variables. You can even add higher order combinations of the independent variables.

A common Machine Learning process is to experiment with different forms of this model and examine how the statistical significance of the fit changes.

Just be wary of the pernicious problem of over-fitting.

John Hawkins

Friday, October 3, 2014

Logistic Regression with R

Logistic Regression