Howdy folks! It’s been a long time since I did a coding demonstrations so I thought I’d put one up to provide you a logistic regression example in Python!
Admittedly, this is a cliff notes version, but I hope you’ll get enough from what I have put up here to at least feel comfortable with the mechanics of doing logistic regression in Python (more specifically; using scikit-learn, pandas, etc…). This logistic regression example in Python will be to predict passenger survival using the titanic dataset from Kaggle. Before launching into the code though, let me give you a tiny bit of theory behind logistic regression.
Logistic Regression Formulas:
The logistic regression formula is derived from the standard linear equation for a straight line. As you may recall from grade school, that is y=mx + b . Using the Sigmoid function (shown below), the standard linear formula is transformed to the logistic regression formula (also shown below). This logistic regression function is useful for predicting the class of a binomial target feature.
Logistic Regression Assumptions
Any logistic regression example in Python is incomplete without addressing model assumptions in the analysis. The important assumptions of the logistic regression model include:
- Target variable is binary
- Predictive features are interval (continuous) or categorical
- Features are independent of one another
- Sample size is adequate – Rule of thumb: 50 records per predictor
So, in my logistic regression example in Python, I am going to walk you through how to check these assumptions in our favorite programming language.
Uses for Logistic Regression
One last thing before I give you the logistic regression example in Python / Jupyter Notebook… What awesome result can you ACHIEVE USING LOGISTIC REGRESSION?!? Well, a few things you can do with logistic regression include:
- You can use logistic regression to predict whether a customer will convert (READ: buy or sign-up) to an offer. (will not convert – 0 / will convert – 1)
- You can use logistic regression to predict and preempt customer churn. (will not drop service – 0 / will drop service – 1)
- You can use logistic regression in clinical testing to predict whether a new drug will cure the average patient. (will not cure – 0 / will cure -1)
The nice thing about logistic regression is that it not only predicts an outcome, it also provides a probability of that prediction being correct.
Now For that Logistic Regression Example in Python
That’s it! That’s what I’ve got. I wish I had more time to type up all the information explaining every detail of the code, but well… Actually, that would be redundant. I cover it all right over here on Lynda.com / LinkedIn Learning.
More resources to get ahead...

Get Income-Generating Ideas For Data Professionals
Are you tired of relying on one employer for your income? Are you dreaming of a side hustle that won’t put you at risk of getting fired or sued? Well, my friend, you’re in luck.

This Post Has 12 Comments
For adequate sample size in the medical world, we use a rule of thumb of needing 10 outcomes of interest (e.g. death) for each variable rather than 50 patients for each variable. Thoughts on that?
That’s going to lead to less reliable predictions. I’d look into it with someone that has expertise in medicine.
Hi,
I ran this example through JMP and got a completely different output. I set up the data exactly as you illustrated, creating my dummy variables (character, nominal) and only only using the final six variables that you illustrated. My Nominal Regression model wound up with a confusion matrix:
Act Survived pred count
0 1
0 472 77
1 109 233
My python example (using v2.7) also differed from yours. I wonder what we did that was different.
Without going back into the demo, my first guess is that there is a random function running and we didn’t set the same seed.
How did you know that Pclass and fare are independent ?
Fare and Pclass are not independent of each other, so I am going to drop these.
Hi Prasanta – It is nice to meet you! I am not sure what you’re talking about bc the demo shows exactly the same… they should be dropped.
Lillian, Prasanta is quoting you.
Prasanta, you can see that Pclass and Fare are not independent in the correlation heatmap by the fact that the cell where they intersect is dark blue, indicating ~high negative correlation.
Not sure why the same assessment was not made for SibSp and Parch.
This is because the heatmap shows a high correlation between Fare and Pclass. This could lead to the error of multicollinearity ( a situation where independent variables are correlated) which is against the assumptions of the model and could lead to inaccurate results.
(thumbs up)
Hey, thanks for publishing this! Did you consider keeping either Fare and Pclass instead of dropping both?
One part I missed in your code was determining whether the features used in the regression were statistically significant or not (i.e., should those features have been used in the model or should they have been dropped for not having any significant impact).
I am looking for different methods using Python code to determine which features to leave in, and which features to drop, in one’s logistic regression model. E.g. another blog I saw used Sci-Kit learn’s RFE (Recursive Feature Elimination) function to determine what to keep or drop, another training course I saw used Backwards Elimination method using a For Loop and dropping anything under .05 p-value.
So in other words, how did you know that you should use all those features vs. eliminating the ones that should not have been in the model?
you have to test and play with it and decide for yourself 😉