Logistic Regression Example in Python (Source Code Included)

Lillian Pierson, P.E.

Lillian Pierson, P.E.

Reading Time: 4 minutes

Howdy folks! It’s been a long time since I did a coding demonstrations so I thought I’d put one up to provide you a logistic regression example in Python!

Admittedly, this is a cliff notes version, but I hope you’ll get enough from what I have put up here to at least feel comfortable with the mechanics of doing logistic regression in Python (more specifically; using scikit-learn, pandas, etc…). This logistic regression example in Python will be to predict passenger survival using the titanic dataset from Kaggle. Before launching into the code though, let me give you a tiny bit of theory behind logistic regression.

Logistic Regression Formulas:

The logistic regression formula is derived from the standard linear equation for a straight line. As you may recall from grade school, that is y=mx + b . Using the Sigmoid function (shown below), the standard linear formula is transformed to the logistic regression formula (also shown below). This logistic regression function is useful for predicting the class of a binomial target feature.

The Sigmoid Function

a fresh logistic regression example in python

Logistic Regression Formula

Logistic Regression Example in PythonLogistic Regression Assumptions

Any logistic regression example in Python is incomplete without addressing model assumptions in the analysis. The important assumptions of the logistic regression model include:

  • Target variable is binary
  • Predictive features are interval (continuous) or categorical
  • Features are independent of one another
  • Sample size is adequate – Rule of thumb: 50 records per predictor

So, in my logistic regression example in Python, I am going to walk you through how to check these assumptions in our favorite programming language.


Uses for Logistic Regression

One last thing before I give you the logistic regression example in Python / Jupyter Notebook… What awesome result can you ACHIEVE USING LOGISTIC REGRESSION?!? Well, a few things you can do with logistic regression include:

  • You can use logistic regression to predict whether a customer  will convert (READ: buy or sign-up) to an offer. (will not convert – 0 / will convert – 1)
  • You can use logistic regression to predict and preempt customer churn. (will not drop service – 0 / will drop service – 1)
  • You can use logistic regression in clinical testing to predict whether a new drug will cure the average patient. (will not cure – 0 / will cure -1)

The nice thing about logistic regression is that it not only predicts an outcome, it also provides a probability of that prediction being correct.

Now For that Logistic Regression Example in Python

 That’s it! That’s what I’ve got. I wish I had more time to type up all the information explaining every detail of the code, but well… Actually, that would be redundant. I cover it all right over here on Lynda.com / LinkedIn Learning.

 

More resources to get ahead...

Get Income-Generating Ideas For Data Professionals

Are you tired of relying on one employer for your income? Are you dreaming of a side hustle that won’t put you at risk of getting fired or sued? Well, my friend, you’re in luck.

Take The Data Superhero Quiz

You can take a much more direct path to the top once you understand how to leverage your skillsets, your talents, your personality and your passions in order to serve in a capacity where you’ll thrive. That’s why I’m encouraging you to take the data superhero quiz.

Join the 25,000 other tech leaders & founders who've discovered powerful startup, growth, product & marketing tips that we only share inside our community newsletter...

Zero spam, guaranteed. Unsubscribe anytime.

Data-Mania Newsletter

A newsletter exclusively created for technology leaders & entrepreneurs…

Hi, I’m Lillian Pierson, Data-Mania’s founder. We welcome you to our little corner of the internet. Our mission is to help technology leaders & entrepreneurs make more money with less effort and hassle.

We are 100% committed to you having an AMAZING ✨ experience – that, of course, involves no spam.

Join the newsletter to get our Data & Technology Entrepreneur’s Toolkit

Newsletter subscribers get all kinds of exclusive, special free goodies that we don’t give out to anyone else. Join today to get our Data & Technology Entrepreneur’s Toolkit – a collection of 32 tools & processes that’ll actually grow your business!

RELATED

This Post Has 12 Comments

  1. James Chamberlain

    For adequate sample size in the medical world, we use a rule of thumb of needing 10 outcomes of interest (e.g. death) for each variable rather than 50 patients for each variable. Thoughts on that?

    1. Lillian Pierson, P.E.

      That’s going to lead to less reliable predictions. I’d look into it with someone that has expertise in medicine.

  2. W.A. Smith

    Hi,

    I ran this example through JMP and got a completely different output. I set up the data exactly as you illustrated, creating my dummy variables (character, nominal) and only only using the final six variables that you illustrated. My Nominal Regression model wound up with a confusion matrix:
    Act Survived pred count
    0 1
    0 472 77
    1 109 233

    My python example (using v2.7) also differed from yours. I wonder what we did that was different.

    1. Without going back into the demo, my first guess is that there is a random function running and we didn’t set the same seed.

  3. Prasanta

    How did you know that Pclass and fare are independent ?

    Fare and Pclass are not independent of each other, so I am going to drop these.

    1. Hi Prasanta – It is nice to meet you! I am not sure what you’re talking about bc the demo shows exactly the same… they should be dropped.

      1. James

        Lillian, Prasanta is quoting you.
        Prasanta, you can see that Pclass and Fare are not independent in the correlation heatmap by the fact that the cell where they intersect is dark blue, indicating ~high negative correlation.
        Not sure why the same assessment was not made for SibSp and Parch.

    2. Anu Tuyi

      This is because the heatmap shows a high correlation between Fare and Pclass. This could lead to the error of multicollinearity ( a situation where independent variables are correlated) which is against the assumptions of the model and could lead to inaccurate results.

  4. Zachary Thomas

    Hey, thanks for publishing this! Did you consider keeping either Fare and Pclass instead of dropping both?

  5. Jacob Matthews

    One part I missed in your code was determining whether the features used in the regression were statistically significant or not (i.e., should those features have been used in the model or should they have been dropped for not having any significant impact).

    I am looking for different methods using Python code to determine which features to leave in, and which features to drop, in one’s logistic regression model. E.g. another blog I saw used Sci-Kit learn’s RFE (Recursive Feature Elimination) function to determine what to keep or drop, another training course I saw used Backwards Elimination method using a For Loop and dropping anything under .05 p-value.

    So in other words, how did you know that you should use all those features vs. eliminating the ones that should not have been in the model?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

STRATEGIES. STARTUPS. GROWTH.

proven, data-driven strategies that help tech startup founders & leaders make more money with less effort & hassle.

© Data-Mania, 2012 - 2022+, All Rights Reserved - Terms & Conditions - Privacy Policy | Designed by Kelly Creative Co. | PRODUCTS PROTECTED BY COPYSCAPE

Completion

Want More Opportunity From Your Data Career?

tAKE OUR FUN, 45-SECOND QUIZ AND You'll get PERSONALIZED DATA CAREER PATH RECOMMENDATIONS that show YOU EXACTLY WHAT YOU NEED TO FOCUS ON TO GET MORE OPPORTUNITY, IMPACT, AND EARNINGS From Your DAta Career

How it works: We analyze your responses and map them out to the ideal data role for you based on your data skills / background, personality traits, and passions.