Currently taking bookings for May >>

The Convergence Blog

The Convergence is sponsored by Data-Mania
… it’s just another way we’re giving back to the data community from which we sprung.

The Convergence - An online community space that's dedicated to empowering operators in the data industry by providing news and education about evergreen strategies, late-breaking data & AI developments, and free or low-cost upskilling resources that you need to thrive as a leader in the data & AI space.

Checklist for Multiple Linear Regression

Lillian Pierson, P.E.

Reading Time: 6 minutes

A 5-Step Checklist for Multiple Linear Regression

Multiple regression analysis is an extension of simple linear regression. It’s useful for describing and making predictions based on linear relationships between predictor variables (ie; independent variables) and a response variable (ie; a dependent variable). Although multiple regression analysis is simpler than many other types of statistical modeling methods, there are still some crucial steps that must be taken to ensure the validity of the results you obtain.

When using the checklist for multiple linear regression analysis, it’s critical to check that model assumptions are not violated. This is to fix or minimize any such violations, and to validate the predictive accuracy of your model. Since the internet provides so few plain-language explanations of this process, I decided to simplify things – to help walk you through the basic process. Please keep in mind that this is a brief summary checklist of steps and considerations. An entire statistics book could probably be written for each of these steps alone. Use this as a basic roadmap, but please investigate the nuances of each step, to avoid making errors. Google is your friend. Lastly, in all instances, use your common sense. If the results you see don’t make sense against what you know to be true, there is a problem that should not be ignored.

Before getting into any of the model investigations, inspect and prepare your data. Check it for errors, treat any missing values, and inspect outliers to determine their validity. After you’re comfortable that your data is correct, go ahead and proceed through the following fix step process.

STEP 1. SELECTING YOUR VARIABLES

To pick the right variables, you’ve got to have a basic understanding of your dataset, enough to know that your data is relevant, high quality, and of adequate volume. As part of your model building efforts, you’ll be working to select the best predictor variables for your model (ie; the variables that have the most direct relationships with your chosen response variable). When selecting predictor variables, a good rule of thumb is that you want to gather a maximum amount of information from a minimum number of variables, remembering that you’re working within the confines of a linear prediction equation.

The two following methods will be helpful to you in the variable selection process.

Try out an automatic search procedure and let R decide what variables are best. Stepwise regression analysis is a quick way to do this. (Make sure to check your output and see that it makes sense)
Use all-possible-regressions to test all possible subsets of potential predictor variables. With the all-possible-regressions method, you get to pick the numerical criteria by which you’d like to have the models ranked. Popular numerical criteria are as follows:

- R² – The set of variables with the highest R² value are the best fit variables for the model.
  - note: R² values are always between 0 and 1.0
- Adjusted R² – The sets of variables with larger adjusted R² values are the better fit variables for the model.
- C_p – The smaller the C_p value, the less total mean square error, and the less regression bias there is.
- PRESS_p – The smaller the predicted sum of squares (PRESS_p) value, the better the predictive capabilities of the model.

STEP 2. REFINING YOUR MODEL

Check the utility of the model by examining the following criteria:

Global F test: Test the significance of your predictor variables (as a group) for predicting the response of your dependent variable.
Adjusted R²: Check the overall sample variation of the dependent variable that is explained by the model after the sample size and the number of parameters have been adjusted. Adjusted R² values are indicative of how well your predictive equation is fit to your data. Larger adjusted R² values indicate that variables are a better fit for the model.
Root mean square error (MSE): MSE provides an estimation for the standard deviation of the random error. An interval of ±2 standard deviations approximates the accuracy in predicting the response variable based on a specific subset of predictor variables.
Coefficient of variation (CV): If a model has a CV value that’s less than or equal to 10%, then the model is more likely to provide accurate predictions.

STEP 3. TESTING MODEL ASSUMPTIONS

Now it’s time to check that your data meets the seven assumptions of a linear regression model. If you want a valid result from multiple regression analysis, these assumptions must be satisfied.

You must have three or more variables that are of metric scale (integer or ratio variables) and that can be measured on a continuous scale.
Your data cannot have any major outliers, or data points that exhibit excessive influence on the rest of the dataset.
Variable relationships exhibit (1) linearity – your response variable has a linear relationship with each of the predictor variables, and (2) additivity – the expected value of your response variable is based on the additive effects of the different predictor variables.
Your data shows an independence of observations, or in other words, there is no autocorrelation between variables.
Your data demonstrates an absence of multicollinearity.
Your data is homoscedastic.
Your residuals must be normally distributed.

STEP 4. ADDRESSING POTENTIAL PROBLEMS WITH THE MODEL

Most of the time, at least one of the model assumptions will be violated. In these cases, if you’re careful, you may be able to either fix or minimize the problem(s) that are in conflict with the assumptions.

If your data is heteroscedastic, you can try transforming your response variable.
If your residuals are non-normal, you can either (1) check to see if your data could be broken into subsets that share more similar statistical distributions, and upon which you could build separate models OR (2) check to see if the problem is related to a few large outliers. If so, and if these are caused by a simple error or some sort of explainable, non-repeating event, then you may be able to remove these outliers to correct for the non-normality in residuals.
If you are seeing correlation between your predictor variables, try taking one of them out.
If your model is generating error due to the presence of missing values, try treating the missing values. You can also use dummy variables to cover for them.

STEP 5. VALIDATING YOUR MODEL

Now it’s time to find out whether the model you’ve chosen is valid. The following three methods will be helpful with that.

Check the predicted values by collecting new data and checking it against results that are predicted by your model.
Check the results predicted by your model against your own common sense. If they clash, you’ve got a problem.
Cross validate results by splitting your data into two randomly-selected samples. Use one half of the data to estimate model parameters. Use the other half for checking the predictive results of your model.

More resources to get ahead...

Get Income-Generating Ideas For Data Professionals

Are you tired of relying on one employer for your income? Are you dreaming of a side hustle that won’t put you at risk of getting fired or sued? Well, my friend, you’re in luck.

This 48-page listing is here to rescue you from the drudgery of corporate slavery and set you on the path to start earning more money from your existing data expertise. Spend just 1 hour with this pdf and I can guarantee you’ll be bursting at the seams with practical, proven & profitable ideas for new income-streams you can create from your existing expertise.

Learn more here!

Take The Data Superhero Quiz

You can take a much more direct path to the top once you understand how to leverage your skillsets, your talents, your personality and your passions in order to serve in a capacity where you’ll thrive. That’s why I’m encouraging you to take the data superhero quiz.

This free and super-fun 45-second quiz is all about you and how your personality type aligns with the very best career path for you. It’s fun, free and it will provide you personalized data career recommendations, complete with potential roles that fit your unique skills and passions, as well as salaries associated with those roles.

Take the Data Superhero Quiz today!

Get The Convergence Newsletter

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.

Income-Generating Ideas For Data Professionals

A 48-page listing of income-generating product and service ideas for data professionals who want to earn additional money from their data expertise without relying on an employer to make it happen.

$77 $17

Data Strategy Action Plan

A step-by-step checklist & collaborative Trello Board planner for data professionals who want to get unstuck & up-leveled into their next promotion by delivering a fail-proof data strategy plan for their data projects.

$97 $37

Data Strategy Starter Kit

An entire set of data strategy building tools – including editable worksheets, done-for-you estimators, editable templates, customizable Gannt charts — and more. This starter kit is for data leaders, freelancers, and consultants who want to take meaningful action to ensure your next data project is protected against failures and delays due to people, processes, and technology.

$197

Interested in guest posting on our blog?

We love helping contributors gain exposure and brand awareness. If you’d like to publish a guest post on this website, we’d love to hear from you. You can learn more about how to go about guest posting by visiting this Blog Contributions page here.

Our newsletter is exclusively written for operators in the data & AI industry.

Hi, I'm Lillian Pierson, Data-Mania's founder. We welcome you to our little corner of the internet. Data-Mania offers fractional CMO and marketing consulting services to deep tech B2B businesses.

The Convergence community is sponsored by Data-Mania, as a tribute to the data community from which we sprung. You are welcome anytime.

Get more actionable advice by joining The Convergence Newsletter for free below.

We are 100% committed to you having an AMAZING ✨ experience – that, of course, involves no spam.

Applied AI

[Webinar] Making AI Routine, Repeatable and Reliable w/ Lillian Pierson for GigaOm + Cloudera

Data Science

Columbia Professor Uses Statistics and Data Science to Solve Global Water Resource Problems

Data Science

7 Excellent Metrics for Monitoring and Optimizing Your Acquisitions Tactics

Data Science

47 Amazingly Cool Free Applications for Doing #DataScience

Data Science

NEW & Improved: Data Science for Dummies Book, Edition 3 Giveaway!!!

Data Science

Custom Web-Based Data Visualizations Using Plotly and R Notebooks

Fractional CMO for deep tech B2B businesses. Specializing in go-to-market strategy, SaaS product growth, and consulting revenue growth. American expat serving clients worldwide since 2012.

Currently taking bookings for May >>

The Convergence Blog

Checklist for Multiple Linear Regression

Lillian Pierson, P.E.

A 5-Step Checklist for Multiple Linear Regression

STEP 1. SELECTING YOUR VARIABLES

STEP 2. REFINING YOUR MODEL

STEP 3. TESTING MODEL ASSUMPTIONS

STEP 4. ADDRESSING POTENTIAL PROBLEMS WITH THE MODEL

STEP 5. VALIDATING YOUR MODEL

More resources to get ahead...

Get Income-Generating Ideas For Data Professionals

Are you tired of relying on one employer for your income? Are you dreaming of a side hustle that won’t put you at risk of getting fired or sued? Well, my friend, you’re in luck.

Take The Data Superhero Quiz

You can take a much more direct path to the top once you understand how to leverage your skillsets, your talents, your personality and your passions in order to serve in a capacity where you’ll thrive. That’s why I’m encouraging you to take the data superhero quiz.

Get The Convergence Newsletter

Income-Generating Ideas For Data Professionals

A 48-page listing of income-generating product and service ideas for data professionals who want to earn additional money from their data expertise without relying on an employer to make it happen.

$77 $17

Data Strategy Action Plan

A step-by-step checklist & collaborative Trello Board planner for data professionals who want to get unstuck & up-leveled into their next promotion by delivering a fail-proof data strategy plan for their data projects.

$97 $37

Data Strategy Starter Kit

$197

Interested in guest posting on our blog?

Get more actionable advice by joining The Convergence Newsletter for free below.

RELATED

[Webinar] Making AI Routine, Repeatable and Reliable w/ Lillian Pierson for GigaOm + Cloudera

47 Amazingly Cool Free Applications for Doing #DataScience

NEW & Improved: Data Science for Dummies Book, Edition 3 Giveaway!!!

Custom Web-Based Data Visualizations Using Plotly and R Notebooks

Fractional CMO for deep tech B2B businesses. Specializing in go-to-market strategy, SaaS product growth, and consulting revenue growth. American expat serving clients worldwide since 2012.

GET CONNECTED

LINKS

© Data-Mania, 2012 - 2024+, All Rights Reserved - Terms & Conditions - Privacy Policy | PRODUCTS PROTECTED BY COPYSCAPE

Get The Newsletter