The Convergence Blog

The Convergence is sponsored by Data-Mania
… it’s just another way we’re giving back to the data community from which we sprung.

The Convergence - An online community space that's dedicated to empowering operators in the data industry by providing news and education about evergreen strategies, late-breaking data & AI developments, and free or low-cost upskilling resources that you need to thrive as a leader in the data & AI space.

Checklist for Multiple Linear Regression

Lillian Pierson, P.E.

Lillian Pierson, P.E.

Reading Time: 6 minutes

A 5-Step Checklist for Multiple Linear Regression

5-Step Checklist for multiple linear regressionMultiple regression analysis is an extension of simple linear regression. It’s useful for describing and making predictions based on linear relationships between predictor variables (ie; independent variables) and a response variable (ie; a dependent variable). Although multiple regression analysis is simpler than many other types of statistical modeling methods, there are still some crucial steps that must be taken to ensure the validity of the results you obtain.

When using the checklist for multiple linear regression analysis, it’s critical to check that model assumptions are not violated. This is to fix or minimize any such violations, and to validate the predictive accuracy of your model. Since the internet provides so few plain-language explanations of this process, I decided to simplify things – to help walk you through the basic process. Please keep in mind that this is a brief summary checklist of steps and considerations. An entire statistics book could probably be written for each of these steps alone. Use this as a basic roadmap, but please investigate the nuances of each step, to avoid making errors. Google is your friend. Lastly, in all instances, use your common sense. If the results you see don’t make sense against what you know to be true, there is a problem that should not be ignored.

Before getting into any of the model investigations, inspect and prepare your data. Check it for errors, treat any missing values, and inspect outliers to determine their validity. After you’re comfortable that your data is correct, go ahead and proceed through the following fix step process.

checklist for multiple linear regression analysis

 

STEP 1. SELECTING YOUR VARIABLES

To pick the right variables, you’ve got to have a basic understanding of your dataset, enough to know that your data is relevant, high quality, and of adequate volume. As part of your model building efforts, you’ll be working to select the best predictor variables for your model (ie; the variables that have the most direct relationships with your chosen response variable).  When selecting predictor variables, a good rule of thumb is that you want to gather a maximum amount of information from a minimum number of variables, remembering that you’re working within the confines of a linear prediction equation.

The two following methods will be helpful to you in the variable selection process.

  1. Try out an automatic search procedure and let R decide what variables are best. Stepwise regression analysis is a quick way to do this. (Make sure to check your output and see that it makes sense)
  2. Use all-possible-regressions to test all possible subsets of potential predictor variables. With the all-possible-regressions method, you get to pick the numerical criteria by which you’d like to have the models ranked. Popular numerical criteria are as follows:
    • R2 – The set of variables with the highest R2 value are the best fit variables for the model.
      • note: R2 values are always between 0 and 1.0
    • Adjusted R2 – The sets of variables with larger adjusted R2 values are the better fit variables for the model.
    • Cp – The smaller the Cp value, the less total mean square error, and the less regression bias there is.
    • PRESSp – The smaller the predicted sum of squares (PRESSp) value, the better the predictive capabilities of the model.

 

STEP 2. REFINING YOUR MODEL

Check the utility of the model by examining the following criteria:

  • Global F test: Test the significance of your predictor variables (as a group) for predicting the response of your dependent variable.
  • Adjusted R2: Check the overall sample variation of the dependent variable that is explained by the model after the sample size and the number of parameters have been adjusted. Adjusted R2 values are indicative of how well your predictive equation is fit to your data. Larger adjusted R2 values indicate that variables are a better fit for the model.
  • Root mean square error (MSE): MSE provides an estimation for the standard deviation of the random error. An interval of ±2 standard deviations approximates the accuracy in predicting the response variable based on a specific subset of predictor variables.
  • Coefficient of variation (CV): If a model has a CV value that’s less than or equal to 10%, then the model is more likely to provide accurate predictions.

 

STEP 3. TESTING MODEL ASSUMPTIONS

Checklist for multiple linear regression

Now it’s time to check that your data meets the seven assumptions of a linear regression model. If you want a valid result from multiple regression analysis, these assumptions must be satisfied.

  1. You must have three or more variables that are of metric scale (integer or ratio variables) and that can be measured on a continuous scale.
  2. Your data cannot have any major outliers, or data points that exhibit excessive influence on the rest of the dataset.
  3. Variable relationships exhibit (1) linearity – your response variable has a linear relationship with each of the predictor variables, and (2) additivity – the expected value of your response variable is based on the additive effects of the different predictor variables.
  4. Your data shows an independence of observations, or in other words, there is no autocorrelation between variables.
  5. Your data demonstrates an absence of multicollinearity.
  6. Your data is homoscedastic.
  7. Your residuals must be normally distributed.

 

STEP 4. ADDRESSING POTENTIAL PROBLEMS WITH THE MODEL

Most of the time, at least one of the model assumptions will be violated. In these cases, if you’re careful, you may be able to either fix or minimize the problem(s) that are in conflict with the assumptions.

  • If your data is heteroscedastic, you can try transforming your response variable.
  • If your residuals are non-normal, you can either (1) check to see if your data could be broken into subsets that share more similar statistical distributions, and upon which you could build separate models OR (2) check to see if the problem is related to a few large outliers. If so, and if these are caused by a simple error or some sort of explainable, non-repeating event, then you may be able to remove these outliers to correct for the non-normality in residuals.
  • If you are seeing correlation between your predictor variables, try taking one of them out.
  • If your model is generating error due to the presence of missing values, try treating the missing values. You can also use dummy variables to cover for them.

 

STEP 5. VALIDATING YOUR MODEL

Now it’s time to find out whether the model you’ve chosen is valid. The following three methods will be helpful with that.

  • Check the predicted values by collecting new data and checking it against results that are predicted by your model.
  • Check the results predicted by your model against your own common sense. If they clash, you’ve got a problem.
  • Cross validate results by splitting your data into two randomly-selected samples. Use one half of the data to estimate model parameters. Use the other half for checking the predictive results of your model.

 

More resources to get ahead...

Get Income-Generating Ideas For Data Professionals

Are you tired of relying on one employer for your income? Are you dreaming of a side hustle that won’t put you at risk of getting fired or sued? Well, my friend, you’re in luck.

Take The Data Superhero Quiz

You can take a much more direct path to the top once you understand how to leverage your skillsets, your talents, your personality and your passions in order to serve in a capacity where you’ll thrive. That’s why I’m encouraging you to take the data superhero quiz.

Our newsletter is exclusively written for operators in the data & AI industry.

Hi, I'm Lillian Pierson, Data-Mania's founder. We welcome you to our little corner of the internet. Data-Mania offers fractional CMO and marketing consulting services to deep tech B2B businesses.

The Convergence community is sponsored by Data-Mania, as a tribute to the data community from which we sprung. You are welcome anytime.

Get more actionable advice by joining The Convergence Newsletter for free below.

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
We are 100% committed to you having an AMAZING ✨ experience – that, of course, involves no spam.

Fractional CMO for deep tech B2B businesses. Specializing in go-to-market strategy, SaaS product growth, and consulting revenue growth. American expat serving clients worldwide since 2012.

© Data-Mania, 2012 - 2024+, All Rights Reserved - Terms & Conditions - Privacy Policy | PRODUCTS PROTECTED BY COPYSCAPE

The Convergence is sponsored by Data-Mania, as a tribute to the data community from which we sprung.

Get The Newsletter

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
* Zero spam. Unsubscribe anytime.