The Convergence Blog

The Convergence is sponsored by Data-Mania
… it’s just another way we’re giving back to the data community from which we sprung.

The Convergence - An online community space that's dedicated to empowering operators in the data industry by providing news and education about evergreen strategies, late-breaking data & AI developments, and free or low-cost upskilling resources that you need to thrive as a leader in the data & AI space.

A Demo of Hierarchical, Moderated, Multiple Regression Analysis in R

Lillian Pierson, P.E.

Lillian Pierson, P.E.

Reading Time: 10 minutes

Moderator models are often used to examine when an independent variable influences a dependent variable. More specifically, moderators are used to identify factors that change the relationship between independent (X) and dependent (Y) variables. In this article, I explain how moderation in regression works, and then demonstrate how to do a hierarchical, moderated, multiple regression analysis in R.

A Demo of Hierarchical, Moderated, Multiple Regression Analysis in R

Explaining hierarchical, moderated, multiple regression analysis in R

Hierarchical, moderated, multiple regression analysis in R can get pretty complicated so let’s start at the very beginning. Let us have a look at a generic linear regression model:

Y = β0 + β1X + ϵ

Y is the dependent variable whereas the variable X is independent i.e. the regression model tries to explain the causality between the two variables. The above equation has a single independent variable.

So, what is moderation analysis?

how to do hierarchical, moderated, multiple regression analysis in RModerator (Z) models are often used to examine when an independent variable influences a dependent variable. That is, moderated models are used to identify factors that change the relationship between independent (X) and dependent (Y) variables. A moderator variable (Z) will enhance a regression model if the relationship between the independent variable (X) and dependent variable (Y) varies as a function of Z.

How does a moderator affect a regression model?

Let’s look at it from two different perspectives. First, looking at it from an experimental research perspective:

  • The manipulation of X causes change in Y.
  • A moderator variable (Z) implies that the effect of the X on the Y is NOT consistent across the distribution of Z.

Second, looking at it from a correlational perspective:

  • Assume a correlation between variable X and variable Y.
  • A moderator variable (Z) implies that the correlation between X and Y is NOT consistent across the distribution of Z.

Now before doing a hierarchical, moderated, multiple regression analysis in R, you must always be sure to check whether your data satisfies the model assumptions!

Checking the assumptions

There are a couple of assumptions that the data has to follow before the moderation analysis is done:

  • The dependent variable (Y) should be measured on a continuous scale (i.e., it should be an interval or ratio variable).
  • The data must have one independent variable (X), which is either continuous (i.e., an interval or ratio variable) or categorical (i.e., nominal or quantitative variable) and one moderator variable (M).
  • The residuals must not be autocorrelated. This can be checked using the Durbin-Watson test in R.
  • This goes without saying, there needs to be a linear relationship between the dependent variable (Y) and the independent variable (X). There are a number of ways to check for linear relationships, like creating a scatterplot.
  • The data needs to show homoscedasticity. This assumption means that the variance around the regression line is the  somewhat same for all combinations of independent (X) and moderator (M) variables.
  • The data must not show multicollinearity within the independent variables (X). This usually occurs when two or more independent variables that are highly correlated with each other.  This can be visually interpreted by plotting a heatmap.
  • The data ideally should not have any significant outliers, highly influential points or many NULL values. The highly influential points can be detected by using the studentized residuals.
  • The last assumption is to check  if the the residual errors are approximately normally distributed.

Demonstrating hierarchical, moderated, multiple regression analysis in R

Now that we know what moderation is, let us start with a demonstration of how to do hierarchical, moderated, multiple regression analysis in R

> ## Reading in the csv file
> dat <- read.csv(file.choose(), h=T)

Since the data is loaded into the R environment. I’ll talk about the data a bit. The data is based on the idea of stereotype threat. A couple of students are set up for an IQ test. When the students come up to take the test, they are given threats. These are implicit and explicit threats, such as “women usually perform worse than men in this test”. This, in turn, tends to affect the performance of the women candidates.

Here, the independent variable (X) is the experimental manipulation (threat) and the dependent variable (Y) is the IQ test score. The variable working memory capacity (wm) is the moderator. We will investigate how the threat affects the IQ test scores with the idea that maybe working memory (wm) has an effect on this relation. (i.e. to see if any of the participants who have a strong working memory are not impacted by the stereotype threat). Therefore, the moderator might say that the stereotype threat may work on some people and not work on some others.

The three threat categories are:

  1. Explicit threat
  2. Implicit threat
  3. No threat (control)

Each group consists of 50 students. Let’s look at the structure of the data. The data:

> str(dat)
'data.frame':	150 obs. of  7 variables:
 $ subject    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ condition  : Factor w/ 3 levels "control","threat1",..: 1 1 1 1 1 1 1 1 1 
 $ iq         : int  134 121 86 74 80 105 100 121 138 104 ...
 $ wm         : int  91 145 118 105 96 133 99 97 96 105 ...
 $ WM.centered: num  -8.08 45.92 18.92 5.92 -3.08 ...
 $ d1         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ d2         : int  0 0 0 0 0 0 0 0 0 0 ..

Looking at the structure of the data frame… The condition variable is categorical with three levels as already discussed. Since there are three categorical variables, we have to create dummy variables of n-1. Where n is the number of categories. So d1 and d2 are the dummy encoded variables. When d1 and d2 is 0, the condition is control. When d1 is 1 the condition is threat1. When d2 is 1 the condition is threat2.

> head(dat)
subject condition iq wm WM.centered d1 d2
1 1 control 134 91 -8.08 0 0
2 2 control 121 145 45.92 0 0
3 3 control 86 118 18.92 0 0
4 4 control 74 105 5.92 0 0
5 5 control 80 96 -3.08 0 0
6 6 control 105 133 33.92 0 0

Now that we know how the data looks like, I’m going plot a boxplot with the IQ and the test condition.

> ggplot (dat, aes (condition, iq)) + geom_boxplot()

Looking at the three groups in your boxplot. It is quite noticeable that the IQ score decreases when there is a threat and that also the severity of the threat affects the IQ scores a little bit. i.e intrinsic vs extrinsic threat. So it seems like the presence and the severity of a threat affects the IQ scores in a negative way. The presence of threat decreases the IQ scores by a large margin.

Also by plotting a scatter plot:

> ggplot (dat, aes (wm, iq, color = condition)) + geom_point()

Looking at the scatter plot, there is a clear distinction between the control cluster and the two threat cluster. As seen from the box plot, the scatter plot also shows that people who took the exam in the control condition had a better score on the IQ test than the other two groups.

Since this has been established, getting some correlation values will help with this problem. The correlation values have to be computed for each threat group.

> # Make the subset for the group condition 'control'
> library(dplyr)
> mod_control <- dat %>% subset(dat$condition == 'control')
> # Make the subset for the group condition 'threat1'
> mod_threat1 <- dat %>% subset(dat$condition == 'threat1')
> # Make the subset for the group condition = 'threat2'
> mod_threat2 <- dat %>% subset(dat$condition == 'threat2')
> # Calculate the correlations
> cor(mod_control$iq, mod_control$wm, method = 'pearson')
[1] 0.1079827
> cor(mod_threat1$iq, mod_threat1$wm, method = 'pearson')
[1] 0.7231095
> cor(mod_threat2$iq, mod_threat2$wm, method = 'pearson')
[1] 0.6772917

There is a really strong correlation between IQ and WMC in the threat conditions but not in the control condition.
Now to build a model without moderation and a model with moderation. Generally, when both the independent (X) and moderator(Z) are continuous.

Y = β0 + β1X + β2Z + β3(X * Z)+ϵ

With β3 we are testing for a non additive effect. So if β3 is significant there is a moderation effect. This model is not valid when variable X is categorical.

When the independent variable (X) is categorical and the moderator variable (Z) is continuous. The model changes a bit.

Y = β0 + β1(D1)+β2(D2)+β3Z + β4(D1 * Z)+β5(D2 * Z)+ϵ

With this specific data, the independent variable being the stereotypical threat with three levels. I have already explained about how dummy encoding is done. So D1 and D2 are used for three levels in the model. The product of the dummy codes and WMC is used to look for the moderation effect.

Let’s run the R code for the models.

> # Model without moderation
> model_1 <- lm(dat$iq ~ dat$wm + mod$d1 + mod$d2) > Getting the summary of model_1
> summary(model_1)

Call:
lm(formula = dat$iq ~ dat$wm + mod$d1 + mod$d2)

Residuals:
Min 1Q Median 3Q Max
-47.339 -7.294 0.744 7.608 42.424

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.78635 7.14360 8.369 4.30e-14 ***
dat$wm 0.37281 0.06688 5.575 1.16e-07 ***
mod$d1 -45.20552 2.94638 -15.343 < 2e-16 ***
mod$d2 -46.90735 2.99218 -15.677 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.72 on 146 degrees of freedom
Multiple R-squared: 0.7246, Adjusted R-squared: 0.719
F-statistic: 128.1 on 3 and 146 DF, p-value: < 2.2e-16 > 
# Create new predictor variables for testing moderation (product of the working memory and the threat condition)
> wm_d1 <- dat$wm * dat$d1 > wm_d2 <- dat$wm * dat$d2 > # Model with moderation
> model_2 <- lm(dat$iq ~ dat$wm + dat$d1 + dat$d2 + wm_d1 + wm_d2) > Getting the summary of model_2
> summary(model_2)

Call:
lm(formula = dat$iq ~ dat$wm + dat$d1 + dat$d2 + wm_d1 + wm_d2)

Residuals:
Min 1Q Median 3Q Max
-50.414 -7.181 0.420 8.196 40.864

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.5851 11.3576 7.535 4.95e-12 ***
dat$wm 0.1203 0.1094 1.100 0.27303
dat$d1 -93.0952 16.8573 -5.523 1.52e-07 ***
dat$d2 -79.8970 15.4772 -5.162 7.96e-07 ***
wm_d1 0.4716 0.1638 2.880 0.00459 **
wm_d2 0.3288 0.1547 2.125 0.03529 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.38 on 144 degrees of freedom
Multiple R-squared: 0.7409, Adjusted R-squared: 0.7319
F-statistic: 82.35 on 5 and 144 DF, p-value: < 2.2e-16 All variables have a significant effect on the IQ scores, because all p-values are significantly small. The effect of stereotype threat is strongly negative. The effect of working memory capacity is slightly positive. From the model with moderation it can be seen that the moderator variables wm_d1 and wm_d2 are significant so there is indeed some moderation effect seen in the data. Since both the models are ready, we have to compare them. using ANOVA is good way to compare models. 
> # Compare model_1 and model_2 with the help of the ANOVA function
> anova(model_1, model_2)
Analysis of Variance Table

Model 1: dat$iq ~ dat$wm + mod$d1 + mod$d2
Model 2: dat$iq ~ dat$wm + dat$d1 + dat$d2 + wm_d1 + wm_d2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 146 31655
2 144 29784 2 1871.3 4.5238 0.01243 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value indicates that the null hypothesis is rejected. This means that there is a significant difference between the two models, so the effect of the moderator is significant. This tells us that:

  • People with high WMC were not affected by the stereotypical threat.
  • Whereas the people with low WMC scores were affected by the stereotypical threat and scored low on the IQ test.

Plotting the scatter plot along with the regression line. The first plot is for the first order or primary effects of WMC on IQ

> # Illustration of the primary effects of WMC on IQ
> ggplot(dat, aes(wm, iq)) + geom_smooth(method = 'lm', color = 'brown') +
+ geom_point(aes(color = condition))

The second scatter plot illustrates the moderation effect of WMC on IQ:

> # Illustration of the moderation effect of WMC on IQ
> ggplot(dat, aes(wm, iq)) + geom_smooth(aes(group = condition), method = 'lm', se = T, color = 'brown') + geom_point(aes(color = condition))

We can clearly see a change in slopes, so this indicates moderation.

This was an interesting use case for hierarchical, moderated, multiple regression analysis in R, huh? In what ways might you consider applying this analytical method in your own work? Write your ideas in the comments section below!

More resources to get ahead...

Get Income-Generating Ideas For Data Professionals

Are you tired of relying on one employer for your income? Are you dreaming of a side hustle that won’t put you at risk of getting fired or sued? Well, my friend, you’re in luck.

Take The Data Superhero Quiz

You can take a much more direct path to the top once you understand how to leverage your skillsets, your talents, your personality and your passions in order to serve in a capacity where you’ll thrive. That’s why I’m encouraging you to take the data superhero quiz.

Author Bio:

This article was contributed by Perceptive Analytics. Rohit Mattah, Chaitanya Sagar, Prudhvi Potuganti and Saneesh Veetil contributed to this article. Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Our newsletter is exclusively written for operators in the data & AI industry.

Hi, I'm Lillian Pierson, Data-Mania's founder. We welcome you to our little corner of the internet. Data-Mania offers fractional CMO and marketing consulting services to deep tech B2B businesses.

The Convergence community is sponsored by Data-Mania, as a tribute to the data community from which we sprung. You are welcome anytime.

Get more actionable advice by joining The Convergence Newsletter for free below.

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
We are 100% committed to you having an AMAZING ✨ experience – that, of course, involves no spam.

Fractional CMO for deep tech B2B businesses. Specializing in go-to-market strategy, SaaS product growth, and consulting revenue growth. American expat serving clients worldwide since 2012.

© Data-Mania, 2012 - 2024+, All Rights Reserved - Terms & Conditions - Privacy Policy | PRODUCTS PROTECTED BY COPYSCAPE

The Convergence is sponsored by Data-Mania, as a tribute to the data community from which we sprung.

Get The Newsletter

See what 26,000 other data professionals have discovered from the powerful data science, AI, and data strategy advice that’s only available inside this free community newsletter.

Join The Convergence Newsletter for free below.
* Zero spam. Unsubscribe anytime.