The purpose of this activity is to introduce/review regression in R.

About the data

The data set, HolzingerSwineford1939.csv contains 26 tests measuring students’ (n = 301) spatial, verbal, mental speed, memory, and mathematical abilities. The data come from two schools Paster School and Grant-White School. We will use this data set throughout the semester. Like the Political Democracy data set, this is a classic data set for factor analyses.

The first thing we need to do is to read the data set into R. You can download and read in the data at the same time.

hs.data <- read.csv("https://github.com/cddesja/epsy8266/raw/master/course_materials/data/HolzingerSwineford1939.csv")
names(hs.data)
##  [1] "id"       "gender"   "grade"    "agey"     "agem"     "school"  
##  [7] "visual"   "cubes"    "paper"    "flags"    "general"  "paragrap"
## [13] "sentence" "wordc"    "wordm"    "addition" "code"     "counting"
## [19] "straight" "wordr"    "numberr"  "figurer"  "object"   "numberf" 
## [25] "figurew"  "deduct"   "numeric"  "problemr" "series"   "arithmet"
## [31] "paperrev" "flagssub"

The following variables are in this data set:

Statistical Model

Suppose we’re interested in predicting a student’s score on the visual perception given the student’s cubes test score. Specifically, this model can be written as:

\[ Y_i = \beta_0 + \beta_1X_i + e_i \]

where \(Y_i\) is student i’s score on the visual perception test, \(X_i\) is student i’s score on the cubes test, and \(e_i\) is the residual error. Across the students, we can rewrite our model as:

\[ E(Y | X) = \beta_0 + \beta_1 X \]

\[ Var(Y | X) = \sigma^2 \]

where the first equation we call the mean function and the second is the variance function. We are almost always interested in inference and because of this we make a normality assumption. Specifically,

\[ Y | X \sim N(\beta_0 + \beta_1 X, \sigma^2) \]

This is why we care about our residuals. We assume they are normally distributed and have constant variance (note the lack of a subscript for \(\sigma^2\)). Residuals have this distribution:

\[ e_i \sim N(0, \sigma^2) \]

Initial Data Analysis

Before we run the actual model, it’s a good idea to look at our data. Numeric summaries can be obtained using the summmary() function.

summary(hs.data$visual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   25.00   30.00   29.61   34.00   51.00
summary(hs.data$cubes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   21.00   24.00   24.35   27.00   37.00
sd(hs.data$visual)
## [1] 7.004593
sd(hs.data$cubes)
## [1] 4.709802

The $ enables us to access a variable from a data set. So, hs.data$visual enables us to access the visual perception test that is contained within the hs.data data set.

We can also plot our data. For plotting continuous variables, you could use a histogram, density curve or a stem-and-leaf plot. Be careful with histograms as they can be a little misleading depending on the size of your bins. It can also be helpful to plot the observations in descending (or ascending) order, this lets see the distribution and look for outliers. It’s pretty similar to what we can see in the stem-and-leaf plot. Below, we create these plots for the visual perception test.

hist(hs.data$visual)

plot(density(hs.data$visual))

stem(hs.data$visual)
## 
##   The decimal point is at the |
## 
##    4 | 0
##    6 | 
##    8 | 
##   10 | 000
##   12 | 0
##   14 | 
##   16 | 00000000
##   18 | 0000000000000
##   20 | 00000000000000
##   22 | 000000000000000
##   24 | 000000000000000000000000000
##   26 | 00000000000000000000000
##   28 | 0000000000000000000000000000000000000
##   30 | 0000000000000000000000000000000000000
##   32 | 000000000000000000000000000000000
##   34 | 00000000000000000000000000000
##   36 | 00000000000000000000
##   38 | 00000000000000000000
##   40 | 0000000000
##   42 | 0000
##   44 | 00000
##   46 | 
##   48 | 
##   50 | 0
plot(sort(hs.data$visual, decreasing = TRUE), pch = ".")

Q1. How would you describe the distribution of scores on the visual perception test? Would you describe it as skewed or symmetric? Kurtotic? Are there outliers? Normal?

Q2. Adapt the code for the cubes test and describe this distribution. Below I have started the code.

hist(  )
plot(density(  ))
stem(  )
plot(sort(   ), pch = ".")

The next plot I would suggest making is a scatter plot.

plot(visual ~ cubes, data = hs.data)

We can modify this plot by adding better labels, changing the color, and adding a LOWESS smoother. A LOWESS smoother is a locally weighted regression line. Conceptually it works by calculating the value at a given point (\(x_i\)) based on a weighted least squares regression that weighs observations that are closer to \(x_i\) more heavily than those that are further away. It can be helpful for ascertaining linearity/non-linearity.

plot(visual ~ cubes, hs.data,
     xlab = "Cubes Test", 
     ylab = "Visual Perception Test",
     col = "dodgerblue")
lines(lowess(x = hs.data$cubes, y = hs.data$visual))

Q3. What does this scatter plot tell us? What are we looking for in this plot? (think assumptions for bivariate scatter plots)

Fitting our model

Assuming we’re satisfied with the data, we can then fit a simple linear regression.

mod <- lm(visual ~ cubes, data = hs.data)

I would encourage you to assess your model assumptions prior to looking at your actual output because we are nearly always interested in inference. We can get some good plots to help with this using the plot() function.

# the plot function will spit out 4 plots by default
# we can view all four plots at the same time using the par 
# commmand
par(mfrow = c(2, 2))
plot(mod)

The plot in the upper-left is a residual plot, the upper-right is a Q-Q plot, the lower-left is the square root of the standardized residuals, this can be helpful for detecting non-constant variance, and the last plot is a plot of leverage. Plots of Cook’s distance can be helpful for assessing influence (similar to leverage).

par(mfrow = c(1, 1)) # return the plotting device how it was originally
plot(mod, which = 4)

For Cook’s distance, you want to look for observations that are far from the rest of the observations. Some people recommend a value of 1 or 4 / N as a cut off, I don’t. I think seeing the points is more helpful. This would be 67, 71, and 268. These values correspond to their row number. As a sensitivity analysis, we can drop these observations and see if they affect the model findings. This is a good idea, but we shouldn’t just remove them from the analysis.

Q4. What do you think about the model? Please comment on non-constant variance, normality, the presence of outliers or influential points.

Presuming we’re okay with the model, we can look at the output.

summary(mod)
## 
## Call:
## lm(formula = visual ~ cubes, data = hs.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.8055  -4.2478   0.6567   4.4256  20.6567 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.84553    2.03671   9.253  < 2e-16 ***
## cubes        0.44222    0.08212   5.385 1.47e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.699 on 299 degrees of freedom
## Multiple R-squared:  0.08841,    Adjusted R-squared:  0.08537 
## F-statistic:    29 on 1 and 299 DF,  p-value: 1.467e-07

This model can be written as:

\[ \hat{Y}_i = 18.845 + 0.442 X_i \]

Q5. How do you interpret the intercept? Is it meaningful here (HINT: look at summary(hs.data$visual) if you don’t remember)?

Q6. How do you interpret the slope?

Confidence intervals can be obtained by:

confint(mod)
##                 2.5 %     97.5 %
## (Intercept) 14.837426 22.8536264
## cubes        0.280619  0.6038276

Q7. How do you interpret the p-value and confidence intervals for cubes?

If we want to get standardized regression coefficients, we need to z-score both visual and cubes. This can be done either before hand or during modeling.

hs.data$visual.s <- scale(hs.data$visual)

# finish the code for cubes
hs.data$cubes.s <- scale(  )

Then we can refit the model. Because we’ve standardized these variables, we no longer need an intercept as it will be 0. This can be omitted by adding -1 in the model syntax.

# two things are missing the dependent variable and the independent variable. Put in the correct values 
# i.e., replace the <HERE>
mod.std <- lm(<HERE> ~ -1 + <HERE>, data = <HERE>)
summary(mod.std)

Q8. What is this slope estimate? How do you interpret it? If we square the slope, what does this tell us? What is 1 minus the square of the slope?

Q9. Calculate the confidence intervals for mod.std.

We can overlay our statistical model onto raw data using the following code:

plot(visual ~ cubes, hs.data,
     xlab = "Cubes Test", 
     ylab = "Visual Perception Test",
     col = "dodgerblue")
abline(coef(mod), col = "red")

Looks pretty good!

SLR as an SEM

We can also do this model within an SEM framework using lavaan. First, let’s write the syntax and then view the model quickly with the semPlot function in the semPlot package.

library(lavaan)
## This is lavaan 0.6-3
## lavaan is BETA software! Please report any bugs.
sem.mod <- '
visual ~ 1 + cubes
'
sem.fit <- sem(model = sem.mod, data = hs.data)

Notice the similarity to the lm function. We can create a path diagram of our model.

library(semPlot)
semPaths(object = mod, 
         color = c(man = "orange", # sets the color for the manifest
                   lat = "green", # sets the color for the latent variables
                   int = "white"), # set the color for the intercept
         edge.color = "black", # sets the color of the path
         what = "est", # plots the unstandardized coefficients
         fade = FALSE, # don't fade the paths based on the magnitude of the estimates
         esize = 1, # force the line width to be equivalent
         rotation = 2 # use a left-right orientation instead of top-down
         )

The triangle represents the intercept.

We can view the output from lavaan, as we can with many R objects, using the summary method.

summary(sem.fit, standardized = TRUE)
## lavaan 0.6-3 ended normally after 16 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          3
## 
##   Number of observations                           301
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   visual ~                                                              
##     cubes             0.442    0.082    5.403    0.000    0.442    0.297
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .visual           18.846    2.030    9.284    0.000   18.846    2.695
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .visual           44.578    3.634   12.268    0.000   44.578    0.912

Q10. Are these results the same as that obtained from lm?

We can also obtain the covariance and correlation between these variables:

cov(hs.data$visual, hs.data$cubes)
## [1] 9.809502
cor(hs.data$visual, hs.data$cubes)
## [1] 0.2973455

Q12. What happens to the intercept if we mean-centered cubes? HINT: Look at the formula for the intercept.

Finally, we could use matrix algebra to get these parameters. If we let Y be a vector of length 301 containing the visual test scores and X be a matrix that has two columns and 301 rows where the first is a 1s column (for the intercept) and the second a column containing the cubes test scores. We might remember that the least squares estimator for b (the unstandardized regression coefficients) can be found as follows:

\[ \begin{align} Y &= Xb \\ X^TY &= X^TXb \\ (X^TX)^{-1}X^TY &= (X^TX)^{-1}X^TXb \\ (X^TX)^{-1}X^TY &= b \end{align} \]

Note that \((X^TX)^{-1}X^TX = I\), i.e., the identity matrix. In R,

X <- cbind(1, hs.data$cubes)
Y <- hs.data$visual
head(X)
##      [,1] [,2]
## [1,]    1   31
## [2,]    1   21
## [3,]    1   21
## [4,]    1   31
## [5,]    1   19
## [6,]    1   20
head(Y)
## [1] 20 32 27 32 29 32
solve(t(X) %*% X) %*% t(X) %*% Y
##            [,1]
## [1,] 18.8455260
## [2,]  0.4422233

The solve() function allows us to invert a matrix, the t() function is transpose a matrix, and %*% is for matrix multiplication.