The purpose of this activity is to introduce/review regression in R.
The data set, HolzingerSwineford1939.csv
contains 26 tests measuring students’ (n = 301) spatial, verbal, mental speed, memory, and mathematical abilities. The data come from two schools Paster School and Grant-White School. We will use this data set throughout the semester. Like the Political Democracy data set, this is a classic data set for factor analyses.
The first thing we need to do is to read the data set into R. You can download and read in the data at the same time.
hs.data <- read.csv("https://github.com/cddesja/epsy8266/raw/master/course_materials/data/HolzingerSwineford1939.csv")
names(hs.data)
## [1] "id" "gender" "grade" "agey" "agem" "school"
## [7] "visual" "cubes" "paper" "flags" "general" "paragrap"
## [13] "sentence" "wordc" "wordm" "addition" "code" "counting"
## [19] "straight" "wordr" "numberr" "figurer" "object" "numberf"
## [25] "figurew" "deduct" "numeric" "problemr" "series" "arithmet"
## [31] "paperrev" "flagssub"
The following variables are in this data set:
id
: student IDgender
: student’s gendergrade
: student’s grade in schoolagey
: student’s age (years)agem
: student’s age (month)school
: student’s schoolvisual
: visual perception testcubes
: cubes testpaper
: paper form board testflags
: lozenges testgeneral
: general information testparagrap
: paragraph comprehension testsentence
: sentence completion testwordc
: word classification testwordm
: word meaning testaddition
: add testcode
: code testcounting
: counting groups of dots teststraight
: straight and curve capitals testwordr
: word recognition testnumberr
: number recognition testfigurer
: figure recognition testobject
: object-number testnumberf
: number-figure testfigurew
: figure-word testdeduct
: deduction testnumeric
: numerical puzzles testproblemr
: problem reasoning testseries
: series completionarithmet
: Woody-McCall mixed fundamentals, form I testpaperrev
: additional paper form board testflagssub
: flags test
The spatial tests consist of visual
, cubes
, paper
, flags
, paperrev
, and flagssub
. The additional paper form board test (paperrev
), can be used as a substitute for the paper form board test (paper
). The flags test (flagssub
), is a possible substitute for lozenges test (flags
).
The verbal tests consist of general
, paragrap
, sentence
, wordc
, and wordm
.
The speed tests consist of addition
, code
, counting
, and straight
.
The memory tests consist of wordr
, numberr
, figurer
, object
, numberf
, and figurew
.
The mathematical-ability tests consist of deduct
, numeric
, problemr
, series
, and arithmet
.
Suppose we’re interested in predicting a student’s score on the visual perception given the student’s cubes test score. Specifically, this model can be written as:
\[ Y_i = \beta_0 + \beta_1X_i + e_i \]
where \(Y_i\) is student i’s score on the visual perception test, \(X_i\) is student i’s score on the cubes test, and \(e_i\) is the residual error. Across the students, we can rewrite our model as:
\[ E(Y | X) = \beta_0 + \beta_1 X \]
\[ Var(Y | X) = \sigma^2 \]
where the first equation we call the mean function and the second is the variance function. We are almost always interested in inference and because of this we make a normality assumption. Specifically,
\[ Y | X \sim N(\beta_0 + \beta_1 X, \sigma^2) \]
This is why we care about our residuals. We assume they are normally distributed and have constant variance (note the lack of a subscript for \(\sigma^2\)). Residuals have this distribution:
\[ e_i \sim N(0, \sigma^2) \]
Before we run the actual model, it’s a good idea to look at our data. Numeric summaries can be obtained using the summmary()
function.
summary(hs.data$visual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 25.00 30.00 29.61 34.00 51.00
summary(hs.data$cubes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 21.00 24.00 24.35 27.00 37.00
sd(hs.data$visual)
## [1] 7.004593
sd(hs.data$cubes)
## [1] 4.709802
The $
enables us to access a variable from a data set. So, hs.data$visual
enables us to access the visual perception test that is contained within the hs.data
data set.
We can also plot our data. For plotting continuous variables, you could use a histogram, density curve or a stem-and-leaf plot. Be careful with histograms as they can be a little misleading depending on the size of your bins. It can also be helpful to plot the observations in descending (or ascending) order, this lets see the distribution and look for outliers. It’s pretty similar to what we can see in the stem-and-leaf plot. Below, we create these plots for the visual perception test.
hist(hs.data$visual)
plot(density(hs.data$visual))
stem(hs.data$visual)
##
## The decimal point is at the |
##
## 4 | 0
## 6 |
## 8 |
## 10 | 000
## 12 | 0
## 14 |
## 16 | 00000000
## 18 | 0000000000000
## 20 | 00000000000000
## 22 | 000000000000000
## 24 | 000000000000000000000000000
## 26 | 00000000000000000000000
## 28 | 0000000000000000000000000000000000000
## 30 | 0000000000000000000000000000000000000
## 32 | 000000000000000000000000000000000
## 34 | 00000000000000000000000000000
## 36 | 00000000000000000000
## 38 | 00000000000000000000
## 40 | 0000000000
## 42 | 0000
## 44 | 00000
## 46 |
## 48 |
## 50 | 0
plot(sort(hs.data$visual, decreasing = TRUE), pch = ".")
hist( )
plot(density( ))
stem( )
plot(sort( ), pch = ".")
The next plot I would suggest making is a scatter plot.
plot(visual ~ cubes, data = hs.data)
We can modify this plot by adding better labels, changing the color, and adding a LOWESS smoother. A LOWESS smoother is a locally weighted regression line. Conceptually it works by calculating the value at a given point (\(x_i\)) based on a weighted least squares regression that weighs observations that are closer to \(x_i\) more heavily than those that are further away. It can be helpful for ascertaining linearity/non-linearity.
plot(visual ~ cubes, hs.data,
xlab = "Cubes Test",
ylab = "Visual Perception Test",
col = "dodgerblue")
lines(lowess(x = hs.data$cubes, y = hs.data$visual))
Assuming we’re satisfied with the data, we can then fit a simple linear regression.
mod <- lm(visual ~ cubes, data = hs.data)
I would encourage you to assess your model assumptions prior to looking at your actual output because we are nearly always interested in inference. We can get some good plots to help with this using the plot()
function.
# the plot function will spit out 4 plots by default
# we can view all four plots at the same time using the par
# commmand
par(mfrow = c(2, 2))
plot(mod)
The plot in the upper-left is a residual plot, the upper-right is a Q-Q plot, the lower-left is the square root of the standardized residuals, this can be helpful for detecting non-constant variance, and the last plot is a plot of leverage. Plots of Cook’s distance can be helpful for assessing influence (similar to leverage).
par(mfrow = c(1, 1)) # return the plotting device how it was originally
plot(mod, which = 4)
For Cook’s distance, you want to look for observations that are far from the rest of the observations. Some people recommend a value of 1 or 4 / N as a cut off, I don’t. I think seeing the points is more helpful. This would be 67, 71, and 268. These values correspond to their row number. As a sensitivity analysis, we can drop these observations and see if they affect the model findings. This is a good idea, but we shouldn’t just remove them from the analysis.
Presuming we’re okay with the model, we can look at the output.
summary(mod)
##
## Call:
## lm(formula = visual ~ cubes, data = hs.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.8055 -4.2478 0.6567 4.4256 20.6567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.84553 2.03671 9.253 < 2e-16 ***
## cubes 0.44222 0.08212 5.385 1.47e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.699 on 299 degrees of freedom
## Multiple R-squared: 0.08841, Adjusted R-squared: 0.08537
## F-statistic: 29 on 1 and 299 DF, p-value: 1.467e-07
This model can be written as:
\[ \hat{Y}_i = 18.845 + 0.442 X_i \]
summary(hs.data$visual)
if you don’t remember)?Confidence intervals can be obtained by:
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 14.837426 22.8536264
## cubes 0.280619 0.6038276
If we want to get standardized regression coefficients, we need to z-score both visual and cubes. This can be done either before hand or during modeling.
hs.data$visual.s <- scale(hs.data$visual)
# finish the code for cubes
hs.data$cubes.s <- scale( )
Then we can refit the model. Because we’ve standardized these variables, we no longer need an intercept as it will be 0. This can be omitted by adding -1
in the model syntax.
# two things are missing the dependent variable and the independent variable. Put in the correct values
# i.e., replace the <HERE>
mod.std <- lm(<HERE> ~ -1 + <HERE>, data = <HERE>)
summary(mod.std)
mod.std
.We can overlay our statistical model onto raw data using the following code:
plot(visual ~ cubes, hs.data,
xlab = "Cubes Test",
ylab = "Visual Perception Test",
col = "dodgerblue")
abline(coef(mod), col = "red")
Looks pretty good!
We can also do this model within an SEM framework using lavaan
. First, let’s write the syntax and then view the model quickly with the semPlot
function in the semPlot
package.
library(lavaan)
## This is lavaan 0.6-3
## lavaan is BETA software! Please report any bugs.
sem.mod <- '
visual ~ 1 + cubes
'
sem.fit <- sem(model = sem.mod, data = hs.data)
Notice the similarity to the lm
function. We can create a path diagram of our model.
library(semPlot)
semPaths(object = mod,
color = c(man = "orange", # sets the color for the manifest
lat = "green", # sets the color for the latent variables
int = "white"), # set the color for the intercept
edge.color = "black", # sets the color of the path
what = "est", # plots the unstandardized coefficients
fade = FALSE, # don't fade the paths based on the magnitude of the estimates
esize = 1, # force the line width to be equivalent
rotation = 2 # use a left-right orientation instead of top-down
)
The triangle represents the intercept.
We can view the output from lavaan
, as we can with many R
objects, using the summary
method.
summary(sem.fit, standardized = TRUE)
## lavaan 0.6-3 ended normally after 16 iterations
##
## Optimization method NLMINB
## Number of free parameters 3
##
## Number of observations 301
##
## Estimator ML
## Model Fit Test Statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## visual ~
## cubes 0.442 0.082 5.403 0.000 0.442 0.297
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .visual 18.846 2.030 9.284 0.000 18.846 2.695
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .visual 44.578 3.634 12.268 0.000 44.578 0.912
lm
?We can also obtain the covariance and correlation between these variables:
cov(hs.data$visual, hs.data$cubes)
## [1] 9.809502
cor(hs.data$visual, hs.data$cubes)
## [1] 0.2973455
Finally, we could use matrix algebra to get these parameters. If we let Y be a vector of length 301 containing the visual test scores and X be a matrix that has two columns and 301 rows where the first is a 1s column (for the intercept) and the second a column containing the cubes test scores. We might remember that the least squares estimator for b (the unstandardized regression coefficients) can be found as follows:
\[ \begin{align} Y &= Xb \\ X^TY &= X^TXb \\ (X^TX)^{-1}X^TY &= (X^TX)^{-1}X^TXb \\ (X^TX)^{-1}X^TY &= b \end{align} \]
Note that \((X^TX)^{-1}X^TX = I\), i.e., the identity matrix. In R,
X <- cbind(1, hs.data$cubes)
Y <- hs.data$visual
head(X)
## [,1] [,2]
## [1,] 1 31
## [2,] 1 21
## [3,] 1 21
## [4,] 1 31
## [5,] 1 19
## [6,] 1 20
head(Y)
## [1] 20 32 27 32 29 32
solve(t(X) %*% X) %*% t(X) %*% Y
## [,1]
## [1,] 18.8455260
## [2,] 0.4422233
The solve()
function allows us to invert a matrix, the t()
function is transpose a matrix, and %*%
is for matrix multiplication.