Exploratory Data Analysis (EDA)
using R
Christopher David Desjardins
Examine integrity of data
Check for problems
Explore and identify relationships between variables
Explore the feasibility of the research questions
Identify data needs
Identify what is important and worth pursuing
Make the client :) or :(
## Install dplyr, ggplot2, and shiny
install.packages("dplyr")
install.packages("ggplot2")
install.packages("shiny")
## Load the libraries
library("dplyr")
library("ggplot2")
library("shiny")
We're going to explore the Lahman data set
Lots of different data sets at the team and player level
We'll explore the Teams data set
No, I won't teach you baseball
If you want to follow along
install.packages("Lahman")
## wait and wait; these are large dbs
library("Lahman")
What are the best predictors of wins?
What should we do first?
How might we proceed?
Attempt to understand the structure of the data
names(Teams)
[1] "yearID" "lgID" "teamID" "franchID" "divID" "Rank" "G" "Ghome" "W"
[10] "L" "DivWin" "WCWin" "LgWin" "WSWin" "R" "AB" "H" "X2B"
[19] "X3B" "HR" "BB" "SO" "SB" "CS" "HBP" "SF" "RA"
[28] "ER" "ERA" "CG" "SHO" "SV" "IPouts" "HA" "HRA" "BBA"
[37] "SOA" "E" "DP" "FP" "name" "park" "attendance" "BPF" "PPF"
[46] "teamIDBR" "teamIDlahman45" "teamIDretro"
str(Teams)
'data.frame': 2775 obs. of 48 variables:
$ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
$ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
$ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
$ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
...
$ teamIDBR : chr "BOS" "CHI" "CLE" "KEK" ...
$ teamIDlahman45: chr "BS1" "CH1" "CL1" "FW1" ...
$ teamIDretro : chr "BS1" "CH1" "CL1" "FW1" ...
Five stat summary
|
Min |
Q1 |
Med |
Mean |
Q3 |
Max |
W |
37 |
71 |
80 |
79.4035532994924 |
89 |
116 |
R |
329 |
639 |
699.5 |
699.745558375635 |
762.25 |
1009 |
AB |
3493 |
5410 |
5498 |
5420.32550761421 |
5566 |
5781 |
H |
797 |
1349.75 |
1411 |
1403.91941624365 |
1476 |
1684 |
X2B |
119 |
215 |
249 |
249.119923857868 |
281 |
376 |
X3B |
11 |
27 |
34 |
34.6414974619289 |
41 |
79 |
HR |
32 |
113 |
140 |
142.058375634518 |
167 |
264 |
BB |
275 |
474 |
521 |
523.487944162437 |
572 |
775 |
SO |
379 |
816 |
925 |
931.015228426396 |
1054 |
1535 |
SB |
13 |
65 |
91 |
95.3362944162437 |
122 |
341 |
If the data look sensible (i.e. there aren't any obvious data integrity issues)
Then what?
Explore bivariate associations
Predictor |
Wins |
RA |
-0.616065865960237 |
ERA |
-0.611445802765869 |
ER |
-0.59861680447287 |
HA |
-0.512781920911115 |
BBA |
-0.444945137909955 |
BB |
0.401595484670558 |
SHO |
0.465124381461136 |
Attendance |
0.4764013801237 |
R |
0.524785625913186 |
SV |
0.655956165173875 |
Marginal plot of Wins
Box and whisker plot: Wins vs. Attendance
Data reduction
Lots of other techniques
k-means clustering
facet plots
added-variable plot
factor analysis
Up next: Inference?
Train, Test