Exploratory Data Analysis (EDA)
using R

Christopher David Desjardins

Examine integrity of data
Check for problems

Explore and identify relationships between variables
Explore the feasibility of the research questions
Identify data needs
Identify what is important and worth pursuing
Make the client :) or :(

The tools (some of ...)

magrittr
dplyr
ggplot2
shiny


## Install dplyr, ggplot2, and shiny
install.packages("dplyr")
install.packages("ggplot2")
install.packages("shiny")

## Load the libraries
library("dplyr")
library("ggplot2")
library("shiny")
  

We're going to explore the Lahman data set
Lots of different data sets at the team and player level
We'll explore the Teams data set

No, I won't teach you baseball

If you want to follow along


install.packages("Lahman")  
## wait and wait; these are large dbs

library("Lahman")
	

What are the best predictors of wins?

EDA gist for R

What should we do first?

How might we proceed?

Attempt to understand the structure of the data

names(Teams)
 [1] "yearID"  "lgID"    "teamID"  "franchID"  "divID"   "Rank"    "G"       "Ghome"   "W"      
[10] "L"       "DivWin"  "WCWin"   "LgWin"   "WSWin"   "R"       "AB"      "H"       "X2B"    
[19] "X3B"     "HR"      "BB"      "SO"      "SB"      "CS"      "HBP"     "SF"      "RA"     
[28] "ER"      "ERA"     "CG"      "SHO"     "SV"      "IPouts"  "HA"      "HRA"     "BBA"    
[37] "SOA"     "E"       "DP"      "FP"      "name"    "park"    "attendance"     "BPF"     "PPF"    
[46] "teamIDBR"       "teamIDlahman45" "teamIDretro" 
 	
str(Teams)
'data.frame':	2775 obs. of  48 variables:
 $ yearID        : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
 $ lgID          : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ teamID        : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
 $ franchID      : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
 ...
 $ teamIDBR      : chr  "BOS" "CHI" "CLE" "KEK" ...
 $ teamIDlahman45: chr  "BS1" "CH1" "CL1" "FW1" ...
 $ teamIDretro   : chr  "BS1" "CH1" "CL1" "FW1" ...

Five stat summary

Min Q1 Med Mean Q3 Max
W 37 71 80 79.4035532994924 89 116
R 329 639 699.5 699.745558375635 762.25 1009
AB 3493 5410 5498 5420.32550761421 5566 5781
H 797 1349.75 1411 1403.91941624365 1476 1684
X2B 119 215 249 249.119923857868 281 376
X3B 11 27 34 34.6414974619289 41 79
HR 32 113 140 142.058375634518 167 264
BB 275 474 521 523.487944162437 572 775
SO 379 816 925 931.015228426396 1054 1535
SB 13 65 91 95.3362944162437 122 341

If the data look sensible (i.e. there aren't any obvious data integrity issues)

Then what?

Explore bivariate associations

Predictor Wins
RA -0.616065865960237
ERA -0.611445802765869
ER -0.59861680447287
HA -0.512781920911115
BBA -0.444945137909955
BB 0.401595484670558
SHO 0.465124381461136
Attendance 0.4764013801237
R 0.524785625913186
SV 0.655956165173875

Bivariate plots

http://130.208.71.121:3838/stat_consult/bivar/

Marginal plot of Wins


Box and whisker plot: Wins vs. Attendance

Data reduction

Lots of other techniques
k-means clustering
facet plots
added-variable plot
factor analysis

Up next: Inference?
Train, Test

Next time
Gelman, Pasarica, and Dodhia (2002)
Bring a table