Exploratory Data Analysis (EDA)
using R

Christopher David Desjardins

Examine integrity of data
Check for problems
Explore and identify relationships between variables
Explore the feasibility of the research questions
Identify data needs
Identify what is important and worth pursuing
Make the client :) or :(

The tools (some of ...)

magrittr
dplyr
ggplot2
shiny


## Install dplyr, ggplot2, and shiny
install.packages("dplyr")
install.packages("ggplot2")
install.packages("shiny")

## Load the libraries
library("dplyr")
library("ggplot2")
library("shiny")

We're going to explore the Lahman data set
Lots of different data sets at the team and player level
We'll explore the Teams data set

No, I won't teach you baseball

If you want to follow along


install.packages("Lahman")  
## wait and wait; these are large dbs

library("Lahman")

Attempt to understand the structure of the data

names(Teams)
 [1] "yearID"  "lgID"    "teamID"  "franchID"  "divID"   "Rank"    "G"       "Ghome"   "W"      
[10] "L"       "DivWin"  "WCWin"   "LgWin"   "WSWin"   "R"       "AB"      "H"       "X2B"    
[19] "X3B"     "HR"      "BB"      "SO"      "SB"      "CS"      "HBP"     "SF"      "RA"     
[28] "ER"      "ERA"     "CG"      "SHO"     "SV"      "IPouts"  "HA"      "HRA"     "BBA"    
[37] "SOA"     "E"       "DP"      "FP"      "name"    "park"    "attendance"     "BPF"     "PPF"    
[46] "teamIDBR"       "teamIDlahman45" "teamIDretro"

str(Teams)
'data.frame':	2775 obs. of  48 variables:
 $ yearID        : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
 $ lgID          : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ teamID        : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
 $ franchID      : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
 ...
 $ teamIDBR      : chr  "BOS" "CHI" "CLE" "KEK" ...
 $ teamIDlahman45: chr  "BS1" "CH1" "CL1" "FW1" ...
 $ teamIDretro   : chr  "BS1" "CH1" "CL1" "FW1" ...

Five stat summary

	Min	Q1	Med	Mean	Q3	Max
W	37	71	80	79.4035532994924	89	116
R	329	639	699.5	699.745558375635	762.25	1009
AB	3493	5410	5498	5420.32550761421	5566	5781
H	797	1349.75	1411	1403.91941624365	1476	1684
X2B	119	215	249	249.119923857868	281	376
X3B	11	27	34	34.6414974619289	41	79
HR	32	113	140	142.058375634518	167	264
BB	275	474	521	523.487944162437	572	775
SO	379	816	925	931.015228426396	1054	1535
SB	13	65	91	95.3362944162437	122	341

Explore bivariate associations

Predictor	Wins
RA	-0.616065865960237
ERA	-0.611445802765869
ER	-0.59861680447287
HA	-0.512781920911115
BBA	-0.444945137909955
BB	0.401595484670558
SHO	0.465124381461136
Attendance	0.4764013801237
R	0.524785625913186
SV	0.655956165173875