{0xc4200ac840 {{{0 0} 0 0 0 0} map[]}}

Analytical Endeavors

Mar 9, 2018 - 12 minute read - Comments - eda scraping sports-analysis

A Baseball Dashboard in Time for Opening Weekend (part one)

Baseball fan? ☑️ Math? ☑️ Computers skills? ☑️

It’s always nice when project ideas fall into your lap. Let’s build an dashboard that can be used to evaluate nominees for a franchise’s hall of fame.

JAWS

For the uninitiated, the first thing you need to understand is WAR. WAR attempts to encapsulate a player’s yearly contribution into one statistic. JAWS is calculated by taking the average of a player’s total WAR over their career and the sum of their seven highest WAR values. The JAWS statistic is meant to be a starting point in the discussion of a nominee’s creditials for the Hall of Fame. Jaffe provides a thorough explanation here. There’s also a more succinct description at Baseball-Reference.

The question is can JAWS, whose domain of applicability is the MLB Hall of Fame, be applied to a franchise’s Hall of Fame. Instead of career WAR, we’ll be using only the WAR the player accrued while he played for the Cincinnati Reds. The likely sticking point is the seven year qualification since free agency makes it less likely that players remain with a team for that length of duration. The number of inductees in a franchise hall of fame probably is considerably less than the MLB hall of fame, so the trick will be to chose a tenure long enough to qualify a high percentage of the inductees and still allow JAWS to provide an adequate measure.

Scrape Hall of Fame Inductee Names

First, we need to get the names of the members of the Reds Hall of Fame. There are a few places with this information including the Reds Hall of Fame website, Baseball-Reference, and Wikipedia. None had the information available to download or presented it in a tidy format. Wikipedia is a source that has wide-ranging utility. This is a good opportunity to become familiar with scraping the website.

Honestly, I expected this part to be more of a hassle, but rvest made the process quite painless. Use Google Chrome’s inspect feature to obtain the xpath by scrolling down to the table with the member names, right-clicking on the page, and clicking inspect. In the left panel under the Elements tab, you’ll see HTML code. Hovering over each line will shade an object on the page. Once we find the line of code that corresponds to the table we want, we right-click that line of code, choose copy, and copy xpath.

library(tidyverse)
library(rvest)

url <- "https://en.wikipedia.org/wiki/Cincinnati_Reds_Hall_of_Fame_and_Museum#Cincinnati_Reds_Hall_of_Fame_members"
members <- url %>%
      read_html() %>%
      html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>%
      html_table()
members <- members[[1]]
glimpse(members)
## Observations: 86
## Variables: 5
## $ Year     <chr> "1958", "1958", "1958", "1958", "1958", "1959", "1959...
## $ No.      <chr> "30", "4", "10", "33", "31", "24", "18", "44, 47", "—...
## $ Inductee <chr> "Paul Derringer", "Ernie Lombardi", "Frank McCormick"...
## $ Position <chr> "P", "C", "1B", "P", "P/3B\nManager", "RF", "P", "P",...
## $ Tenure   <chr> "1933–1942", "1932–1941", "1934–1945", "1937–1943\n19...
  • Year is the year inducted into the Reds HOF.
  • No. is the number on the back of player’s jersey.
  • Inductee is the player’s name.
  • Position is the positions played.
  • Tenure is the number of years played as a Red and is formatted as a range

For our calculation, we’ll need Inductee, Position, and Tenure. There are formatting problems with Position and Tenure, but that information can be obtained elsewhere.

There are only six nominees in the 2018 class so we can just copy/paste their information from the announcement: third baseman Aaron Boone (1997-2003), outfielder Adam Dunn (2001-2008), pitcher John Franco (1984-1989),pitcher Danny Graves (1997-2005), third baseman Scott Rolen (2009-2012) and outfielder Reggie Sanders (1991-1998).

WAR Values

To get our WAR values, we’ll utilize two data sets from the openWARData package: idTT (player IDs) and rWAR (Baseball-Reference WAR). Our member names are used to filter the idTT data to get IDs and the IDs to filter the rWAR data to get the WAR values.

library(openWARData)

# Combining first and last names to match the member names we scraped
idTTa <- idTT %>%
      select(key_bbref, name_last, name_first) %>%
      mutate(name_whole = paste(name_first, name_last))

# Missing values come along for the ride so they need removed
indID <- map_dfr(members[,"Inductee"], function(x) {
      filter(idTTa, name_whole == x & key_bbref != "")})

Some players were excluded from indID because in members, they have accent marks in their names. Likewise excluded, a father/son duo who have the same names in idTT but have Sr/Jr suffixes in members. All of these players will be added to indID. Also, administrative personnel were removed during the filtering process, since obviously, they have no WAR values.

missNamList <- list("Dolf Luque", "Leo Cardenas", "Tony Perez", "Dave Concepcion", "Ken Griffey",
                    "Jose Rijo", "Cesar Geronimo", "Pedro Borbon")
indID <- map_dfr(missNamList, function(x) {
      filter(idTTa, name_whole == x & key_bbref != "")}) %>% 
      bind_rows(indID) %>% 
      mutate(name_whole = if_else(key_bbref == "griffke02", "Ken Griffey Jr", name_whole))

Not many people have original names, including baseball players, so we need to remove the extra Pete Rose (Jr.), Joe Morgan, Mike McCormick, Pedro Borbon, and George Wright. Sparky Anderson and Fred Hutchinson were managers so they can be dropped as well.

indID <- filter(indID, key_bbref != "rosepe02" & key_bbref != "morgajo01"
                    & key_bbref != "mccormi03" & key_bbref != "andersp01" 
                    & key_bbref != "wrighge03" & key_bbref != "hutchfr01"
                    & key_bbref != "borbope02"    
                )
glimpse(indID)
## Observations: 81
## Variables: 4
## $ key_bbref  <fct> luquedo01, cardele01, perezto01, conceda01, griffke...
## $ name_last  <fct> Luque, Cardenas, Perez, Concepcion, Griffey, Griffe...
## $ name_first <fct> Dolf, Leo, Tony, Dave, Ken, Ken, Jose, Cesar, Pedro...
## $ name_whole <chr> "Dolf Luque", "Leo Cardenas", "Tony Perez", "Dave C...

Now for the nominees…

nomNamList <- list("Aaron Boone", "Adam Dunn", "John Franco", "Danny Graves", "Scott Rolen",
                   "Reggie Sanders")
nomID <- map_dfr(nomNamList, function(x) {
      filter(idTTa, name_whole == x & key_bbref != "")})

# Snagged an extra Reggie Sanders
nomID <- filter(nomID, key_bbref != "sandere01")

We have IDs for the inductees and nominees, so now we can get those Reds WAR values.

# Inductees
indWar <- map_dfr(as.character(indID$key_bbref), function(x) {
      filter(rWAR, playerId == x)}) %>%
      select(playerId, yearId, teamId, rWAR) %>%
      mutate_if(is.factor, as.character) %>% 
      filter(teamId == "CIN")

# add Name column
indWar <- indID %>% 
      select(name_whole, key_bbref) %>% 
      rename(Name = name_whole, playerId = key_bbref) %>% 
      inner_join(indWar, by = 'playerId')
## Warning: Column `playerId` joining factor and character vector, coercing
## into character vector
# Nominees
nomWar <- map_dfr(as.character(nomID$key_bbref), function(x) {
      filter(rWAR, playerId == x)}) %>%
      select(playerId, yearId, teamId, rWAR) %>%
      mutate_if(is.factor, as.character) %>% 
      filter(teamId == "CIN")

# add Name column
nomWar <- nomID %>% 
      select(name_whole, key_bbref) %>% 
      rename(Name = name_whole, playerId = key_bbref) %>% 
      inner_join(nomWar, by = 'playerId')
## Warning: Column `playerId` joining factor and character vector, coercing
## into character vector
glimpse(indWar)
## Observations: 721
## Variables: 5
## $ Name     <chr> "Dolf Luque", "Dolf Luque", "Dolf Luque", "Dolf Luque...
## $ playerId <chr> "luquedo01", "luquedo01", "luquedo01", "luquedo01", "...
## $ yearId   <int> 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926,...
## $ teamId   <chr> "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN...
## $ rWAR     <dbl> -0.14, 1.14, 4.56, 4.96, 3.10, 10.77, 2.55, 6.56, 1.1...

EDA

Tenure

We can next turn our attention to making a decision on our central quandary: how long should our tenure requirement be? If we look at tables of the different values, we can calculate the percentage of players that would remain at each cutoff.

# Inductees
indYrs <- indWar %>%
      group_by(playerId) %>% 
      summarize(tenure = n())
table(indYrs$tenure)
## 
##  2  3  4  5  6  7  8  9 10 11 12 13 15 16 17 18 19 
##  1  2  3  5  6  9 12  8 10  8  6  2  1  1  1  1  3
# Nominees
nomYrs <- nomWar %>%
      group_by(playerId) %>% 
      summarize(tenure = n())
table(nomYrs$tenure)
## 
## 4 6 7 8 9 
## 1 1 1 2 1
Cutoff % Remaining
7 yrs 67%
6 yrs 78%
5 yrs 86%
4 yrs 92%

Looking at the nominees, it turns out Scott Rolen only played four seasons for the Reds. The goal of this project is to evaluate nominees, so four years would be the necessary cutoff in order for Rolen to be included. If Rolen wasn’t part of this class, I’d consider five years but not above five.

So the inductees that didn’t make the cut were the following: Billy Werber, Bill McKechnie, and Wayne Granger. The Wright boys, George and Harry, also aren’t in there. They played with the Reds prior to 1871 and their WAR wasn’t available. Our final inductee pool has 76 players.

indWar <- filter(indWar, playerId != "grangwa01" & playerId != "mckecbi01"
                     & playerId != "werbebi01" & playerId != "wrighge01"
                     & playerId != "wrighha01")

warDat <- indWar %>% 
      bind_rows(nomWar)

Position

We need to figure out the primary positions for players as they tend to play multiple positions during their career. When comparing JAWS scores between inductees and nominees, it’s usually done by position. So, if we were to examine Scott Rolen’s case, we’d look at his JAWS score and compare it to other Reds third basemen such as Frank Robinson or Chris Sabo.

Jaffe makes this determination by calculating the total WAR at each position and selecting the position with the greatest value. Currently, I can’t find a relatively convenient way to obtain the necessary data to make that calculation. We’ll make our determination by using the Lahman package and its Fielding data set to find the position which has the most games played as a Red.

library(Lahman)

head(Fielding, 3)
##    playerID yearID stint teamID lgID POS  G GS InnOuts PO  A  E DP PB WP
## 1 abercda01   1871     1    TRO   NA  SS  1 NA      NA  1  3  2  0 NA NA
## 2  addybo01   1871     1    RC1   NA  2B 22 NA      NA 67 72 42  5 NA NA
## 3  addybo01   1871     1    RC1   NA  SS  3 NA      NA  8 14  7  0 NA NA
##   SB CS ZR
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
# Fielding dataset has different designations for 1800's Reds teams: CN1 and CN2.
posDat <- map2_dfr(warDat$playerId, warDat$yearId, function(x,y) {
      filter(Fielding, playerID == x & yearID == y)}) %>% 
      filter(teamID == "CIN" | teamID == "CN1" | teamID == "CN2")

# Getting position with most games as a Red
posDat <- posDat %>%
      select(playerID, POS, G) %>% 
      group_by(playerID, POS) %>% 
      summarize(sumG = sum(G)) %>% 
      filter(sumG == max(sumG)) %>% 
      ungroup() %>% 
      select(playerID, POS)


# Jim O'Toole's Baseball-Reference ID in the Fielding data set is incorrect but he was a pitcher his whole career. Adding him to the df.
setdiff(warDat$playerId, posDat$playerID)
## [1] "o'tooji01"
posDat <- posDat %>% 
      add_row(playerID = "o'tooji01", POS = "P") %>% 
      rename(playerId = playerID)

glimpse(posDat)
## Observations: 82
## Variables: 2
## $ playerId <chr> "becklja01", "bellgu01", "benchjo01", "billija01", "b...
## $ POS      <chr> "1B", "OF", "C", "P", "P", "3B", "P", "OF", "P", "C",...

The outfield position can be further divided into right field, center field, and left field using the Appearances data set in Lahman. It would be desirable to also split the pitching position into relief and starting but unfortunately the Lahman package doesn’t afford us this capability explicitly. I think the information could be derived from the games started statistic and some others but that task will have be left for another time.

head(Appearances, 3)
##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b
## 1   1871    TRO   NA abercda01     1 NA         1         1   0   0    0
## 2   1871    RC1   NA  addybo01    25 NA        25        25   0   0    0
## 3   1871    CL1   NA allisar01    29 NA        29        29   0   0    0
##   G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1    0    0    1    0    0    0    0   NA   NA   NA
## 2   22    0    3    0    0    0    0   NA   NA   NA
## 3    2    0    0    0   29    0   29   NA   NA   NA
# Get outfielder IDs
ofDat <- posDat %>% 
      filter(POS == "OF")

# Get yearId from warDat
ofYears <- map_dfr(ofDat$playerId, function(x) {
      filter(warDat, playerId == x)
})
      
# Number of games played at each OF position for each season
ofSplit <- map2_dfr(ofYears$playerId, ofYears$yearId, function(x,y) {
      filter(Appearances, playerID == x & yearID == y)}) %>% 
      rename(LF = G_lf, CF = G_cf, RF = G_rf) %>% 
      gather('LF', 'CF', 'RF', key = "POS", value = "G")

# Primary outfield position = most games played at that position
splitSum <- ofSplit %>% 
      select(playerID, POS, G) %>%
      rename(playerId = playerID) %>%
      group_by(playerId, POS) %>% 
      summarize(sumG = sum(G)) %>% 
      filter(sumG == max(sumG)) %>% 
      ungroup() %>% 
      select(playerId, POS)

# Replacing "OF" values in posDat
ofPos <- posDat %>% 
      filter(POS == "OF") %>% 
      select(-POS) %>% 
      inner_join(splitSum, by = "playerId")

posDat <- posDat %>% 
      filter(POS != "OF") %>%
      bind_rows(ofPos)

# Add POS column
warDat <- warDat %>% 
      inner_join(posDat, by = 'playerId')

Save Objects

We’ve generated quite a few objects in part one of this series. If you’d like to keep your environment relatively clean, we’ll only need a few of these going into the calculations of part two: indWar, nomWar, posDat, and warDat.

Conclusion

We have the ingredients for the JAWS calculation and our dashboard. In this part, we scraped Wikipedia to get the Hall of Fame members’ names. Those names were used to get Baseball-Reference IDs which in turn were used to obtain WAR values. Next, by examining the data, we determined our tenure qualification, and primary player positions were determined by the greatest number of games played at a position. Next, we’ll perform the calculations in part two and visualize them in a shinydashboard in part three.

References

[1] B. Baumer and G. Matthews. openWARData: Data Associated with openWAR. R package version 0.1.1.9004. 2015. URL: https://github.com/beanumber/openWARData.

[2] M. Friendly. Lahman: Sean ‘Lahman’ Baseball Database. R package version 6.0-0. 2017. URL: https://CRAN.R-project.org/package=Lahman.

[3] H. Wickham. rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.2. 2016. URL: https://CRAN.R-project.org/package=rvest.

[4] H. Wickham. tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. 2017. URL: https://CRAN.R-project.org/package=tidyverse.