NHL Play-by-Play Data Scraping

I know about Evolving Wild’s scraper- https://github.com/evolvingwild/evolving-hockey/blob/master/EH_scrape_functions.R, but I’m having trouble figuring out how exactly to use it. Do I change something in the code or am I simply adding code? Do I run everything every time?

Also, how do I figure out the number code for any given game?

If anyone’s willing to share an example of the code they used for a specific game (or any thing in general), that would also be incredibly helpful.

Hi and welcome Skrimmage!

Great question. I haven’t used their code yet, so let me tag in @Evolvingwild to get an answer.

Hi! Sorry for the delay in responding.

The easiest way to use the scraper is by creating a new script in R, loading the packages commented out at the top of the script, and sourcing the EH_scape_functions.R script from GitHub. That looks like this:

## Dependencies
library(RCurl); library(xml2); library(rvest); library(jsonlite); library(foreach)
library(lubridate)
library(tidyverse) ## specifically: stringr, readr, tidyr, and dplyr

## Source scraper functions
devtools::source_url("https://raw.githubusercontent.com/evolvingwild/evolving-hockey/master/EH_scrape_functions.R")

Once you have the required packages and scraper functions loaded into your environment, you can use the sc.scape_pbp function to scrape any number of games you want. The NHL’s game IDs are formatted as “2019020001” where “2019” is the first year of the season (so a game in the 2019-2020 season is 2019), “0” is a separator, and “20001” is the game number (20001 would be the first game of the regular season).

Additionally, you can use the sc.scrape_schedule function to find the games from a specific date. So, to put it all together:

## ------------------ ##
##   Example Scrape   ##
## ------------------ ##

## Get schedule for first day of the '19-20 season
schedule_current <- sc.scrape_schedule(start_date = "2019-10-02", end_date = "2019-10-02")

## Identify game IDs
games_vec <- schedule_current$game_id

## Scrape games from 2019-10-03
pbp_scrape <- sc.scrape_pbp(games = games_vec)

## Pull data out of returned list
game_info_df_new <-     pbp_scrape$game_info_df               ## game information data
pbp_base_new <-         pbp_scrape$pbp_base                   ## main play-by-play data
pbp_extras_new <-       pbp_scrape$pbp_extras                 ## extra play-by-play data
player_shifts_new <-    pbp_scrape$player_shifts              ## full player shifts data
player_periods_new <-   pbp_scrape$player_periods             ## player TOI sums per period (from the shifts source)
roster_df_new <-        pbp_scrape$roster_df                  ## roster data
scratches_df_new <-     pbp_scrape$scratches_df               ## scratches data
event_summary_df_new <- pbp_scrape$events_summary_df          ## event summary data (box score stats, etc.)
scrape_report <-        pbp_scrape$report                     ## report showing number of rows and time to scrape game

Hope that helps!

2 Likes

@Evolvingwild Thanks for sharing your code.
In my case i use php to save data from api into a mysql database.
When i want to display players on the ice during a event (for exemple the event is at 4:35 in first period ) sometimes i have players who start the shift at 4:35 and also players who end the shift at 4:35 .
In your opinion, what is the practice ? Show players who start at 4:35 or who finish shift at 4:35 ?

I haven’t delved deeply into that specific problem, but the way I think about it is leaning towards the players are coming off the ice. If a shot happens, I would prefer to give credit to the person that was involved in the lead up. The time where it might be relevant to (also) include the players coming on the ice, would be a Too Many Players penalty. But overall any event is usually in some way due to the player who was around for the play that built up to the event.

Thanks for you reply. It confirms whati I thought. :slight_smile: