Many people want to begin scraping and analyzing MLB data, but they do not know where to start. There are existing tools, such as baseballr, which already do most if not all of the work for you. They are great, go use them if you want. I am not writing this for people who want to use those sorts of tools. I am writing this for someone who wants to do things themselves. I relate to those sorts of people, and I am going to write the sort of instructions I wish I had a few years ago.
This is not necessarily “for beginners.” I am not going to explain everything. Instead, I aim to give you just enough information to get you pointed in the right direction. Enough to let you know which questions to type into Google and which Youtube tutorials to watch. I want to give you just enough that, no matter how little you know today, you can start doing this work in the near future. Maybe not today, maybe not tomorrow, but hell, maybe next week? Is next week soon enough for you? I dunno.
Also, before I get into anything, I am not saying this is the best way to do things. I am not saying it is the only way. This is merely *a* way. If this isn’t the way for you, I’m sorry. Oh, and I will be using javascript. However, the vast majority of what I am doing is more or less basic programming logic, and you could convert this to java with one arm tied behind your back. You could convert it to python while wearing a bucket for a hat. You could convert this to many languages. Maybe not R. I mean, you could use baseballr if you want to use R.
Oh, and one more thing, you will need node. Node is server side javascript, i.e. it does not run in a browser. We are using javascript, but we’re not making a website. We’re making a scraper. It will be fun. Go install node.
BTW I am assuming you’re using Windows. If you aren’t using Windows, replace the Windows parts with whatever operating system you are using.
Okay, let’s get going.
Make a node project
First, go to the root directory of your computer, on windows it would be the C drive. Create a folder, name it something catchy like “mlbscraper”. I dunno, you can come up with your own name. In the folder, open up the command line. To open a command line in a folder, click the address of the folder in the explorer menu and type in cmd. I made a video showing how this works, because I recently learned some programmer type people didn’t know this shortcut existed.
Okay, in that command line, type in:
npx gitignore node
npm init -y
This is the basic way to start a node project, it creates a git ignore file and then initiates node. If you care, read here or google it or whatever.
Now type in:
npm install fast-csv node-fetch
These are the two packages we will use. The first is a convenient way to write data to csv and the second fetches data from webpages. Since node is server side code and not browser code, it doesn’t have access to browser stuff like fetching websites. This package puts that ability back into the javascript. Do you ever feel like you’re doing something contrary to someone else’s original intentions? Me neither.
EXTREMELY IMPORTANT:
Open up the package.json file and beneath where it says “main”: “index.js” add the line “type”: “module” like so:
"name": "mlbscraper",
"version": "1.0.0",
"description": "",
"main": "index.js",
"type": "module",
(…etc the rest of the file unchanged)
Let’s make some javascript!
Okay, so, open an IDE or a text editor. I use notepad++. Use whatever floats your goat. In whatever way makes you feel comfortable and at ease with your life, create and/or open and/or save a file called api_scrape.js inside of your node project folder. Now we have a blank page of javascript to work with. But first, let’s think. What is it we want to do?
We want baseball data, right? Okay, but where do we get the data? From the api, duh. But where is the api? Probably at a url? Um. Okay. So, what if I told you a game url looks something like this:
You can open that link if you want. It is a json file. We’re going to look at it in a few minutes. If you don’t know what a json file is, you should probably go google it. Anywho, there are a few key parts to the url. Here, let me split the url so it is sorta color coded:
The blue parts are the generic url, and the bold six digit number is the gamepk, which is the ID of the game. Knowing this, we can create a javascript function that turns a gamepk into a game url. Let’s call the function gameurl.
const gameurl = (gamepk) => "https://statsapi.mlb.com/api/v1.1/game/"+gamepk+"/feed/live";
Alright, but where do we get the gamepk things? Great question. What if I told you there is a game schedule? You’d probably be pretty unsurprised, right? What if I said the schedule url is this:
I bet you didn’t expect the sportId=1 at the end. So that is probably surprising. This schedule url gives you all of the games played today. That’s sometimes useful. What if we want yesterday? Well, there is another schedule url that takes in a date parameter. Here is a javascript function called scheduleurl, which takes two dates and spits out the schedule url. I am sure you can parse the function without much issue.
const scheduleurl = (startDate, endDate) => "https://statsapi.mlb.com/api/v1/schedule?sportId=1&startDate="+startDate+"&endDate="+endDate;
Neato. Okay, so we have a schedule that gives us a list of gamepks and then we have a function that turns those gamepks into a game url. That’s a good place to start. There is one more thing we need to do before we move on, though. Have you noticed how dates are all weird in MLB? Probably not, because I haven’t showed you how they work. And I dunno if you know this, but they are even weirder in javascript. In order to get the weirdness of javascript to match the weirdness of MLB we need a function that converts javascript dates into MLB dates. That function looks like this:
const formatDate = (date) => {
const day = date.getDate();
const month = date.getMonth() + 1;
const year = date.getFullYear();
return `${year}-${month}-${day}`;
}
You put in a javascript date, it converts it into day, month year components, then returns the date in the form of year-month-day. Notice how you need to add one to the month. Nice job having January as the 0th month of the year, javascript!
As an aside, I feel like this article is an uncomfortable hybrid of glossing over important details and overexplaining pretty simple functions. It is going to be like this the whole way through, I’m sorry.
What is next?
Okay, okay, so we probably need to look at the data provided by the API at some point. Here is the game I posted above. I am opening it with a browser extension called JSON Beautifier & Editor, which you should probably install in your browser. The plugin makes it much easier to read through and understand json. I opened the appropriate objects down to the first play of the game. Take note of how it looks.
You have a root object which contains other objects, namely liveData, which in turn holds an object called plays, which holds an array called allPlays. Displayed is index 0 of allPlays, aka the first play of the game. So, in other words, this location is:
object.liveData.plays.allPlays[0]
Note that the play has several objects and arrays inside of it. You have result, about, count, playEvents, etc. Each contain data, presumably. You can go through and figure out what is interesting to you. I have already made an arbitrary list of stats, here is my list:
Date and year
Gamepk
Temperature
Weather condition
Wind
Venue
Batter ID, handedness, and name
Pitcher ID, handedness, and name
Game event
Play description and outcome
Home and away team
Game type
Inning and half inning
At bat number and pitch number
These are just a few stats you might be interested in. Feel free to grab whatever else you’d like.
In order to grab this data from the json, we have to traverse the json to find the relative locations of all of these stats within the file. I’ve made a list of the locations below. BTW, I am going to call the root location “data” instead of “object”. I just find it easier, and it is fewer letters to type.
Alright, so we have locations like
data.gameData.game.pk
data.gameData.datetime.officialDate
data.gameData.weather.temp
data.gameData.weather.condition
data.gameData.weather.wind
data.gameData.venue.id
data.gameData.teams.home.abbreviation
data.gameData.teams.away.abbreviation
data.gameData.game.season
data.gameData.game.gamedayType
data.gameData.game.type
These are all data stored in the gameData object on the root. Pretty simple. Okay, to get the rest of the data we need to look at a given play. So I am going to make some shorthand to make looking at a play a bit easier:
ab = data.liveData.plays.allPlays[i];
Now we can use that shorthand to snatch up these locations:
ab.matchup.batter.id
ab.matchup.batter.fullName
ab.matchup.batSide.code
ab.matchup.pitcher.id
ab.matchup.pitcher.fullName
ab.matchup.pitchHand.code
ab.result.event
ab.result.eventType
ab.result.description
ab.about.inning
ab.about.halfInning
ab.atBatIndex
Okay, here is one part that is actually pretty important. Intentional walks do not necessarily have pitches, depending on the game situation. So in order to maintain a unique ID (you’ll see in a moment) we need to check to see if a play has pitches. If it doesn’t have any pitches, we need to assign a pitch number to it in order to maintain a unique ID. In order to check how many pitches a play has, we need to find the length of:
ab.playEvents
If there is a length, then the play has pitches and we can use the length as the number of pitches. If there is no length, we can assign a value. I use 1.
const pitchnum = (typeof ab.playEvents !== "undefined") ? ab.playEvents.length:1;
Okay, now we can access all of the data I planned to access in the json. We can express this all in javascript using the following code
const play = {}
play.gamepk = data.gameData.game.pk;
const ab = data.liveData.plays.allPlays[i];
play.date = data.gameData.datetime.officialDate;
play.temp = data.gameData.weather.temp;
play.condition = data.gameData.weather.condition;
play.wind = data.gameData.weather.wind;
play.venue = data.gameData.venue.id;
play.batter = ab.matchup.batter.id;
play.batter_name = ab.matchup.batter.fullName;
play.stand = ab.matchup.batSide.code;
play.pitcher = ab.matchup.pitcher.id;
play.pitcher_name = ab.matchup.pitcher.fullName;
play.throws = ab.matchup.pitchHand.code;
play.events = ab.result.event;
play.description = ab.result.eventType;
play.des = ab.result.description;
play.home_team = data.gameData.teams.home.abbreviation;
play.away_team = data.gameData.teams.away.abbreviation;
play.year = data.gameData.game.season;
play.type = data.gameData.game.gamedayType;
play.game_type = data.gameData.game.type;
play.inning = ab.about.inning;
play.topbot = ab.about.halfInning;
play.abnum = ab.atBatIndex;
const pitchnum = (typeof ab.playEvents !== "undefined") ? ab.playEvents.length:1;
play.pitchnum = pitchnum;
This code declares a play object, then stores each piece of data within that object under an appropriate label. Pretty simple stuff. Now, about that unique play ID thing I was talking about.
play.id = String(gamepk)+"-"+String(ab.matchup.batter.id)+"-"+String(ab.matchup.pitcher.id)+"-"+String(ab.about.inning) +"-"+ String(ab.atBatIndex) +"-"+ String(pitchnum);
This combines the gamepk, batter id, pitcher id, inning, at bat index, and pitch number into a string with the form of:
123456-123456-123456-1-12-12
We can rearrange all of this into a javascript loop like this:
const gamepk = data.gameData.game.pk;
for (let i = 0; i < data.liveData.plays.allPlays.length; i++) {
const play = {}
play.gamepk = gamepk;
const ab = data.liveData.plays.allPlays[i];
play.date = data.gameData.datetime.officialDate;
play.temp = data.gameData.weather.temp;
play.condition = data.gameData.weather.condition;
play.wind = data.gameData.weather.wind;
play.venue = data.gameData.venue.id;
play.batter = ab.matchup.batter.id;
play.batter_name = ab.matchup.batter.fullName;
play.stand = ab.matchup.batSide.code;
play.pitcher = ab.matchup.pitcher.id;
play.pitcher_name = ab.matchup.pitcher.fullName;
play.throws = ab.matchup.pitchHand.code;
play.events = ab.result.event;
play.description = ab.result.eventType;
play.des = ab.result.description;
play.home_team = data.gameData.teams.home.abbreviation;
play.away_team = data.gameData.teams.away.abbreviation;
play.year = data.gameData.game.season;
play.type = data.gameData.game.gamedayType;
play.game_type = data.gameData.game.type;
play.inning = ab.about.inning;
play.topbot = ab.about.halfInning;
play.abnum = ab.atBatIndex;
const pitchnum = (typeof ab.playEvents !== "undefined") ? ab.playEvents.length:1;
play.pitchnum = pitchnum;
play.id = String(gamepk)+"-"+String(ab.matchup.batter.id)+"-"+String(ab.matchup.pitcher.id)+"-"+String(ab.about.inning) +"-"+ String(ab.atBatIndex) +"-"+ String(pitchnum);
}
I popped the gamepk out of the loop to just have less repetition in the loop. Should I move those other repetitive things out, too, such as venue and date? Yes. Why didn’t I? Because I’m copying this from working code, and I don’t know why past me decided to keep them in the loop. Sometimes we have to live with the arbitrary decisions of the past.
There is a logic check that we should probably make. We don’t want to accidentally scrape a partially done game, because then we would have to check the last play we downloaded and we’re way too lazy for that, right? Instead, let’s just download completed regular season games and skip ongoing games, exhibition games, and post season games. For that, we need this logic check:
const game_status = data.gameData.status.abstractGameState;
const game_type = data.gameData.game.type;
if (game_status !== "Final" && game_type === "R") { return }
Oh, and what are we doing with this data? Well, personally, I would put it into a sql database, but that is a bit beyond what I am willing to write about today so instead I will throw it into a csv file. You could toss that csv into sql if you want, that should be easy enough. I know a lot of you only use excel, so I guess this is perfect for those of you. Also, maybe you just want a csv backup of the game data so you don’t have to redownload it again in the future should something go wrong.
In order to put it into a csv, we will want to save chunks of the data and then throw those chunks into the fast-csv library to write to file. We will call the chunks “output”. Now we wrap all this code together into one function that takes something called “data” as input and sends output to a write to file function. Wammo bammo we wrote the scraping function!
function scrape(data) {
let output = [];
const game_status = data.gameData.status.abstractGameState;
const game_type = data.gameData.game.type;
if (game_status !== "Final" && game_type === "R") { return }
const gamepk = data.gameData.game.pk;
for (let i = 0; i < data.liveData.plays.allPlays.length; i++) {
const play = {}
play.gamepk = gamepk;
const ab = data.liveData.plays.allPlays[i];
play.date = data.gameData.datetime.officialDate;
play.temp = data.gameData.weather.temp;
play.condition = data.gameData.weather.condition;
play.wind = data.gameData.weather.wind;
play.venue = data.gameData.venue.id;
play.batter = ab.matchup.batter.id;
play.batter_name = ab.matchup.batter.fullName;
play.stand = ab.matchup.batSide.code;
play.pitcher = ab.matchup.pitcher.id;
play.pitcher_name = ab.matchup.pitcher.fullName;
play.throws = ab.matchup.pitchHand.code;
play.events = ab.result.event;
play.description = ab.result.eventType;
play.des = ab.result.description;
play.home_team = data.gameData.teams.home.abbreviation;
play.away_team = data.gameData.teams.away.abbreviation;
play.year = data.gameData.game.season;
play.type = data.gameData.game.gamedayType;
play.game_type = data.gameData.game.type;
play.inning = ab.about.inning;
play.topbot = ab.about.halfInning;
play.abnum = ab.atBatIndex;
const pitchnum = (typeof ab.playEvents !== "undefined") ? ab.playEvents.length:1;
play.pitchnum = pitchnum;
play.id = String(gamepk)+"-"+String(ab.matchup.batter.id)+"-"+String(ab.matchup.pitcher.id)+"-"+String(ab.about.inning) +"-"+ String(ab.atBatIndex) +"-"+ String(pitchnum);
output.push( play );
}
writeToFile(output);
}
That is a big chunk of the code done. Now we gotta set up the exporting to csv stuff. It is actually pretty simple. It looks like this.
const fs = require('fs');
var csv = require('fast-csv');
const ws = fs.createWriteStream('./data.csv');
const stream = csv.format();
stream.pipe(ws);
stream.write([ 'date', 'temp', 'condition', 'wind', 'venue', 'batter_name', 'batter', 'stand', 'pitcher_name', 'pitcher', 'throws', 'events', 'description', 'des', 'home_team', 'away_team', 'year', 'type', 'game_type', 'gamepk', 'inning', 'topbot', 'abnum', 'id' ]);
I am not going to explain any of that, you can google it. I will say, though, this code will save the data to a file called “data.csv”. Neat, right?
Oh, and we need that writeToFile function. That looks like this.
function writeToFile(data) {
data.forEach( (row) => stream.write(row) );
}
Simple! That just takes each line of the data and writes it to the csv file. Easy enough.
Believe it or not, there are only four-ish things we need to do at this point. First, how do we load the webpage to get the data from the internet? Let’s jump over that hurdle using the fetch api, it couldn’t be easier.
fetch(url)
.then(res => res.json())
This fetches a url and parses it as json. That is the basis of everything we need, except there are two types of urls we need to deal with. First, the schedule to get the gamepks, then the game url to get the data.
For the schedule, we want to take the data and throw it into a function to parse out the gamepk data. So let’s write that function. First, we need to look at the schedule json, again with the JSON beautifier.
Note that the gamepk is stored in an array at location object.dates.games. So, we will want to write a loop that goes through this array to get the gamepks. I write that function thusly.
function setGamepks(data) {
const gpks = [];
for (let d = 0; d < data.dates.length; d++) {
const total = data.dates[d].games.length;
const games = data.dates[d].games;
for(let g = 0; g < total; g++) {
gpks.push(games[g].gamePk);
}
}
const gamepks = gpks.unique();
next(gamepks);
}
This function creates a gpks array to store the gamepks, then for each date in the schedule it finds the total number of games and turns the games array into a variable. It then loops through the games array and pushes each gamepk to the gpks array. Then it uses a custom array function called unique to remove duplicate gamepks, which can happen with rainouts, resumed games, etc, and pushes that resulting array to a function called next. Why did I call it next? I dunno, because it goes to the next game in the array? Because I am bad at naming things? You decide.
The custom array function can be added to your default array using the following code:
Array.prototype.contains = function(v) {
for (var i = 0; i < this.length; i++) {
if (this[i] === v) return true;
}
return false;
};
Array.prototype.unique = function() {
var arr = [];
for (var i = 0; i < this.length; i++) {
if (!arr.contains(this[i])) {
arr.push(this[i]);
}
}
return arr;
}
Now, the next function: This function will loop through the list of gamepks, fetch the data from the internet, and give the data to the scrape function we already wrote. It is pretty simple.
function next(gamepks) {
for (let game = 0; game < gamepks.length; game++) {
const gamepk = gamepks[game];
const url = gameurl(gamepk);
fetch(url)
.then(res => res.json())
.then(json => scrape(json) )
.catch(err => console.error(err));
}
}
Finally, we have to actually get the schedule url. In order to do so, we need a date. For this example, I will use yesterday’s date. First, we get today’s date.
const today = new Date();
Then we use this to find yesterday’s date.
const yesterday = new Date(today.setDate(today.getDate() - 1));
Then we format it for MLB.
const date = formatDate(yesterday);
Then we throw that into the scheduleurl function.
const url = scheduleurl (date, date)
Pretty simple, right? We now have a working scraper to get data from MLB! Hooray!
Here is the final code:
import fetch from 'node-fetch';
import fs from 'fs';
import csv from 'fast-csv';
const scheduleurl = (startDate, endDate) => "https://statsapi.mlb.com/api/v1/schedule?sportId=1&startDate="+startDate+"&endDate="+endDate;
const gameurl = (gamepk) => "https://statsapi.mlb.com/api/v1.1/game/"+gamepk+"/feed/live";
const formatDate = (date) => {
const day = date.getDate();
const month = date.getMonth() + 1;
const year = date.getFullYear();
return `${year}-${month}-${day}`;
}
const ws = fs.createWriteStream('./data.csv');
const stream = csv.format();
stream.pipe(ws);
stream.write([ 'date', 'temp', 'condition', 'wind', 'venue', 'batter_name', 'batter', 'stand', 'pitcher_name', 'pitcher', 'throws', 'events', 'description', 'des', 'home_team', 'away_team', 'year', 'type', 'game_type', 'gamepk', 'inning', 'topbot', 'abnum', 'id' ]);
const today = new Date();
const yesterday = new Date(today.setDate(today.getDate() - 1));
const date = formatDate(yesterday);
const url = scheduleurl (date, date)
fetch(url)
.then(res => res.json())
.then(json => setGamepks(json) )
.catch(err => console.error(err));
function setGamepks(data) {
const gpks = [];
for (let d = 0; d < data.dates.length; d++) {
const total = data.dates[d].games.length;
const games = data.dates[d].games;
for(let g = 0; g < total; g++) {
gpks.push(games[g].gamePk);
}
}
const gamepks = gpks.unique();
next(gamepks);
}
function next(gamepks) {
for (let game = 0; game < gamepks.length; game++) {
const gamepk = gamepks[game];
const url = gameurl(gamepk);
fetch(url)
.then(res => res.json())
.then(json => scrape(json) )
.catch(err => console.error(err));
}
}
function scrape(data) {
let output = [];
const game_status = data.gameData.status.abstractGameState;
const game_type = data.gameData.game.type;
if (game_status !== "Final" && game_type === "R") { return }
const gamepk = data.gameData.game.pk;
for (let i = 0; i < data.liveData.plays.allPlays.length; i++) {
const play = {}
play.gamepk = gamepk;
const ab = data.liveData.plays.allPlays[i];
play.date = data.gameData.datetime.officialDate;
play.temp = data.gameData.weather.temp;
play.condition = data.gameData.weather.condition;
play.wind = data.gameData.weather.wind;
play.venue = data.gameData.venue.id;
play.batter = ab.matchup.batter.id;
play.batter_name = ab.matchup.batter.fullName;
play.stand = ab.matchup.batSide.code;
play.pitcher = ab.matchup.pitcher.id;
play.pitcher_name = ab.matchup.pitcher.fullName;
play.throws = ab.matchup.pitchHand.code;
play.events = ab.result.event;
play.description = ab.result.eventType;
play.des = ab.result.description;
play.home_team = data.gameData.teams.home.abbreviation;
play.away_team = data.gameData.teams.away.abbreviation;
play.year = data.gameData.game.season;
play.type = data.gameData.game.gamedayType;
play.game_type = data.gameData.game.type;
play.inning = ab.about.inning;
play.topbot = ab.about.halfInning;
play.abnum = ab.atBatIndex;
const pitchnum = (typeof ab.playEvents !== "undefined") ? ab.playEvents.length:1;
play.pitchnum = pitchnum;
play.id = String(gamepk)+"-"+String(ab.matchup.batter.id)+"-"+String(ab.matchup.pitcher.id)+"-"+String(ab.about.inning) +"-"+ String(ab.atBatIndex) +"-"+ String(pitchnum);
output.push( play );
}
writeToFile(output);
}
function writeToFile(data) {
data.forEach( (row) => stream.write(row) );
}
Array.prototype.contains = function(v) {
for (var i = 0; i < this.length; i++) {
if (this[i] === v) return true;
}
return false;
};
Array.prototype.unique = function() {
var arr = [];
for (var i = 0; i < this.length; i++) {
if (!arr.contains(this[i])) {
arr.push(this[i]);
}
}
return arr;
}
Let’s scrape some data!
To scrape data, just go back to your command line and type in
node api_scrape
That’s it! It should run pretty much instantly and save the results to csv file called “data”. Open it up and wammo, bammo! We did it! MLB data! Do you want data from before yesterday? I’m sure you can go figure it out. Do you want different data? Go change the data collection loop. Do you like exit velocity? Launch angle? Pitch location? Go find them in the json and throw them in the loop! Warning, you might need to catch errors with certain stats, because not all stats appear for all pitches or all plate appearances.
You can view the code in this repo.