Catcher Pitch Framing using Machine Learning: Part One

January 31, 2018

by

· 5 min read

As sports leagues continue to add additional real-time player tracking technologies (e.g. Statcast, NBA player tracking, NFL Next Gen Stats), the amount of data available to analyze and understand (and potentially predict) player performance is growing at an incredible rate. Unfortunately, it is very difficult for us to make sense of this volume of data. The only way to get value out of this much data is through computational methods. Most casual sports fans historically haven’t had access to these techniques. However, with the development of automated machine learning tools, these methods are being opened up to all fans. We still don’t have access to most of this data but, some data on baseball (PITCHf/x) has been available for a while, allowing us to explore the impact of pitching and hitting.

One of the first places this data was used was for pitch framing — the impact a catcher has on the calling of strikes and balls. Pitch framing has been well explored with two common approaches. First, we can simply specify the strike zone and look at the balls called inside the strike zone, and strikes called outside the zone (see StatCorner for this approach). Second, we can try to determine the probability that a pitch will be called a strike. This approach can be done either by looking at portions of the strike zone and calculating the percent of pitches in that zone that were called strikes (see Sons of Sam Horn), or by building a model that uses the PITCHf/x data to predict the probability that a pitch will be a strike (see Baseball Prospectus).

In this post, we will explore both approaches using a machine learning model. Starting with the basic PITCHf/x data for 2016 — where the pitch was released, how fast it was moving, how much it broke, where it crossed home plate, and the handedness of both the pitcher and hitter — we will build a logistic regression model, and use it to explore the data for 2017’s called pitches, in order to evaluate catchers for the 2017 season.

Using DataRobot, we built our logistic regression model and got good performance (an AUC of 0.8956 with a correspondingly good ROC curve):

The output of this model is a prediction of the probability that this particular pitch will be called a strike, and we can look at plots of the distribution of predictions for both those pitches actually called strikes (green) and those called balls (purple):

This plot shows some issues with the strike predictions that we will discuss in a future blog post, but the model seems to be working well. We can use the model to make predictions for all called pitches in 2017. Then, we can use those predictions to calculate each catcher’s pitch framing abilities in two ways – by classifying each pitch as a strike or a ball or by using the probability of the pitch being called a strike directlyin each of the two ways above.

We’ve also seen how using the PITCHf/x data allows us to explore the impact a catcher has on a strike call without having to rely on someone else’s model.

First, we translate the probability of a pitch to be a strike to a prediction of either a strike or a ball. We do this by classifying all pitches with a prediction greater than 0.5 as strikes, with the rest being balls.

Next, we follow the approach detailed in this StatCorner page to get the best and worst performers at pitch framing. We first calculate the number of extra strikes as the number of actual strikes the catcher caught with a predicted probability of being a strike at less than 0.5, and then subtract the number of balls the catcher caught that had a predicted probability of being a strike greater than 0.5. We only consider those catchers that had at least 1000 pitches called by the umpire.

This gives us our top 10 catchers by number of extra strikes over an average catcher:

Catcher	Extra Strikes	Run Value from Extra Strikes
Austin Hedges	266.5187671	35.44699602
Tyler Flowers	203.9357833	27.12345918
Yasmani Grandal	168.0882305	22.35573465
Martin Maldonado	141.8951887	18.8720601
Sandy Leon	118.7151284	15.78911207
Buster Posey	117.8232089	15.67048678
Yadier Molina	111.3092983	14.80413668
Austin Barnes	78.70610241	10.46791162
Jeff Mathis	77.58826792	10.31923963
Roberto Perez	76.71821709	10.20352287

Similarly, we see the worst 10 catchers in this category:

Catcher	Extra Strikes	Run Value from Extra Strikes
Jonathan Lucroy	-184.7968473	-24.5779807
James McCann	-141.943514	-18.87848736
Tucker Barnhart	-110.731716	-14.72731823
Robinson Chirinos	-105.9659549	-14.093472
Ryan Hanigan	-88.7177596	-11.79946203
Alex Avila	-80.94357731	-10.76549578
Cameron Rupp	-78.19012273	-10.39928632
Salvador Perez	-74.4950262	-9.907838484
Elias Diaz	-72.17583945	-9.599386647
Omar Narvaez	-69.43673423	-9.235085653

Subsequently, if we take the straight predicted probability of a strike, we can calculate the number of expected strikes. The difference between the number of expected strikes and the total number of strikes will give us the number of extra strikes above what we expected the catcher to receive.

We can then look at the top 10 catchers using this method:

Catcher	Extra Strikes	Run Value from Extra Strikes
Austin Hedges	290.0510321	38.57678727
Tyler Flowers	247.7300069	32.94809092
Yasmani Grandal	182.9294209	24.32961297
Buster Posey	149.2730956	19.85332171
Martin Maldonado	147.7739811	19.65393949
Sandy Leon	137.8406481	18.3328062
Yadier Molina	112.3472118	14.94217917
Jeff Mathis	108.2902048	14.40259724
Stephen Vogt	100.9989409	13.43285914
Caleb Joseph	93.95869017	12.49650579

And, the worst 10 catchers:

Catcher	Extra Strikes	Run Value from Extra Strikes
James McCann	-183.4191805	-24.39475101
Jonathan Lucroy	-166.9488215	-22.20419326
Elias Diaz	-133.9348569	-17.81333597
Tucker Barnhart	-132.5160819	-17.62463889
Robinson Chirinos	-125.2482114	-16.65801212
Jorge Alfaro	-117.7666034	-15.66295826
Gary Sanchez	-104.8528524	-13.94542936
Brett Nicholas	-96.71005573	-12.86243741
Salvador Perez	-90.24680831	-12.00282551
Travis d’Arnaud	-88.82454016	-11.81366384

Reassuringly, we see a very similar list of catchers between these two approaches. We’ve also seen how using the PITCHf/x data allows us to explore the impact a catcher has on a strike call without having to rely on someone else’s model. This lets us not only understand the framing ability, but the underlying model as well. Finally, using these techniques and other data sources (fielding or hitting for example), we could evaluate other areas of baseball performance.

In our next blog, we are going to discuss one problem with this approach of using a logistic regression, and how we can use modern machine learning techniques to resolve it.

About the Author:

Andrew Engel is a Customer Facing Data Scientist at DataRobot. He works with DataRobot customers in a wide variety of industries, including several Major League Baseball teams. He has been working as a data scientist and leading teams of data scientists for over 10 years in a wide variety of domains ranging from fraud prediction to marketing analytics. Andrew received his PhD in Systems and Industrial Engineering with a focus on optimization and stochastic modeling. He has worked for Towson University, SAS Institute, the US Navy, Websense (now ForcePoint), Stics, and HP before joining DataRobot in February of 2016.

About the author

Andrew Engel

General Manager for Sports and Gaming, DataRobot

Andrew Engel is General Manager for Sports and Gaming at DataRobot. He works with DataRobot customers across sports and casinos, including several Major League Baseball, National Basketball League and National Hockey League teams. He has been working as a data scientist and leading teams of data scientists for over ten years in a wide variety of domains from fraud prediction to marketing analytics. Andrew received his Ph.D. in Systems and Industrial Engineering with a focus on optimization and stochastic modeling. He has worked for Towson University, SAS Institute, the US Navy, Websense (now ForcePoint), Stics, and HP before joining DataRobot in February of 2016.

Meet Andrew Engel

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Email

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Email

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Catcher Pitch Framing using Machine Learning: Part One

Spring Launch ‘24: Meet DataRobot's Newest Features to Confidently Build and Deploy Production-Grade GenAI Applications

How to Choose the Right LLM for Your Use Case

Belong @ DataRobot: Celebrating 2024 Women’s History Month with DataRobot AI Legends

Related Posts

Thanks! Check your inbox to confirm your subscription.