This blog provides a unique take on using machine learning to predict free agent signings in the off-season.
MLB’s Hot Stove season has begun and several big contracts have already been handed out to Zack Wheeler, Yasmani Grandal, Will Smith, and more. However, over 90% of this year’s free agent class remains unsigned, including the big three of Gerritt Cole, Stephen Strasburg, and Anthony Rendon. Players, teams, agents, and fans all want to know who will sign, for how much, and with which team - and so do we. So, we predicted how the entire free agency market would play out with DataRobot. We believe the history of player performance and free agent signings from prior years has the predictive power to tell us how this off-season will happen, and we put that data to work through AI (artificial intelligence) and machine learning.
We wanted to predict who will sign for how much, and which team will they go to. Using the DataRobot’s automated machine learning platform and data from numerous sources ranging from MLB payrolls, to free agent signings, to historical player performance, we built an array of AI models to tell us specific details about how this free agent market would play out, showing contract values, terms, and destinations for every player.
Additionally, we also wanted to identify which contracts and players would create the most value for their teams. Guaranteeing money to players who dramatically underperform expectations is a systematic risk in professional sports. However, we also believe we can use AI to predict these good and bad contract risks, and have done so in this analysis as well.
We compiled our predictions and analysis in the interactive graphic below, showing every player in this free agent class who had a sufficient track record of data to predict:
Predicting Contract Terms
First, we predicted contract terms for all of this offseason’s free agents: total contract value, average annual value, and years. To do this, we built a series of models that predict the key outcomes of contract negotiations. Free agent negotiations should be driven by the forces of supply and demand, so we built an extensive dataset to quantify these conditions including advanced analytics on individual player performance going back up to five seasons before each contract signing, league-wide and free agent market depth at each position, MLB payroll and luxury tax data, historical contract negotiation outcomes going back 10 years, and key player traits and characteristics (e.g. age, service time, position).
With this combined dataset, we built models in DataRobot to predict Average Annual Value (AAV) and Years for each contract, which we used to calculate Total Contract Value (TCV). We also built in the capacity to accommodate discontinuities in the reality of contract negotiations. For example, trends and patterns that work for a $4M/year player start to breakdown when you apply them to $20M/year players, so we divided those players and used different models to predict their contracts. Think of this as the “Scott Boras Premium”.
This gave us a complete and reliable set of predictions for contract terms. For those interested in data science, most of our models registered R-squared values against our training data of between 0.7 and 0.9, which indicates very strong predictive power for the 2020 offseason, assuming no major shifts in the negotiating positions of players and teams from the last decade.
Insights & Interpretation
We believe AI is only as good as it is explainable, so the charts below show which variables our AI relied on the most to predict AAV for both pitchers and position players.
Position Player AAV Feature Impact
- Qualifying Offer (qual_offer): One of the strongest indicators of value was whether or not a player received and accepted or rejected a ‘Qualifying Offer’ from their team. This season, that was worth a one year, $17.8M guaranteed contract. Our AI recognized this and added value to our predictions for these players appropriately.
- wRC per Plate Appearance over the last 5 Years (prior_5_wRC_per_PA): This rate metric of productivity per at-bat over the last five years served as the most important direct indicator of position player productivity in predicting AAV.
- Prior Year WAR (prior_1_WAR): WAR from the prior season also served as a direct, and recent indicator of player value and had a strong positive influence on AAV.
Pitcher AAV Feature Impact
- Starting Innings Pitched from the Prior Season (Start_IP): Innings pitched as a starter had a huge positive impact on AAV for pitchers. This is likely partial causation and partial correlation, as starters that go deep provide direct value by eating innings, but also, only good pitchers are allowed to pitch lots of innings as starters.
- Prior 2 Season WAR (prior_2_WAR): WAR from the prior two seasons showed consistency in performance, which is more important for pitchers than position players since consistency and resiliency is a more important pitcher trait.
- Age: In paying for future performance instead of rewarding for past performance, age matters. Older pitchers lose MPH on their fastball, sharpness on their sliders, and are more brittle.
Contract terms are only one part of determining winners and losers from this Hot Stove season. We also wanted to know who would sign smart contracts that valued players correctly. After predicting the contracts each player would sign, we predicted which contracts would create (or destroy) the most value for the ‘winning’ teams. Every team hopes they will get their money’s worth when they sign 9-figure contracts, but who will actually be able to make that claim?
To answer this, we built our own player performance forecasting tool, which relied on an array of AI models to predict player performance between 1 and 10 years into the future. Using 1500+ variables across multiple years of historical performance, we used DataRobot to determine which variables and machine learning algorithms were most accurate for predicting future performance. We then combined the results of our year-by-year forecasts to determine how much each player would contribute, as measured by WAR, during the life of the contract. This allowed us to rank contracts in terms of TCV $ per WAR and determine which players will create or destroy the most value for their teams deep into the future.
Specific Team & Signing Predictions
Using historical spending tendencies of teams and player-team fits, we also predicted the probabilities for every team to sign each player. We compiled data on historical payrolls by team, free-agent signings by teams, holes in-depth charts by position for each team, and our projected contract terms; then built AI models that predicted the probability for each team to sign players based on these team-player fits.
Signing Team Probability- Feature Impact and Explanations of Top Features
- Ratio of AAV to Gap Between Team’s Free Agent Opening Payrolls and 5-Year Average Payroll (aav_to_fa_opening_and-5_year_avg...): This ratio compared the size of each player’s contract in terms of Average Annual Value to how much money we’d expect the club to spend in the off-season based on their average Opening Day payroll from the last five seasons. That is - if Player X is demanding $10M/year, and Bidding Club X is currently committed to spending $150M in 2020, but has averaged a total payroll of $200M since 2015 (a $50M gap), then this measure would come out to 0.2 ($10M / $50M). The lower this ratio, the more likely the team is to sign the player because it indicates how much of the club’s free agency budget they’d consume.
- AAV to Club’s Lost WAR at the Player’s Position (aav_to_club_lost_war): This ratio aligns the Player’s AAV with each team’s need to fill a gap at their position. If Clubs lose players with high WAR at a position to free agency, they are more likely to spend on the open market to plug that gap, and that is what this metric indicates. Lower values show a team is more likely to sign a player as they seek value in filling an open spot.
- New Club Remaining WAR at Position (new_club_remaining_pos_WAR): For the player’s position, how much WAR does each bidding club have remaining at that same position? Lower values mean a team is more likely to sign the player as they lack position depth.
Top Free Agent Predictions Explained
Gerritt Cole - $217M ($31M per year, 7 years)
- Projected to produce 26.6 WAR at a cost of $8.2M per WAR
- We see Cole fitting well with several clubs that fit within their free agency bucket, and is a good value to add WAR.
Stephen Strasburg - $176M ($29M per year, 6 years)
- Projected to produce 19.7 WAR at a cost of $8.9M per WAR
- Strasburg fits with the several organizations that have money to spend (only ~$150M committed for 2020) without being pushed against the Luxury Tax Threshold and can help shore up a rotation with veteran leadership and production.
Anthony Rendon - $138M ($23M per year, 6 years)
- Projected to produce 22.6 WAR at a cost of $6.1M per WAR
- Rendon represents good value relative to remaining WAR several teams have at 3B.
Josh Donaldson - $117M ($23M per year, 5 years)
- Projected to produce 8.6 WAR at a cost of $13.6M per WAR
After each free agent signing, we will re-running our DataRobot models and update the dashboard in this blog. So be sure to check back often and spread the word!
About the Authors:
John Sturdivant is an AI Success Director at DataRobot and amateur baseball sabermetrician. He has led or advised CEOs in digital transformations across several industries and geographies, maintains the baseball analytics blog baseball-pop.com, and lives in Dallas, TX with his wife and dog. Prior to joining DataRobot, he was Head of Digital and Transformation at TSS, LLC and a consultant at McKinsey & Co.
Sarah Khatry is a data scientist at DataRobot. Prior to joining DataRobot, Sarah has worked in longform journalism, experimental physics and the entertainment industry. Sarah has her B.A. in English and Physics from Dartmouth College.