Scoring Goals with Data: Champions League Predictions

We at DataRobot have been enjoying building models to predict various sporting events around the world. However, it’s been pointed out to me that we have shown a bias for covering North American events so far. Several colleagues in Europe asked when we were going to predict the world’s most popular game, football. So I enlisted Chloe and Akshay to help me and see if we could predict the Champions League Knockout Stage and the eventual Champions League winner.

To start, we approached Data Sports Group to see if they would be willing to share some of their rich football data. As soon as we had that data, we faced several challenges. First, how do we extend our models beyond what we’ve done in previous sports and create useful football-specific features that can help our predictions? Second, how best do we merge the different datasets together to create the training dataset? Finally, how can we handle the nature of the knockout stage tiebreakers? Taking advantage of all that we’ve learned from prior sports predictions, and DataRobot’s recent purchase of Paxata, we’ve been able to answer these questions. 

Using Data Sports Group’s data, we calculated Elo ratings for each team, captured additional team rankings based on their goals, and scored and added a DataRobot proprietary ranking for every player. Based on this information, we predict the champion is most likely to come from Manchester City, Liverpool, Juventus, PSG or Bayern Munich, with Manchester City being the slight favorites.

 

Modeling Approach

In the Champions League Knockout Stage, all rounds (except the final) are played with each team hosting one leg. The winner of each round is the team with the higher aggregate goal score and if tied on aggregate goals, the winner is the team with the most away goals. To adequately simulate this, we decided to build models to predict the goal totals for each team in a home or away matchup.

To start, we collected the last four years of data from the Premier League, La Liga, Bundesliga, Ligue Un, Series A, and the Champions League from DataSportsGroup. This data contains information on every match played, the players, their individual stats, and teams stats for those games. Using that data, we calculated Elo, rankings (inspired by Pythagorean win percentages) for the individual stats, and individual player rankings (see below) for every team after every game. 

Next we built our training dataset using both the current Elo, season-to-date performance, and recent form (both for the team rankings and the individual rankings) using Paxata (see below) in DataRobot. By examining one of our models for home team score, we see (in Figure 1) that our model relied most on each team's Elo scores, but that the recent and season-to-date performance of the top eleven players for each team was also important. Beyond the Elo and player performance models, our additional rankings also impacted our model.

Figure 1: Feature importance of the home team goal model

Digging into the impact of Elo (in Figure 2), we see that the number of goals the model predicts increases as expected with increasing Elo:

Figure 2: Impact of the home team’s Elo on the models prediction of home goals

Finally,  we examine the effect of our player rating models, see Figure 3 below. The better the team’s players have performed over the course of the season, the higher prediction of goals for the home team:

Figure 3: Impact of the home team’s top 11 players’ rating over the season

Using these models, we predicted the average number of goals scored by both teams on both legs of a round in the Knockout stage. With the averages, we then ran a simulation of the entire Knockout stage. This simulation was repeated 10,000 times and the results were tallied to determine the Champions League favorites, as detailed above. The total percentage of simulations each team won (along with what percentage of the time they reached each round) is shown in Table 1 below:

Team

Quarterfinals

Semifinals

Finals

Champion

Manchester City

65%

41%

26%

16%

Liverpool

72%

43%

26%

15%

Juventus

74%

44%

25%

13%

PSG

73%

43%

24%

12%

Bayern Munich

67%

39%

22%

12%

Barcelona

62%

36%

20%

10%

RB Leipzig

66%

27%

11%

4%

Real Madrid

35%

17%

8%

3%

Napoli

38%

18%

8%

3%

Atlanta

55%

23%

8%

3%

Chelsea

33%

14%

5%

2%

Dortmund

27%

12%

5%

2%

Atletico Madrid

28%

11%

5%

2%

Valencia

45%

15%

5%

1%

Olympique Lyon

26%

8%

2%

1%

Tottenham

34%

9%

2%

1%

Table 1: Results of DataRobot’s simulation of the Champions League Knockout Stage

 

Player Ratings

There is rich data at a player level about each player’s contribution to a team win. Can we predict whether Chelsea are going to beat Liverpool based on their line-up? And what is the best predictor of match form? Is it their performance in the preceding match, month or aggregated cumulatively over the season as a whole? We built models based on each of these questions to provide a performance score per player and to ascertain the key performance indicators. 

In Figure 5 below, we can see the impact of the game’s stats of each individual on the rating we calculate for that player:

Figure 4: Feature Importance of the player model

Once we have obtained a predicted performance score for each player (i.e. the probability of a win given their inclusion in the squad), we wanted to explore how this can be combined with their teammates’ predicted performances. Perhaps the performance of the best player is the best predictor for a team win. We calculated scores for team average, best player, and top N players on each team, normalizing the scores by position. 

 

Paxata Data Prep 

When taking a closer look at the datasets (separated by leagues) - there was a lot of extraneous data. Each league had about 2,500 columns and it was extremely important for us to identify which would be the most relevant columns before we would run it through DataRobot. Using the Rapid Data Profiling feature in Paxata, we were able to get a better understanding of the data. Also by using the Paxata’s powerful data prep capabilities, we were then able to get rid of the unnecessary columns using a single click:

Figure 5: Rapid Data Profiling in Paxata

Figure 6: Column Management in Paxata 

Since the data was separated by leagues, it was important for us to join all these datasets together and build features/key metrics given that our goal was to make predictions for the upcoming Champions League games, in which top teams from the different leagues play against each other.

Using Paxata, we were able to consolidate all this data into one single file; once we had finished with this data, we developed a Python script to calculate Elo scores for all teams. Then, we brought this data back into Paxata to build the Home and Away form for all teams currently in the Champions League. Since form is a categorical variable, we used One Hot Encoding in Paxata to shape this data before we ran it through the model. 

Figure 7: Appending multiple files using Paxata

Figure 8: Shaping and One Hot Encoding using Paxata

After shaping, transforming and building additional key features to this dataset, we brought it onto DataRobot to model and get our predictions. 

 

Conclusions

Predicting game outcomes, especially for the world’s sport of football, is both fun and challenging. By using DataRobot’s leading enterprise AI platform, we were quickly able to leverage DataSportsGroup’s generous data to predict the Champions League Knockout Stage and the eventual Champions League winner. Time will tell if Manchester City, Liverpool, or Juventus will be victorious, as predicted as most likely. Or will PSG, Bayern Munich, or Barcelona come in and upset from their slightly lower probabilities? Perhaps one of the ten other teams will break the less-than-five-percent odds to take it all. Time will tell. Until then enjoy the game.

 

New call-to-action


About the Authors:

Andrew Engel is General Manager for Sports and Gaming at DataRobot. He works with DataRobot customers across sports and casinos, including several Major League Baseball, National Basketball League and National Hockey League teams. He has been working as a data scientist and leading teams of data scientists for over ten years in a wide variety of domains from fraud prediction to marketing analytics. Andrew received his Ph.D. in Systems and Industrial Engineering with a focus on optimization and stochastic modeling. He has worked for Towson University, SAS Institute, the US Navy, Websense (now ForcePoint), Stics, and HP before joining DataRobot in February of 2016.

Chloe Coates is an Applied Data Science Associate at DataRobot. She holds a PhD in Materials Chemistry from the University of Oxford with experience in the analysis and modelling of synchrotron diffraction data of disordered materials. She has also represented Oxford in the Varsity Women’s Football Match against Cambridge, for which she was awarded an Oxford Blue.

Akshay Viswanathan is a Data Platform Architect at DataRobot. He works with DataRobot Paxata customers to identify and implement use cases in the Data Prep domain across various industries including Sports, Finance and Healthcare. Akshay received his Masters in Information Systems and Science with a focus on Data Science and Analytics in 2017. He worked for Paxata before joining DataRobot as part of the Paxata acquisition in 2020.