← Back to projects

Brøndby IF — Data-Driven Football Analytics

2nd semester data analysis exam project · Sports analytics · Statistics · R

Using detailed event data from the Danish Superliga and international leagues, this project builds Expected Goals (xG) models, clustering analyses and large descriptive overviews to understand Brøndby IF’s shot quality, patterns of play and how men’s and women’s football differ in style and tactics.

R Sports analytics Expected Goals (xG) Predictive modelling Clustering Visualisation

Context & challenge

Football decisions are often guided by intuition, experience and emotion. At the same time, clubs now have access to detailed event data for every pass, shot and duel. In this project, we asked:

  • How can an xG model help Brøndby IF evaluate performance beyond goals?
  • Which factors most strongly influence whether a shot becomes a goal?
  • How can large volumes of match data be turned into clear, usable insight?

The work combines several assignments into one analytical storyline: xG models for Brøndby and three synthetic leagues, clustering of passes and players, and descriptive comparisons of men’s and women’s football using StatsBomb data.

My role – Data & analysis lead

I worked as the main technical and analytical driver on the project:

  • Loaded, merged and cleaned large match event datasets in R.
  • Engineered features such as shot distance, angle, body part and formations.
  • Implemented and evaluated xG models (classification trees and random forests).
  • Built clustering models for passes, shots and player profiles.
  • Designed visualisations and tables so results were easy to interpret.

A big part of my work was turning raw, messy data into something that both we and non-technical stakeholders could actually reason with.

xG modelling for Brøndby

Feature engineering

For Brøndby’s Superliga matches we combined two shot datasets and derived variables describing:

  • Shot location (X/Y), distance and angle to goal.
  • Body part used and shot type (open play, free kick, penalty, etc.).
  • Match context such as half and game state.
  • Team and opposition formations at the moment of the shot.

Models

We first built a classification tree to keep the logic explainable, and then compared it to a random forest model which uses many trees in parallel. The random forest achieved higher accuracy and highlighted which variables were most important (location and distance dominating, with some effect of formation).

Beyond Brøndby – leagues & player profiles

Three-league xG model

To avoid over- or underestimating goal probability in a single league, we combined “bottom”, “mid” and “top” synthetic leagues into one 3,000-shot dataset and built a second xG model. This allowed us to compare how shot quality and conversion change across different competitive levels.

Clustering passes & players

Using k-means and hierarchical clustering, we grouped:

  • Passes by length, height and start/end location to reveal typical build-up patterns.
  • Players by average shot locations, pass profiles and success rates – useful when clubs want to find replacements with similar characteristics.

These models made it possible to talk about “player types” and typical patterns instead of just isolated statistics.

Gender comparison & freeze-frame analysis

In a later part of the project, we compared men’s and women’s matches using a large StatsBomb dataset. We looked at differences in passing length, shot types, formations and event distributions, and analysed “freeze-frames” to see how many team-mates and opponents were between shooter and goal at the moment of a shot.

The analysis challenged stereotypes about women’s football being “less aggressive” and showed that many patterns are more similar than popular narratives suggest.

Outcome & learnings

  • xG gives a more stable view of performance than raw goals alone.
  • Spatial features like distance and angle are strong predictors of success.
  • Clustering can translate huge datasets into understandable “profiles” for passes and players.
  • Good visualisations are essential when communicating complex models to coaches and non-technical stakeholders.

Overall, the project sharpened my skills in statistical modelling, working with large sports datasets and turning technical work into concrete, visual insight.