Data Sources & Competitions

Useful public data source, some are free while others must be bought from data companies. In addition, sports analytics challenge and competitions are collected, some of which offers off-line data for testing insights on. There are also some training environments that can be used for reinforcement learning included. These collections mainly focus on football while some may cover other sports.

Data Sources

KaggleData is kept by Kaggle, the world’s famous data mining competition platform. There are kinds of football data while they also keeping basketball data and cricket data.
Jokecamp created this repository of Football/Soccer data for anyone to use. He save them here as he find them or build the files. Data is in mostly csv and json formats.
Free football data from StatsBomb
This repository is committed by StatsBomb to sharing new data and research publicly to enhance understanding of the game of Football. They want to actively encourage new research and analysis at all levels. Therefore they have made certain leagues of StatsBomb Data freely available for public use for research projects and genuine interest in football analytics.

Training Environments

Google Research Football: A Novel Reinforcement Learning Environment
Google Research Football is an RL environment based on open-source game Gameplay Football. Read the related paper here.
DeepMind MuJoCo Multi-Agent Soccer Environment
DeepMind MuJoCo Multi-Agent Soccer Environment, a simulator which is a 2v2 soccer game using the MuJoCo physics engine. Read the related paper here.


PSG Sports Analytics Challenge

PSG Sports Analytics Challenge (finished) is a football analytics challenge held by Paris Saint-Germain and École Polytechnique. This challenge provides Opta event data and InStat tracking data, which can be only download by those participants, and ask participants to predict something like possession and design useful tools for football analytics.

Some fans have open their solution to this competition on github with more details about this competition:

Football Player’s Worth Estimation
Football Player’s Worth Estimation (open) is an exercise, which asks people to predict the player’s market value based on the player’s information and ability values. This competition has public dataset that is available to download until today.
March Machine Learning Mania (Google Cloud & NCAA® ML Competition)

These are challenges on Kaggle, which ask data scientists to use machine learning methods to predict winners and losers of the men’s 2016 NCAA basketball tournament since 2014. There are many open source solutions and discussions. The dataset can be download until now and they also keep updating the NCAA basketball dataset as far back as 1894. Competitions are listed bellow:

  1. March Machine Learning Mania
  2. March Machine Learning Mania 2015
  3. March Machine Learning Mania 2016
  4. March Machine Learning Mania 2017
  5. Google Cloud & NCAA® ML Competition 2018-Men’s
  6. Google Cloud & NCAA® ML Competition 2018-Women’s
  7. Google Cloud & NCAA® ML Competition 2019-Men’s
  8. Google Cloud & NCAA® ML Competition 2019-Women’s