I often see xG (expected goals) data being used to make FPL decisions, but the data ranges selected for such use often feel arbitrary. In this article I look at the predictive power of xG data over several different date ranges.
In order to source the data for this I utilised the excellent understat.com and scraped the data for the last 7 seasons, after some data wrangling my data set looked like this.
Where goals represent the number of goals scored by a player in a particular match and the values xG_1, xG_20 denote the expected goals the player accumulated in the preceding games from one game ago (xG_1) to 20 games prior (xG_20). One way we can look at this is by calculating a simple correlation with goals scored for each column which upon graphing results in the below.
Another way to look at this is as a classification problem in machine learning, using the xG data as the explanatory variables and classifying the player/match combinations into scored/didn’t score we can come up with a simple logistic regression model for that.
Using the sklearn library in Python, the coefficients produced again show that the most recent matches are contain the most predictive data.
Taking a similar approach but this time using the sklearn library to create Random Forest models using the datasets we can measure the predictive performance of our models as more matches are added to the dataset. We find that the precision and recall of our models increase rapidly initially as more match data is added however after a few iterations we start to see this affect taper off.
What all this would suggest to me is that using a dataset as small as 10 matches would look to be perfectly valid and if including more historical data in a more extensive model it would probably be a good idea to perform some dimension reduction before model creation.