What Determines A Soccer Player's Salary?

By Peter Feeney

Introduction

MLS

There can be no doubt about it -- soccer is growing incredibly rapidly in the United States. MLS player salaries go up every year, as does average attendance for MLS games. And the league is still actively expanding. The number of high school soccer players has grown by 30% since 2004. And we have a world cup winning woman's team!

MLS salaries are pretty high by normal people standard, with a median income of $179,000 but relatively low by professional player standards. Keep in mind that the average NFL player makes a median salary of $860,000.

In this tutorial, our goal is to analyze the distribution of soccer players' salaries, find what factors determine a soccer player's salary, and see if we can predict a soccer player's salary from their stats. We should be able to use the exploratory data analysis techniques and machine learning algorithms we learned in class to do so.

It is my hope that this tutorial will provide some insights into what skills a soccer player should have to be considered "valuable" by professional teams. It is well-documented that many MLS players are unhappy with their salary -- over a third make less than $100,000 a year. While I can't provide any macroeconomic solutions to these problems, hopefully I can give people some insight into how they should train if they want a pay raise.

Required Tools

I'll use a standard set of libraries here: NumPy to store arrays, pandas to store data in tabular form, seaborn for simple and visually appealing data visualization, scikit-learn for machine learning, etc.

I am also using the powerlaw library. powerlaw is a python package that makes it simple to analyze heavy-tailed distributions. The relevance of this will become clear later on. You can find out more about powerlaw here.

I created a short helper function that gives somes important information about a pandas dataframe -- the names of its columns and the number of rows it contains -- and then shows that dataframe's first few rows. This is completely optional but it can be helpful when you're working with dataframes that have a fair number of columns.

Data Collection

Kaggle is an online community of data scientists and machine learning practioners. They have over 50,000 public datasets to work with. Two in particular look like they will be very useful to us. This dataset gives the salaries of Major League Soccer players from 2007 to 2017. This dataset contains data about every outfield player who's played in the MLS from 1996 to 2020.

Neither dataset contains enough data on its own to draw any conclusions about the relationship between a player's performance and his salary. But by merging the two together we will have a dataset that will work well for our purposes.

We are given salary data from 2007 - 2017 in a different .csv file for each year. The .csv files are identical in structure (same column names). Here we put all of the data together into a single dataframe for ease of analysis.

As one can see, we are given quite a bit of useful data from the salary dataset -- a player's name, his salary, the year in which he played -- but certainly not enough to discover any meaningful insights about the causal factors behind a player's salary.

Let's see what the "player stats" dataset has to offer us:

This dataset clearly gives us a lot more features to work with. From this dataset we have information on a player's:

  1. Name
  2. Team
  3. Position
  4. Games Played
  5. Games Started
  6. Minutes Played
  7. Goals
  8. Assists
  9. Shots Attempted
  10. Shots on Goal
  11. Game Winning Goal
  12. Penalty Kick Goals / Penalty Kick Attempts (unfortunately this one has some malformed data so I'm just going to drop it)
  13. Home Goals
  14. Road Goals
  15. Goals per 90 Minutes Played
  16. I don't know. It's hard to know what "SC%" is referring to, so we'll drop this column.
  17. Game Winning Assists
  18. Home Assists
  19. Road Assists
  20. Assists per 90 Minutes
  21. Fouls Committed
  22. Fouls Sustained
  23. Offsides Penalties
  24. Yellow Cards
  25. Red Cards
  26. Shots on Goal
  27. Year Played
  28. Are these regular or postseason stats?

Tidying Player Stats

Should we drop any columns in player stats?

There's a lot of features to work with! In fact, there's probably too many features to work with. It can be hard to fit ML models to high-dimensional data. It's also harder to make sense of the dataframe on a qualitative level when there's a lot of columns. So it's worth thinking about what columns we can drop now.

I'd guess that some of the columns in the dataframe don't have a significant impact on salary. I think in particular that the number of fouls a player sustains and the number of yellow cards a player gets in a season aren't connected to his salary, simply because these things don't often have an impact on the way a game goes. It's annoying for a player to get a lot of yellow cards but all in all it's not that important. Offsides penalities and red cards can cause a lot of problems for a team so on an intutive level it makes sense to keep those as features in whatever ML model we end up using.

"What features have a meaningful output on the output variable?" is a question that can be answered with empirical analysis/ We'll come back to this question.

The bigger problem at the moment is that some columns in this dataframe are redundant. We probably don't need to know how many goals or assists a player scored at home and on the road, since already know the total number of goals and assists a player scored. Similarly it doesn't make much sense to include both "goals scored" and "goals scored every 90 minute" as columns here.

We'll go ahead and drop the following columns:

  1. "SC%" (can't figure out what it's supposed to mean)
  2. Home Goals
  3. Road Goals
  4. Home Assists
  5. Road Assists
  6. Penalty Kick Goals / Penalty Kick Attempts
  7. Goals per 90 Minutes
  8. Assists per 90 Minutes

We should also change player_stats "Year" column to be named as "year" to correspond with the salary dataframes "year" column. This will make it easier to merge the two dataframes together.

Tidying the Salary Dataframe

We should merge the first/last name columns into a single "name" column, since this is the way player_stats is set up and we want to merge the two dataframes together.

A player's guarenteed compensation is a more accurate measure of how much money he makes than his base pay, so we'll go ahead and drop the base pay column.

Let's go ahead and merge the two dataframes together:

More tidying

This dataframe isn't clean yet.

Some players play multiple positions. Their position in the dataframe is denoted with two letters. For example, a player who plays both forward and midfield will have his entry in the "POS" column as F/M. This complexifies analysis, especially since we're working with a relatively small dataframe to begin with. If we want to make meaningful connections between a player's position and his pay, then we should dumb things down a bit. We'll designate a players position as whatever the first one listed is.

Entries are listed twice, once for the regular season and once for the postseason. This is confusing. We're primarily concerned with how a player performs in a regular season so we'll make a new dataframe out of regular seasons entries and analyze that.

Exploratory Data Analysis & Visualization

Let's see what we're working with! We'll try to get some basic information on the players in our dataframe.

How many players of each position are represented? How many players from each team? How many players from each year? What's the distribution of goals and assists across players? What's the median salary for each position? The variance? Who are the top-paid players? Who are the top-paid players for each positon? How are salaries distributed?

We have roughly the same number of midfielders and defenders, but we have less forwards than midfielders and defenders. This is a natural byproduct of the fact that most soccer teams will put 4 defenders and 4 midfielders on the pitch at any given time but only 2 forwards:

positions

Some teams are more represented than others. This could be happening for a lot of reasons: some teams may have higher turnover than others, for example. But most variation is explained by the fact that some teams are simply older than others. The MLS has expanded a lot since its beginning. Atlanta only began playing in 2017, for example, so it makes sense that they're less represented in our dataset. Similarly, New York had one franchise until 2015, when New York City FC entered the league. The New York Red Bulls (designated as NYRB) are actually quite well represented in our dataset, as both NY and the NYRB.

Since NY and NYRB are the same franchise, we'll say that any player who played for "NY" actually played for "NYRB".

Looks better now.

Later years are more likely to be represented in our dataset than earlier years. The MLS has grown rapidly since 2007 (adding 9 teams, which is 70% growth over this period) so this makes sense.

The distribution is clearly very skewed. It may be a good idea to plot it on a log scale:

Still very skewed. Only a small proportion of players score over 10 goals in a season.

75% of players score less than 2 goals a season, but the 5 best score over 20 a season.

The shape is nearly identical to the distribution of goals. The numbers are a bit bigger in absolute terms, probably because up to two players can be credited for an assist.

Why are both distributions so heavily skewed? Why do a few players score so much and everyone else scores so little? It's hard to say. We'll revisit, but for now we'll just make note of the fact that this distribution is highly asymmetrical.

If there's >10 outliers, then it's probably not fair to treat those data points as outliers. In other words, we're going to keep the outliers in our dataset.

You have to feel bad for the defenders here. Only a handful of them make more than >$1,000,000 a year.

This plot seems to provide some evidence for the idea that goals and assists (and other offensive stats) are the primary determinants of a player's salary, since midfielders and forwards are making way more than defenders.

It's relieving to know that it's not as though defenders are paid based on completely different criterion than forwards and midfielders. It will simplify our analysis to assume this once we get into the machine learning part of things.

Note how wildly asymmetrical these distributions are.

Not a whole lot of suprises here -- the highest paid payers are some big American names (Clint Dempsey) and some famous players from overseas who were wooed over to play in the MLS for a little bit (David Beckham, Sebastian Giovinco). It's worth noting that players may appear more than once in the top five if they have a long-term lucrative contract, as Sebastian Giovinco did.

This distriution is ridiculously asymmetrical, much more so than the distributions for goals and assists, which were highly asymmetrical themselves. Let's see how it looks on a log scale:

Still ridicuously asymmetrical. We should consider what could possibly be causing these asymmetries.

Many things in nature are fit well by a normal distribution. Heights, blood pressure, measurement error, etc. This can be attributed to the central limit theorem. When independent random variables are added, their sum tends toward a normal distribution even if the original variables themselves are not normally distributed. So any number that can be viewed as the sum/average of a lot of small random variables can be fit well by a normal distribution.

Different generative mechanisms can lead to much different distributions. For example, the product / geometric average of a lot of small random variables leads to a log-normal distribution.

Power Laws

You can find out more about power laws at this link and the paper that accompanied powerlaw's release.

Power laws are heavy-tailed (defined as a distribution with a heavier-tail than the exponential distribution) distributions that occur very often in the real world. In simple terms, this means that there are way more large values in a data than can possibly be explained by a normal distribution. Income is distributed according to a power law, so it's fair to guess that soccer player's pay is also distributed according to a power law.

Power laws have some very interesting mathematical properties. If the parameter alpha in the equation below is < 3, then the distribution has no standard deviation. If the parameter alpha in the equation below is < 2, then it has no central tendency.

Why do we care? If we know that compensation follows a power law then that may shed some light on the generative mechanisms that are causing such discrepencies in pay. The first link I attached states that "power laws arise from the feedback introduced by correlated decisions across a population." For example, the popularity of websites and academic papers follow a power law, since they're caused by "decision-making cascades" -- people have the tendency to copy decisions made by people who acted before them.

If a power law fits this data really well, then does that mean that a good degree of the variation in soccer player pay is caused by a similar "decision-making cascade"? What would that "decision-making cascade" be? Maybe it's the result of teams outbidding each other for top players. Maybe it's something else. It's an interesting enough question (and the powerlaw library makes it easy enough to answer) that we'll use a few lines of code to investigate.

Using the powerlaw library is very easy and you should be able to get the gist of it just by looking at the next few lines.

powerlaw_graph powerlaw_equation

The distance between the data and the fit is small, indicating that a power law might be a good fit. Let's compare it to some other distributions to really see.

The higher the first value in the tuple returned by the distribution_compare method, the more likely it is that first distribution fits the data than the second one. Similarly, large negative values indicate that it's likely that the second distribution is a good fit for the data. The second value in the tuple is the p-value -- smaller values mean that the package is sure that one distribution is better than the other. You can find more rigorous definitions at the paper that accompanied powerlaw's release.

The power-law absolutely must fit better than an exponential fit if it has any hope at being accurate, since otherwise the tail isn't even heavy enough to be considered a heavy-tail:

A truncated power law seems to fit the data way better than any kind of exponential fit does. That's a good sign. Now let's compare the distribution to what is usually the main competitor to a power-law distribution to describe a heavy-tailed distribution, a lognormal distribution:

A power law seems to fit the data signficantly better than even a lognormal distribution. With an alpha value of ~2.09, we can't describe the distribution as having any meaningful standard deviation and we should show some caution when talking about its central tendency, since it is right around 2.

This has a practical implication for us going forward as we begin the machine learning portion of this tutorial: be skeptical of regression analysis approaches, unless if we transform the data as shown here.

Analysis! (With machine learning)

We understand a lot about the data we're working with now. We're in a good position to answer our original questions: what features are most important in determining a soccer player's salary, and can we effictively predict a soccer player's salary?

We can use feature selection techniques to determine what features most strongly predict a soccer player's salary. Inspired by this article, we'll use the following techniques to determine which features have the strongest relationship with a player's salary:

  1. Univeriate Selection
  2. Feature Importance
  3. Correlation Matrix with Heatmap

Machine learning approaches work best when the training dataset includes a full range of feature combinations. That way our model will be effective at determining how each feature affects a player's compensation.

Machine Learning for Absolute Beginners by Oliver Theobald recommends that a basic machine learning model contains ten times as many data points as total number of features. Right now, our dataset is small in absolute size (at 4250 rows) but it does meet this criteria, since we have only 25 features. We'll drop any rows that interfere with our ability to fit machine learning models to our data but so long as this holds we should be ok.

For datasets that have less than 10,000 samples, Theobald recommends using clustering and dimensionality reduction algorithms. He recommends regression analysis algorithms for larger dataframes (10,000 <= n <= 100,000). This is fine for us,since as we said earlier we need to be cautious about applying regression analysis when the dependent variable (compensation) is distributed according to a power law.

Univariate feature selection

We'll follow along with the documentation here.

Univariate feature selection selects the most predictive features through the use of univariate statistical tests.

Our task is fundamentally a regression task -- we seek to predict numeric scores. So as the scikit-learn documentation suggests, we'll use F-test and mutual information methods. F-tests estimate the degree of linear dependency between two random datasets, which worries me since our dependent variable is decidedly nonlinear. Mutual information methods can capture any kind of dependency but require a decent number of samples to be accurate, which is bad since our dataset is pretty small.

We'll use both methods and compare to see what the most predictive features are.

Results

The f_test gave results that are intuitive and in-line with our exploratory data analysis: goals are the most important determinant of a player's salary, and assists are really important too. One is tempted to believe that the importance of SHTS and SOG on a player's salary is caused by the effect that players who shoot more are more likely to score, or at least to be seen as great shooters by their teammates. Game winning goals and game winning assists are also important determinants of a player's salary. Suprisingly, the number of offsides penalities a player incurs has a big effect on his salary. I wonder if the effect is positive or negative. If it's positive, there's a simple explanation: players who are offsides a lot are simply more likely to create more offensive opportunites, some of which may occasionally go wrong.

The results given by the mutual info fit don't look as good. Our dataset is relatively small and it probably didn't work as well as it would have with more data.

One counterintuitive result here given in both models is that fouls sustained has a large effect on a player's salary. I'd bet that this is because player's who sustain fouls are more likely to get penalty kicks and are therefore more likely to score goals.

Feature Importance

Decision tree based regressors can be very effective. Scikit has an "extra-trees" regressor that "fits a number of randomized decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting". So a concerted effort is made to combat the main drawback of using a decision tree, overfitting.

It will be interesting to see if this decision tree based regressor predicts salaries well, and to see what kind of hyperparameter tweaks could boost its performance. But we'll answer these questions later. For now, we simply seek to understand what kind of features the ExtraTreeRegressor finds to be important to the dataset. We'll play around with some values of max_depth (one of the most important hyperparameters in this model) to see how that affects things.

Analysis

The fundamental conclusions of the decision tree are the same as the f_test from above. Goals and shots taken are really important determinants of a player's salary. So are the number of assists he has. Interestingly, the decision tree model picks up on the importance of club. At every level of max_depth I tried, playing in a big market is a huge determinant of a player's salary: playing for Toronto, LA, or NYCFC seems to be important. I worry about the effect of outliers here: does LA pay everyone a lot, or is David Beckham just dominating the model?

In many ways, the decision tree model gives more intuitive results than the f_test did. It makes a lot of sense that the club a player plays for would have a bigger impact on his salary than the number of fouls he sustains. I'm excited to see if the decision tree predicts a player's salary well...

Results

Perhaps we should've done this during our EDA; this is a truly fascinating graphic, and there are all kinds of insights that one can glean from it. Here are a few of the biggest ones:

  1. The correlation the f-test found between goals, game-winning goals, shots, shot-on-goals and salary are all very closely related, if not just variants on the very strong relationship between goals and compensation.
  2. The same thing can be said about offsides penalities, which are very strongly correlated with goals scored. This explains the seeming paradox that additional offsides penalities leads to additional goals. A similar dynamic exists for fouls sustained, although it's not nearly as intense.
  3. Assists are less correlated with goals than offsides penalities and fouls sustained are correlated with goals, funnily enough. You're left to conlcude that risk-taking (selfish?) players get paid more than their peers. But assists are also a strong predictor of compensation. It's probably the only strong factor behind salary that's truly independent from goals.
  4. Defenders are just screwed. They're dramatically less likely to take shots, score goals, or have assists than other players. It's no wonder that they get paid (on the tail-end of things) a lot less than their peers.

Predictions

I'm excited to see how well that ExtraTreeRegressor is really able to predict salaries.

Since the book I mentioned earlier said that clustering algorithms tend to work well on smaller dataframes, I'm going to compare the performance of the decision tree to a nearest neighbors regression and see how it goes. I'll play around with hyperparameters for both the ExtraTreeRegressor and the clustering algorithm.

Any study that attempts to predict human behavior will usually have an R^2 value < 50% so no big deal that we're at 42.67%. We'll play around with hyperparameters anyways.

So our model probably needs more data than it has, but it's performing pretty well regardless. Onto the kNN algorithm:

Results

"Oliver Theobald" was really off the mark this time. The ensemble regressor performed much better than the clustering algorithm. Perhaps kNN doesn't perform well when the dependent variable has a heavy-tail distribution, or the statistical sophistication of the ensemble method just makes it better in general and better able to handle hard cases.

CONCLUSIONS

Soccer player pay has a very skewed distribution. The vast majority of players will do alright financially, making well over \$100,000 a year. But a select few are paid huge sums of money, well over \$5,000,000 in some cases. It's this kind of money that has brought over big stars like David Beckham and Wayne Rooney from overseas to play in America for a couple of years, but you can't blame the players for being upset that the median pay isn't higher.

One thing that became clear during our analysis of the factors that determine a soccer player's salary is the enormous importance of scoring goals. If a player wants a pay raise, then the best thing to do is probably to start taking more shots. Defenders are generally in a bad spot, since they naturally take less shots than forwards and midfielders, but they still can increase their pay by a lot by improving their offensive output.

The analysis in this tutorial is good news for selfish players. Taking a lot of shots seems to have more of a positive effect on salary than assists do! This is really pretty remarkable. Players are told for their whole lives to be good team players, but it's the ones who don't take that advice who are most likely to get paid a lot. It's also good news for aggressive players. The number of offsides penalties a player incurs is actually positively correlated with his salary.

Another piece of insight for players unhappy with their salaries: maybe you should try to sustain more fouls! This seems to have a positive effect on salary.

I should throw in the obvious disclaimer that correlation isn't causation, so you can't really take any of this analysis and turn it into practical advice. But if I were a soccer player I'd be working hard to increase my offensive output.