NBA Position Clustering
Player positions in the NBA have become a rather fluid concept. Teams like the Warriors with their “Lineup of Death” have shaken the traditional mindset of the basketball world. We wanted to be able to build out a clustering model that used a player’s statistics to identify the player’s “true position.” When we say “true position,” we mean the position the player plays most alike. While LeBron James could be listed at just about any position on the floor, we wanted to know what his stats told us. By creating an unsupervised clustering model, players would be grouped together with other players of a similar statistical model.
Our Data Mining project was based on looking for statistical groupings in the National Basketball Association that define the different positions in the modern game of basketball. In basketball, often a given position becomes an argument for what will and won’t work on a roster, when it’s really much more complicated than that. We want to define the numbers behind what a guard, forward, wing or center is, as well as look for outliers, such as forwards performing statistically equivalent to guards. The key idea to our project was that there is more to a player than his labeled position. As the NBA has changed in the modern game, there has been a tendency to ‘play small’ as popularized by the Golden State Warriors infamous ‘Lineup of Death’. Examples like these have shown that players longtime slotted into a single position are actually more flexible, and possibly more effective, when placed somewhere completely new. Our goal was to use clustering methodology to look for what defines each of the four positions and search for specific groups of players, as well as outliers, in order to view the game differently. Our graphs have been created using Power BI, an interactive data visualization tool, and we strongly recommend opening the dashboard side by side while reading this paper. It can be found at this link.
Data Wrangling and Cleaning
For our project, we used individual statistics of players from the 2010 - 2016 seasons in the NBA, which we obtained from https://probasketballapi.com/. The data didn’t turn out to be perfect and required several rounds of cleaning and refining in order to produce something worth using. To start, there were several non-basketball athletes in the data set, many players were minimum impact competitors who hardly played and didn’t last very long in the NBA, and some categories, like position, were labeled in an extremely haphazard manner. For example, a player could easily be labeled as a Guard, Point Guard, Shooting Guard, Point Guard/Shooting Guard or Shooting Guard/Point Guard with no distinction as to why any particular choice was made. In order to give a wide range of options, we ran our simulations over statistics gathered from each season, as well as an averaged statistical output from the entire 2010 - 2016 data set.
In cleaning our data, we eliminated all non-basketball players, required that a player has played at least 20 games and averaged at least 10 minutes played per game. To normalize the position labels and make up for a lack of distinction between some positions, we organized the athletes into four basic positions. “Guard” consists of players labeled Guard, Point Guard or capable of playing Point Guard or Shooting Guard. “Wing” consists of Shooting Guards (only 3 players in the whole dataset labeled as such), hybrid Guard/Forwards and Small Forwards. “Forward” consists of hybrid Small and Power Forwards and pure Power Forwards. And “Center” is our big man group, consisting of hybrid Power Forward/Centers and Centers. While originally, we only had three positions defined, looking at the numbers in the many categories provided by our data source, we felt these four categories most naturally used the distribution to create a good foundation for our analysis and represented the current state of NBA tactics. One of the difficulties in identifying effective clustering measures was the skew in number of players for each position. This is shown in the already-large and increasing number of guards, as depicted below.
This increase of smaller players seems to agree with the rest of our analysis, as the league is trending towards players who can score from long range as opposed to the traditional, “inside-out” philosophy, but it adds to the challenge of finding effective clustering measures.
We began our analysis by doing simple comparisons among the four positions we had identified. We compared the averages of the box score statistics, the advanced statistics and some of the shot chart statistics to look for simple ways in which the positions differentiated themselves. Some graphs are shown here, but for full size and interactivity for all graphs, please see our dashboard.
We began with clustering our dataset using both hierarchical clustering with single-link distance metrics and assignment-based clustering using k-means++ and Lloyd’s algorithm. We originally used Gonzalez’s as well but found it less effective and switched our comparisons to single-link and k-means++. For both algorithms, we clustered the data from k = 3 to k = 8, using every combination of 7 statistical feature sets for each player: box score, advanced statistics, shot zone, shot range, shot area, action type and shot type. We ran assignment-based clustering for larger values of k, but with the amount of time required to run single-link hierarchical clustering, we limited our comparison from k = 3 to k = 8. Even with the limited scope, this resulted in almost 10,000 different clusterings with no simple way to identify which were better suited for our purposes. In order to find which clusterings best represented our chosen positions, we began searching for “polarity” among the results. We define “polarity” for a group of clusters as the average percentage of the dominant position for each cluster. Ideally, we would want a group of clusters to have a polarity of 100%, which would mean each cluster would consist entirely of one position.
The first thing we found is that hierarchical clustering with single-link did not perform as well as we thought. We had anticipated that single-link would do well in linking the most statistically similar players one at a time and that this would lead to a more linear clustering of players. What it actually did, in most cases, was produce (K - 1) singleton clusters and 1 large cluster of the rest of the players. This lead to it getting great scores on our polarity tests, because a cluster consists of only 1 or 2 players, it’s pretty easy to get a cluster of 100% the same position. The variation in cluster size is shown in the graph “Cluster Size Standard Deviation.” While not especially valuable for clustering, it was interesting to see the algorithm identify the game’s “superstars” (Russell Westbrook, Lebron James, Kevin Durant, etc.). We then decided to programmatically prioritize cluster sets that had larger chunks of one position in each collection and focus solely on k-means++ assignment-based clustering for our results.
In the end, our polarity methods determined that the best clustering result was using Lloyd’s algorithm with k-means++, clustering on box score, advanced statistics, shot range (Less than 8 ft., 8-16 ft., 16-24 ft., 24 ft.+), action type (pull-up jumper, alley-oop dunk, etc.) and shot type (2 pt. vs. 3 pt.) for the 2013 season data set where k = 7.
In figure 5, you can see the basic results of our determined best clustering, organized by position and each cluster’s size. We call cluster 1 the “Attack the Rim” cluster. It consists of high volume inside shooters like Derrick Rose, Kobe Bryant, Brook Lopez and JaVale McGee. It’s interesting to see how this clustering put players in very different positions into the same grouping. Cluster 2 is our “True Point Guards” (traditional, pass-first) collection, with Rajon Rondo, Jrue Holiday, Steve Nash and Eric Bledsoe leading the way. Cluster 4 is referred to as our “Spot-up Shooters” cluster, consisting of high volume outside shooters like Jimmer Fredette, Brandon Rush, CJ McCollum and Jason Terry. Players who are known more for their ability to shoot from the floor, and are most likely subpar defenders. Cluster 7, which we call the “6th Man Cluster”, is another intriguing look. It is full of guards known for high scoring and utility in limited minutes. Matthew Dellavedova, J.J. Barea, Jeremy Lin, Patrick Beverley, Lou Williams, Jerryd Bayless, Shelvin Mack, Iman Shumpert and even Andre Miller are all sorted here.
While these little itemizations are fun, overall, we learned a great deal from this project. The first thing we learned was that more data is not inherently good thing. The more parameters you input, the more confounding and confused your results can become. As your results become harder to visualize, it’s difficult to tell if your results actually mean anything. Clustering is also a difficult problem, and the methodology you decide on at the beginning heavily affects your results. In our quest to determine positional outliers, we also had tremendous success. One example is our “6th Man Cluster”. While being primarily guards, an outlier is the inclusion of Andre Iguodala. Though this is based on the 2013 data set, it feels reminiscent of Iguodala’s run as a key playmaker in Golden State’s aforementioned ‘Lineup of Death’. His ability to play a role far from his position title led to his naming as Finals MVP in Golden State’s championship. Looking to the future, similar clusterings on the 2016 data set have us extremely interested in the futures of Sam Dekker, Kelly Oubre Jr. and Myles Turner. It becomes even more difficult to cluster over a career when taking into consideration the evolution of each player’s style of play. One of the most well-known examples of this was Michael Jordan’s shift from the high-flying dunk virtuoso that he was when he entered the league to the clutch scorer from mid-to-long range that he became near the end of his career. We don’t think the NBA has ever been as simple as slotting 5 players into position on the floor. And it’s only going to get more interesting.
In order to narrow the scope of this project, we chose to cluster based on groups of statistics (box score, advanced, shot type, etc.) rather than individual statistics themselves. This is partly for our own sanity in trying to keep track of 123 different statistical measures for each player over the span of 6 NBA seasons, but mostly because of the combinatorial nightmare that would arise when trying to find effective combinations of statistics for high accuracy in clustering players based on position. Given the time and resources, we would like to run all possible combinations of those stats, or at least a reasonable amount of those combinations, to see if we can fine-tune our view about which statistics really are indicative of position, how those positions change and evolve over time and how the game is influenced by this position fluidity.