The Greatest NBA Team Of All Time — According To The Numbers

How data can (kind of) settle the debate over who’s better: The 2015–16 Golden State Warriors or the 1995–96 Chicago Bulls.

Published in

The Cauldron

13 min readMay 4, 2016

The 2015–16 Golden State Warriors achieved the greatest NBA regular season of all time (73–9), beating out the 1995–96 Chicago Bulls (72–10) for the best record ever. Naturally, the Dubs’ historic achievement has led to much discussion about the greatest NBA team of all time (GNTOAT), whether league-wide talent is better now than it has been in the past, and, obviously, who would win in a head-to-head match-up between Stephen Curry’s Warriors and Michael Jordan’s Bulls.

Finding myself surprisingly motivated to solve these pressing questions, and having recently been introduced to the Elovation App, I aimed to resolve the arguments once and for all.

Numbering the Beast

At its core, comparing two teams from different eras that have consequently never played one another is tricky business. To effectively parse out the data, we need some understanding of exactly how such analysis can be performed.

As it turns out, similar comparisons have been attempted before, particularly in the world of chess.

ELO

International chess rankings have been officially calculated using the ELO rating system since 1970. ELO works by calculating the likelihood of a player winning a game based on both that player’s current rating versus that of his or her opponent’s. Over time, if the player is winning more games than his or her rating would indicate, the player’s rating increases and those of their defeated opponents decrease.

See, so sciency!

The ELO system has some notable problems: It encourages players to only target competitors they are likely to beat, and it discourages highly-ranked players from competing at all so as to preserve their rank. Neither of these apply to an NBA season where there is a set number of games and a predetermined schedule.

That said, the most relevant problem in this case is not specific to the rating system, but rather temporal circumstances: competitors are ranked relative to their present competition. There is no reliable way to objectively rank teams or players from different time periods when results are dependent on the play of their respective opponents. The most it can determine is how dominant a participant was relative to the field. Apparently, Arpad Elo (the creator of ELO) even believed it was an insufficient system for ranking players from different time periods.

Historically, ELO and variants of it have been adapted and used in a variety of applications including the (beloved) college football BCS rankings, FIFA Women’s World Rankings, Magic: The Gathering, Pokemon, and numerous video and computer games.

Trueskill

To address the problems with ELO, Microsoft created and subsequently patented the Trueskill rating system in 2007 for use on Xbox. It works similarly to ELO in that it calculates the likelihood of players winning a match. Where Trueskill differs from ELO is that it holds additional meta-information about a rating that can affect how much a player’s rating changes after a win or loss. Specifically, a “confidence score” indicates how likely the score is to be a reflection of the player’s actual abilities. Generally, as the player plays more games, this confidence score goes up, as there is a larger sample size to base the rating on. If a player has unexpected results — losing to a player they were not expected to lose to, for instance — the confidence factor can be lowered.

Confidence levels play an important role in determining how much a given result affects each players’ ratings. For example: If a player with a high confidence rating — meaning the algorithm is fairly certain that player’s rating is an accurate reflection of his or her skill level — defeats a player of higher rank but with a low confidence-rating, it’s likely the losing, higher-ranked player will see a significant drop. Meanwhile, the winner will not see as much of a change, as their confidence rating was quite high, and their opponent’s was not.

Taken to the extreme, if a player whose confidence level is high defeats a player of much lower rank, the winning player’s confidence level and rank will not change at all. If that highly-ranked player does lose to a lower ranked opponent, however, there will be a sizable drop in their ranking. These can help guard against very skilled players continuously targeting noobs.

Trueskill also allows for calculation of an individual player’s rating even when participating in a multi-player team — something ELO does not support. (This is an awesome feature of the system, but does not affect our GNTOAT exercise. Additional calculations of individual players’ contributions could be an interesting avenue to investigate.)

In general, Trueskill is more reactive than ELO to unexpected outcomes, allowing for larger swings in ratings over fewer results. It and its derivations have been used as the ratings system and in the match-making process for various video and computer games including the Halo franchise, World of Warcraft, Call of Duty, and of course Rocket League.

The Great Beyond

There are numerous adaptations of these systems, like Glicko, as well as other unrelated ratings algorithms. However, the Elovation project currently only supports ELO and Trueskill, so they are the two we will deal with.

Drawbacks

In addition to not being able to directly compare teams that have never played each other, there is also the matter that neither ELO nor Trueskill take into account margin of victory. A team that eeks out wins is likely weaker than a team that consistently wins by double digits.

In Search of a Champion

(Note: If you thought that last section was overly technical, you’ll love this next one! Feel free to skip to The Grand Coronation below for instant gratification.)

And by truth I mean my obviously rusty Ruby skills.

To crown the ultimate GNTOAT, I entered the regular season stats from each game of four different NBA teams’ years into the Rails Elovation App. These were the 1985–86, 1986–87, 1995–96, and 2015–16 seasons, which correspond to the (consensus) greatest Boston Celtics, Los Angeles Lakers, Bulls, and Warriors teams, respectively.

(Note: the entire codebase — a fork of the Elovation project — is located here. To run Elovation with the specific versions of Ruby and Rails it requires, Docker was incredibly helpful and prevented me from mucking about with my local environment. Seriously, want a different version of Ruby? You can literally just change a number in the Dockerfile telling it what image to load. Docker is amazing.)

Loading the data into the application was a multi-step process:

Regular season results are available for download in multiple formats (I used CSV) from BasketballReference.com. They look like this.
This data was then cleaned and formatted into JSON with a combination of console commands (mostly sed and awk) and a Python script. This was the result.
Some reverse engineering of Elovation culminated with tweaking the Rater model to allow for passing in a custom date for the time of the game. The app usually assumes the game happened at whatever time it was entered into the system. Shockingly, the unit tests still pass. Some minor aesthetic tweaks to the graph were also tailored to this specific data.
Some scripts that create seasons, teams, and individual matches from the JSON data were then run via

docker-compose run web rails runner [path to script]

At this point all the data was loaded into the app, and pointing to the page in a browser (usually localhost:3000) allowed for viewing the results through the Elovation UI.

Stats were generated by querying PostgreSQL directly:

SELECT min(value), max(value), avg(value), stddev_pop(value) 
    FROM ratings JOIN games ON ratings.game_id = games.id 
    WHERE games.name = ‘86–87 Season’ AND games.rating_type = ‘elo’;

Facts and Figures

The application yielded some interesting outcomes. The ELO scores for each of the 4 teams is tabulated below.

Unsurprisingly, the ELO ratings are proportional to overall record. Let’s take a look at Trueskill:

Both the Bulls and Lakers teams have a higher Trueskill rating than teams with better records. Hmm.

Fortunately, Elovation provides graphs of how ratings change throughout each season. Perhaps these can provide some insight with respect to this season’s Warriors.

The overall 2015–16 NBA regular season data in visual form:

Clearly the Warriors and San Antonio Spurs dominated the rest of the league by a considerable margin, but what’s really interesting is how flat each club’s lines trend after the first quarter of the season. The Warriors seem to reach equilibrium such that the team’s winning streaks are almost perfectly balanced by its losses towards the end of the year.

That is largely true for most of the teams in the league with the exception of a few at the bottom. Specifically, the 76ers just continued to drop like a stone as the season dragged on. The vast majority of teams, however, were within about 100 points of each other by the end of the year, with an overall average slightly lower average than the starting point of 1,000.

What about Trueskill?

Again, in visual form:

Once again we see the Warriors and Spurs with clear separation from the rest of the league. Teams seemed to reach equilibrium here as well, although it took much longer (almost mid-season in most cases) and the trends to get there were not nearly as smooth as in ELO. Trueskill is much more reactive to unexpected results — good teams losing and bad teams winning. (Look at that drop after the Warriors lost in late December!)

What’s also interesting here is that the teams at the bottom were not punished much at all for their losses, but when they managed a win, the ratings shot up dramaticaly — to the point that even consecutive losses didn’t push them back down to where they had been before.

So what about the Bulls in 1995–96?

The data represented visually:

And via Trueskill:

Here, we see similar trends between 1995–96 and 2015–16, but the Bulls did not see the kind of separation from the rest of the league as Golden State did. Interestingly, both the Bulls and the Warriors were the only teams to break the 3,000 Trueskill mark, and both then fell back below it by the end of the season. (The Warriors actually topped out at 3,111 while the Bulls’ highest score at any point was 3,064.)

Interestingly, each teams’ ELO average was within a point of the other, while the Trueskill average was markedly higher for Jordan’s Bulls.

Overall, the data leaves us in a tight spot as far as picking a GNTOAT, especially with the two metrics contradicting each other. Let’s dig a bit deeper …

Into the Depths

What about the perceived strength of opposition faced by the 1995–96 Bulls and 2015–16 Warriors?

Now we are getting somewhere! The additional data seems to suggest the Bulls have an edge. In ELO terms, the Warriors beat stronger teams on average, but also lost to weaker teams. From the Trueskill perspective, the Warriors both beat weaker teams and lost to weaker teams overall.

Admittedly, neither ELO nor Trueskill incorporates margin of victory; each only factor in wins and losses. Let’s see what happens when we factor how badly the Bulls and Warriors beat opponents up:

More data suggesting the Bulls were a superior team.

Chicago holds a slight advantage in margin of victory, but when the Warriors lost, they were absolutely crushed. This could imply that Golden State punted in games in which it knew the outcome was decided, opting to rest its star players to avoid injury. Conjecture? Maybe, but objectively speaking, point differentials indicate the Bulls were a stronger team.

This may reveal an apparent shortcoming in ELO in that once a baseline is established, it’s much harder to alter it than in Trueskill. The Warriors started their season 24–0, as reflected by the astronomical rise on the Trueskill graph, later followed by a huge drop after the team’s first loss; a drop from which they never recovered.

On the ELO chart, however, we see a small drop after that first loss, but then an overall upward trend such that the Warriors finish the season (1,219) above where they were at 24–0 (1,195). That final number was also only 6 points below their season peak of 1,225.

The Bulls, on the other hand, lost 2 of the team’s first 12 games, setting the ELO baseline lower. It took a sustained hot streak (31–1 over the next 32 games) to reach a height of 1,216 just a few games before the All-Star break at 41–3 before finishing the season a full 12 points below that at 1,204.

What is clear from the graphs is that the Warriors — and the Spurs, for what it’s worth — were more dominant over the rest of the league than the Bulls were, perhaps corroborating the widely-held belief that the NBA has seen less parity overall in recent years. Then again, that may not be the case after all.

The Grand Coronation

Many believe that there is more overall talent in the NBA now than ever before. If that is true, the fact that the Warriors — and Spurs, for what it’s worth — managed to dominate the NBA more than Chicago did in its hey-day simply cannot be discounted. Factoring in margin of victory, though, muddies the waters somewhat.

Trueskill is more reactive and can result in larger swings in numbers, while ELO runs the risk of getting prematurely anchored to a value with adjustments difficult to execute. (This likely accounts for the discrepancy between the two methods.)

So while the ratings systems may disagree, a secondary examination of point margins favors Chicago. Inconclusive results aside, based purely on the numbers, it’s difficult not to crown the 1995–96 Chicago Bulls as GNTOAT.

Verdict: 1995–1996 Bulls

Also Rans

For your perusal, here are the graphs of the 1985–86 Celtics and 1986–87 Lakers seasons. Most interesting is the relatively close grouping of teams towards the end of both seasons, perhaps owing to greater parity in the 80s. (Except the 86–87 Clippers, good lord.) It’s also interesting that in both of these years there was not a clear dominant team throughout the entire season.

Feeling Stronger Every Day?

One of the most interesting aspects of this has been to see trends over time. Particularly the spread of ratings within the league year to year.

Standard deviations of ratings over time

The increase in standard deviation of ratings seems to imply the league has seen a separation of power since the late 80s. Although with such a small sample size this is hardly comprehensive. Meanwhile the average rating seems consistent.

It’s unclear (to me, at the very least) if the consistency of average ratings is a product of the ratings systems themselves, or if it implies a constant “total” talent in the league. Considering they are ratings relative to the other teams in the league that year, it seems more likely to be the former.

Further investigation

Entering in all seasons (and even other sports) to examine trends in standard deviations of ratings could give a more definitive view of historical parity in sports. Comparing that with viewership or revenue could demonstrate if parity really is linked to success for sports leagues. I may examine this in a future article.
It would be interesting to delve deeper into the Trueskill rating system and calculate the rankings of individual players on teams throughout each season. This could give information about the most dominant and valuable players across history.
Removing the Spurs from the 2015–16 season — leaving the Warriors as the only dominant team — could yield interesting results. Do multiple dominant teams prop up each others’ ratings, or would they fare better in a league with no one close to them? How do the different ratings systems behave?
And for kicks it would be funny to see if the 1986–87 Clippers really are the worst team ever. Those lines are just trending down into sadness.