Let’s make a bet; Leafs win the Stanley Cup?

13 min readJan 22, 2019

Inspiration

It’s been over 50 years since the Toronto Maple Leafs have lifted the Stanley Cup. This is a result of the NHL shifting from a game of grit and defending to puck possession and skill. Nowadays, hockey stars are conditioning during the offseason and possess un-before seen skills in the league. Despite these new talents, hockey still ranks as one of the most unpredictable sports compared to other professional sports leagues. In fact, a Vox film explained why hockey skill doesn’t make a team more likely to win a game. Researcher, Michael Mauboussin, describes hockey is the closest of sports to randomness due to a combination of the small sample size of games, the small number of scoring chances within a game, and the distribution of ice time between all players.

The video got me thinking, is it possible to quantify luck and skill? Currently, the NHL tracks both advanced and straightforward statistics for both teams and players. So, I decided to put my Statistics major and recent Data Science internship to use to predict the outcome of NHL games with a higher probability than flipping heads with a fair coin. After all, I didn’t come to UWaterloo to drink milk! The idea also stemmed from The Athletic and their analytically written articles I enjoy reading.

Method

At a high level, machine learning, a subset of artificial intelligence, is a set of techniques allowing computers to improve performance in specific tasks by making use of data rather than being programmed explicitly. If that was a mouth-full, a helpful example to illustrate this is identifying spam messages. A solution might be to manually create instructions dictating how each message should be handled based on features deemed important. For instance, if a message contains “inheritance” or “prince” it would be flagged as spam. Another solution is to leverage data to determine what features can be used to identify spam. This is accomplished by utilizing a machine learning classifier that automatically distinguishes similar features between messages.

Can you tell the difference between a blueberry muffin and a chihuahua? Well, it’s not as easy for computers.

Similarly, the same set of techniques can be used to predict the outcome of NHL games. The task to predict NHL games is a binary classification problem. In other words, given a dataset, we attempt to learn how to consume new instances and determine what class they belong too. For example, given a new message, we conclude whether it is spam or not or given a hockey game what team will win or lose. Successfully determining an instance’s class requires labeled data. That is each message is known to be spam or not, and the winner of each hockey game is known. Then, based on the instances and features in our dataset, we try to distinguish between classes. Formally, this process is called building a classification model. Once we have the classifier model, we test it on new instances to predict the target (i.e. attempt to classify future hockey games of their outcome). Doing this, we can evaluate the model on how correctly and incorrectly it predicted games (more on this in Results).

For this project, I chose a combination of 15 predictive hockey statistics for my feature set. Features included, for each respective team:

team id,
goals for,
goals against,
goal differential,
winning streak,
division standing,
conference standing,
league standing,
power play success rate,
power kill success rate,
shot percentage,
save percentage,
faceoff win percentage,
Fenwick close percentage and,
PDO

For those interested, Fenwick close percentage is arguably one of the best approximations for measuring puck possession. The modern-day NHL is much more of a puck possession game than it used to be a couple of decades ago. So, it stands to reason that the team that has the puck more often will direct more shots at the opponent’s net than the other way around. PDO calculates shooting percentage plus save percentage and is known to be an adequate measure of luck in a game. According to hockey articles, these statistics have a stronger correlation with winning hockey games compared to other statistics.

Data Processing

All data used was obtained through the NHL API and NHL.com. I pulled games beginning from 2005–06 until the beginning of the 2018–2019 season and excluded pre-season and playoff games. During pre-season games, players must build chemistry with new teammates and linemates, adopt new team strategy, and build stamina. On the other hand, playoffs pose an unrepresentative population of teams thus skewing the distribution of team statistics. Playoff games also pose higher stakes than regular season games (duh!).

Machine Learning

Typically, training data is partitioned into sets: a training set and a test set that is not used for training. Take, for example, a child learning to differentiate car models. If provided with the same pictures repeatedly, the child would yield unrealistically good results! The child can only classify car models in the images and would fail miserably with new models of cars. Instead, the pictures are split into two piles. One pile is given to the child to organize. While the child fits the collection of images, they are training a mental classifier on how to predict car models. Once the first pile is arranged, the child uses what they have learned to label the other “new” collection of images. Completing this task, we can test how well the child classified car models and providing new images will allow the child to build a better classifier.

Before building the classifier, it’s good to ensure an even spread of classes. In our case, there is roughly a 50% split. Recall, the classes are 0 representing the home team lost and 1 for the home team winning. Next, I checked for missing values. It’s essential to address them since some classifiers handle missing values differently. Fortunately, our hockey data has no missing data. Another good check is for independence between independent variables. This step surfaces any statistics that may be unnecessary. For example, goal differential, goals for, and goals against are dependent. That said, I opted to keep them considering raw data sometimes yields more information. One last check is to ensure the dataset size is sufficient. We have 30 predictive features (15 for each team). The rule of thumb is 50 records per feature; therefore, we need to have at least 1500 records in the dataset which is true. Now we can move along to fit our models.

Models

I chose to test six classification algorithms.

Logistic Regression

Definition: In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.

Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.

Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other and assumes data is free of missing values (lucky for us!).

Naive Bayes

Definition: Naive Bayes algorithm is based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.

Advantages: This algorithm requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods.

Disadvantages: Naive Bayes is known to be a bad estimator.

K-Nearest Neighbours

Definition: Neighbours based classification is a type of passive learning as it does not attempt to construct a general internal model, but only stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.

Advantages: This algorithm is simple to implement, robust to noisy training data, and useful if training data is significant.

Disadvantages: Need to determine the value of K, and the computation cost is high as it needs to compute the distance of each instance to all the training samples.

Decision Trees

Definition: Given a set of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.

Advantages: Decision Tree is simple to understand and visualize, requires little data preparation, and can handle both numerical and categorical data.

Disadvantages: Decision tree can create complex trees that do not generalize well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated.

Random Forest

Definition: Random forest classifier is a meta-estimator that fits some decision trees on various sub-samples of datasets and uses the average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size, but the samples are drawn with replacement.

Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.

Disadvantages: Slow real-time prediction, challenging to implement, and sophisticated algorithm.

Support Vector Machine

Definition: Support vector machine is a representation of the training data as points in space separated into categories by an apparent gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a class based on which side of the gap they fall.

Advantages: Effective in high dimensional spaces and uses a subset of training points in the decision function, so it is also memory efficient.

Disadvantages: The algorithm does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

Results

There are two ways in which we can measure how well our model performed: accuracy and F1-Score. To better explain these concepts, I’ll use Jian Yang’s SeeFood classifier he built on HBO’s Silicon Valley.

Accuracy: (True Positive + True Negative) / (True Positive + False Positive + True Negative + False Negative)

Accuracy is a ratio of correctly predicted observation to the total observations. It’s the most intuitive performance measure.
True negatives: the “hot dog” vs “not hot dog” image classifier correctly classified the image of a car as not being a “hot dog.”
False negatives: the “hot dog” vs “not hot dog” image classifier incorrectly classified an image of a messed up “hot dog” as not being a “hot dog.”
True positives: the “hot dog” vs “not hot dog” classifier correctly classifies a “hot dog” as being a “hot dog.”
False positives: the “hot dog” vs “not hot dog” classifier incorrectly classifies a hamburger as being a “hot dog.”

F1-Score: (2 x Precision x Recall) / (Precision + Recall)

F1-Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. It’s usually more useful than accuracy, especially if we have an uneven class distribution.
High recall, low precision: the classifier thinks a lot of things are “hot dogs”; legs on beaches, fries and whatnot. However, it also thinks a lot of “hot dogs” are “hot dogs.” So from the set of images, there are a lot of images classified as “hot dogs,” many of them were in the set of actual “hot dogs,” however, a lot of them were also “not hot dogs.”

Low recall, high precision: the classifier is very picky and does not think many things are hot dogs. All the images it believes are “hot dogs,” are really “hot dogs.” However, it also misses a lot of actual “hot dogs,” because it is so very picky. There is low recall, but very high precision.

High recall, high precision: the classifier is excellent, it is very picky, but still gets almost all of the images of “hot dogs” which are “hot dogs” correct. We are happy!

After you’ve ingested the above, here’s a quick gander of how each classifier performed.

Logistic Regression achieved the highest F1-Score of 67% with an accuracy of 67%
Naive Bayes achieved an F1-Score of 65% with a matching accuracy
KNN had a 58% F1-Score and accuracy score
Decision Trees had an F1-Score of 62% with 58% accuracy
Random Forest yielded a 65% F1-Score and 64% accuracy score
SVM received a miserable F1-Score of 41% with an accuracy score of 56%.

And some more digestible statistics.

Logistic Regression:

When home team wins, classifier predicts they win 71.62% of the time
When home team loses, classifier predicts they win 39.40% of the time
When home team loses, classifier predicts they lose 60.60% of the time
When classifer predicts home team wins, home team actually wins  70.07% of the time
When classifer predicts home team loses, home team actually loses  62.38% of the time

Naive Bayes

When home team wins, classifier predicts they win 64.50% of the time
When home team loses, classifier predicts they win 34.49% of the time
When home team loses, classifier predicts they lose 65.51% of the time
When classifer predicts home team wins, home team actually wins  70.66% of the time
When classifer predicts home team loses, home team actually loses  58.89% of the time

When home team wins, classifier predicts they win 61.30% of the time
When home team loses, classifier predicts they win 47.31% of the time
When home team loses, classifier predicts they lose 52.69% of the time
When classifer predicts home team wins, home team actually wins  62.53% of the time
When classifer predicts home team loses, home team actually loses  51.39% of the time

Decision Tree

When home team wins, classifier predicts they win 64.25% of the time
When home team loses, classifier predicts they win 40.19% of the time
When home team loses, classifier predicts they lose 59.81% of the time
When classifer predicts home team wins, home team actually wins  67.31% of the time
When classifer predicts home team loses, home team actually loses  56.50% of the time

Random Forest

When home team wins, classifier predicts they win 64.37% of the time
When home team loses, classifier predicts they win 35.60% of the time
When home team loses, classifier predicts they lose 64.40% of the time
When classifer predicts home team wins, home team actually wins  69.96% of the time
When classifer predicts home team loses, home team actually loses  58.39% of the time

For a more in-depth look at the results, check out the project’s repository on Github.

Future Work

When I began this project, I had the goal to predict the outcome of an NHL game with a probability higher than 50%. I achieved that goal. However, I’m sure experienced hockey analysts can predict the result of a hockey game with higher confidence. With that said, improvements can be made to my current classifier and additions I can build onto it.

First I would optimize the feature set. There are many benefits of performing feature selection before modeling your data, including

Reducing Overfitting: Less redundant data means less opportunity to make decisions based on noise.
Improving Accuracy: Less misleading data means modeling accuracy improves.
Reducing Training Time: Less data means that algorithms train faster.

Other improvements to be incorporated into my classifier are more advanced statistics and qualitative features. For example, including a team’s last ten games raw data to evaluate how well they have recently played. Using a team’s previous ten game record (e.g. 7–3–0), we neglect valuable information concerning how the team performed. Take, for instance, a team that played well but lost in overtime, a team that played against a hard team and lost, or a team that played a bad team and won due to a lucky bounce.

The last improvement that I am currently working on is incorporating sports betting websites’ predictions. For those interested, a website compares its performance to Vegas sportsbooks for an entire season using the Log Loss metric. These websites are a great baseline to use when betting that allows them to perform closer to Vegas. Factors they include are the end of season tanking factor, advanced injury analysis (offence/defence impact), individual performance, line chemistry, and weather. Including these statistics should garner a greater F1-Score and maybe a guest appearance on Hockey Night in Canada’s intermission panel, air time on Prime Time Sports with Bob McCown, or RA’s Gambling Corner on Spittin’ Chiclets.

Eventually, I would like to predict potential prospects and their success in the NHL. Predicting prospects, however, is complex since many factors play a role in their progress including their performance, where they played, who they played with, who coached them, and when they were born. Furthermore, we have to define what a successful prospect is. Are strong prospects those who get drafted? Outliers include Martin St. Louis who wasn’t drafted but led a legendary career and those who were drafted but didn’t play an NHL game. Do successful NHL players have to play a specific number of NHL games, score a minimum number of goals, or sign a multi-year multi-million dollar contract? Without a doubt, predicting prospects as scouts do is challenging considering we don’t possess the same expertise they have acquired over their tenure.

GO LEAFS GO!

Thanks to those who have read this far or to those who scrolled to the bottom of the article to see how long it is. Peace in the east!

A special shoutout to Analytics India Magazine and to @klintcho for their helpful explanations.