I’m not a sport’s person. I’ve tried! I like going to games, and can totally get wrapped up in the excitement, but just never felt that loyalty to a team that so many fans have.
My five year old just started hockey (well, skating lessons) and I thought it would be fun to “Money Ball” Hockey. First question? Is it a strong link (one player matters), or a full chain (all the players matter).
For each team, using the 2024 season, pull in the stats of the top players, worst players, and an average across the team to see what is most correlated to the with the net goals (scores scored – scores allowed)
Scope
I’m going to pull in a complete seasons data, and compare the player who scores the most goals, versus the rest of the team average, against the success metric of net goals. I’m then going to weigh the correlations and significances against one another, to see what more contributes to a teams success; either the top player or the whole team (minus the top player).
Assumptions
While the net goals scored is influenced by goals scored with both team averages, and top-players; it ignores other factors. For example, a really strong goalie could prevent more goals, but isn’t included in the analysis. IE, this is really only looking at the offensive side of the hockey… field? rink? court?
Methodology
I found a really cool website called MoneyPuck that tracks alot of the stats already, and offers simple exports using CSV’s. I grabbed the Skaters and Team Level Datasets, and went to work.
- Import Datasets from MoneyPuck
- Select Stats
- Order by Season and Teams
- Pull Top Player Stats per Team
- Pull Average Stats per Team
- Pull Success Metric
- Correlate Top Player and Average Player Against Success Metric
Import Datasets from MoneyPuck
We are keeping it pretty simple, with just two imports. Pandas, and SciPy.
import pandas as pd from scipy.stats import pearsonr
I then visited the website MoneyPuck to download the player and team CSVs, and uploaded and converted them into two separate DataFrames.
## Load and preprocess data
skater_df = pd.read_csv('skaters.csv')
team_df = pd.read_csv('teams.csv')
There is a TON of data here, that I had to learn about. Luckily, there was a handy directory. First, I had to isolate the “situation” to all, using an order by function.
## Filter to only include 'all' situation team_df = team_df[team_df["situation"] == "all"] skater_df = skater_df[skater_df["situation"] == "all"]
Then I only kept the stats I needed for the Analysis, to help clean up the DataFrames.
## Sort dataframes skater_df_sorted = skater_df[["season","name","team",'position','games_played','icetime','I_F_faceOffsWon',"I_F_goals"]] skater_df_sorted = skater_df_sorted.sort_values(by=["season","team"], ascending=[True,False]) team_df_sorted = team_df[["season","team","goalsFor","goalsAgainst"]] team_df_sorted = team_df_sorted.sort_values(by=["season","team"], ascending=[True,False])
Now I need the success metric! Since I have “goalsFor”, and “goalsAgainst”, I can create a new column for “net goals”. Then, I am going to normalize the success metric, to help control for outliers, and wide range gaps.
## Update skater dataframe to get best player and other player average per team per season
skater_team_grouped = skater_df_sorted.groupby(['season','team']).agg({
'I_F_goals': [
('other player average', lambda x: x[x != x.max()].mean()),
('best player', 'max'),
]
}).reset_index()
skater_team_grouped.columns = ['season', 'team', 'other player average', 'best player',]
Now we’re going to create a new DataFrame by merging both the Teams DataFrame with the Success Metric, and the Skater DataFrame. The Season and Team columns in both DataFrames should contain the same data, so we’re going to merge on those columns.
## Merge dataframes merged_df = pd.merge(team_df_sorted, skater_team_grouped, on=['season', 'team'])
Finally, I’m going to run the correlation tests. I’m going to use the pearsonr function, to get two variables. The Correlation Coefficient and the P-Value. The higher the Correlation Coefficient, the better (with perfect being 1). For the P-Value, any value under .05 means the metric is statically significant.
## Calculate Pearson correlation coefficients
corr_best, p_val_best = pearsonr(merged_df['best player'], merged_df['net_normalized'])
corr_other, p_val_other = pearsonr(merged_df['other player average'], merged_df['net_normalized'])
print(f"Pearson correlation between best player goals and team success: {corr_best:.4f} (p-value: {p_val_best:.4f})")
print(f"Pearson correlation between other player average goals and team success: {corr_other:.4f} (p-value: {p_val_other:.4f})")
Results
Both are significantly positively correlated, so both matter. Overall, in the 2024 Hockey Season, there is a 76.36% (P-Value < 0.0000) correlation between the strength of the “non-best players” on the teams net-goal differential. Yet the strongest player undoubtably have an impact, with a 44.32% correlation (P-Value <.0110).


This brings up a few more interesting questions. Is this isolated to 2024? What about the years of that Crosby guy? How does this sport compare to the NBA? MLB?
Stay Curious!
Source
MoneyPuck.com -Download Data. n.d. Retrieved January 20, 2026. https://moneypuck.com/data.htm.
Full Code Dump
import pandas as pd
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
## Load and preprocess data
skater_df = pd.read_csv('skaters.csv')
team_df = pd.read_csv('teams.csv')
## Filter to only include 'all' situation
team_df = team_df[team_df["situation"] == "all"]
skater_df = skater_df[skater_df["situation"] == "all"]
## Sort dataframes
skater_df_sorted = skater_df[["season","name","team",'position','games_played','icetime','I_F_faceOffsWon',"I_F_goals"]]
skater_df_sorted = skater_df_sorted.sort_values(by=["season","team"], ascending=[True,False])
team_df_sorted = team_df[["season","team","goalsFor","goalsAgainst"]]
team_df_sorted = team_df_sorted.sort_values(by=["season","team"], ascending=[True,False])
## Calculate goal net and normalized net for teams, creating the success metric
team_df_sorted["goal_net"] = team_df_sorted["goalsFor"] - team_df_sorted["goalsAgainst"]
team_df_sorted["net_normalized"] = (team_df_sorted["goal_net"] - team_df_sorted["goal_net"].mean()) / team_df_sorted["goal_net"].std()
team_df_sorted.sort_values(by=["season","goal_net"], ascending=[True,False], inplace=True)
## Update skater dataframe to get best player and other player average per team per season
skater_team_grouped = skater_df_sorted.groupby(['season','team']).agg({
'I_F_goals': [
('other player average', lambda x: x[x != x.max()].mean()),
('best player', 'max'),
]
}).reset_index()
skater_team_grouped.columns = ['season', 'team', 'other player average', 'best player',]
## Merge dataframes
merged_df = pd.merge(team_df_sorted, skater_team_grouped, on=['season', 'team'])
## Calculate Pearson correlation coefficients
corr_best, p_val_best = pearsonr(merged_df['best player'], merged_df['net_normalized'])
corr_other, p_val_other = pearsonr(merged_df['other player average'], merged_df['net_normalized'])
print(f"Pearson correlation between best player goals and team success: {corr_best:.4f} (p-value: {p_val_best:.4f})")
print(f"Pearson correlation between other player average goals and team success: {corr_other:.4f} (p-value: {p_val_other:.4f})")
fig = merged_df.plot.scatter(x='other player average', y='net_normalized', title='Other Player Average Goals vs Team Success')
fig.set_xlabel('Other Player Average Goals')
fig.set_ylabel('Team Success (Normalized Goal Net)')
fig.figure.savefig('other_player_average_vs_team_success.png')
fig2 = merged_df.plot.scatter(x='best player', y='net_normalized', title='Best Player Goals vs Team Success', color='orange')
fig2.set_xlabel('Best Player Goals')
fig2.set_ylabel('Team Success (Normalized Goal Net)')
fig2.figure.savefig('best_player_vs_team_success.png')

Leave a Reply