NBA Salary Prediction

Predicting NBA player salary with ML/DL models trained with years of stats and salary data.

May 2025

MLDL

GitHub Kaggle

Tools

Pythonscikit-learnPyTorchXGBoostPandasMatplotlib

Overview

Objective: Accurately predict 2025–26 NBA player salaries using models trained by players' traditional per-game statistics- points/Assists/Rebounds per game, etc.- and salary data.

Two data scopes: single-season (2024–25) data for smaller linear regression and random forest models; 15 seasons (2010–2025) to feed deep learning models.

Data & Preparation

Collected player stats and salary data from Basketball Reference. If there was a need for extra data, scrapped from other sites that provide NBA-related data.

The data prep notebook (inside GitHub link) includes codes handling data scraping, cleaning, joins, type fixes, standardization, and output CSVs ready for modeling.

Feature Selection

Reduced multicollinearity using VIF and removed low-signal features. As a result of VIF test with salary as dependent variable, Linear Regression model used a compact set of stats: PTS, AST, REB, STL, BLK, Age.

Random Forest and Deep Learning models consumed entire set of stats with scaling among variables.

Models & Training

Linear Regression baseline trained on 2024–25 data only to minimize multicollinearity.

Random Forest compared against GBM, XGBoost, and Extra Tree models; RF outperformed even before heavy tuning, then improved further with hyperparameters.

Deep Learning model (PyTorch) used a fully-connected network (50 epochs, standard scaling).

Evaluation & Results

Metrics: RMSE (absolute error in dollars) and R² (explained variance).

Linear Regression model — RMSE ≈ $9.24M, R² ≈ 0.526

Random Forest model — RMSE ≈ $4.20M, R² ≈ 0.744

Deep Learning model — RMSE ≈ $4.93M, R² ≈ 0.650 (±0.02 between runs)

Conclusion: Random Forest provides the best overall evaluation scores while remaining relatively interpretable through feature importance.

Sample Predictions & Reproducibility

Python 3.8+ with pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost, plotly, torch.

Run instructions: download `data/` and `notebooks/` inside GitHub link, open a notebook based on preferred model (LR/RF/DL), and execute the final Prediction Function cell.

Each notebook includes a prediction function to query by player name; example narrative outputs for Harden, LeBron, and Ty Jerome are shown in the figure.