# The Diggs Equation — Forecasting Josh Allen Passes and Stefon Diggs Catches

In a previous iteration of this article, we explored how to use data, data science, and formulas to abstract and model data around a problem that was first brought to awareness through a divergence of thought and visualization.

The goal of thinking about this concept was to improve the approach or at least understand the approach, so we worked through it in the last article “The Diggs Equation — Will Josh Allen Pass to Stefon Diggs?”. We found that the probability both naive and simulated was larger than the actual forecasted percentage, and now we will explore how that could be reevaluated.

Through this next stage keep in mind that if the calculations for all these players and all of their dichotomies or statistical relationships were continuously being calculated for thousands of users, it would be very expensive predictive system. New advancements in machine learning, through foundation models, could allow for a stateless model, where in fact the algorithm for one of these relations could be replicated to predict the same type of relation, and spread across all the QB (Quarterback) to WR(Wide Receiver) calculations.

# Sampling the existing data

When we are looking to train a machine learning model, we need to reference the data that was extracted in the first iteration of this article and either structure it to fit a simulation of the throws or utilize the percentage outcomes of the throws as parameters for a model.

We can use one of the Scikit Learn Python examples, adjust the training data, and plot the Gaussian Process Regression (GPR) which should show the outcome of each throw caught and its regression to the mean outcome.

In preparation for this model, we need to create a simulated sample of data that illuminates the historical data. We can look back at the three (3) columns that equate to our throw catch probability which are Stefon’s** (TGTS) Receiving Targets, REC (Receptions), **and Josh’s **ATT (Passing Attempts)**. We said that the **REC/TGTS **or** **387/550 gave Stefon a 70% probability of catching the ball and the **TGTS/ATT** or 550/1990 gave Stefon a 28% chance of receiving the ball.

It may be time consuming and computationally expensive to create this data iteratively, plotting each of Josh’s throws as a row, with the alignment of Stefon’s catches, especially if this number set is scaling across more instances than the one thousand nine hundred and ninety passes (1990).

What we can do instead as a training sample is create a table of outcomes with the previous percentage of 28% or .28 of passes going to Stefon divided by 70% or .7 passes caught by Stefon leaving a 40% or 4/10 catch percentage. In relation to the table this means that four (4) rows are catch and six (6) rows are drops, referencing the sample of the converted historical occurrence from which we can create the table to begin training the model.

# Plotting the sampled data

From our existing analysis we create the table which we save as a CSV file, two columns host the ten iterations with the labels included. For each pass, the number of the ‘Pass’ is numerically incremented for its representative row, and for each ‘Outcome’ we use a binary one (1) for catch and a binary zero (0) for drop.

**Example CSV (MS-DOS) file:**

`Pass,Outcome`

1,1

2,1

3,1

4,1

5,0

6,0

7,0

8,0

9,0

10,0

Working with this table in Python will require installing Pandas, which is a package for data analysis and most commonly known for its data frame (df) functions that are essentially virtualized tables. We make sure to install Pandas with `pip install pandas`

, add our csv file to the root of the project, and then create a script with the following lines from the code block for printing the csv.

**Data Frame Import Script:**

`import pandas as pd`

# Import Historical Sample Data CSV

df = pd.read_csv("historical-sample-data.csv")

print(df)

Now run `python model.py`

to test our models ability to import that CSV file into a data frame. Once we print the file we should see a mirror of the above CSV file with the added data frame indexed rows. Now that we confirmed our CSV file is readable from the Operating System (OS) we can begin to follow along with and adjust the Scikit Learn example to fit our use-case.

We intake the CSV and parse the columns “Pass” and “Outcome”, setting the index of the data frame as the “Pass” column. We can also check the min and max range of the selected index to see that coverage of passes available in the data set.

*Add the following code block to the existing script to continue the development of this model.*

**Pass Index Script — Addition:**

`# Create Pass Index`

hist_data_samp = df

hist_data_samp = hist_data_samp[["Pass", "Outcome"]].set_index("Pass")

index_range = hist_data_samp.index.min(), hist_data_samp.index.max()

print(index_range)

We can run `python model.py`

again to test our model's ability to inference the data frame and display the range of indices which is (1,10), meaning all ten entries are accounted for and the column header index is escaped.

Plotting the data frame to a graph requires the addition of another package, so we make sure to install Matplotlib with `pip install matplotlib`

, and begin to setup the graph with the proper axes based on our chosen index and binary scoring principle.

*Add the following code block to the existing script to continue the development of this model.*

**Graph Data Script — Addition:**

`# Graph Historical Sample Data`

hist_data_samp.plot()

plt.xlabel("Passes Thrown")

plt.ylabel("Passes Caught")

_ = plt.title("Fantasy Football QB/WR - Historical Data")

plt.show()

With `python model.py`

again, we can see that Matplotlib has generated a figure which we can review, update, generate, and save as a PNG file or image. This confirms our ability to document the forecast from the CSV through data frames, by creating an index. Once you are ready to move on, make sure to comment out the `plt.show()`

function to not block the rest of the script.

# Forecasting the sampled data

From our initial data we generated a representative sample based on the percentage of passes thrown to Stefon Diggs at 28% by Josh Allen and the percentage of passes caught by Stefon Diggs at 70%. Through this lens we will extrapolate the initial forecast to illuminate the outcomes of future passes based on the historical percentage of historical caught and received passes at 40%.

We can follow the Scikit Learn example again, utilizing the same Kernel processes for forecasting our data as the model in the example is based on a yearly occurring frequency, which we may have to adapt later for seasonal frequency.

Forecasting the data with the frequency kernels means, we need to install Scikit Learn with `pip install scikit-learn`

in addition to Numpy with `pip install numpy`

and begin to import the necessary kernels for running the projections of our data across the existing indices, along with dementional-izing the data into arrays.

For reference, a projection of this fitted model is based on the total number of passes thrown from the original data ingestion of Josh Allen’s **ATT (Passing Attempts) **column. The size of this data sample, if you recall, is based on the seasons in which Josh Allen and Stefon Diggs were both playing and hypothetically starting for the Buffalo Bills, seasons 2020–2023. Our multiplier or parameter for the ten (10) throw sample set is one hundred and ninety (190) to get to the 1990 total throws, which is also constrained in deviation by the same sum for Numpy.

*Add the following code block to the existing script to continue the development of this model.*

**Kernel Dependency Script — Addition:**

`# Import Kernel Dependencies For Extrapolated Data`

X = (hist_data_samp.index * 190).to_numpy().reshape(-1, 1)

y = hist_data_samp["Outcome"].to_numpy()

long_term_trend_kernel = 50.0**2 * RBF(length_scale=50.0)

seasonal_kernel = (

2.0**2

* RBF(length_scale=100.0)

* ExpSineSquared(length_scale=1.0, periodicity=1.0, periodicity_bounds="fixed")

)

irregularities_kernel = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.0)

noise_kernel = 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(

noise_level=0.1**2, noise_level_bounds=(1e-5, 1e5)

)

hist_data_samp_kernel = (

long_term_trend_kernel + seasonal_kernel + irregularities_kernel + noise_kernel

)

y_mean = y.mean()

gaussian_process = GaussianProcessRegressor(kernel=hist_data_samp_kernel, normalize_y=False)

gaussian_process.fit(X, y - y_mean)

The process of extrapolation goes hand-in-hand with fitting the data over the forecasted model and for this we need to add in Numpy with `pip install numpy`

which will allow for the deviation of the numbers through the use of dimensional arrays.

**Fitting Data Script — Addition:**

`# Fit Model With Extrapolated Data`

X_test = np.linspace(start=1, stop=1990, num=1_990).reshape(-1, 1)

mean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std=True)

mean_y_pred += y_mean

Once the data has been fitted and the kernels have run their distribution for extrapolation, we can again graph the outcome of these theoretical forecasts to exemplify how this data my enumerate.

*Add the following code block to the existing script to continue the development of this model.*

**Extrapolated Data Edition— Addition:**

`# Graph Extrapolated Data`

plt.plot(X, y, color="black", linestyle="dashed", label="Measurements")

plt.plot(X_test, mean_y_pred, color="tab:blue", alpha=0.4, label="Gaussian process")

plt.fill_between(

X_test.ravel(),

mean_y_pred - std_y_pred,

mean_y_pred + std_y_pred,

color="tab:blue",

alpha=0.2,

)

plt.legend()

plt.xlabel("Passes Thrown")

plt.ylabel("Passes Caught")

_ = plt.title(

"Fantasy Football QB/WR - Forecasted Data"

)

plt.show()

Finally we can run `python model.py`

again to graph our forecasted data across our fitted model for the extrapolation of data that leads to our forecast. In this run, the data should be enumerated through Numpy and the Scikit Learn kernels to increase the awareness of noise and divergence of the initial statistical probability.

We can assume from that as the measurement and outcome lines in the chart exemplifies the distribution of 1990 passes from Josh Allen, should land Stefon Diggs around 950 catches. Our initial mathematical hypothesis was that 40% of throws would be caught by Diggs, this is the sample with which we started, where 1990 total passed multiplied by .4 catches would reveal 796 perceived catches.

Our forecasted outcome is inflated by our models extrapolation, but hypothetically just as we experienced our fandom-like calculation of simulations for mutually exclusive events, we see that the model even meets our secondary estimations of a greater outcome.

## Continue Exploring This Machine Learning Model, on Github and Previous Blogs

The blog in reference to beginning this data science deep dive, can shed more light on Ingesting Data Sets and Prediction Algorithms. Primer code for this machine learning model and forecast can be found on GitHub.

# Thanks for Reading, Keep Forecasting!

Looking for more Application Development advice? Follow along on Twitter, GitHub, and LinkedIn. Visit online for the latest updates, news, and information at heyitsjoealongi.com.