How to Train
a Model

A hands-on, interactive guide to understanding
how calculus is used to train AI models.

Jacob Buckhouse

First published at NeurIPS 2025

Prerequisites:
Some familiarity with Calculus concepts.
No CS background is required.

Expected Time:
10-15 minutes

Description:
A hands-on, interactive guide to understanding how calculus is used to train AI models. The guide is broken down into easy-to-digest sections that tackle one idea at a time. Each section includes a text description and a fully interactive Desmos graph. This combination is designed so that students not only learn the ideas but also engage with the math directly to gain an intuitive sense of the concepts.

Introduction

Illustrating the Invisible

Using calculus, humanity has designed safer roads, launched rockets into space, and tracked the motion of the planets and the spread of pandemics. In the AI era, calculus also gives us a new power: to create machines that think (or at least are very good at predicting the next word in a sentence). All of this comes from the simple ability to determine the slope of a function.

But how does AI actually work?

AI, at its core, is about making predictions. It might feel like you are asking AI a question and it magically retrieves an answer. But AI is in fact just a big prediction engine. It’s making a prediction (or best guess) of what the answer might be based on all the data it has access to and how it’s been trained.

When you ask an AI model to write you a paragraph, what it is actually doing is predicting the next token (a chunk of a word, usually just a few letters) and then stringing them all together. LLMs (Large Language Models) work by just predicting what small part comes next and doing this over and over.

With enough data, this allows for predicting complex outputs that approach genuine reasoning. How far can it go? We still don’t know.

So, how are these prediction machines constructed?

For AI to happen, you need three things. You need a model, data, and training. The model is the prediction machine itself. Data comes in two parts: examples to give to the model and the answers that the model's predictions can be compared against.

What is Training?

Think of training a model like tuning a string on a guitar. You strum the string and listen to the sound it produces. Then you tighten it or loosen it to get it closer to the sound you expect. If it’s too high, you decrease the pitch. If it’s too low, you increase the pitch. Eventually, you find a pitch that is just right.

To train a model, what we’re tuning isn’t pitch, but increasingly large amounts of tiny numbers called parameters. Each parameter subtly changes how the model behaves, like how each tuning peg slightly adjusts how the guitar will sound. We tune these parameters until the predictions match what we expect.

Training is a five-step process.

Give the model an example
See what it predicts
Compare it to our expectations
Tune the parameters so they match our expectation a little bit better
Repeat!

Great! So how do we actually do that? The most challenging step is step four. On a guitar, it’s relatively easy: Tighter is higher and looser is lower. But with AI models, changing any one parameter affects the model in ways we can’t easily predict.

So how do we do it?

We use calculus. Specifically gradients, which describe how each parameter influences the accuracy of the model. We'll show you the specifics one step at a time.

How to use this interactive:

Each section introduces a new concept and an interactive Desmos graph. Follow the instructions and click on the Desmos graph to gain a hands-on, intuitive sense of the math, as well as see the actual equations at work.
Navigate by clicking and dragging the sliders, clicking on arrows, dragging around points, or dragging around the 3D models.

Section One: Linear Regression

Linear regression uses the same mathematical principles as modern AI, but with just two parameters we can actually see and understand. Our task is to derive two parameters, $m$, and $b$ for $y=mx+b$ such that it matches the x-y values best. Or rather, so that for each datapoint $(x_n, y_n)$, we want $mx_n+b≈y_n$. We have data, in the forms of items we want to fit. The premise is simple: Given the x value, predict the y value. For example, you might set up a linear regression model to predict house price given square footage.

But how do we find $m$ and $b$?

Try sliding around $m$ and $b$ to try to match it with the data

Section Two: Loss Functions

The first thing is to more numerically define what denotes a good match between the line and data points. In other words, a heuristic for success. This is called a loss function. For each datum and prediction, we want to punish guesses that were far off and reward close guesses. One formula you could imagine is $L=(y_{real}-y_{pred})^2$. Subtracting the real value from the predicted value makes sense. After all, a larger difference is bad, and that corresponds to a higher loss. Why squared? Three reasons: (1) It makes negative differences positive, (2) it punishes big mistakes much more than small ones, and (3) squared functions have nice mathematical properties for finding minimums. So this makes sense. You have some set $m$ and b, and for any given data point, you can calculate the loss $L$. You could calculate the overall loss by summing the individual errors and averaging them. This is called MSE, for Mean Squared Error, and is a very common loss metric used by professionals in machine learning. Here’s the formula:

$\text{MSE}=\frac{1}{n}\sum_{ }^{ }\left(y_{real}-y_{pred}\right)^{2}$

Try sliding around $m$ and b, with the goal of minimizing L.

Section Three: Parameterized Loss

Here's the key insight: instead of guessing $m$ and $b$ randomly, we can treat this as a math problem. We now have a function $L(m,b)$ that tells us how wrong our line is for any choice of $m$ and $b$. We want to find the $m$ and $b$ such that it minimizes the output. Now the really cool part is that we can actually in this case graph $m$, $b$, and $L(m,b)$ in 3D since there are only two parameters. Guessing numbers that produce a line that matches data is very unscientific, but the process of minimizing a function's output by changing its inputs has been thoroughly studied and can be approached mathematically. This idea of plotting loss over the parameters produces what is called a loss landscape.

Drag around the 3D model on the right and explore the descriptions and equations on the left.

Section Four: Minimizing Functions

These same principles work even if there are millions of parameters. But we have to be clever. When there are so many parameters, we can't just plot all the possible options and look for the smallest one. Let's start simple with just one parameter to see how calculus helps us find the minimum. We have a normal function $y=f(x)$, and we want to find the lowest point.

Try dragging the orange point along the invisible curve. Click the black text to reveal the function once you think you've found the lowest point.

Section Five: Derivatives

The derivative of a function essentially gives us the instantaneous slope at that point. With that we can construct the tangent line. Try dragging around $a$ and seeing what happens with respect to the tangent line (and the value of $f'(a)$) on either side of the lowest point of the function.

Try dragging the orange point along the invisible curve, paying attention to the tangent line. Click the black text to reveal the function once you think you've found the lowest point. The tangent line should be horizontal at the lowest point, because the derivative is zero.

The mathematical reason behind this is that the derivative is zero at local extrema since the function switches between going down and going up (so the derivative switches from a negative number to a positive number, stopping at zero in the middle)

Section Six: an Optimization Strategy

On the left of the lowest point, the derivative is a negative number. On the right, it's a positive number. In either case, it's steeper when it's further away. You might discover a strategy. If the derivative points down and to the right, go to the right. If it points down and to the left, go to the left. If it’s mostly flat, you’re almost there. If the slope is instead very steep, take a bigger step. Almost like tuning a guitar to find the right pitch.

Click the arrow symbol on the left to advance the point, following the algorithm outlined. As you progress, how does our point appear to move? How does $f'(a)$ behave?

Section Seven: Back to 3D

Now, in 3D (or any amount more of dimensions), the method is almost the same. In 2D, the strategy was to find the tangent line and go in the opposite direction it's pointing. The same applies in 3D, but we have to take the derivative with respect to each parameter, since it's multiple dimensions.

Click the arrow on the left of the interactive and watch the point descend down the green surface over time.

Click here

Section Eight: From Parameters to Predictions

When we find the lowest point on that 3D surface, we've found the best values for $m$ and $b$ - the ones that make our line fit the data points as closely as possible. Play around on the final graph:

Try clicking the arrow next to Update several times and see how the red line behaves. Click the arrow next to Reset to reset the interactive.

Section Nine [Advanced]: Learning Rates

A note on learning rates. You can't set the learning rate coefficient $k$ too high because it would overcorrect, then overcorrect again, etc, until $m_1$ and $b_1$ spiral out of control. But can't set it too low because it would run into local minima all the time. Picture a landscape with many small valleys - if your steps are too small, you might get stuck in a shallow valley and never find the deepest one. This phenomenon is called gradient trapping.

Section Ten: Putting it All Together

So that is what 'training' really means - mathematical optimization of parameters for accurate predictions. Here’s an expanded sketch of what our five-step training process looks like with our newfound understanding:

Give the model an example
See what it predicts
Treat the model like a function, with inputs and outputs
Compare it to our expectations
Use a loss function to quantify the errors made and calculate an average over all the data points
Tune the parameters so they match our expectation a little bit better
Get the gradient for each parameter with respect to the loss
The key step that makes it all work: Update each parameter by subtracting the gradient and multiplying by the learning rate
Repeat!

All this process is doing is trying to find the lowest point of the loss landscape in the most efficient way possible. And in minimizing the loss, the model is able to form good predictions. With this newly-minted prediction machine, the hope is that it will be able to generalize further and correctly predict things it hasn’t seen before.

Check Your Understanding

Congratulations, you've made it this far! This section is to check what you've learned and see what you might want to review again to make sure you've got everything down. Teachers: You can assign this to your students as an assessment.

If you don't know the answer, I've linked the respective chapter where you can review.

All the questions are free-response.

Explain gradient descent in your own words. [hint]
Why is it important to get the lowest point on the loss landscape? [hint]
What does the lowest point correspond to? [hint]
Why do we want predictions that match the training data? In the case of a language model, how does this lead to generating better text? [hint]
Why does subtracting the gradient multiplied by the learning rate lead us towards minima? [Look here or here for hint ]
Why are learning rates important? What happens if the learning rate is too high? Too low? [hint]

Appendix

Gallery of graphs used: https://www.desmos.com/gallery/1e189716-7348-4121-95f9-5b705e39c323

Looking for more? Here is some suggested reading from outside sources.

How LLMs like ChatGPT more specifically work: https://www.youtube.com/watch?v=wjZofJX0v4M&vl=en

1.a. Additional explainer of gradient descent: https://www.3blue1brown.com/lessons/gradient-descent. Requires a bit more context surrounding Neural Networks, but it goes into the math in more granular detail.

Train your own (simple) AI model in a no-code web environment: https://teachablemachine.withgoogle.com/
Excellent in-depth Coursera course titled "Mathematics for Machine Learning: Multivariate Calculus": https://www.coursera.org/learn/multivariate-calculus-machine-learning?specialization=mathematics-machine-learning#modules
Want to train your own AI model and actually write some code? Get started with Pytorch! https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html

How to Train a Model

A hands-on, interactive guide to understanding how calculus is used to train AI models.