Train a new neural network in your browser. Adjust hyperparameters and watch how they affect learning.
Every number you set before training is a hyperparameter. The model can't figure these out — you decide them.
1. Learning Rate
Controls how much each weight changes per update. Think of it as step size — you're walking downhill toward the lowest loss, and this decides how big each step is.
Too High (0.01–0.1)
Takes huge steps, overshoots the minimum, loss spikes or diverges. Model never converges.
Sweet Spot (0.001)
Standard default for Adam. With SGD, you often need higher (0.01–0.1) because SGD doesn't adapt per-weight.
Too Low (0.00001)
Takes tiny steps, training is painfully slow. Might get stuck in a bad local minimum.
Experiment: Train with lr=0.001, then lr=0.1. Watch the second run's loss explode.
2. Batch Size
How many data samples the model sees before updating weights once. The gradient is averaged over these samples.
Small (32–64)
Noisy gradient (few samples), but many updates per epoch. Can escape local minima due to noise.
Sweet Spot (256–512)
Good balance. Smooth enough gradients, enough updates per epoch. Powers of 2 align with hardware.
Large (1024+)
Very smooth gradient, fewer updates per epoch. Fast computation but can converge to sharp minima.
Experiment: Train batch=32 vs batch=1024. Small batch curve will be noisier.
3. Architecture (Layers & Neurons)
Defines the model's capacity — how complex a function it can represent. Each layer learns a different level of abstraction:
Layer 1 — Learns basic features: "high velocity = more cooling", "bigger plate = more area"
Layer 2 — Combines features: "velocity × area effect", "heat vs surface area ratio"
Layer 3 — Synthesizes: "overall thermal resistance considering all factors"
Too Small (1 layer, 16 neurons)
Can't learn complex interactions. Underfits — R² plateaus low.
Too Big (5 layers, 256 neurons)
More parameters than needed. Overfitting risk — memorizes training data, fails on test data.
Experiment: Train 1×16 vs 3×64 vs 5×256. Watch R² improve then plateau.
4. Optimizer (Adam vs SGD)
The algorithm that decides how to update weights given the gradient. Different strategies, same goal.
SGD (Stochastic Gradient Descent)
Simplest: w = w - lr × gradient. Same learning rate for all weights. Works but slow, sensitive to lr choice.
SGD + Momentum
Adds inertia — keeps a running average of past gradients. Like a ball rolling downhill that doesn't stop at every bump.
Adam (Recommended)
Adapts learning rate per weight automatically. Confident weights → smaller steps. Neglected weights → bigger steps. Almost always the best choice.
Experiment: Adam lr=0.001 vs SGD lr=0.001 vs SGD lr=0.01. Adam converges much faster.
5. Epochs
How many complete passes through the training data. Each epoch, every sample is seen once. More epochs = more refinement of weights.
Too Few (10–20)
Model hasn't converged. Loss still dropping. Underfitting.
Sweet Spot (~200–300)
Loss curve has flattened. Further training gives diminishing returns.
Too Many (1000+)
Wasting time. May start overfitting — memorizing training noise.
Experiment: Use "Step ×10" to train incrementally. Watch diminishing returns after ~100–150 epochs.
6. Train / Test Split
Divides data into two sets. The model trains ONLY on the training set. The test set checks if the model generalizes — can it predict data it has never seen?
50/50
Conservative. Less training data, but very robust test evaluation.
70/30 (Standard)
Good balance. Enough data to learn, enough to test reliably.
90/10
More training data, but fewer test samples → less confident R² score.
Overfitting signal: If train R² = 0.99 but test R² = 0.7, the model memorized training data instead of learning physics.