PyTorch
PyTorch is an open source machine learning and deep learning framework.
Why use PyTorch?
Machine learning researchers love using PyTorch.
And as of February 2022, PyTorch is the most used deep learning framework on Papers With Code, a website for tracking machine learning research papers and the code repositories attached with them.
PyTorch also helps take care of many things such as GPU acceleration (making your code run faster) behind the scenes.
So you can focus on manipulating data and writing algorithms, and PyTorch will make sure it runs fast.
And if companies such as Tesla and Meta (Facebook) use it to build models they deploy to power hundreds of applications, drive thousands of cars and deliver content to billions of people, it's clearly capable on the development front too.
Tensors: Tensors are the basic building block of all of machine learning and deep learning. Tensors can represent almost any kind of data (images, words, tables of numbers.
Tensor Types by Dimensionality
| Name | Dimensions | Example |
|---|---|---|
| Scalar | 0D | torch.tensor(7) |
| Vector | 1D | torch.tensor([7, 7]) |
| Matrix | 2D | torch.tensor([[7, 8], [9, 10]]) |
| Tensor | nD | 3D+ (images, batches, etc.) |
Creating Tensors
`# Manual scalar = torch.tensor(7)
Random (0–1)
torch.rand(size=(3, 4))
Zeros / Ones
torch.zeros(size=(3, 4)) torch.ones(size=(3, 4))
Range
torch.arange(start=0, end=10, step=1)
Like an existing tensor
torch.zeros_like(input=some_tensor)`
3 Critical Tensor Attributes (Always Check These When Debugging)
tensor.shape # e.g. torch.Size([3, 4]) tensor.dtype # e.g. torch.float32 tensor.device # e.g. cpu or cuda:0
Common dtypes:
torch.float32— default, best for most opstorch.float16— faster, less precisetorch.float64— more precise, slowertorch.int8 / int32 / int64
tensor.type(torch.float16) # convert dtype
Tensor Operations
tensor + 10 # element-wise add tensor * 10 # element-wise multiply tensor - 10 tensor / 2
Matrix Multiplication ⚠️ (Most Common Source of Errors)
- Rule: Inner dimensions must match →
(3, 2) @ (2, 3)✅ |(3, 2) @ (3, 2)❌ - Result shape = outer dimensions
torch.matmul(A, B.T) # or torch.mm(A, B.T) # or A @ B.T
Neural networks are essentially stacked matrix multiplications.
torch.nn.Linear()doesy = x·Aᵀ + b
Aggregation
`x = torch.arange(0, 100, 10)
x.min() # tensor(0) x.max() # tensor(90) x.mean() # tensor(45.) — needs float dtype x.sum() # tensor(450)
x.argmin() # index of min x.argmax() # index of max`
Tensor Shape Manipulation
`x = torch.arange(1., 8.) # shape: [7]
x.reshape(1, 7) # → [1, 7] (new tensor) x.view(1, 7) # → [1, 7] (shares memory with x!)
x_reshaped.squeeze() # remove dims of size 1: [1,7] → [7] x.unsqueeze(dim=0) # add dim: [7] → [1, 7]
torch.stack([x, x, x], dim=0) # stack tensors
Permute axes (common for images: HWC → CHW)
x.permute(2, 0, 1) # [224, 224, 3] → [3, 224, 224]`
Indexing
`x = torch.arange(1, 10).reshape(1, 3, 3)
x[0] # first element on dim 0 x[0][0][0] # chain indexing x[:, 0] # all of dim 0, index 0 of dim 1 x[:, :, 1] # all dims 0–1, index 1 of dim 2`
PyTorch ↔ NumPy
`# NumPy → PyTorch tensor = torch.from_numpy(np_array)
PyTorch → NumPy
np_array = tensor.numpy()`
⚠️ NumPy uses
float64by default; PyTorch usesfloat32. Convert explicitly:
torch.from_numpy(array).type(torch.float32)
Reproducibility
`RANDOM_SEED = 42 torch.manual_seed(RANDOM_SEED) tensor_A = torch.rand(3, 4)
torch.manual_seed(RANDOM_SEED) tensor_B = torch.rand(3, 4)
tensor_A == tensor_B ✅`
GPU / Device Setup
`# Check availability torch.cuda.is_available() # NVIDIA torch.backends.mps.is_available() # Apple Silicon
Device-agnostic code (best practice)
if torch.cuda.is_available(): device = "cuda" elif torch.backends.mps.is_available(): device = "mps" else: device = "cpu"
tensor = tensor.to(device)
Move back to CPU for NumPy
tensor.cpu().numpy()`
Key Takeaways
- Tensors are everything — all ML data in PyTorch lives in tensors
- Always check shape, dtype, device when debugging
- Matrix multiplication is the core op of neural networks
.view()shares memory with original;.reshape()may not- Use
torch.manual_seed()for reproducible experiments - Write device-agnostic code so it runs on CPU/GPU without changes
PyTorch Workflow — Notes
The 5-Step PyTorch Workflow
1. Prepare Data 2. Build Model 3. Train Model 4. Make Predictions (Inference) 5. Save & Load Model
Step 1 — Prepare & Split Data
- Convert raw data into tensors
- Split into train / val / test sets
| Split | Size | Purpose |
|---|---|---|
| Training | 60–80% | Model learns from this |
| Validation | 10–20% | Tune hyperparameters |
| Test | 10–20% | Final evaluation only |
⚠️ Never let the model see test data during training — it measures true generalization
`# Example: synthetic linear data import torch
weight = 0.7 bias = 0.3
X = torch.arange(0, 1, 0.02).unsqueeze(dim=1) # add feature dimension y = weight * X + bias
Split
train_split = int(0.8 * len(X)) X_train, y_train = X[:train_split], y[:train_split] X_test, y_test = X[train_split:], y[train_split:]`
Step 2 — Build Model
- Subclass
torch.nn.Module - Define learnable params with
nn.Parameter() - Implement
forward()method
`from torch import nn
class LinearRegressionModel(nn.Module): def init(self): super().init() self.weights = nn.Parameter(torch.randn(1, dtype=torch.float), requires_grad=True) self.bias = nn.Parameter(torch.randn(1, dtype=torch.float), requires_grad=True)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.weights * x + self.bias`
Key nn Components
| Component | Purpose |
|---|---|
nn.Module | Base class for all models — must implement forward() |
nn.Parameter | Learnable tensor — auto-tracked by autograd |
torch.optim | Optimizers for updating parameters |
Useful Model Utilities
`model = LinearRegressionModel()
model.state_dict() # view all learned params {name: tensor} model.parameters() # iterate over all params`
Step 3 — Train the Model
Pick a Loss Function & Optimizer
| Problem Type | Loss Function | Common Optimizer |
|---|---|---|
| Regression | nn.L1Loss() (MAE) | torch.optim.SGD |
| Binary Classification | nn.BCELoss() | torch.optim.Adam |
| Multi-class | nn.CrossEntropyLoss() | torch.optim.Adam |
loss_fn = nn.L1Loss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
Learning rate controls how big each update step is. Too high = unstable. Too low = slow.
Training Loop (5 Steps Every Epoch)
`# 1. Forward pass → get predictions
2. Calculate loss
3. Zero gradients (always before backward!)
4. Backpropagation → compute gradients
5. Update parameters
torch.manual_seed(42) epochs = 100
for epoch in range(epochs): # --- TRAIN --- model.train() # activates training mode y_pred = model(X_train) # 1. forward pass loss = loss_fn(y_pred, y_train) # 2. loss optimizer.zero_grad() # 3. zero grads loss.backward() # 4. backprop optimizer.step() # 5. update params
# --- TEST ---
model.eval() # activates eval mode
with torch.inference_mode():
test_pred = model(X_test)
test_loss = loss_fn(test_pred, y_test.type(torch.float))
if epoch % 10 == 0:
print(f"Epoch {epoch} | Train Loss: {loss:.4f} | Test Loss: {test_loss:.4f}")`
Why optimizer.zero_grad()?
PyTorch accumulates gradients by default. You must zero them before each backward pass or they stack up and corrupt updates.
Step 4 — Inference (Making Predictions)
3 rules for inference:
model.eval() # 1. switch to eval mode with torch.inference_mode(): # 2. disable gradient tracking y_preds = model(X_test) # 3. data & model must be on same device
Use
torch.inference_mode()— it's faster thantorch.no_grad()and is the modern preferred approach
Step 5 — Save & Load Model
Saving
`from pathlib import Path
Create directory
MODEL_PATH = Path("models") MODEL_PATH.mkdir(parents=True, exist_ok=True)
MODEL_SAVE_PATH = MODEL_PATH / "model_0.pth"
Save only state_dict (recommended — lightweight)
torch.save(obj=model.state_dict(), f=MODEL_SAVE_PATH)`
Loading
`# Must recreate model architecture first loaded_model = LinearRegressionModel()
Then load saved params
loaded_model.load_state_dict(torch.load(MODEL_SAVE_PATH))
loaded_model.eval() with torch.inference_mode(): preds = loaded_model(X_test)`
Always save
state_dict()not the whole model — it's smaller and more portable
Training vs Eval Mode — What Changes?
| Mode | Layers Affected | Use When |
|---|---|---|
model.train() | Dropout ON, BatchNorm uses batch stats | Training loop |
model.eval() | Dropout OFF, BatchNorm uses running stats | Testing / Inference |
Key Takeaways
- Every training loop has the same 5 steps — forward → loss → zero grad → backward → step
- Always call
model.eval()+torch.inference_mode()before making predictions - Save
state_dict()not the full model — it's the standard, portable way - Loss going down = model learning — track both train and test loss to spot overfitting
model.train()andmodel.eval()matter — they control Dropout and BatchNorm behavior
Pytorch Linear Regression model:-

Pytorch model building essentials:-
| PyTorch module | What does it do? |
|---|---|
torch.nn | Contains all of the building blocks for computational graphs (essentially a series of computations executed in a particular way). |
torch.nn.Parameter | Stores tensors that can be used with nn.Module. If requires_grad=True gradients (used for updating model parameters via gradient descent) are calculated automatically, this is often referred to as "autograd". |
torch.nn.Module | The base class for all neural network modules, all the building blocks for neural networks are subclasses. If you're building a neural network in PyTorch, your models should subclass nn.Module. Requires a forward() method be implemented. |
torch.optim | Contains various optimization algorithms (these tell the model parameters stored in nn.Parameter how to best change to improve gradient descent and in turn reduce the loss). |
def forward() | All nn.Module subclasses require a forward() method, this defines the computation that will take place on the data passed to the particular nn.Module (e.g. the linear regression formula above). |

Making predictions using torch.inference_mode()
python# Make predictions with model
with torch.inference_mode():
y_preds = model_0(X_test)
# Note: in older PyTorch code you might also see torch.no_grad()
# with torch.no_grad():
# y_preds = model_0(X_test)
You probably noticed we used torch.inference_mode() as a context manager (that's what the with torch.inference_mode(): is) to make the predictions.
As the name suggests, torch.inference_mode() is used when using a model for inference (making predictions).
torch.inference_mode() turns off a bunch of things (like gradient tracking, which is necessary for training but not for inference) to make forward-passes (data going through the forward() method) faster.
Train Model:
The whole idea of training is for a model to from some unknown parameters (these may be random) to some known parameteres. Or in other words from a poor representation of the data to a better representation of the data.
One way to measure how poor or how wrong your models predictions are is to use a loss function. Note: Loss function may also be called cost function or criterion in different areas. For our case, we are going to refer to it as a loss function.
Things we need to train: → Loss function: A function to measure how wrong your model’s predictions are to the ideal outputs, lower is better. → Optimizer: Takes into account the loss of a model and adjusts the model’s parameters (eg. weight & bias in our case) to improve the loss function.
Depending on what kind of problem you're working on will depend on what loss function and what optimizer you use.
However, there are some common values, that are known to work well such as the SGD (stochastic gradient descent) or Adam optimizer. And the MAE (mean absolute error) loss function for regression problems (predicting a number) or binary cross entropy loss function for classification problems (predicting one thing or another).
For our problem, since we're predicting a number, let's use MAE (which is under torch.nn.L1Loss()) in PyTorch as our loss function.
Mean absolute error (MAE, in PyTorch: torch.nn.L1Loss__) measures the absolute difference between two points (predictions and labels) and then takes the mean across all examples.
And we'll use SGD, torch.optim.SGD(params, lr) where:
paramsis the target model parameters you'd like to optimize (e.g. theweightsandbiasvalues we randomly set before).lris the learning rate you'd like the optimizer to update the parameters at, higher means the optimizer will try larger updates (these can sometimes be too large and the optimizer will fail to work), lower means the optimizer will try smaller updates (these can sometimes be too small and the optimizer will take too long to find the ideal values). The learning rate is considered a hyperparameter (because it's set by a machine learning engineer). Common starting values for the learning rate are0.01,0.001,0.0001, however, these can also be adjusted over time (this is called learning rate scheduling).
python# Create the loss function
loss_fn = nn.L1Loss() # MAE loss is same as L1Loss
# Create the optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(), # parameters of target model to optimize
lr=0.01) # learning rate (how much the optimizer should change parameters at each step, higher=more (less stable), lower=less (might take a long time))
Q: Which loss function and optimizer should i use?
Ans: This will be problem specific. But with experience, you will get an idea of what works and what does not with your particular problem set. For example, for a regression problem (like ours), a loss function of nn.L1Loss() and an optimizer like torch.optim.SGD() will suffice.
But for a classification problem like classifying whether a photo is of a dog or a cat, you will likely want to use a loss function of nn.BCELoss() (binary cross entropy loss).
And specifically for Pytorch, we need:
→ A training loop → A testing loop
Training Loop:

Testing:
modelname.eval() → you’re telling PyTorch that the model is no longer training, but instead will be used for testing or inference.
with torch.inference_mode(): = is a PyTorch context manager used to make inference faster and more memory-efficient.
pythonwith torch.inference_mode():
output = model(input)
It:
- Disables gradient calculation
- Disables autograd tracking
- Optimizes tensor operations for inference
- Prevents PyTorch from storing computation graphs.