PyTorch is an open source machine learning and deep learning framework.

Why use PyTorch?

Machine learning researchers love using PyTorch.

And as of February 2022, PyTorch is the most used deep learning framework on Papers With Code, a website for tracking machine learning research papers and the code repositories attached with them.

PyTorch also helps take care of many things such as GPU acceleration (making your code run faster) behind the scenes.

So you can focus on manipulating data and writing algorithms, and PyTorch will make sure it runs fast.

And if companies such as Tesla and Meta (Facebook) use it to build models they deploy to power hundreds of applications, drive thousands of cars and deliver content to billions of people, it's clearly capable on the development front too.

Tensors: Tensors are the basic building block of all of machine learning and deep learning. Tensors can represent almost any kind of data (images, words, tables of numbers.

Tensor Types by Dimensionality

Name	Dimensions	Example
Scalar	0D	`torch.tensor(7)`
Vector	1D	`torch.tensor([7, 7])`
Matrix	2D	`torch.tensor([[7, 8], [9, 10]])`
Tensor	nD	3D+ (images, batches, etc.)

Creating Tensors

`# Manual scalar = torch.tensor(7)

Random (0–1)

torch.rand(size=(3, 4))

Zeros / Ones

torch.zeros(size=(3, 4)) torch.ones(size=(3, 4))

Range

torch.arange(start=0, end=10, step=1)

Like an existing tensor

torch.zeros_like(input=some_tensor)`

3 Critical Tensor Attributes (Always Check These When Debugging)

tensor.shape # e.g. torch.Size([3, 4]) tensor.dtype # e.g. torch.float32 tensor.device # e.g. cpu or cuda:0

Common dtypes:

torch.float32 — default, best for most ops
torch.float16 — faster, less precise
torch.float64 — more precise, slower
torch.int8 / int32 / int64

tensor.type(torch.float16) # convert dtype

Tensor Operations

tensor + 10 # element-wise add tensor * 10 # element-wise multiply tensor - 10 tensor / 2

Matrix Multiplication ⚠️ (Most Common Source of Errors)

Rule: Inner dimensions must match → (3, 2) @ (2, 3) ✅ | (3, 2) @ (3, 2) ❌
Result shape = outer dimensions

torch.matmul(A, B.T) # or torch.mm(A, B.T) # or A @ B.T

Neural networks are essentially stacked matrix multiplications. torch.nn.Linear() does y = x·Aᵀ + b

Aggregation

`x = torch.arange(0, 100, 10)

x.min() # tensor(0) x.max() # tensor(90) x.mean() # tensor(45.) — needs float dtype x.sum() # tensor(450)

x.argmin() # index of min x.argmax() # index of max`

Tensor Shape Manipulation

`x = torch.arange(1., 8.) # shape: [7]

x.reshape(1, 7) # → [1, 7] (new tensor) x.view(1, 7) # → [1, 7] (shares memory with x!)

x_reshaped.squeeze() # remove dims of size 1: [1,7] → [7] x.unsqueeze(dim=0) # add dim: [7] → [1, 7]

torch.stack([x, x, x], dim=0) # stack tensors

Permute axes (common for images: HWC → CHW)

x.permute(2, 0, 1) # [224, 224, 3] → [3, 224, 224]`

Indexing

`x = torch.arange(1, 10).reshape(1, 3, 3)

x[0] # first element on dim 0 x[0][0][0] # chain indexing x[:, 0] # all of dim 0, index 0 of dim 1 x[:, :, 1] # all dims 0–1, index 1 of dim 2`

PyTorch ↔ NumPy

`# NumPy → PyTorch tensor = torch.from_numpy(np_array)

PyTorch → NumPy

np_array = tensor.numpy()`

⚠️ NumPy uses float64 by default; PyTorch uses float32. Convert explicitly:

torch.from_numpy(array).type(torch.float32)

Reproducibility

`RANDOM_SEED = 42 torch.manual_seed(RANDOM_SEED) tensor_A = torch.rand(3, 4)

torch.manual_seed(RANDOM_SEED) tensor_B = torch.rand(3, 4)

tensor_A == tensor_B ✅`

GPU / Device Setup

`# Check availability torch.cuda.is_available() # NVIDIA torch.backends.mps.is_available() # Apple Silicon

Device-agnostic code (best practice)

if torch.cuda.is_available(): device = "cuda" elif torch.backends.mps.is_available(): device = "mps" else: device = "cpu"

tensor = tensor.to(device)

Move back to CPU for NumPy

tensor.cpu().numpy()`

Key Takeaways

Tensors are everything — all ML data in PyTorch lives in tensors
Always check shape, dtype, device when debugging
Matrix multiplication is the core op of neural networks
.view() shares memory with original; .reshape() may not
Use torch.manual_seed() for reproducible experiments
Write device-agnostic code so it runs on CPU/GPU without changes

PyTorch Workflow — Notes

The 5-Step PyTorch Workflow

1. Prepare Data 2. Build Model 3. Train Model 4. Make Predictions (Inference) 5. Save & Load Model

Step 1 — Prepare & Split Data

Convert raw data into tensors
Split into train / val / test sets

Split	Size	Purpose
Training	60–80%	Model learns from this
Validation	10–20%	Tune hyperparameters
Test	10–20%	Final evaluation only

⚠️ Never let the model see test data during training — it measures true generalization

`# Example: synthetic linear data import torch

weight = 0.7 bias = 0.3

X = torch.arange(0, 1, 0.02).unsqueeze(dim=1) # add feature dimension y = weight * X + bias

Split

train_split = int(0.8 * len(X)) X_train, y_train = X[:train_split], y[:train_split] X_test, y_test = X[train_split:], y[train_split:]`

Step 2 — Build Model

Subclass torch.nn.Module
Define learnable params with nn.Parameter()
Implement forward() method

`from torch import nn

class LinearRegressionModel(nn.Module): def init(self): super().init() self.weights = nn.Parameter(torch.randn(1, dtype=torch.float), requires_grad=True) self.bias = nn.Parameter(torch.randn(1, dtype=torch.float), requires_grad=True)

def forward(self, x: torch.Tensor) -> torch.Tensor:
    return self.weights * x + self.bias`

Key nn Components

Component	Purpose
`nn.Module`	Base class for all models — must implement `forward()`
`nn.Parameter`	Learnable tensor — auto-tracked by autograd
`torch.optim`	Optimizers for updating parameters

Useful Model Utilities

`model = LinearRegressionModel()

model.state_dict() # view all learned params {name: tensor} model.parameters() # iterate over all params`

Step 3 — Train the Model

Pick a Loss Function & Optimizer

Problem Type	Loss Function	Common Optimizer
Regression	`nn.L1Loss()` (MAE)	`torch.optim.SGD`
Binary Classification	`nn.BCELoss()`	`torch.optim.Adam`
Multi-class	`nn.CrossEntropyLoss()`	`torch.optim.Adam`

loss_fn = nn.L1Loss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

Learning rate controls how big each update step is. Too high = unstable. Too low = slow.

Training Loop (5 Steps Every Epoch)

`# 1. Forward pass → get predictions

2. Calculate loss

3. Zero gradients (always before backward!)

4. Backpropagation → compute gradients

5. Update parameters

torch.manual_seed(42) epochs = 100

for epoch in range(epochs): # --- TRAIN --- model.train() # activates training mode y_pred = model(X_train) # 1. forward pass loss = loss_fn(y_pred, y_train) # 2. loss optimizer.zero_grad() # 3. zero grads loss.backward() # 4. backprop optimizer.step() # 5. update params

# --- TEST ---
model.eval()                           # activates eval mode
with torch.inference_mode():
    test_pred = model(X_test)
    test_loss = loss_fn(test_pred, y_test.type(torch.float))

if epoch % 10 == 0:
    print(f"Epoch {epoch} | Train Loss: {loss:.4f} | Test Loss: {test_loss:.4f}")`

Why `optimizer.zero_grad()`?

PyTorch accumulates gradients by default. You must zero them before each backward pass or they stack up and corrupt updates.

Step 4 — Inference (Making Predictions)

3 rules for inference:

model.eval() # 1. switch to eval mode with torch.inference_mode(): # 2. disable gradient tracking y_preds = model(X_test) # 3. data & model must be on same device

Use torch.inference_mode() — it's faster than torch.no_grad() and is the modern preferred approach

Step 5 — Save & Load Model

Saving

`from pathlib import Path

Create directory

MODEL_PATH = Path("models") MODEL_PATH.mkdir(parents=True, exist_ok=True)

MODEL_SAVE_PATH = MODEL_PATH / "model_0.pth"

Save only state_dict (recommended — lightweight)

torch.save(obj=model.state_dict(), f=MODEL_SAVE_PATH)`

Loading

`# Must recreate model architecture first loaded_model = LinearRegressionModel()

Then load saved params

loaded_model.load_state_dict(torch.load(MODEL_SAVE_PATH))

loaded_model.eval() with torch.inference_mode(): preds = loaded_model(X_test)`

Always save state_dict() not the whole model — it's smaller and more portable

Training vs Eval Mode — What Changes?

Mode	Layers Affected	Use When
`model.train()`	Dropout ON, BatchNorm uses batch stats	Training loop
`model.eval()`	Dropout OFF, BatchNorm uses running stats	Testing / Inference

Key Takeaways

Every training loop has the same 5 steps — forward → loss → zero grad → backward → step
Always call model.eval() + torch.inference_mode() before making predictions
Save state_dict() not the full model — it's the standard, portable way
Loss going down = model learning — track both train and test loss to spot overfitting
model.train() and model.eval() matter — they control Dropout and BatchNorm behavior

Pytorch Linear Regression model:-

Pytorch model building essentials:-

PyTorch module	What does it do?
`torch.nn`	Contains all of the building blocks for computational graphs (essentially a series of computations executed in a particular way).
`torch.nn.Parameter`	Stores tensors that can be used with `nn.Module`. If `requires_grad=True` gradients (used for updating model parameters via gradient descent) are calculated automatically, this is often referred to as "autograd".
`torch.nn.Module`	The base class for all neural network modules, all the building blocks for neural networks are subclasses. If you're building a neural network in PyTorch, your models should subclass `nn.Module`. Requires a `forward()` method be implemented.
`torch.optim`	Contains various optimization algorithms (these tell the model parameters stored in `nn.Parameter` how to best change to improve gradient descent and in turn reduce the loss).
`def forward()`	All `nn.Module` subclasses require a `forward()` method, this defines the computation that will take place on the data passed to the particular `nn.Module` (e.g. the linear regression formula above).

Making predictions using torch.inference_mode()

python# Make predictions with model
with torch.inference_mode(): 
    y_preds = model_0(X_test)

# Note: in older PyTorch code you might also see torch.no_grad()
# with torch.no_grad():
#   y_preds = model_0(X_test)

You probably noticed we used torch.inference_mode() as a context manager (that's what the with torch.inference_mode(): is) to make the predictions.

As the name suggests, torch.inference_mode() is used when using a model for inference (making predictions).

torch.inference_mode() turns off a bunch of things (like gradient tracking, which is necessary for training but not for inference) to make forward-passes (data going through the forward() method) faster.

Train Model:

The whole idea of training is for a model to from some unknown parameters (these may be random) to some known parameteres. Or in other words from a poor representation of the data to a better representation of the data.

One way to measure how poor or how wrong your models predictions are is to use a loss function. Note: Loss function may also be called cost function or criterion in different areas. For our case, we are going to refer to it as a loss function.

Things we need to train: → Loss function: A function to measure how wrong your model’s predictions are to the ideal outputs, lower is better. → Optimizer: Takes into account the loss of a model and adjusts the model’s parameters (eg. weight & bias in our case) to improve the loss function.

Depending on what kind of problem you're working on will depend on what loss function and what optimizer you use.

However, there are some common values, that are known to work well such as the SGD (stochastic gradient descent) or Adam optimizer. And the MAE (mean absolute error) loss function for regression problems (predicting a number) or binary cross entropy loss function for classification problems (predicting one thing or another).

For our problem, since we're predicting a number, let's use MAE (which is under torch.nn.L1Loss()) in PyTorch as our loss function.

Mean absolute error (MAE, in PyTorch: torch.nn.L1Loss__) measures the absolute difference between two points (predictions and labels) and then takes the mean across all examples.

And we'll use SGD, torch.optim.SGD(params, lr) where:

params is the target model parameters you'd like to optimize (e.g. the weights and bias values we randomly set before).
lr is the learning rate you'd like the optimizer to update the parameters at, higher means the optimizer will try larger updates (these can sometimes be too large and the optimizer will fail to work), lower means the optimizer will try smaller updates (these can sometimes be too small and the optimizer will take too long to find the ideal values). The learning rate is considered a hyperparameter (because it's set by a machine learning engineer). Common starting values for the learning rate are 0.01, 0.001, 0.0001, however, these can also be adjusted over time (this is called learning rate scheduling).

python# Create the loss function
loss_fn = nn.L1Loss() # MAE loss is same as L1Loss

# Create the optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(), # parameters of target model to optimize
                            lr=0.01) # learning rate (how much the optimizer should change parameters at each step, higher=more (less stable), lower=less (might take a long time))

Q: Which loss function and optimizer should i use?

Ans: This will be problem specific. But with experience, you will get an idea of what works and what does not with your particular problem set. For example, for a regression problem (like ours), a loss function of nn.L1Loss() and an optimizer like torch.optim.SGD() will suffice.

But for a classification problem like classifying whether a photo is of a dog or a cat, you will likely want to use a loss function of nn.BCELoss() (binary cross entropy loss).

And specifically for Pytorch, we need:

→ A training loop → A testing loop

Training Loop:

Testing:

modelname.eval() → you’re telling PyTorch that the model is no longer training, but instead will be used for testing or inference.

with torch.inference_mode(): = is a PyTorch context manager used to make inference faster and more memory-efficient.

pythonwith torch.inference_mode():
output = model(input)

It:

Disables gradient calculation
Disables autograd tracking
Optimizes tensor operations for inference
Prevents PyTorch from storing computation graphs.

Why use PyTorch?

Tensor Types by Dimensionality

Creating Tensors

Random (0–1)

Zeros / Ones

Range

Like an existing tensor

3 Critical Tensor Attributes (Always Check These When Debugging)

Tensor Operations

Matrix Multiplication ⚠️ (Most Common Source of Errors)

Aggregation

Tensor Shape Manipulation

Permute axes (common for images: HWC → CHW)

Indexing

PyTorch ↔ NumPy

PyTorch → NumPy

Reproducibility

tensor_A == tensor_B ✅`

GPU / Device Setup

Device-agnostic code (best practice)

Move back to CPU for NumPy

Key Takeaways

PyTorch Workflow — Notes

The 5-Step PyTorch Workflow

Step 1 — Prepare & Split Data

Split

Step 2 — Build Model

Key nn Components

Useful Model Utilities

Step 3 — Train the Model

Pick a Loss Function & Optimizer

Training Loop (5 Steps Every Epoch)

2. Calculate loss

3. Zero gradients (always before backward!)

4. Backpropagation → compute gradients

5. Update parameters

Why optimizer.zero_grad()?

Step 4 — Inference (Making Predictions)

Step 5 — Save & Load Model

Saving

Create directory

Save only state_dict (recommended — lightweight)

Loading

Then load saved params

Training vs Eval Mode — What Changes?

Key Takeaways

Why use PyTorch?

Tensor Types by Dimensionality

Creating Tensors

Random (0–1)

Zeros / Ones

Range

Like an existing tensor

3 Critical Tensor Attributes (Always Check These When Debugging)

Tensor Operations

Matrix Multiplication ⚠️ (Most Common Source of Errors)

Aggregation

Tensor Shape Manipulation

Permute axes (common for images: HWC → CHW)

Indexing

PyTorch ↔ NumPy

PyTorch → NumPy

Reproducibility

tensor_A == tensor_B ✅`

GPU / Device Setup

Device-agnostic code (best practice)

Move back to CPU for NumPy

Key Takeaways

PyTorch Workflow — Notes

The 5-Step PyTorch Workflow

Step 1 — Prepare & Split Data

Split

Step 2 — Build Model

Key nn Components

Useful Model Utilities

Step 3 — Train the Model

Pick a Loss Function & Optimizer

Training Loop (5 Steps Every Epoch)

2. Calculate loss

3. Zero gradients (always before backward!)

Why `optimizer.zero_grad()`?

Why `optimizer.zero_grad()`?