Revisiting Deep Learning: AlexNet & ResNet

From depth to skip connections: what AlexNet and ResNet taught us

Jan 06, 2026

AlexNet: A deep convolutional network that gave rise to the modern deep learning. It introduces many key ideas such as ReLU activations instead of tanh/sigmoid, multi-GPU training, dropout for regularization, and overlapping pooling.

ResNet: With increasing depth, a notorious problem of vanishing/exploding gradients was observed leading to accuracy getting saturated eventually degrading. It introduced residual blocks, built around skip connections that provide an alternate path for gradient flow. By enabling the network to learn identity mappings, residual connections ensure that deeper layers do not perform worse than shallower ones.

Code is available on github: Code

AlexNet Architecture - Key Features

ReLU Nonlinearity

Traditional nonlinear activation function like tanh/sigmoid suffer from vanishing gradients when neurons saturate. Also exponential function is a bit expensive
- Increase the cost and is slow to train
ReLU addressed these issues by providing a non-saturating, piecewise linear activation that is computationally cheap and enables faster, more stable optimization
\(\mathrm{ReLU}(x) = \max(0, x) \)

Local Response Normalization

Biologically inspired mechanism that encourages competition among neurons at the same spatial location across neighboring channels, promoting sparse and locally normalized activations.

For an activation aⁱ_x,yat spatial location (x,y)(x,y) and channel index ii, the normalized output bⁱ_x,y

\(\ b_{x,y}^{i} = \frac{a_{x,y}^{i}} {\left( k + \alpha \sum_{j=\max\left(0,\, i-\frac{n}{2}\right)}^{\min\left(N-1,\, i+\frac{n}{2}\right)} \left(a_{x,y}^{j}\right)^2 \right)^{\beta}} \ \)

The normalization is performed across channels, not spatially
summation runs over nnn adjacent channels at the same spatial location
N is the total number of channels (feature maps) in the layer
This introduces local competition among nearby feature maps

In AlexNet, LRN uses k = 2, α = 1e-4, β=0.75, and normalizes over n = 5 neighbouring channels.

Why LRN is used?

Although ReLU activations are non-saturating and do not strictly require normalization, LRN was used in AlexNet as a cross-channel normalization mechanism that encouraged local competition, improved generalization, and reduced overfitting in early deep CNNs

In modern times, LRN is replaced by BatchNorm[3]

BatchNorm:

Normalizes activations using batch-wise mean and variance
Stabilizes gradients by reducing internal covariate shift
Accelerates convergence and allows higher learning rates
Learns scale and shift parameters (γ,β)

LRN:

Normalizes activations across neighboring channels at the same spatial location
Does not explicitly stabilize gradients
Introduces additional computation
Relies on hand-tuned hyperparameters (k,α,β,n)
Provides limited regularization compared to BatchNorm
Compute-inefficient on modern hardware:
- Requires channel-wise reductions
- Leads to poor GPU utilization
- Is memory-bandwidth heavy
Poor fit for modern accelerators, lacking efficient fused implementations

Overlapping pooling

Traditional pooling typically uses:

kernel size (k) = stride (s) → non-overlapping regions
e.g., 2×2 pooling with stride 2

AlexNet instead uses:

Max pooling
Kernel size: k=3×3
Stride: s=2

Because s < k, adjacent pooling windows overlap.

The authors observed the following:

Reduces error rates compared to non-overlapping pooling
Acts as a regularizer
Slightly increases computation, but with measurable gains

Reducing overfitting

Data Augmentation

Random cropping
Preprocessing
- Original ImageNet images were rescaled so the shorter side = 256
- Aspect ratio preserved
During training
- Random 224 × 224 crops were sampled from the resized image
During testing
- Center crop was used
Why it helps
- Translation invariance
- Forces robustness to object position
- Multiplies dataset size implicitly

Horizontal flipping

Each crop was randomly mirrored
Doubles the effective dataset size
Why it helps
- Objects are usually left–right symmetric
- Safe augmentation (label preserved)
- Very cheap computationally

PCA-based color augmentation (important & non-obvious)

The technique perturbs image colors in a statistically grounded way, rather than using arbitrary jitter.

This was novel at the time.

What they did

Compute PCA on RGB values (offline)
- Collect all RGB pixel values from the ImageNet training set
- Treat each pixel as a 3D vector [R,G,B]
- Perform PCA on this distribution
- This yields:
  - Eigenvectors P ∈ ℝ^3X3 → principal directions of color variation
  - Eigenvalues Λ = [λ1,λ2,λ3] → magnitude of variation along each direction
For each training image, modify every pixel as:

\(\ \tilde{I}(x,y) = I(x,y) + P \left( \alpha \odot \Lambda \right), \quad \alpha_i \sim \mathcal{N}(0, 0.1) \\)

Where:

I(x,y) is the original RGB pixel
α is a random coefficient
⊙: element-wise multiplication
The same RGB offset is added to every pixel

Why it helps

Simulates lighting changes
Preserves structure while altering color statistics
Improves color invariance

Dropout

Where dropout was applied

Only in fully connected layers

Not used in convolutional layers

Why dropout was necessary in AlexNet

1. Massive parameterization

~60 million parameters
Majority in FC layers

Without dropout:

Model memorized ImageNet
Training error decreased rapidly
Validation error shot up → severe overfitting

2. Prevents co-adaptation

Dropout forces neurons to:

Work independently
Learn redundant representations
Avoid brittle feature dependencies

This was crucial before BatchNorm existed.

ResNet18 Architecture - Key Features

Residual Learning(The Big Idea)

Prior to this, training deeper networks often performed worse than their shallower counterparts. When deeper networks start to converge, a degradation problem was observed: with network depth increasing, accuracy gets saturated and then degrades rapidly. Such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.

Residual learning was introduce to optimize such systems.

Let’s consider H(x) where x is the input, that a neural network is trying to learn.

Instead of learning H(x) directly, the network leans a residual function F(x) = H(x) - x.

So the original function becomes H(x) = F(x) + x

In short, the network learns change relative to the input instead of relearning the entire transformation.

In practice,

Idea is implemented using skip(shortcut) connections.

An typical of a residual block[2] is shown below, consisting of few stacked layers (e.g., Conv → BN → ReLU), shortcut connection that bypasses these layers, element-wise addition of the input and the block’s output.

These skip connections are:

Parameter-free
Identity mappings
Simple additions

Why is this useful?

Solves the degradation problem
As networks get deeper:
- Training error increases after a point (even without overfitting)
- Gradients become weak or unstable
Residual connections allow gradients to flow directly through the network, bypassing multiple nonlinear layers. This stabilizes optimization and makes very deep models trainable.
Identity mapping is easy to learn

Bottleneck Architecture

For very deep networks (50+ layers), ResNet introduced the bottleneck block:

1×1 conv → 3×3 conv → 1×1 conv

Basic block on the left. BottleNeck on the right[2].

Why it matters?

1×1 convolutions reduce and restore channel dimensions
Significantly lowers:
- Parameters
- FLOPs
Makes 100+ layer networks computationally feasible

Somik's Substack

Discussion about this post

Ready for more?