MNIST: MLP vs CNN - Neural Network Comparison

The challenge

MNIST is 70,000 grayscale images of handwritten digits (0-9), each 28×28 pixels. It is the "hello world" of deep learning. Simple enough to train on a laptop, but complex enough to reveal fundamental differences between architectures.

0

1

2

3

4

5

6

7

8

9

Actual MNIST "7"

Each image is 28×28 pixels, giving 784 grayscale values between 0 and 255. The question: can a model learn which digit each image represents?

Two approaches

MLP

Flatten, then Dense layers. Treats each pixel independently with no concept of spatial relationships.

784 → 512 → 256 → 10

CNN

Convolutions → Pooling → Dense. Preserves spatial structure and learns local features.

28×28 → Conv(32) → Pool → Conv(64) → Pool → 128 → 10

The MLP sees a list of 784 numbers. The CNN sees a 28×28 image.

Multilayer Perceptron

scikit-learn

The MLP's first step is destructive: it flattens the 28×28 image into a single vector of 784 numbers. Row 1 sits next to Row 2, but the pixel directly above is now 28 positions away. Spatial structure is gone before the model even starts learning.

Original: 28×28 image

In 2D, each pixel has 8 natural neighbors. The CNN exploits this structure. The MLP destroys it.

Network Architecture

Watch the signal pulse through the network. Each input pixel feeds into every neuron in the first hidden layer (512 neurons), which feeds into the second (256 neurons), and finally into 10 output neurons representing digits 0 through 9. Every connection carries a learned weight. The total parameter count is over 535,000 unique weights the model has to learn.

Each neuron computes a weighted sum of its inputs, adds a bias, and applies ReLU activation (which zeros out negative values). The output layer uses Softmax to produce a probability distribution across the 10 digit classes. The highest probability becomes the prediction.

Activation functions

Without activation functions, stacking layers would just produce another linear transformation, regardless of depth. Nonlinearities are what give neural networks the capacity to learn complex decision boundaries.

f(x) = \max(0, x)

Hover over the plot to see values

The gradient is either 0 or 1. No exponentials, no divisions. This simplicity is why it trains faster than sigmoid.

Forward pass math

Data flows through the network one layer at a time. Each layer applies a linear transformation (matrix multiply + bias) then a nonlinearity. The final layer uses Softmax to produce class probabilities, and Cross-Entropy measures how wrong those probabilities are.

Single Neuron

a = \text{ReLU}(\mathbf{w} \cdot \mathbf{x} + b)

Each neuron computes a weighted sum of its inputs, adds a bias term, and applies an activation function. The weights w and bias b are the learnable parameters that training optimizes.

Layer Forward Pass

\mathbf{A}^{[l]} = \text{ReLU}\left(\mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}\right)

A full layer applies the neuron computation in parallel using matrix multiplication. W is the weight matrix, A is the activation from the previous layer, and b is the bias vector. This is the core operation repeated at every layer.

Softmax Output

\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}

Converts the raw output scores (logits) from the final layer into a probability distribution that sums to 1. The exponential amplifies differences between scores, making the network more decisive.

Cross-Entropy Loss

L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)

Measures how far the predicted probability distribution is from the true label. When the model is confident and correct, loss is near zero. When it assigns low probability to the true class, loss spikes. This is the signal that drives weight updates during training.

Training works by computing the gradient of the loss with respect to every weight (backpropagation), then nudging each weight in the direction that reduces the loss (gradient descent). The optimizer (Adam, in both models here) adapts the learning rate per-parameter for faster convergence.

Implementation

MLP

scikit-learn

97.92%

Test Accuracy

hidden_layer_sizes

(512, 256)

activation

relu

max_iter

20

optimizer

adam

784→

512→

256→

10

Convolutional Neural Network

TensorFlow / Keras

Instead of flattening, the CNN keeps the 2D structure intact. Small 3×3 filters slide across the image, learning to detect edges, curves, and corners. Each filter produces a feature map, which is a new image that highlights where a specific pattern was found.

How convolution works

Position: (0,0)

Input patch

Kernel

Output

The 3×3 kernel slides across the input, computing element-wise products and summing them at each position. This single kernel detects one type of pattern (edges in this case). A conv layer uses 32 or 64 different kernels simultaneously.

Convolution math

The convolution operation is the CNN's key advantage. Instead of learning a separate weight for every input pixel, it learns a small kernel that slides across the entire image. This enforces weight sharing and translation equivariance.

2D Convolution

(f * g)(i,j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} f(m,n) \cdot g(i+m, j+n)

The kernel f slides across the input g, computing element-wise products and summing them at each position. This produces a feature map that highlights where the kernel's pattern was detected.

Conv Layer Parameters

P = F \times (k_h \times k_w \times C_{in} + 1)

F = number of filters, k = kernel size, C_in = input channels. The +1 accounts for the bias per filter. A 3x3 conv layer with 32 filters on a single-channel input has only 32 x (9 + 1) = 320 parameters.

Network Architecture

The data flows left to right through the network. The input image (28×28) passes through two convolution blocks. Each block applies learned filters that detect increasingly complex patterns, followed by max pooling that halves the spatial dimensions while keeping the strongest activations. Notice how the feature maps get smaller but deeper at each stage.

After the second pooling layer, the 5×5×64 feature maps are flattened into a 1,600-dimensional vector. This is where the CNN transitions from spatial feature extraction to classification. A dense layer with 128 neurons processes these features, and dropout (30%) randomly disables neurons during training to prevent overfitting. The final softmax layer produces digit probabilities.

Implementation

CNN

TensorFlow / Keras

99.29%

Test Accuracy

conv_filters

32 → 64

kernel_size

3×3

dense

128

dropout

0.3

optimizer

adam

28×28×1→

26×26×32→

13×13×32→

11×11×64→

5×5×64→

1,600→

128→

10

Why CNN wins

What the MLP sees

A flat list of 64 numbers. No rows, no columns, no spatial structure. Pixel 5 and pixel 13 are "adjacent" in the list but were 8 rows apart in the image.

What the CNN sees

A 2D grid. The 3×3 kernel slides across, detecting local patterns like edges, curves, and corners. Every pixel is processed in context of its neighbors.

Weight sharing

One 3×3 kernel is reused across all spatial positions. The MLP needs 784×512 = 401,408 unique weights just for the first layer. The CNN's first conv layer uses only 32×(3×3×1+1) = 320 parameters. That is 1,254× fewer.

Translation invariance

A "7" in the top-left corner activates the same edge filters as a "7" in the center. The CNN recognizes patterns regardless of position. The MLP must learn each position separately.

Known limitations

Both architectures have well-documented failure modes. Understanding these limitations is as important as understanding the models themselves.

Vanishing gradients

Our MLP uses ReLU, which avoids the worst of this problem, but with only 2 hidden layers and 20 training iterations, gradients in the first layer can still be weak. Deeper MLPs would face this more severely.

Dying ReLU neurons

With 512 + 256 ReLU neurons in our MLP, some inevitably receive only negative inputs and stop contributing. These dead neurons waste capacity. Our CNN's 128-neuron dense layer is smaller but more efficiently utilized thanks to the conv layers doing feature extraction first.

Our MLP's spatial blindness

This is our MLP's core limitation on MNIST. By flattening 28x28 into 784, pixel (0,0) and pixel (0,27) become equally distant in the input vector. Our MLP needs 535K+ parameters to compensate for the lost spatial structure.

Our CNN's fixed receptive field

Each of our CNN's 3x3 conv layers only sees a local patch. A single layer cannot detect a full digit shape. Stacking two conv + pool blocks gives the deepest neurons an effective receptive field covering most of the 28x28 input, but larger images would need more depth.

Results comparison

MLPscikit-learn

Train

99.62%

Test

97.92%

CNNTensorFlow / Keras

Train

99.85%

Test

99.29%

Scale: 96% – 100%

Training progress

MLP (validation)

CNN (validation)

Generalization gap

MLP

Train99.62%

Test97.92%

1.70%gap

CNN

Train99.85%

Test99.29%

0.56%gap

Both models generalize well at 97.9%+ accuracy. The CNN's smaller gap (0.56% vs 1.70%) suggests its spatial inductive bias requires less memorization of individual pixel positions.

Confusion matrices

MLP

0

1

2

3

4

5

6

7

8

9

0

973

0

1

0

1

3

1

0

1

0

1127

3

1

0

1

0

2

0

2

3

0

1009

5

2

0

2

6

5

0

3

0

4

990

0

4

0

5

2

4

1

0

4

0

960

0

5

1

2

9

5

2

0

8

1

872

3

1

3

2

6

4

2

1

0

3

942

0

3

0

7

1

3

8

2

1

0

1004

2

7

8

3

0

4

6

3

4

2

3

946

3

9

2

3

1

5

9

3

1

5

3

977

← Predicted →Actual ↑

CNN

0

1

2

3

4

5

6

7

8

9

0

978

0

1

0

1

0

1

0

1132

1

0

1

0

1

0

2

1

0

1025

1

0

2

0

3

0

1

1003

0

2

0

1

2

1

4

0

1

0

975

0

2

0

4

5

1

0

3

0

885

1

0

1

6

2

1

0

1

952

0

1

0

7

0

1

3

1

0

1020

1

2

8

2

0

1

0

1

0

966

2

9

0

1

0

1

3

2

0

2

1

999

← Predicted →Actual ↑

4 ↔ 9 : Similar upper structure; MLP confuses open/closed tops

3 ↔ 8 : Overlapping curves; differs only by closed middle loop

7 ↔ 2 : Angled strokes can appear similar when flattened

What went wrong?

Even the best models make mistakes. Examining specific failures reveals what each architecture struggles to differentiate. These are representative examples from the MLP's 208 misclassifications.

Actual

4

→

Predicted

9

CNN got it right

The top loop of this 4 is fully closed, making it nearly identical to a 9 when viewed as a flat pixel vector. Our MLP has no way to detect that the crossbar extends left, a structural cue. Our CNN's conv filters pick up that horizontal edge, which 9s lack.

Actual

3

→

Predicted

8

CNN got it right

This 3 is open on the left, but the upper and lower curves bulge close together near the middle. Our MLP flattens the image and sees the overall pixel distribution, which looks similar to an 8's two stacked loops. Our CNN's 3x3 filters trace the actual contour and detect that the left side never closes.

Actual

7

→

Predicted

2

Both models failed

This 7 has a European-style crossbar and a bottom stroke that curves into a flat base, creating a silhouette nearly identical to a 2. Both our MLP and CNN see a horizontal top, a diagonal descent, and a horizontal bottom. Even with spatial awareness, the structure is genuinely ambiguous at this resolution.

The CNN corrected 137 of 208 MLP errors by leveraging spatial features the MLP could never see.

Code walkthrough

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import numpy as np
# Load MNIST dataset
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data / 255.0, mnist.target.astype(int)
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=10000, random_state=42
)
# For CNN: reshape to 28x28x1
X_train_cnn = X_train.reshape(-1, 28, 28, 1)
X_test_cnn = X_test.reshape(-1, 28, 28, 1)
print(f"Training: {X_train.shape[0]} samples")
print(f"Test:     {X_test.shape[0]} samples")

The challenge

Two approaches

MLP

CNN

Multilayer Perceptron

Original: 28×28 image

Network Architecture

Activation functions

Forward pass math

Implementation

MLP

Convolutional Neural Network

How convolution works

Convolution math

Network Architecture

Implementation

CNN

Why CNN wins

Weight sharing

Translation invariance

Known limitations

Results comparison

Training progress

Generalization gap

Confusion matrices

What went wrong?

Code walkthrough

Technologies