MNIST: MLP vs CNN
Two architectures, one dataset, and a lesson in spatial awareness.
A side-by-side comparison of a classic Multilayer Perceptron and a Convolutional Neural Network on handwritten digit classification. Same data, different assumptions about structure, and a 1.37 percentage point gap in accuracy that reveals why convolutions changed computer vision.
99.29%
CNN Accuracy
97.92%
MLP Accuracy
60K
Training Images
2
Models
The challenge
MNIST is 70,000 grayscale images of handwritten digits (0-9), each 28×28 pixels. It is the "hello world" of deep learning. Simple enough to train on a laptop, but complex enough to reveal fundamental differences between architectures.
Each image is 28×28 pixels, giving 784 grayscale values between 0 and 255. The question: can a model learn which digit each image represents?
Two approaches
MLP
Flatten, then Dense layers. Treats each pixel independently with no concept of spatial relationships.
784 → 512 → 256 → 10
CNN
Convolutions → Pooling → Dense. Preserves spatial structure and learns local features.
28×28 → Conv(32) → Pool → Conv(64) → Pool → 128 → 10
The MLP sees a list of 784 numbers. The CNN sees a 28×28 image.
Multilayer Perceptron
scikit-learnThe MLP's first step is destructive: it flattens the 28×28 image into a single vector of 784 numbers. Row 1 sits next to Row 2, but the pixel directly above is now 28 positions away. Spatial structure is gone before the model even starts learning.
Original: 28×28 image
In 2D, each pixel has 8 natural neighbors. The CNN exploits this structure. The MLP destroys it.
Network Architecture
Watch the signal pulse through the network. Each input pixel feeds into every neuron in the first hidden layer (512 neurons), which feeds into the second (256 neurons), and finally into 10 output neurons representing digits 0 through 9. Every connection carries a learned weight. The total parameter count is over 535,000 unique weights the model has to learn.
Each neuron computes a weighted sum of its inputs, adds a bias, and applies ReLU activation (which zeros out negative values). The output layer uses Softmax to produce a probability distribution across the 10 digit classes. The highest probability becomes the prediction.
Activation functions
Without activation functions, stacking layers would just produce another linear transformation, regardless of depth. Nonlinearities are what give neural networks the capacity to learn complex decision boundaries.
Hover over the plot to see values
The gradient is either 0 or 1. No exponentials, no divisions. This simplicity is why it trains faster than sigmoid.
Forward pass math
Data flows through the network one layer at a time. Each layer applies a linear transformation (matrix multiply + bias) then a nonlinearity. The final layer uses Softmax to produce class probabilities, and Cross-Entropy measures how wrong those probabilities are.
Single Neuron
Each neuron computes a weighted sum of its inputs, adds a bias term, and applies an activation function. The weights w and bias b are the learnable parameters that training optimizes.
Layer Forward Pass
A full layer applies the neuron computation in parallel using matrix multiplication. W is the weight matrix, A is the activation from the previous layer, and b is the bias vector. This is the core operation repeated at every layer.
Softmax Output
Converts the raw output scores (logits) from the final layer into a probability distribution that sums to 1. The exponential amplifies differences between scores, making the network more decisive.
Cross-Entropy Loss
Measures how far the predicted probability distribution is from the true label. When the model is confident and correct, loss is near zero. When it assigns low probability to the true class, loss spikes. This is the signal that drives weight updates during training.
Training works by computing the gradient of the loss with respect to every weight (backpropagation), then nudging each weight in the direction that reduces the loss (gradient descent). The optimizer (Adam, in both models here) adapts the learning rate per-parameter for faster convergence.
Implementation
MLP
scikit-learn
97.92%
Test Accuracy
hidden_layer_sizes
(512, 256)
activation
relu
max_iter
20
optimizer
adam
Convolutional Neural Network
TensorFlow / KerasInstead of flattening, the CNN keeps the 2D structure intact. Small 3×3 filters slide across the image, learning to detect edges, curves, and corners. Each filter produces a feature map, which is a new image that highlights where a specific pattern was found.
How convolution works
Input patch
Kernel
Output
The 3×3 kernel slides across the input, computing element-wise products and summing them at each position. This single kernel detects one type of pattern (edges in this case). A conv layer uses 32 or 64 different kernels simultaneously.
Convolution math
The convolution operation is the CNN's key advantage. Instead of learning a separate weight for every input pixel, it learns a small kernel that slides across the entire image. This enforces weight sharing and translation equivariance.
2D Convolution
The kernel f slides across the input g, computing element-wise products and summing them at each position. This produces a feature map that highlights where the kernel's pattern was detected.
Conv Layer Parameters
F = number of filters, k = kernel size, C_in = input channels. The +1 accounts for the bias per filter. A 3x3 conv layer with 32 filters on a single-channel input has only 32 x (9 + 1) = 320 parameters.
Network Architecture
The data flows left to right through the network. The input image (28×28) passes through two convolution blocks. Each block applies learned filters that detect increasingly complex patterns, followed by max pooling that halves the spatial dimensions while keeping the strongest activations. Notice how the feature maps get smaller but deeper at each stage.
After the second pooling layer, the 5×5×64 feature maps are flattened into a 1,600-dimensional vector. This is where the CNN transitions from spatial feature extraction to classification. A dense layer with 128 neurons processes these features, and dropout (30%) randomly disables neurons during training to prevent overfitting. The final softmax layer produces digit probabilities.
Implementation
CNN
TensorFlow / Keras
99.29%
Test Accuracy
conv_filters
32 → 64
kernel_size
3×3
dense
128
dropout
0.3
optimizer
adam
Why CNN wins
A flat list of 64 numbers. No rows, no columns, no spatial structure. Pixel 5 and pixel 13 are "adjacent" in the list but were 8 rows apart in the image.
A 2D grid. The 3×3 kernel slides across, detecting local patterns like edges, curves, and corners. Every pixel is processed in context of its neighbors.
Weight sharing
One 3×3 kernel is reused across all spatial positions. The MLP needs 784×512 = 401,408 unique weights just for the first layer. The CNN's first conv layer uses only 32×(3×3×1+1) = 320 parameters. That is 1,254× fewer.
Translation invariance
A "7" in the top-left corner activates the same edge filters as a "7" in the center. The CNN recognizes patterns regardless of position. The MLP must learn each position separately.
Known limitations
Both architectures have well-documented failure modes. Understanding these limitations is as important as understanding the models themselves.
Vanishing gradients
Our MLP uses ReLU, which avoids the worst of this problem, but with only 2 hidden layers and 20 training iterations, gradients in the first layer can still be weak. Deeper MLPs would face this more severely.
Dying ReLU neurons
With 512 + 256 ReLU neurons in our MLP, some inevitably receive only negative inputs and stop contributing. These dead neurons waste capacity. Our CNN's 128-neuron dense layer is smaller but more efficiently utilized thanks to the conv layers doing feature extraction first.
Our MLP's spatial blindness
This is our MLP's core limitation on MNIST. By flattening 28x28 into 784, pixel (0,0) and pixel (0,27) become equally distant in the input vector. Our MLP needs 535K+ parameters to compensate for the lost spatial structure.
Our CNN's fixed receptive field
Each of our CNN's 3x3 conv layers only sees a local patch. A single layer cannot detect a full digit shape. Stacking two conv + pool blocks gives the deepest neurons an effective receptive field covering most of the 28x28 input, but larger images would need more depth.
Results comparison
Scale: 96% – 100%
Training progress
Generalization gap
1.70%gap
0.56%gap
Both models generalize well at 97.9%+ accuracy. The CNN's smaller gap (0.56% vs 1.70%) suggests its spatial inductive bias requires less memorization of individual pixel positions.
Confusion matrices
4 ↔ 9 : Similar upper structure; MLP confuses open/closed tops
3 ↔ 8 : Overlapping curves; differs only by closed middle loop
7 ↔ 2 : Angled strokes can appear similar when flattened
What went wrong?
Even the best models make mistakes. Examining specific failures reveals what each architecture struggles to differentiate. These are representative examples from the MLP's 208 misclassifications.
Actual
4
Predicted
9
The top loop of this 4 is fully closed, making it nearly identical to a 9 when viewed as a flat pixel vector. Our MLP has no way to detect that the crossbar extends left, a structural cue. Our CNN's conv filters pick up that horizontal edge, which 9s lack.
Actual
3
Predicted
8
This 3 is open on the left, but the upper and lower curves bulge close together near the middle. Our MLP flattens the image and sees the overall pixel distribution, which looks similar to an 8's two stacked loops. Our CNN's 3x3 filters trace the actual contour and detect that the left side never closes.
Actual
7
Predicted
2
This 7 has a European-style crossbar and a bottom stroke that curves into a flat base, creating a silhouette nearly identical to a 2. Both our MLP and CNN see a horizontal top, a diagonal descent, and a horizontal bottom. Even with spatial awareness, the structure is genuinely ambiguous at this resolution.
The CNN corrected 137 of 208 MLP errors by leveraging spatial features the MLP could never see.
Code walkthrough
from sklearn.datasets import fetch_openmlfrom sklearn.model_selection import train_test_splitimport numpy as np# Load MNIST datasetmnist = fetch_openml('mnist_784', version=1, as_frame=False)X, y = mnist.data / 255.0, mnist.target.astype(int)# Split into train/test setsX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=10000, random_state=42)# For CNN: reshape to 28x28x1X_train_cnn = X_train.reshape(-1, 28, 28, 1)X_test_cnn = X_test.reshape(-1, 28, 28, 1)print(f"Training: {X_train.shape[0]} samples")print(f"Test: {X_test.shape[0]} samples")