Image processing boomed after the 2012 introduction of AlexNet. AlexNet implements a Convolutional Neural Network (CNN) to increase accuracy for image processing tasks. There are two major modern frameworks for building neural network (NN) models today. PyTorch and Tensorflow. In this post, we’re going to learn how to build and train a PyTorch CNN. Before that, we will also explore what a CNN is and understand its inner workings so we are armed to implement our own PyTorch CNN. We’ll cover how to build a CNN in Tensorflow in a future article.
Find examples for building a PyTorch CNN on GitHub here. If you learn better through videos, check out this video on Building a Debugging a CNN.
In this guide to PyTorch CNN building we go over:
- An Introduction to Convolutional Neural Networks
- What is a Convolution?
- Visual Example of the Math Behind an Image Convolution
- How PyTorch nn.Conv2d Works
- What is Max Pooling?
- Visual Example of how MaxPool2D Works
- How PyTorch nn.MaxPool2d Works
- What is Average pooling?
- Average Pooling PyTorch Visualization
- Average Pooling PyTorch Implementation
- When to use MaxPool2D vs AvgPool2D
- PyTorch CNN Example on Fashion MNIST
- nn.Conv2d + ReLU + nn.maxpool2d
- Torch Flatten for Final Fully Connected NN Layers
- Summary of PyTorch Convolutional Neural Networks
Introduction to Convolutional Neural Networks
The definitive features of convolutional neural networks are a convolution layer and a pooling layer. The AlexNet paper uses max pooling in its pooling layers. It is important to note that this is note the only pooling method. There are other forms like average pooling and min pooling as well as other ways to tune it such as local or global pooling. In this article we’ll cover max pooling and average pooling in PyTorch.
What is a Convolution?
In math, a convolution is an operation on two functions that produces a third function describing how the shape of one is modified by the other. When it comes to images as we use it in convolutional neural networks, it is when an image is modified by a kernel and produces another image of (usually) different dimensions.
For most applications of convolutions over an image, we can visualize it as sliding a window of values across our image. Libraries like PyTorch offer ways to do convolutions over 1 dimension (nn.conv1d
), 2 dimensions (nn.conv2d
), or 3 dimensions (nn.conv3d
). That’s not to say you can’t do convolutions over 4, 5, or more dimensions, it’s just not a common enough task that it comes built into the library.
Visual Example of the Math Behind an Image Convolution
Let’s cover an example of a convolution to understand it. In this example, we take a 5×5 image and apply a 2D Convolution (nn.conv2d
) with a 3×3 kernel (kernel_size=3
). We start by aligning the kernel with the top left corner. Then we “slide” the kernel along the image until we get to the rightmost side of the image. In this example, we end up with 3 convoluted pixels from that slide. Next, we move the kernel back to the leftmost side of the image, but down one pixel from the top and repeat the slide from left to right.
We repeat the slide from left to right until our image has been completely covered by the kernel. When an n x n
image is convoluted using an m x m
kernel, our resulting image has a dimension of (n-m+1) x (n-m+1)
. Generalized further, when an n x m
image is convoluted using an i x j
kernel, our resulting image has a dimensionality of (n-i+1) x (m-j+1)
.
Now that we understand dimensionality change, let’s look at how the numbers change. In each interaction between the kernel and the original image, the convolution applies position-wise multiplication and sums the results. I’ve kept the values in our conv2d
example to 0s and 1s for simplicity. Not because all image values can only be 0 or 1. The resulting 3 x 3
image should have many non 0 or 1 values.
Looking above we can visualize the math for the top leftmost pixel in the image. When the convolution is carried out we get: 1*1 + 1*0 + 0*0 + 0*0 + 1*1 + 1*1 + 1*0 + 1*1 + 0*1
. I’ve underlined the 0s and bolded the 1s in the former equation for clarity. That results in a value of 4 for the top leftmost point in our convoluted image as shown. When we carry out the full 2D convolution (conv2d
) functionality on the image, we get the resulting 3×3 image below.
How PyTorch nn.Conv2d Works
torch nn conv2d
is the 2D convolution function in PyTorch. The nn.conv2d
function has 9 parameters. Of these parameters, three must be specified and six come with defaults. The three that must be provided are the number of in_channels
, the number of out_channels
, and kernel_size
. In the above example, we have one input channel, one output channel, and a kernel size of 3.
Image from PyTorch Documentation
We just use the defaults for the other six parameters but let’s take a look at what they are and what they do.
- Stride: how far the window moves each time
- Padding: specifies how pixels should be added to the height and width of the image, can be a tuple of
(n, m)
wheren
is the number of pixels padded on the height andm
is the number of pixels padded on the width. Also allowssame
which pads the input so that the output is the same size as the input. Also allowsvalid
, which is the same as no padding. - Dilation: how many pixels between each kernel. This animation gives a good visualization.
- Groups: how many groups to split the input into. For example, if there are 2 groups, it is the equivalent of having 2 convolutional layers and concatenating the outputs. The input layers would be split into 2 groups and each group would get convoluted on its own and then combined at the end.
- Bias: whether or not to add learnable bias to output
- Padding_mode: allows
zeros
,reflect
,replicate
, orcircular
with a default ofzeros
.reflect
reflects the values without repeating the last pixel,replicate
repeats the values of the last pixel across, and I have been unable to find howcircular
works yet. - Device: used to set a device if you want to train your network on a specific device (ie cuda, mps, or cpu)
- Dtype: used to set the expected type for the input.
Note on Custom Kernels for Conv2d
The torch.nn.conv2d
module actually doesn’t provide functionality for a custom kernel. In order to implement a custom kernel, we need to use the torch.nn.functional.conv2d
module.
What is Max Pooling?
Max Pooling is a type of pooling technique. It downsizes an image to reduce computational cost and complexity. Max pooling is the specific application where we take a “pool” of pixels and replace them with their maximum value. This was the pooling technique applied on AlexNet in 2012 and is widely considered the de facto pooling technique to use in convolutional neural networks.
Visual Example of MaxPool2D
For this visualization, we’re going to take an image from Wikipedia. The image below starts with a 4x4
image. We apply a MaxPool2D technique on it with a 2x2
kernel. The default behavior for max pooling is to pool each set of pixels separately. Unlike the convolution, there is not an overlap of pixels when pooling. Using nn.maxpool2d
in PyTorch provides functionality to do this through the stride
parameter which we cover below.
Max Pooling Image from Wikipedia
How PyTorch nn.maxpool2d Works
Image from PyTorch Documentation
The PyTorch nn.MaxPool2d
function has six parameters. Only one of these parameters is required while five of them come with defaults. The required parameter is kernel_size
. In the visualization above, we had a kernel size of 2. For an exploratory example, watch this video as we explore the kernel_size
parameter in nn.maxpool2d
live to understand how it affects the input and output sizes.
For this tutorial we use the defaults for the other parameters, but let’s take a look at what they do:
- Stride: how far to move the kernel window.
None
by default means we get the behavior illustrated above. The stride of the kernel is equivalent to the kernel size. If we had a stride of 1 above, then it would only move over 1 pixel each time and a4x4
image would become a3x3
image. - Padding: how many pixels of negative infinity to pad the image with on either side (height or width). Can be an
int
or atuple
ofint
s that represents(height, width)
. - Dilation: works like the
nn.Conv2d
dilation parameter. - Return_indices: can be
True
orFalse
. WhenTrue
, the torch max pooling function also returns the indices of the max values in each pool. - Ceil_mode: Whether to use
ceil
orfloor
to calculate the output dimensions. WhenTrue
, it allows starting the pools in the padded regions to the left and top.
What is Average Pooling?
Like Max Pooling, Average Pooling is a version of the pooling algorithm. Unlike Max Pooling, average pooling does not take the max value within a pool and assign that as the corresponding value in the output image. Average pooling takes the average (mean) of the values within the pool, with some possible parametric changes.
The classic average pooling implementation uses a simple average. However, it is still called average pooling if the pooling technique does not use a simple mean. Changes you can make to the algorithm include including the padded values, using a different divisor, or using a different type of average. PyTorch’s implementation includes parameters to automatically implement the first two but does not auto-implement a median or other type of averaging method.
AvgPool2d Visualization
Taking the same example that we looked at for the visualization for maxpool2d
, we can see that the values are drastically different. The AvgPool2d
implementation leaves us with smaller values than a MaxPool2d
implementation. We discuss when to use average pooling or max pooling in the “When to use MaxPool2D
vs AvgPool2D
” section below.
PyTorch AvgPool2d Implementation
The PyTorch Average Pooling function for flat images is avgpool2d
. There are six parameters for nn.avgpool2d
, only one of which is required. Much like the PyTorch MaxPool2D function, the PyTorch Average Pooling function requires a kernel size. Many of the other parameters are similar as well.
The nn.avgpool2d
parameters that come with a default are:
- Stride: how far to move the kernel window. Defaults to the size of the kernel, just like max pooling behavior.
- Padding: amount of 0 padding. Uses 0s instead of negative infinities like the PyTorch Max Pooling function. Can be one integer or a tuple defining amount of padding on the height and width.
- Ceil_mode: works just like the max pooling function, if set to
True
, uses ceiling instead of floor function to determine output size. - Count_include_pad: includes the 0 values when considering the divisor for pools including padded 0s by default. Does not include the padded pixels in the count when set to
False
. - Divisor_override: set to an integer to use a specific integer to divide by instead of the pool size.
When to use MaxPool2D vs AvgPool2D
The main difference between using maxpool2d
and avgpool2d
in images is that max pooling gives a sharper image while average pooling gives a smoother image. Using nn.maxpool2d
is best when we want to retain the most prominent features of the image. Using nn.avgpool2d
is best when we want to retain the essence of an object.
Examples of when to use PyTorch maxpool2d
instead of avgpool2d
include when you have a drastic change in background color, when you are working with dark backgrounds, or when only the outline of an object is salient. Examples of when to use PyTorch nn.avgpool2d
over nn.maxpool2d
include when you are working with images with a variety of colors, when you are working with images with lighter backgrounds, and when you want your network to learn the general shape of an object.
PyTorch CNN Example on Fashion MNIST
The Fashion MNIST dataset is a modified dataset from the National Institute of Standards and Technology. Much like the original MNIST digits dataset that we trained our neural network from scratch on, the Fashion MNIST dataset contains 28x28
images. It contains 60000 training images and 10000 test images with 10 unique labels. Each of the labels corresponds to a type of clothing. In this section, we’re going to learn how to build a basic PyTorch CNN to classify images in the Fashion MNIST dataset.
We cover how to build the neural network and its associated hyperparameters. The network that we build is a simple PyTorch CNN that consists of Conv2D, ReLU, and MaxPool2D for the convolutional part. It then flattens the input and uses a linear + ReLU + linear set of layers for the fully connected part and prediction.
The skeleton of the PyTorch CNN looks like the code below. It extends the nn.Module
object from PyTorch. The two functions that we touch are the __init__
function and the forward
function. We define the network in the __init__
function and implement how a forward pass works in the forward
function. The PyTorch CNN skeleton below includes an implemented forward pass. To see how to train the neural network, check out this video on K-Fold Validation for a PyTorch CNN.
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.cnn = nn.Sequential(...)
def forward(self, X):
logits = self.cnn(X)
return logits
nn.Conv2d + ReLU + nn.maxpool2d
Let’s add the convolutional layers to our PyTorch CNN. In many convolutional neural networks there are multiple convolutional layers, but we build just one as an example. We define one “convolutional layer” as a Conv2D layer + a MaxPool2D layer. In this case we also add a ReLU activation in the middle.
Naturally, the question arises, why do we use a max pooling layer? It is not a mandate of every CNN to use a max pooling layer. As we discussed above, we can use an average pooling layer as well. We can also use pure convolutions with stride. There are three reasons we use a max pooling layer in this example.
First, this example is meant as an introduction to building a PyTorch CNN. Introducing a Max Pooling layer is a classical part of building ConvNets. Second, a max pooling layer (not an average pooling layer) introduces more nonlinearity into the network This is important for the network to learn better abstractions. Third, you can use it to reduce complexity.
Note that a kernel size of 1 is equivalent to not having a MaxPool2D layer at all. In the code example here, we use an nn.MaxPool2d
layer with a kernel size of 2. Our convolution + max pooling layer starts with a 28 x 28
image and ends with 4 13 x 13
images. The 4 out_channels
we add have in our nn.Conv2d
layer means that we get 4 different kernels from that layer so the network learns 4 different representations.
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),13 x 13 x 4
...
)
def forward(self, X):
logits = self.cnn(X)
return logits
Final Fully Connected NN Layers
Once we have set up our convolution and max pooling layer, we add the fully connected (also called “dense”) layers to facilitate prediction. The first thing that we have to do to our convoluted image is flatten it. The PyTorch nn.Linear
layer is only able to take flattened vectors. Once we flatten it, we can treat the rest of our PyTorch like any other basic neural network.
For this example, we take our length 784 vector and turn it into 64 hidden states. The next layer uses a ReLU activation function for nonlinearity. Finally, we turn the 64 hidden states into 10 output neurons. We use 10 output neurons for one to represent each class in the Fashion MNIST dataset. The output that has the highest number on it represents the class that the image is most likely to correspond to. Find the full code with training code for our PyTorch CNN on GitHub.
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
nn.ReLU(),
nn.MaxPool2d(kernel_size=2), # 13 x 13 x 4
nn.Flatten(), # --> (26 x 26 x 4)
nn.Linear(13*13*4, 64),
nn.ReLU(),
nn.Linear(64, 10)
)
def forward(self, X):
logits = self.cnn(X)
return logits
Summary of PyTorch Convolutional Neural Networks
In this article, we learned about Convolutional Neural Networks. We took a look at the math behind convolutions, max pooling, and average pooling. Then, we built a PyTorch CNN for practice. CNNs are primarily used for image recognition, however, they can be applied to other tasks as well.
The example PyTorch CNN we built assumes that we are training on 28x28
images as in the MNIST dataset. We use the nn.conv2d
and nn.maxpool2d
layers. If we want to work with different images, such as 3D brain scans, we would use the nn.conv3d
and nn.maxpool3d
layers. Alternatively, if we our task was to look for the basic shape of objects instead of their outlines, we may choose to use average pooling through nn.avgpool2d
.
More by the Author
- What is an Encoder Decoder Model?
- How to Automatically Transcribe a Zapier MP3 File
- LSTM vs RNN vs GRU for Image Classification
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
