# The PyTorch CNN Beginners Guide

Image processing boomed after the 2012 introduction of AlexNet. AlexNet implements a Convolutional Neural Network (CNN) to increase accuracy for image processing tasks. There are two major modern frameworks for building neural network (NN) models today. PyTorch and Tensorflow. In this post, we’re going to learn how to build and train a PyTorch CNN. Before that, we will also explore what a CNN is and understand its inner workings so we are armed to implement our own PyTorch CNN. We’ll cover how to build a CNN in Tensorflow in a future article.

Find examples for building a PyTorch CNN on GitHub here. If you learn better through videos, check out this video on Building a Debugging a CNN.

In this guide to PyTorch CNN building we go over:

• An Introduction to Convolutional Neural Networks
• What is a Convolution?
• Visual Example of the Math Behind an Image Convolution
• How PyTorch nn.Conv2d Works
• What is Max Pooling?
• Visual Example of how MaxPool2D Works
• How PyTorch nn.MaxPool2d Works
• What is Average pooling?
• Average Pooling PyTorch Visualization
• Average Pooling PyTorch Implementation
• When to use MaxPool2D vs AvgPool2D
• PyTorch CNN Example on Fashion MNIST
• nn.Conv2d + ReLU + nn.maxpool2d
• Torch Flatten for Final Fully Connected NN Layers
• Summary of PyTorch Convolutional Neural Networks

## Introduction to Convolutional Neural Networks

Typical CNN from Wikipedia

The definitive features of convolutional neural networks are a convolution layer and a pooling layer. The AlexNet paper uses max pooling in its pooling layers. It is important to note that this is note the only pooling method. There are other forms like average pooling and min pooling as well as other ways to tune it such as local or global pooling. In this article we’ll cover max pooling and average pooling in PyTorch.

## What is a Convolution?

Image from Wikipedia

In math, a convolution is an operation on two functions that produces a third function describing how the shape of one is modified by the other. When it comes to images as we use it in convolutional neural networks, it is when an image is modified by a kernel and produces another image of (usually) different dimensions.

For most applications of convolutions over an image, we can visualize it as sliding a window of values across our image. Libraries like PyTorch offer ways to do convolutions over 1 dimension (`nn.conv1d`), 2 dimensions (`nn.conv2d`), or 3 dimensions (`nn.conv3d`). That’s not to say you can’t do convolutions over 4, 5, or more dimensions, it’s just not a common enough task that it comes built into the library.

### Visual Example of the Math Behind an Image Convolution

Let’s cover an example of a convolution to understand it. In this example, we take a 5×5 image and apply a 2D Convolution (`nn.conv2d`) with a 3×3 kernel (`kernel_size=3`). We start by aligning the kernel with the top left corner. Then we “slide” the kernel along the image until we get to the rightmost side of the image. In this example, we end up with 3 convoluted pixels from that slide. Next, we move the kernel back to the leftmost side of the image, but down one pixel from the top and repeat the slide from left to right.

We repeat the slide from left to right until our image has been completely covered by the kernel. When an `n x n` image is convoluted using an `m x m` kernel, our resulting image has a dimension of `(n-m+1) x (n-m+1)`. Generalized further, when an `n x m` image is convoluted using an `i x j` kernel, our resulting image has a dimensionality of `(n-i+1) x (m-j+1)`.

Now that we understand dimensionality change, let’s look at how the numbers change. In each interaction between the kernel and the original image, the convolution applies position-wise multiplication and sums the results. I’ve kept the values in our `conv2d` example to 0s and 1s for simplicity. Not because all image values can only be 0 or 1. The resulting `3 x 3` image should have many non 0 or 1 values.

Looking above we can visualize the math for the top leftmost pixel in the image. When the convolution is carried out we get: `1*1 + 1*0 + 0*0 + 0*0 + 1*1 + 1*1 + 1*0 + 1*1 + 0*1`. I’ve underlined the 0s and bolded the 1s in the former equation for clarity. That results in a value of 4 for the top leftmost point in our convoluted image as shown. When we carry out the full 2D convolution (`conv2d`) functionality on the image, we get the resulting 3×3 image below.

### How PyTorch nn.Conv2d Works

`torch nn conv2d` is the 2D convolution function in PyTorch. The `nn.conv2d` function has 9 parameters. Of these parameters, three must be specified and six come with defaults. The three that must be provided are the number of `in_channels`, the number of `out_channels`, and `kernel_size`. In the above example, we have one input channel, one output channel, and a kernel size of 3.

Image from PyTorch Documentation

We just use the defaults for the other six parameters but let’s take a look at what they are and what they do.

• Stride: how far the window moves each time
• Padding: specifies how pixels should be added to the height and width of the image, can be a tuple of `(n, m)` where `n` is the number of pixels padded on the height and `m` is the number of pixels padded on the width. Also allows `same` which pads the input so that the output is the same size as the input. Also allows `valid`, which is the same as no padding.
• Dilation: how many pixels between each kernel. This animation gives a good visualization.
• Groups: how many groups to split the input into. For example, if there are 2 groups, it is the equivalent of having 2 convolutional layers and concatenating the outputs. The input layers would be split into 2 groups and each group would get convoluted on its own and then combined at the end.
• Bias: whether or not to add learnable bias to output
• Padding_mode: allows `zeros`, `reflect`, `replicate`, or `circular` with a default of `zeros`. `reflect` reflects the values without repeating the last pixel, `replicate` repeats the values of the last pixel across, and I have been unable to find how `circular` works yet.
• Device: used to set a device if you want to train your network on a specific device (ie cuda, mps, or cpu)
• Dtype: used to set the expected type for the input.

### Note on Custom Kernels for Conv2d

The `torch.nn.conv2d` module actually doesn’t provide functionality for a custom kernel. In order to implement a custom kernel, we need to use the `torch.nn.functional.conv2d` module.

## What is Max Pooling?

Max Pooling is a type of pooling technique. It downsizes an image to reduce computational cost and complexity. Max pooling is the specific application where we take a “pool” of pixels and replace them with their maximum value. This was the pooling technique applied on AlexNet in 2012 and is widely considered the de facto pooling technique to use in convolutional neural networks.

### Visual Example of MaxPool2D

For this visualization, we’re going to take an image from Wikipedia. The image below starts with a `4x4` image. We apply a MaxPool2D technique on it with a `2x2` kernel. The default behavior for max pooling is to pool each set of pixels separately. Unlike the convolution, there is not an overlap of pixels when pooling. Using `nn.maxpool2d` in PyTorch provides functionality to do this through the `stride` parameter which we cover below.

Max Pooling Image from Wikipedia

### How PyTorch nn.maxpool2d Works

Image from PyTorch Documentation

The PyTorch `nn.MaxPool2d` function has six parameters. Only one of these parameters is required while five of them come with defaults. The required parameter is `kernel_size`. In the visualization above, we had a kernel size of 2. For an exploratory example, watch this video as we explore the `kernel_size` parameter in `nn.maxpool2d` live to understand how it affects the input and output sizes.

For this tutorial we use the defaults for the other parameters, but let’s take a look at what they do:

• Stride: how far to move the kernel window. `None` by default means we get the behavior illustrated above. The stride of the kernel is equivalent to the kernel size. If we had a stride of 1 above, then it would only move over 1 pixel each time and a `4x4` image would become a `3x3` image.
• Padding: how many pixels of negative infinity to pad the image with on either side (height or width). Can be an `int` or a `tuple` of `int`s that represents `(height, width)`.
• Dilation: works like the `nn.Conv2d` dilation parameter.
• Return_indices: can be `True` or `False`. When `True`, the torch max pooling function also returns the indices of the max values in each pool.
• Ceil_mode: Whether to use `ceil` or `floor` to calculate the output dimensions. When `True`, it allows starting the pools in the padded regions to the left and top.

## What is Average Pooling?

Like Max Pooling, Average Pooling is a version of the pooling algorithm. Unlike Max Pooling, average pooling does not take the max value within a pool and assign that as the corresponding value in the output image. Average pooling takes the average (mean) of the values within the pool, with some possible parametric changes.

The classic average pooling implementation uses a simple average. However, it is still called average pooling if the pooling technique does not use a simple mean. Changes you can make to the algorithm include including the padded values, using a different divisor, or using a different type of average. PyTorch’s implementation includes parameters to automatically implement the first two but does not auto-implement a median or other type of averaging method.

### AvgPool2d Visualization

Taking the same example that we looked at for the visualization for `maxpool2d`, we can see that the values are drastically different. The `AvgPool2d` implementation leaves us with smaller values than a `MaxPool2d` implementation. We discuss when to use average pooling or max pooling in the “When to use `MaxPool2D` vs `AvgPool2D`” section below.

### PyTorch AvgPool2d Implementation

The PyTorch Average Pooling function for flat images is `avgpool2d`. There are six parameters for `nn.avgpool2d`, only one of which is required. Much like the PyTorch MaxPool2D function, the PyTorch Average Pooling function requires a kernel size. Many of the other parameters are similar as well.

The `nn.avgpool2d` parameters that come with a default are:

• Stride: how far to move the kernel window. Defaults to the size of the kernel, just like max pooling behavior.
• Padding: amount of 0 padding. Uses 0s instead of negative infinities like the PyTorch Max Pooling function. Can be one integer or a tuple defining amount of padding on the height and width.
• Ceil_mode: works just like the max pooling function, if set to `True`, uses ceiling instead of floor function to determine output size.
• Count_include_pad: includes the 0 values when considering the divisor for pools including padded 0s by default. Does not include the padded pixels in the count when set to `False`.
• Divisor_override: set to an integer to use a specific integer to divide by instead of the pool size.

## When to use MaxPool2D vs AvgPool2D

The main difference between using `maxpool2d` and `avgpool2d` in images is that max pooling gives a sharper image while average pooling gives a smoother image. Using `nn.maxpool2d` is best when we want to retain the most prominent features of the image. Using `nn.avgpool2d` is best when we want to retain the essence of an object.

Examples of when to use PyTorch `maxpool2d` instead of `avgpool2d` include when you have a drastic change in background color, when you are working with dark backgrounds, or when only the outline of an object is salient. Examples of when to use PyTorch `nn.avgpool2d` over `nn.maxpool2d` include when you are working with images with a variety of colors, when you are working with images with lighter backgrounds, and when you want your network to learn the general shape of an object.

## PyTorch CNN Example on Fashion MNIST

The Fashion MNIST dataset is a modified dataset from the National Institute of Standards and Technology. Much like the original MNIST digits dataset that we trained our neural network from scratch on, the Fashion MNIST dataset contains `28x28` images. It contains 60000 training images and 10000 test images with 10 unique labels. Each of the labels corresponds to a type of clothing. In this section, we’re going to learn how to build a basic PyTorch CNN to classify images in the Fashion MNIST dataset.

We cover how to build the neural network and its associated hyperparameters. The network that we build is a simple PyTorch CNN that consists of Conv2D, ReLU, and MaxPool2D for the convolutional part. It then flattens the input and uses a linear + ReLU + linear set of layers for the fully connected part and prediction.
The skeleton of the PyTorch CNN looks like the code below. It extends the `nn.Module` object from PyTorch. The two functions that we touch are the `__init__` function and the `forward` function. We define the network in the `__init__` function and implement how a forward pass works in the `forward` function. The PyTorch CNN skeleton below includes an implemented forward pass. To see how to train the neural network, check out this video on K-Fold Validation for a PyTorch CNN.

``````class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.cnn = nn.Sequential(...)

def forward(self, X):
logits = self.cnn(X)
return logits``````

### nn.Conv2d + ReLU + nn.maxpool2d

Let’s add the convolutional layers to our PyTorch CNN. In many convolutional neural networks there are multiple convolutional layers, but we build just one as an example. We define one “convolutional layer” as a Conv2D layer + a MaxPool2D layer. In this case we also add a ReLU activation in the middle.

Naturally, the question arises, why do we use a max pooling layer? It is not a mandate of every CNN to use a max pooling layer. As we discussed above, we can use an average pooling layer as well. We can also use pure convolutions with stride. There are three reasons we use a max pooling layer in this example.

First, this example is meant as an introduction to building a PyTorch CNN. Introducing a Max Pooling layer is a classical part of building ConvNets. Second, a max pooling layer (not an average pooling layer) introduces more nonlinearity into the network This is important for the network to learn better abstractions. Third, you can use it to reduce complexity.

Note that a kernel size of 1 is equivalent to not having a MaxPool2D layer at all. In the code example here, we use an `nn.MaxPool2d` layer with a kernel size of 2. Our convolution + max pooling layer starts with a `28 x 28` image and ends with 4 `13 x 13` images. The 4 `out_channels` we add have in our `nn.Conv2d` layer means that we get 4 different kernels from that layer so the network learns 4 different representations.

``````class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),13 x 13 x 4
...
)

def forward(self, X):
logits = self.cnn(X)
return logits``````

### Final Fully Connected NN Layers

Once we have set up our convolution and max pooling layer, we add the fully connected (also called “dense”) layers to facilitate prediction. The first thing that we have to do to our convoluted image is flatten it. The PyTorch `nn.Linear` layer is only able to take flattened vectors. Once we flatten it, we can treat the rest of our PyTorch like any other basic neural network.

For this example, we take our length 784 vector and turn it into 64 hidden states. The next layer uses a ReLU activation function for nonlinearity. Finally, we turn the 64 hidden states into 10 output neurons. We use 10 output neurons for one to represent each class in the Fashion MNIST dataset. The output that has the highest number on it represents the class that the image is most likely to correspond to. Find the full code with training code for our PyTorch CNN on GitHub.

``````class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
nn.ReLU(),
nn.MaxPool2d(kernel_size=2), # 13 x 13 x 4
nn.Flatten(), # --> (26 x 26 x 4)
nn.Linear(13*13*4, 64),
nn.ReLU(),
nn.Linear(64, 10)
)

def forward(self, X):
logits = self.cnn(X)
return logits``````

## Summary of PyTorch Convolutional Neural Networks

In this article, we learned about Convolutional Neural Networks. We took a look at the math behind convolutions, max pooling, and average pooling. Then, we built a PyTorch CNN for practice. CNNs are primarily used for image recognition, however, they can be applied to other tasks as well.

The example PyTorch CNN we built assumes that we are training on `28x28` images as in the MNIST dataset. We use the `nn.conv2d` and `nn.maxpool2d` layers. If we want to work with different images, such as 3D brain scans, we would use the `nn.conv3d` and `nn.maxpool3d` layers. Alternatively, if we our task was to look for the basic shape of objects instead of their outlines, we may choose to use average pooling through `nn.avgpool2d`.

## More by the Author

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.