The PyTorch CNN Beginners Guide

Image processing boomed after the 2012 introduction of AlexNet. AlexNet implements a Convolutional Neural Network (CNN) to increase accuracy for image processing tasks. There are two major modern frameworks for building neural network (NN) models today. PyTorch and Tensorflow. In this post, we’re going to learn how to build and train a PyTorch CNN. Before that, we will also explore what a CNN is and understand its inner workings so we are armed to implement our own PyTorch CNN. We’ll cover how to build a CNN in Tensorflow in a future article.

Find examples for building a PyTorch CNN on GitHub here. If you learn better through videos, check out this video on Building a Debugging a CNN.

In this guide to PyTorch CNN building we go over:

  • An Introduction to Convolutional Neural Networks
  • What is a Convolution?
    • Visual Example of the Math Behind an Image Convolution
    • How PyTorch nn.Conv2d Works
  • What is Max Pooling?
    • Visual Example of how MaxPool2D Works
    • How PyTorch nn.MaxPool2d Works
  • What is Average pooling?
    • Average Pooling PyTorch Visualization
    • Average Pooling PyTorch Implementation
  • When to use MaxPool2D vs AvgPool2D
  • PyTorch CNN Example on Fashion MNIST
    • nn.Conv2d + ReLU + nn.maxpool2d
    • Torch Flatten for Final Fully Connected NN Layers
  • Summary of PyTorch Convolutional Neural Networks

Introduction to Convolutional Neural Networks

Typical CNN from Wikipedia

The definitive features of convolutional neural networks are a convolution layer and a pooling layer. The AlexNet paper uses max pooling in its pooling layers. It is important to note that this is note the only pooling method. There are other forms like average pooling and min pooling as well as other ways to tune it such as local or global pooling. In this article we’ll cover max pooling and average pooling in PyTorch.

What is a Convolution?

Image from Wikipedia

In math, a convolution is an operation on two functions that produces a third function describing how the shape of one is modified by the other. When it comes to images as we use it in convolutional neural networks, it is when an image is modified by a kernel and produces another image of (usually) different dimensions.

For most applications of convolutions over an image, we can visualize it as sliding a window of values across our image. Libraries like PyTorch offer ways to do convolutions over 1 dimension (nn.conv1d), 2 dimensions (nn.conv2d), or 3 dimensions (nn.conv3d). That’s not to say you can’t do convolutions over 4, 5, or more dimensions, it’s just not a common enough task that it comes built into the library.

Visual Example of the Math Behind an Image Convolution

Let’s cover an example of a convolution to understand it. In this example, we take a 5×5 image and apply a 2D Convolution (nn.conv2d) with a 3×3 kernel (kernel_size=3). We start by aligning the kernel with the top left corner. Then we “slide” the kernel along the image until we get to the rightmost side of the image. In this example, we end up with 3 convoluted pixels from that slide. Next, we move the kernel back to the leftmost side of the image, but down one pixel from the top and repeat the slide from left to right. 

We repeat the slide from left to right until our image has been completely covered by the kernel. When an n x n image is convoluted using an m x m kernel, our resulting image has a dimension of (n-m+1) x (n-m+1). Generalized further, when an n x m image is convoluted using an i x j kernel, our resulting image has a dimensionality of (n-i+1) x (m-j+1).

Now that we understand dimensionality change, let’s look at how the numbers change. In each interaction between the kernel and the original image, the convolution applies position-wise multiplication and sums the results. I’ve kept the values in our conv2d example to 0s and 1s for simplicity. Not because all image values can only be 0 or 1. The resulting 3 x 3 image should have many non 0 or 1 values. 

Looking above we can visualize the math for the top leftmost pixel in the image. When the convolution is carried out we get: 1*1 + 1*0 + 0*0 + 0*0 + 1*1 + 1*1 + 1*0 + 1*1 + 0*1. I’ve underlined the 0s and bolded the 1s in the former equation for clarity. That results in a value of 4 for the top leftmost point in our convoluted image as shown. When we carry out the full 2D convolution (conv2d) functionality on the image, we get the resulting 3×3 image below. 

How PyTorch nn.Conv2d Works

torch nn conv2d is the 2D convolution function in PyTorch. The nn.conv2d function has 9 parameters. Of these parameters, three must be specified and six come with defaults. The three that must be provided are the number of in_channels, the number of out_channels, and kernel_size. In the above example, we have one input channel, one output channel, and a kernel size of 3.

Image from PyTorch Documentation

We just use the defaults for the other six parameters but let’s take a look at what they are and what they do.

  • Stride: how far the window moves each time
  • Padding: specifies how pixels should be added to the height and width of the image, can be a tuple of (n, m) where n is the number of pixels padded on the height and m is the number of pixels padded on the width. Also allows same which pads the input so that the output is the same size as the input. Also allows valid, which is the same as no padding.
  • Dilation: how many pixels between each kernel. This animation gives a good visualization.
  • Groups: how many groups to split the input into. For example, if there are 2 groups, it is the equivalent of having 2 convolutional layers and concatenating the outputs. The input layers would be split into 2 groups and each group would get convoluted on its own and then combined at the end.
  • Bias: whether or not to add learnable bias to output
  • Padding_mode: allows zeros, reflect, replicate, or circular with a default of zeros. reflect reflects the values without repeating the last pixel, replicate repeats the values of the last pixel across, and I have been unable to find how circular works yet. 
  • Device: used to set a device if you want to train your network on a specific device (ie cuda, mps, or cpu)
  • Dtype: used to set the expected type for the input.

Note on Custom Kernels for Conv2d

The torch.nn.conv2d module actually doesn’t provide functionality for a custom kernel. In order to implement a custom kernel, we need to use the torch.nn.functional.conv2d module.

What is Max Pooling?

Max Pooling is a type of pooling technique. It downsizes an image to reduce computational cost and complexity. Max pooling is the specific application where we take a “pool” of pixels and replace them with their maximum value. This was the pooling technique applied on AlexNet in 2012 and is widely considered the de facto pooling technique to use in convolutional neural networks.

Visual Example of MaxPool2D

For this visualization, we’re going to take an image from Wikipedia. The image below starts with a 4x4 image. We apply a MaxPool2D technique on it with a 2x2 kernel. The default behavior for max pooling is to pool each set of pixels separately. Unlike the convolution, there is not an overlap of pixels when pooling. Using nn.maxpool2d in PyTorch provides functionality to do this through the stride parameter which we cover below. 

Max Pooling Image from Wikipedia

How PyTorch nn.maxpool2d Works

Image from PyTorch Documentation

The PyTorch nn.MaxPool2d function has six parameters. Only one of these parameters is required while five of them come with defaults. The required parameter is kernel_size. In the visualization above, we had a kernel size of 2. For an exploratory example, watch this video as we explore the kernel_size parameter in nn.maxpool2d live to understand how it affects the input and output sizes.

For this tutorial we use the defaults for the other parameters, but let’s take a look at what they do:

  • Stride: how far to move the kernel window. None by default means we get the behavior illustrated above. The stride of the kernel is equivalent to the kernel size. If we had a stride of 1 above, then it would only move over 1 pixel each time and a 4x4 image would become a 3x3 image.
  • Padding: how many pixels of negative infinity to pad the image with on either side (height or width). Can be an int or a tuple of ints that represents (height, width).
  • Dilation: works like the nn.Conv2d dilation parameter.
  • Return_indices: can be True or False. When True, the torch max pooling function also returns the indices of the max values in each pool.
  • Ceil_mode: Whether to use ceil or floor to calculate the output dimensions. When True, it allows starting the pools in the padded regions to the left and top.

What is Average Pooling?

Like Max Pooling, Average Pooling is a version of the pooling algorithm. Unlike Max Pooling, average pooling does not take the max value within a pool and assign that as the corresponding value in the output image. Average pooling takes the average (mean) of the values within the pool, with some possible parametric changes.

The classic average pooling implementation uses a simple average. However, it is still called average pooling if the pooling technique does not use a simple mean. Changes you can make to the algorithm include including the padded values, using a different divisor, or using a different type of average. PyTorch’s implementation includes parameters to automatically implement the first two but does not auto-implement a median or other type of averaging method.

AvgPool2d Visualization

Taking the same example that we looked at for the visualization for maxpool2d, we can see that the values are drastically different. The AvgPool2d implementation leaves us with smaller values than a MaxPool2d implementation. We discuss when to use average pooling or max pooling in the “When to use MaxPool2D vs AvgPool2D” section below.

PyTorch AvgPool2d Implementation

The PyTorch Average Pooling function for flat images is avgpool2d. There are six parameters for nn.avgpool2d, only one of which is required. Much like the PyTorch MaxPool2D function, the PyTorch Average Pooling function requires a kernel size. Many of the other parameters are similar as well.

The nn.avgpool2d parameters that come with a default are:

  • Stride: how far to move the kernel window. Defaults to the size of the kernel, just like max pooling behavior.
  • Padding: amount of 0 padding. Uses 0s instead of negative infinities like the PyTorch Max Pooling function. Can be one integer or a tuple defining amount of padding on the height and width.
  • Ceil_mode: works just like the max pooling function, if set to True, uses ceiling instead of floor function to determine output size.
  • Count_include_pad: includes the 0 values when considering the divisor for pools including padded 0s by default. Does not include the padded pixels in the count when set to False.
  • Divisor_override: set to an integer to use a specific integer to divide by instead of the pool size.

When to use MaxPool2D vs AvgPool2D

The main difference between using maxpool2d and avgpool2d in images is that max pooling gives a sharper image while average pooling gives a smoother image. Using nn.maxpool2d is best when we want to retain the most prominent features of the image. Using nn.avgpool2d is best when we want to retain the essence of an object.

Examples of when to use PyTorch maxpool2d instead of avgpool2d include when you have a drastic change in background color, when you are working with dark backgrounds, or when only the outline of an object is salient. Examples of when to use PyTorch nn.avgpool2d over nn.maxpool2d include when you are working with images with a variety of colors, when you are working with images with lighter backgrounds, and when you want your network to learn the general shape of an object.

PyTorch CNN Example on Fashion MNIST

The Fashion MNIST dataset is a modified dataset from the National Institute of Standards and Technology. Much like the original MNIST digits dataset that we trained our neural network from scratch on, the Fashion MNIST dataset contains 28x28 images. It contains 60000 training images and 10000 test images with 10 unique labels. Each of the labels corresponds to a type of clothing. In this section, we’re going to learn how to build a basic PyTorch CNN to classify images in the Fashion MNIST dataset. 

We cover how to build the neural network and its associated hyperparameters. The network that we build is a simple PyTorch CNN that consists of Conv2D, ReLU, and MaxPool2D for the convolutional part. It then flattens the input and uses a linear + ReLU + linear set of layers for the fully connected part and prediction. 
The skeleton of the PyTorch CNN looks like the code below. It extends the nn.Module object from PyTorch. The two functions that we touch are the __init__ function and the forward function. We define the network in the __init__ function and implement how a forward pass works in the forward function. The PyTorch CNN skeleton below includes an implemented forward pass. To see how to train the neural network, check out this video on K-Fold Validation for a PyTorch CNN.

class NeuralNetwork(nn.Module):
   def __init__(self):
       super(NeuralNetwork, self).__init__()
       self.cnn = nn.Sequential(...)

   def forward(self, X):
       logits = self.cnn(X)
       return logits

nn.Conv2d + ReLU + nn.maxpool2d

Let’s add the convolutional layers to our PyTorch CNN. In many convolutional neural networks there are multiple convolutional layers, but we build just one as an example. We define one “convolutional layer” as a Conv2D layer + a MaxPool2D layer. In this case we also add a ReLU activation in the middle.

Naturally, the question arises, why do we use a max pooling layer? It is not a mandate of every CNN to use a max pooling layer. As we discussed above, we can use an average pooling layer as well. We can also use pure convolutions with stride. There are three reasons we use a max pooling layer in this example.

First, this example is meant as an introduction to building a PyTorch CNN. Introducing a Max Pooling layer is a classical part of building ConvNets. Second, a max pooling layer (not an average pooling layer) introduces more nonlinearity into the network This is important for the network to learn better abstractions. Third, you can use it to reduce complexity.

Note that a kernel size of 1 is equivalent to not having a MaxPool2D layer at all. In the code example here, we use an nn.MaxPool2d layer with a kernel size of 2. Our convolution + max pooling layer starts with a 28 x 28 image and ends with 4 13 x 13 images. The 4 out_channels we add have in our nn.Conv2d layer means that we get 4 different kernels from that layer so the network learns 4 different representations.

class NeuralNetwork(nn.Module):
   def __init__(self):
       super(NeuralNetwork, self).__init__()
       self.cnn = nn.Sequential(
           nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
           nn.MaxPool2d(kernel_size=2),13 x 13 x 4

   def forward(self, X):
       logits = self.cnn(X)
       return logits

Final Fully Connected NN Layers

Once we have set up our convolution and max pooling layer, we add the fully connected (also called “dense”) layers to facilitate prediction. The first thing that we have to do to our convoluted image is flatten it. The PyTorch nn.Linear layer is only able to take flattened vectors. Once we flatten it, we can treat the rest of our PyTorch like any other basic neural network. 

For this example, we take our length 784 vector and turn it into 64 hidden states. The next layer uses a ReLU activation function for nonlinearity. Finally, we turn the 64 hidden states into 10 output neurons. We use 10 output neurons for one to represent each class in the Fashion MNIST dataset. The output that has the highest number on it represents the class that the image is most likely to correspond to. Find the full code with training code for our PyTorch CNN on GitHub.

class NeuralNetwork(nn.Module):
   def __init__(self):
       super(NeuralNetwork, self).__init__()
       self.cnn = nn.Sequential(
           nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
           nn.MaxPool2d(kernel_size=2), # 13 x 13 x 4
           nn.Flatten(), # --> (26 x 26 x 4)
           nn.Linear(13*13*4, 64),
           nn.Linear(64, 10)

   def forward(self, X):
       logits = self.cnn(X)
       return logits

Summary of PyTorch Convolutional Neural Networks

In this article, we learned about Convolutional Neural Networks. We took a look at the math behind convolutions, max pooling, and average pooling. Then, we built a PyTorch CNN for practice. CNNs are primarily used for image recognition, however, they can be applied to other tasks as well. 

The example PyTorch CNN we built assumes that we are training on 28x28 images as in the MNIST dataset. We use the nn.conv2d and nn.maxpool2d layers. If we want to work with different images, such as 3D brain scans, we would use the nn.conv3d and nn.maxpool3d layers. Alternatively, if we our task was to look for the basic shape of objects instead of their outlines, we may choose to use average pooling through nn.avgpool2d.

More by the Author

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.


Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount


Or enter a custom amount

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Yujian Tang

Leave a Reply