Categories
Uncategorized

The PyTorch CNN Beginners Guide

Image processing boomed after the 2012 introduction of AlexNet. AlexNet implements a Convolutional Neural Network (CNN) to increase accuracy for image processing tasks. There are two major modern frameworks for building neural network (NN) models today. PyTorch and Tensorflow. In this post, we’re going to learn how to build and train a PyTorch CNN. Before that, we will also explore what a CNN is and understand its inner workings so we are armed to implement our own PyTorch CNN. We’ll cover how to build a CNN in Tensorflow in a future article.

Find examples for building a PyTorch CNN on GitHub here. If you learn better through videos, check out this video on Building a Debugging a CNN.

In this guide to PyTorch CNN building we go over:

  • An Introduction to Convolutional Neural Networks
  • What is a Convolution?
    • Visual Example of the Math Behind an Image Convolution
    • How PyTorch nn.Conv2d Works
  • What is Max Pooling?
    • Visual Example of how MaxPool2D Works
    • How PyTorch nn.MaxPool2d Works
  • What is Average pooling?
    • Average Pooling PyTorch Visualization
    • Average Pooling PyTorch Implementation
  • When to use MaxPool2D vs AvgPool2D
  • PyTorch CNN Example on Fashion MNIST
    • nn.Conv2d + ReLU + nn.maxpool2d
    • Torch Flatten for Final Fully Connected NN Layers
  • Summary of PyTorch Convolutional Neural Networks

Introduction to Convolutional Neural Networks

Typical CNN from Wikipedia

The definitive features of convolutional neural networks are a convolution layer and a pooling layer. The AlexNet paper uses max pooling in its pooling layers. It is important to note that this is note the only pooling method. There are other forms like average pooling and min pooling as well as other ways to tune it such as local or global pooling. In this article we’ll cover max pooling and average pooling in PyTorch.

What is a Convolution?

Image from Wikipedia

In math, a convolution is an operation on two functions that produces a third function describing how the shape of one is modified by the other. When it comes to images as we use it in convolutional neural networks, it is when an image is modified by a kernel and produces another image of (usually) different dimensions.

For most applications of convolutions over an image, we can visualize it as sliding a window of values across our image. Libraries like PyTorch offer ways to do convolutions over 1 dimension (nn.conv1d), 2 dimensions (nn.conv2d), or 3 dimensions (nn.conv3d). That’s not to say you can’t do convolutions over 4, 5, or more dimensions, it’s just not a common enough task that it comes built into the library.

Visual Example of the Math Behind an Image Convolution

Let’s cover an example of a convolution to understand it. In this example, we take a 5×5 image and apply a 2D Convolution (nn.conv2d) with a 3×3 kernel (kernel_size=3). We start by aligning the kernel with the top left corner. Then we “slide” the kernel along the image until we get to the rightmost side of the image. In this example, we end up with 3 convoluted pixels from that slide. Next, we move the kernel back to the leftmost side of the image, but down one pixel from the top and repeat the slide from left to right. 

We repeat the slide from left to right until our image has been completely covered by the kernel. When an n x n image is convoluted using an m x m kernel, our resulting image has a dimension of (n-m+1) x (n-m+1). Generalized further, when an n x m image is convoluted using an i x j kernel, our resulting image has a dimensionality of (n-i+1) x (m-j+1).

Now that we understand dimensionality change, let’s look at how the numbers change. In each interaction between the kernel and the original image, the convolution applies position-wise multiplication and sums the results. I’ve kept the values in our conv2d example to 0s and 1s for simplicity. Not because all image values can only be 0 or 1. The resulting 3 x 3 image should have many non 0 or 1 values. 

Looking above we can visualize the math for the top leftmost pixel in the image. When the convolution is carried out we get: 1*1 + 1*0 + 0*0 + 0*0 + 1*1 + 1*1 + 1*0 + 1*1 + 0*1. I’ve underlined the 0s and bolded the 1s in the former equation for clarity. That results in a value of 4 for the top leftmost point in our convoluted image as shown. When we carry out the full 2D convolution (conv2d) functionality on the image, we get the resulting 3×3 image below. 

How PyTorch nn.Conv2d Works

torch nn conv2d is the 2D convolution function in PyTorch. The nn.conv2d function has 9 parameters. Of these parameters, three must be specified and six come with defaults. The three that must be provided are the number of in_channels, the number of out_channels, and kernel_size. In the above example, we have one input channel, one output channel, and a kernel size of 3.

Image from PyTorch Documentation

We just use the defaults for the other six parameters but let’s take a look at what they are and what they do.

  • Stride: how far the window moves each time
  • Padding: specifies how pixels should be added to the height and width of the image, can be a tuple of (n, m) where n is the number of pixels padded on the height and m is the number of pixels padded on the width. Also allows same which pads the input so that the output is the same size as the input. Also allows valid, which is the same as no padding.
  • Dilation: how many pixels between each kernel. This animation gives a good visualization.
  • Groups: how many groups to split the input into. For example, if there are 2 groups, it is the equivalent of having 2 convolutional layers and concatenating the outputs. The input layers would be split into 2 groups and each group would get convoluted on its own and then combined at the end.
  • Bias: whether or not to add learnable bias to output
  • Padding_mode: allows zeros, reflect, replicate, or circular with a default of zeros. reflect reflects the values without repeating the last pixel, replicate repeats the values of the last pixel across, and I have been unable to find how circular works yet. 
  • Device: used to set a device if you want to train your network on a specific device (ie cuda, mps, or cpu)
  • Dtype: used to set the expected type for the input.

Note on Custom Kernels for Conv2d

The torch.nn.conv2d module actually doesn’t provide functionality for a custom kernel. In order to implement a custom kernel, we need to use the torch.nn.functional.conv2d module.

What is Max Pooling?

Max Pooling is a type of pooling technique. It downsizes an image to reduce computational cost and complexity. Max pooling is the specific application where we take a “pool” of pixels and replace them with their maximum value. This was the pooling technique applied on AlexNet in 2012 and is widely considered the de facto pooling technique to use in convolutional neural networks.

Visual Example of MaxPool2D

For this visualization, we’re going to take an image from Wikipedia. The image below starts with a 4x4 image. We apply a MaxPool2D technique on it with a 2x2 kernel. The default behavior for max pooling is to pool each set of pixels separately. Unlike the convolution, there is not an overlap of pixels when pooling. Using nn.maxpool2d in PyTorch provides functionality to do this through the stride parameter which we cover below. 

Max Pooling Image from Wikipedia

How PyTorch nn.maxpool2d Works

Image from PyTorch Documentation

The PyTorch nn.MaxPool2d function has six parameters. Only one of these parameters is required while five of them come with defaults. The required parameter is kernel_size. In the visualization above, we had a kernel size of 2. For an exploratory example, watch this video as we explore the kernel_size parameter in nn.maxpool2d live to understand how it affects the input and output sizes.

For this tutorial we use the defaults for the other parameters, but let’s take a look at what they do:

  • Stride: how far to move the kernel window. None by default means we get the behavior illustrated above. The stride of the kernel is equivalent to the kernel size. If we had a stride of 1 above, then it would only move over 1 pixel each time and a 4x4 image would become a 3x3 image.
  • Padding: how many pixels of negative infinity to pad the image with on either side (height or width). Can be an int or a tuple of ints that represents (height, width).
  • Dilation: works like the nn.Conv2d dilation parameter.
  • Return_indices: can be True or False. When True, the torch max pooling function also returns the indices of the max values in each pool.
  • Ceil_mode: Whether to use ceil or floor to calculate the output dimensions. When True, it allows starting the pools in the padded regions to the left and top.

What is Average Pooling?

Like Max Pooling, Average Pooling is a version of the pooling algorithm. Unlike Max Pooling, average pooling does not take the max value within a pool and assign that as the corresponding value in the output image. Average pooling takes the average (mean) of the values within the pool, with some possible parametric changes.

The classic average pooling implementation uses a simple average. However, it is still called average pooling if the pooling technique does not use a simple mean. Changes you can make to the algorithm include including the padded values, using a different divisor, or using a different type of average. PyTorch’s implementation includes parameters to automatically implement the first two but does not auto-implement a median or other type of averaging method.

AvgPool2d Visualization

Taking the same example that we looked at for the visualization for maxpool2d, we can see that the values are drastically different. The AvgPool2d implementation leaves us with smaller values than a MaxPool2d implementation. We discuss when to use average pooling or max pooling in the “When to use MaxPool2D vs AvgPool2D” section below.

PyTorch AvgPool2d Implementation

The PyTorch Average Pooling function for flat images is avgpool2d. There are six parameters for nn.avgpool2d, only one of which is required. Much like the PyTorch MaxPool2D function, the PyTorch Average Pooling function requires a kernel size. Many of the other parameters are similar as well.

The nn.avgpool2d parameters that come with a default are:

  • Stride: how far to move the kernel window. Defaults to the size of the kernel, just like max pooling behavior.
  • Padding: amount of 0 padding. Uses 0s instead of negative infinities like the PyTorch Max Pooling function. Can be one integer or a tuple defining amount of padding on the height and width.
  • Ceil_mode: works just like the max pooling function, if set to True, uses ceiling instead of floor function to determine output size.
  • Count_include_pad: includes the 0 values when considering the divisor for pools including padded 0s by default. Does not include the padded pixels in the count when set to False.
  • Divisor_override: set to an integer to use a specific integer to divide by instead of the pool size.

When to use MaxPool2D vs AvgPool2D

The main difference between using maxpool2d and avgpool2d in images is that max pooling gives a sharper image while average pooling gives a smoother image. Using nn.maxpool2d is best when we want to retain the most prominent features of the image. Using nn.avgpool2d is best when we want to retain the essence of an object.

Examples of when to use PyTorch maxpool2d instead of avgpool2d include when you have a drastic change in background color, when you are working with dark backgrounds, or when only the outline of an object is salient. Examples of when to use PyTorch nn.avgpool2d over nn.maxpool2d include when you are working with images with a variety of colors, when you are working with images with lighter backgrounds, and when you want your network to learn the general shape of an object.

PyTorch CNN Example on Fashion MNIST

The Fashion MNIST dataset is a modified dataset from the National Institute of Standards and Technology. Much like the original MNIST digits dataset that we trained our neural network from scratch on, the Fashion MNIST dataset contains 28x28 images. It contains 60000 training images and 10000 test images with 10 unique labels. Each of the labels corresponds to a type of clothing. In this section, we’re going to learn how to build a basic PyTorch CNN to classify images in the Fashion MNIST dataset. 

We cover how to build the neural network and its associated hyperparameters. The network that we build is a simple PyTorch CNN that consists of Conv2D, ReLU, and MaxPool2D for the convolutional part. It then flattens the input and uses a linear + ReLU + linear set of layers for the fully connected part and prediction. 
The skeleton of the PyTorch CNN looks like the code below. It extends the nn.Module object from PyTorch. The two functions that we touch are the __init__ function and the forward function. We define the network in the __init__ function and implement how a forward pass works in the forward function. The PyTorch CNN skeleton below includes an implemented forward pass. To see how to train the neural network, check out this video on K-Fold Validation for a PyTorch CNN.

class NeuralNetwork(nn.Module):
   def __init__(self):
       super(NeuralNetwork, self).__init__()
       self.cnn = nn.Sequential(...)


   def forward(self, X):
       logits = self.cnn(X)
       return logits

nn.Conv2d + ReLU + nn.maxpool2d

Let’s add the convolutional layers to our PyTorch CNN. In many convolutional neural networks there are multiple convolutional layers, but we build just one as an example. We define one “convolutional layer” as a Conv2D layer + a MaxPool2D layer. In this case we also add a ReLU activation in the middle.

Naturally, the question arises, why do we use a max pooling layer? It is not a mandate of every CNN to use a max pooling layer. As we discussed above, we can use an average pooling layer as well. We can also use pure convolutions with stride. There are three reasons we use a max pooling layer in this example.

First, this example is meant as an introduction to building a PyTorch CNN. Introducing a Max Pooling layer is a classical part of building ConvNets. Second, a max pooling layer (not an average pooling layer) introduces more nonlinearity into the network This is important for the network to learn better abstractions. Third, you can use it to reduce complexity.

Note that a kernel size of 1 is equivalent to not having a MaxPool2D layer at all. In the code example here, we use an nn.MaxPool2d layer with a kernel size of 2. Our convolution + max pooling layer starts with a 28 x 28 image and ends with 4 13 x 13 images. The 4 out_channels we add have in our nn.Conv2d layer means that we get 4 different kernels from that layer so the network learns 4 different representations.

class NeuralNetwork(nn.Module):
   def __init__(self):
       super(NeuralNetwork, self).__init__()
       self.cnn = nn.Sequential(
           nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
           nn.ReLU(),
           nn.MaxPool2d(kernel_size=2),13 x 13 x 4
           ...
       )


   def forward(self, X):
       logits = self.cnn(X)
       return logits

Final Fully Connected NN Layers

Once we have set up our convolution and max pooling layer, we add the fully connected (also called “dense”) layers to facilitate prediction. The first thing that we have to do to our convoluted image is flatten it. The PyTorch nn.Linear layer is only able to take flattened vectors. Once we flatten it, we can treat the rest of our PyTorch like any other basic neural network. 

For this example, we take our length 784 vector and turn it into 64 hidden states. The next layer uses a ReLU activation function for nonlinearity. Finally, we turn the 64 hidden states into 10 output neurons. We use 10 output neurons for one to represent each class in the Fashion MNIST dataset. The output that has the highest number on it represents the class that the image is most likely to correspond to. Find the full code with training code for our PyTorch CNN on GitHub.

class NeuralNetwork(nn.Module):
   def __init__(self):
       super(NeuralNetwork, self).__init__()
       self.cnn = nn.Sequential(
           nn.Conv2d(in_channels=1, out_channels=4, kernel_size=3), # 28 x 28 --> 26 x 26 x 4
           nn.ReLU(),
           nn.MaxPool2d(kernel_size=2), # 13 x 13 x 4
           nn.Flatten(), # --> (26 x 26 x 4)
           nn.Linear(13*13*4, 64),
           nn.ReLU(),
           nn.Linear(64, 10)
       )


   def forward(self, X):
       logits = self.cnn(X)
       return logits

Summary of PyTorch Convolutional Neural Networks

In this article, we learned about Convolutional Neural Networks. We took a look at the math behind convolutions, max pooling, and average pooling. Then, we built a PyTorch CNN for practice. CNNs are primarily used for image recognition, however, they can be applied to other tasks as well. 

The example PyTorch CNN we built assumes that we are training on 28x28 images as in the MNIST dataset. We use the nn.conv2d and nn.maxpool2d layers. If we want to work with different images, such as 3D brain scans, we would use the nn.conv3d and nn.maxpool3d layers. Alternatively, if we our task was to look for the basic shape of objects instead of their outlines, we may choose to use average pooling through nn.avgpool2d.

More by the Author

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
Uncategorized

What is an Encoder Decoder Model?

I recently had someone ask me what an encoder decoder model was and how an encoder decoder Long-Short Term Memory (LSTM) model is different from a regular (or stacked) LSTM. I’ve worked with many different kinds of RNNs, including LSTMs, and we’ve explored them on this blog. However, encoder decoder models are different from regular RNNs, they are set up so that part of the model “encodes” something and the next part of the model “decodes” it. 

If you are simply trying to understand how these models work, I suggest just reading the conceptual parts. Feel free to read the technical portions for your own entertainment. They are mostly directed at aspiring machine learning engineers and people looking to build their own encoder decoder models.

While I was learning about encoder decoder models, I found myself asking three main types of questions. One, what does it do conceptually? Two, how can I make one, technically? And three, what makes these models special? In this article, we answer all of these questions. We cover:

  • What is an Encoder Decoder Model?
    • What does an Encoder Decoder do, conceptually?
    • What does an Encoder Decoder do, technically?
    • How are Encoder Decoder Models different from stacked RNNs?
  • Why would you use an Encoder Decoder Model?
    • Applications of Encoder Decoder Models
  • How to make your own Encoder Decoder Model Interface
    • The Encoder Class Interface
    • The Decoder Class Interface
    • The Encoder-Decoder Class Interface
  • Summary of Encoder Decoder Models

What is an Encoder Decoder Model?

Encoder Decoder Models are a type of artificial neural network. They fall into the “deep” neural network category because they require multiple layers of neurons. They are also part of the seq2seq set of models because they take one sequence and transform it into another. Encoder Decoder Models came to mainstream machine learning attention for their use in language translation. We touch on this later in the applications section. 

There are many types of encoding decoding structures for seq2seq models. The most popular of these is “attention”. Attention models became popular through the 2017 paper, Attention is All You Need. The decoder in an attention model maps different parts of the input selectively. This is a strong optimization for written text. Extensions of attention techniques include beam search, which uses a heuristic probabilistic search, and bucketing, which pads the sequences.

What does an Encoder Decoder do, conceptually?

Conceptually, an encoder decoder model is quite simple. It consists of an encoder that turns the input into some encoded value, referred to as a “hidden state”, and a decoder that turns that hidden state into an output. Usually the decoder takes more inputs than just a hidden state. For example, when implementing an attention mechanism, the decoder also needs a matrix of alignment scores.

Let’s cover a more concrete example. Imagine that you need to translate between English and Pig Latin. Your input to the encoder may be “machine learning”. In this case, your desired output is “achine-may earning-lay”. Your encoder would encode the words “machine learning” into a hidden state of “m + achine” for “machine” and “l + earning” for “learning”. Then, your decoder would take the input of “m achine l earning” and apply the rules of Pig Latin to get your desired output, “achine-may earning-lay”.

Note that this is a contrived example intended to help you understand how an encoder-decoder model works conceptually. In reality, the hidden state would not be separated words, but a set of vectors (read: numbers). We are just using the words to indicate an example of how an input could be “encoded” and fed to the decoder to get an output. In the next section, we take a closer look at the technical details of how an encoder decoder works.

What does an Encoder Decoder do, technically?

Now that we have a conceptual understanding of encoder decoders, let’s take a deeper dive into the technical details. As we saw above, the encoder takes an input, such as “machine learning” and encodes it into a hidden state that it passes to the decoder. The conceptual example of the hidden state that we gave above was a representation of a middle step in the translation from English to Pig Latin. In reality, this hidden state is a matrix or vector.

Without an attention mechanism, which we represent above as the rules of Pig Latin but is really also a matrix, the encoder usually passes a vector, or a 1-dimensional matrix. However, this presents a challenge. Passing only a vector imposes a bottleneck on the size of the input. The longer the input, the more difficult it is to put all that information into one vector. Today, most encoder-decoder models use attention mechanisms as a way to get around this artificial limit.

The image above shows a “traditional” or unmodified encoder decoder model. Assuming that there is no attention mechanism (read: no additional input), the encoder model passes a vector to the decoder model. The decoder model then uses this vector as its only input and produces an output.

The second image shows what an encoder decoder model that uses an attention mechanism could look like. In this case, the encoder could, and should, pass stacked vectors, which are interpreted as a matrix by the decoder. The decoder also takes an attention matrix as an input and combines the two inputs. There are many ways to combine these matrices, additively, multiplicatively, or any other way to mathematically manipulate two matrices together.

How are Encoder Decoder Models different from stacked RNNs?

Let’s loop back to the question that inspired this article. How are encoder decoder LSTMs (which are a form of RNNs) different from stacked LSTMs? The main difference is in how the information is passed between layers. In a traditional stacked RNN model, the layers pass information to each other without taking a step to combine all the information in one layer.

In an encoder-decoder model, all the information in the encoder is combined into one matrix at the last layer before being passed to the decoder. Another big difference is that encoder decoder models can work with sequences of different lengths. A stacked LSTM or RNN model produces output sequences of the same length as input sequences. Encoder decoder models do not require that. 

Why would you use an Encoder Decoder Model?

The answer to this question is really quite simple. Encoder decoder models can provide better performance than a traditional stacked LSTM model. Another reason to use an encoder decoder model is that it can handle inputs and outputs of different lengths as mentioned above. In a general sense, encoder-decoder models are ideal for any machine learning task that requires context. Let’s look at some existing applications below.

Applications of Encoder Decoder Models

There are three main domains associated with the application of encoder decoder models. The most well known application is probably machine translation. Encoder decoder models are great for translating between languages. The input and output don’t have to be the same length, and applying attention allows the neural network to take the whole context of the sentence into account. This is especially important for cases where different parts of sentences are placed differently according to the language. 

A second common application of encoder decoders is image captioning. Attention models are particularly useful in this case as illustrated in the paper Show, Attend, and Tell. Different attention models yield different outcomes as shown in the two images below (taken from the Show, Attend, and Tell paper). The “hard” attention model uses a stochastic approach and samples the image on each word generation. The “soft” attention model uses a deterministic approach by taking the expectation of the context vector.

Figure 9 from Show, Attend, and Tell, using the “hard” attention model
Figure 10 from Show, Attend, and Tell, using the “soft” attention model

The third domain that often applies encoder decoder models is sentiment analysis. Similar to the tasks above, sentiment analysis produces a different number of output tokens from the number of input tokens. Getting the sentiment of a sentence also requires context of the whole sentence. While it is possible to produce sentiment for each word, it’s not exactly useful for an overall sentence. For example, the sentence “that was super bad”, may produce an average sentiment when considering each word, but produces a low sentiment when considering the whole sentence with context.

How to make your own Encoder Decoder Model Interface

Creating an Encoder Decoder Model is quite a feat and we will leave covering creating an entire neural network to a future article. If you have a burning desire to code your own Encoder Decoder Model right now, check out the code from The Annotated Transformer. In this article, we are going to cover the interface for an Encoder Decoder Model to set you up to understand how you can create your own.

The example code that we look at below uses PyTorch and Python 3.9. To get started, you need to run `pip install pytorch` in your terminal. Then we need to import the Neural Network module from PyTorch with the line of code below.

import torch.nn as nn

The Encoder Class Interface

Our first order of business in creating an encoder decoder model interface is creating an Encoder class interface. Our encoder class extends the PyTorch neural network module. We need two functions in this interface. First, the classic `__init__` function; second a feed forward function.

The `__init__` function takes a set of keyword arguments and calls `super`, the `torch.nn.Module` init function with those arguments. We don’t implement our `forward` function here. We only define that it needs to take an input matrix, `X`. This function implements the logic for what the Encoder does with an input. For now, we will raise a `NotImplementedError` to let us know that the function has not been implemented yet.

class Encoder(nn.Module):
   def __init__(self, **kwargs):
       super(Encoder, self).__init__(**kwargs)
  
   # X is the input matrix to the network
   def forward(self, X):
       raise NotImplementedError

The Decoder Class Interface

Next up is the Decoder class interface. Much like the Encoder interface, the Decoder interface extends the `nn.Module` object. The init function also works in the same way by calling `super` and passing the keyword arguments. Unlike the Encoder class, the Decoder interface has two functions to implement.

First, we need a function to get the hidden state. This function requires the output of the encoder function and optionally takes a list of arguments. The second function is a feed forward function again. This time, the feed forward function takes an input to the decoder and the state returned by the `init_state` function. Both of these functions will raise `NotImplementedError`s for now.

class Decoder(nn.Module):
   def __init__(self, **kwargs):
       super(Decoder, self).__init__(**kwargs)
  
   # decoder needs encoder outputs and some args
   # returns the state
   def init_state(self, enc_outputs, *args):
       raise NotImplementedError
  
   # X is the input matrix for the decoder (eg pig latin rules)
   def forward(self, X, state):
       raise NotImplementedError

The Encoder-Decoder Class Interface

With both the Encoder and Decoder Class interfaces written, we move to writing the Encoder-Decoder Class Interface. The Encoder-Decoder Class Interface is set up much like the Encoder Class. It has two functions: `__init__` and `forward`. Its init function starts off much like the Encoder and Decoder Class init functions – by calling `super` and passing keyword arguments. After calling the `nn.Module` init function via `super`, it also stores the passed in encoder and decoder objects as attributes.

The feed forward function for the Encoder-Decoder Class Interface takes the two inputs for the encoder and decoder and an optional list of arguments. As we saw above, an Encoder Decoder Model gets a hidden state out of its encoder and passes that to the decoder. The feed forward function first gets the outputs of the encoder. Then, it translates those outputs into a state for the decoder using the decoder’s `init_state` function. Finally, it returns the decoded state for an output.

class EncoderDecoder(nn.Module):
   def __init__(self, encoder, decoder, **kwargs):
       super(EncoderDecoder, self).__init__(**kwargs)
       self.encoder = encoder
       self.decoder = decoder
  
   def forward(self, enc_X, dec_X, *args):
       enc_outputs = self.encoder(enc_X)
       dec_state = self.decoder.init_state(enc_outputs, *args)
       return self.decoder(dec_X, dec_state)

Summary of Encoder Decoder Models

In this article, we learned what Encoder Decoder Models are and how they work both conceptually and technically. We also explored how you can begin to implement your own version of an Encoder Decoder Model. Encoder Decoder models are neural networks that are split into two functional components. An encoder that turns a sequence into a state that includes all the context of the input, and a decoder that turns that state into an output. 

Encoder Decoder Models rose to prominence propelled by attention mechanisms, which allow the decoder to focus on different parts of the input sequence. Another big advantage of Encoder Decoder models is that they can handle sequences of different lengths. This is a big improvement over a basic Seq2Seq model. The main uses of Encoder Decoder models include image captioning, sentiment analysis, and machine translation.

More by the Author

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly