Keras is one of the most used frameworks for building machine learning models. Keras is a high level neural network API built on Python. That’s why we’ve covered how to use it so much on this blog! We’ve covered how to build Long Short Term Memory (LSTM) Models, Recurrent Neural Networks (RNNs), and Gated Recurrent Unit (GRU) Models in Keras.

*This is the second part of a multipart series on Keras! Check out the first part on Keras Loss Functions by my friend, the amazing **Kristen Kehrer**, a Developer Advocate at Comet ML and one of my favorite LinkedIn influencers. Sign up for a Comet ML Account to improve your ML Model Monitoring.*

In this post we’ll cover:

- Keras Topics We’ve Covered so far
- A Reference and Overview of Keras Optimizers
- Adaptive Movement Estimation (
`keras.optimizers.Adam`

) Optimizer - Stochastic Gradient Descent (
`keras.optimizers.SGD`

)- What is the Nesterov Formulation with Respect to SGD?

- Root Mean Squared Propagation (
`keras.optimizers.RMSprop`

) - Keras Adadelta Optimizer
`tf.keras.optimizers.Adagrad`

Overview- Adamax Optimizer Keras Implementation Overview
- Nestrov Adam Optimizer (
`tf.keras.optimizers.nadam`

) - Follow the Regularized Leader Keras Optimizer
- Keyword Arguments Parameter for
`tf.keras.optimizers`

Functions

- Adaptive Movement Estimation (
- Common Errors Related to Keras Optimizers
- Summary of Keras Optimizers in Tensorflow and Common Errors

## A Reference and Overview of Keras Optimizers

Tensorflow Keras currently has eight optimizers. It includes Stochastic Gradient Descent (SGD), Root Mean Squared Propagation (RMSprop), Adaptive Movement Estimation (Adam), Adadelta, Adaptive Gradient (adagrad), Adamax, Nesterov-accelerated Adaptive Moment Estimation (Nadam), and Follow the Regularized Leader (ftrl).

The top three most searched for Keras optimizers are Adam, SGD, and RMSprop. Note that Keras and the Tensorflow implementation of Keras are still separate projects. However, because Keras is now installed with Tensorflow (for Python 3.9+), we assume the Tensorflow implementation.

### Adaptive Movement Estimation (`keras.optimizers.Adam`

) Optimizer

The Adaptive Movement Estimation Optimizer, otherwise known as “Adam”, is one of the most popular optimizers on Keras as of 2022. Adam comes from a 2014 paper titled Adam: A Method for Stochastic Optimization by Diederik P. Kingma and Jimmy Ba. It uses both momentum and scaling, combining the effects of SGD with momentum and RMSprop (explained below).

Keras’s Adam Optimizer takes seven parameters. Six of those parameters are keyword parameters, and the seventh is `**kwargs`

, which allows you to pass a custom list of keyword arguments. The six named keyword parameters for the Adam optimizer are `learning_rate`

, `beta_1`

, `beta_2`

, `epsilon`

, `amsgrad`

, `name`

.

`learning_rate`

passes the value of the learning rate of the optimizer and defaults to 0.001. The `beta_1`

and `beta_2`

values are the exponential decay rates of the first and second moments. They default to 0.9 and 0.999 respectively. `epsilon`

is a small numerical constant used for function stability. It defaults to `1e-07`

. `amsgrad`

is a boolean value that represents whether or not we should use the AMSgrad version of Adam. Finally, `name`

is the prefix you give the resulting gradient operations which defaults to “Adam”.

Here is the default declaration of the `Adam`

Keras optimizer:

```
tf.keras.optimizers.Adam(
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
amsgrad=False,
name="Adam",
**kwargs
)
```

### Stochastic Gradient Descent (`keras.optimizers.SGD`

)

Stochastic Gradient Descent (SGD) is one of the oldest neural network optimizers out there. It is a version of gradient descent that speeds up the process. SGD selects a random set of data to estimate the gradient. At high dimensions, this leads to a significant drop in computational expense and time.

`tf.keras.optimizers`

SGD implementation takes four named parameters and allows a fifth one for keyword arguments. The four named parameters for Keras SGD are `learning_rate`

, `momentum`

, `nesterov`

, and `name`

.

`learning_rate`

corresponds to the learning rate for the gradient descent function we decided on. The learning rate parameter can take a float (as shown), a tensor, a scheduler in the form of an `tf.keras.optimizers.schedules.LearningRateSchedule`

object, or a function that takes no parameters. The `momentum`

parameter accelerates gradient descent and dampens oscillation. At the default value of `0.0`

, we have regular gradient descent.

`nesterov`

is a boolean that defaults to `False`

. When true, we use the Nesterov formulation for SGD. The `name`

parameter works the same as in `tf.keras.optimizers.Adam`

. It defines the name of the resulting gradients.

```
tf.keras.optimizers.SGD(
learning_rate=0.01, momentum=0.0, nesterov=False, name="SGD", **kwargs
)
```

#### What is the Nesterov Formulation with Respect to SGD?

The Neterov Accelerated Gradient formula for SGD is a version of SGD with momentum. The formulation for Stochastic Gradient Descent with momentum is shown below. Velocity is created from momentum and `w`

is the weight.

```
velocity = momentum * velocity - learning_rate * g
w = w + velocity
```

The Nesterov version is shown below this paragraph. As you can see, there is a difference in the way we apply velocity in this version. Instead of directly adding velocity, we add the velocity with a momentum multiplier and then correct with the learning rate.

```
velocity = momentum * velocity - learning_rate * g
w = w + momentum * velocity - learning_rate * g
```

So what’s the main idea behind the Nesterov version of SGD vs regular momentum descent? The standard version of using momentum on SGD calculates the gradient and then “jumps” forward based on that gradient. The Nesterov formulation “jumps” based on the former gradient and then corrects using the learning rate.

*The main idea of Nesterov Momentum is that it is better to make a mistake then correct it than to try to estimate the correct jump value.*

### Root Mean Squared Propagation (`keras.optimizers.RMSprop`

)

The third most popular optimizer from `tf.keras.optimizers`

is root mean squared propagation or RMSprop. The basic idea behind RMSprop is to maintain a moving average of the square of the gradients (hence mean squared), and divide the gradient by the root of this average as we go (hence root).

The Keras implementation of RMSprop takes six named parameters and allows a seventh for keyword arguments. The keyword arguments are for the same thing in here as they are for SGD and `keras.optimizers.Adam`

. The six named parameters are `learning_rate`

, `rho`

, `momentum`

, `epsilon`

, `centered`

, and `name`

.

As with the Adam optimizer and SGD optimizer, we start with `learning_rate`

, which represents the learning rate of the network. `rho`

is the discount factor for the historical values of the gradients. Rho is used to calculate the exponentially weighted average of the square of the gradients.

The `momentum`

parameter is used for momentum. It’s basically the same as the momentum parameter for accelerated SGD. Next we have `epsilon`

, which is a small numerical constant for stability. It’s used in the same way as the `tf.keras.optimizers.Adam`

`momentum`

parameter.

`centered`

is a boolean parameter that determines how we normalize the gradients. When set to false (the default value), we normalize by the value of the second moment. If set to true, we normalize the gradients based on the estimated variance. The last named parameter is `name`

, which works just like the `name`

parameter in SGD and Adam.

The basic `tf.keras.optimizers.RMSprop`

declaration is below, along with default values.

```
tf.keras.optimizers.RMSprop(
learning_rate=0.001,
rho=0.9,
momentum=0.0,
epsilon=1e-07,
centered=False,
name="RMSprop",
**kwargs
)
```

`tf.keras.optimizers.Adagrad`

Overview

Now let’s look at Adagrad. Adagrad is an adaptive learning method that makes smaller updates to parameters that are updated more often. The main advantage of Adagrad over stochastic gradient descent and SGD with momentum is that it doesn’t require manual tuning of parameters. Adagrad’s biggest drawback is the continuous accumulation of squared gradients in the denominator, causing it to have a continuously decreasing learning rate. Read the paper, Adagrad.

The Keras implementation of Adagrad takes four named parameters and allows for keyword arguments. The named parameters are `learning_rate`

, `initial_accumulator_value`

, `epsilon`

, and `name`

. We’ve already seen three of these, `learning_rate`

, `epsilon`

, and `name`

. These have the same usage here as they did above. The remaining parameter, `initial_accumulator_value`

, is the starting value for the per-parameter momentum values.

```
tf.keras.optimizers.Adagrad(
learning_rate=0.001,
initial_accumulator_value=0.1,
epsilon=1e-07,
name="Adagrad",
**kwargs
)
```

### Keras Adadelta Optimizer

Adadelta is a version of the Adagrad algorithm. It comes from this 2012 paper titled “Adadelta, an adaptive learning rate method”. Adadelta uses a window of past gradients as opposed to all past gradients as used by Adagrad. It reduces the rate at which the monotonically decreasing learning rate falls. In addition, Adadelta doesn’t require a global learning rate to be set.

Keras’s implementation of Adadelta takes four named parameters in addition to the `**kwargs`

parameter. These four named parameters are `learning_rate`

, `rho`

, `epsilon`

, and `name`

. As with all the other functions, `learning_rate`

is the learning rate. The `rho`

parameter for `tf.keras.optimizers.Adadelta`

is the decay rate. `epsilon`

is a small value for numerical stability. `name`

is the name of the applied gradients.

```
tf.keras.optimizers.Adadelta(
learning_rate=0.001, rho=0.95, epsilon=1e-07, name="Adadelta", **kwargs
)
```

### Adamax Optimizer Keras Implementation Overview

Adamax is another variation of the Adaptive Movement Estimation (Adam) algorithm. The main difference is that Adamax uses the L-Infinity norm instead of the L-2 norm for regularization. The L-2 norm is the “Euclidean distance” or the shortest line between two points. This is usually calculated as the root of the sum of the squares of the vector values. The L-Infinity norm is the largest magnitude of a value in the vector.

Adamax was mentioned as an extension of Adam in the original Adam paper. Adamax is known to outperform Adam on models with embeddings. Other differences include not adjusting the initial learning rate and a simpler bounded update. For more detailed differences between Adam and Adamax, see the paper. Adam: A Method for Stochastic Optimization.

The Keras implementation of Adamax has five named parameters. The `learning_rate`

, `name`

, and `epsilon`

parameters we’ve seen before. They serve the same purpose here. The ones we haven’t seen before are `beta_1`

and `beta_2`

. `beta_1`

is a float value representing the decay rate of the first moment estimates. `beta_2`

is another float value; it represents the decay rate of the infinity norm (L-infinity).

```
tf.keras.optimizers.Adamax(
learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Adamax", **kwargs
)
```

### Nesterov Adam Optimizer (`tf.keras.optimizers.Nadam`

)

Nesterov, Nesterov, Nesterov. We already came across this name once up above in the section on stochastic gradient descent. We’re talking about Yurii Nesterov. He published a paper on accelerated gradient descent in 2004 that has since worked its way into multiple optimizers. The `tf.keras.optimizers.Nadam`

optimizer implements a version of Adam augmented with his acceleration parameters.

The Nesterov accelerated version of Adam takes five named parameters and a keyword arguments parameter. It takes the exact same set of named parameters as Adamax, which we saw above. These named parameters also serve the exact same purpose as they do in Adamax. The only difference here is that we’re applying Nesterov momentum. (ie using a priori momentum and then correcting).

```
tf.keras.optimizers.Nadam(
learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Nadam", **kwargs
)
```

### Follow the Regularized Leader Keras Optimizer (`keras.optimizers.ftrl`

)

The last available Keras optimizer at the time of writing is Follow the Regularized Leader (FTRL). FTRL comes from a 2013 paper from researchers at Google looking to estimate click through rates. The original paper, Ad Click Prediction, proposes an algorithm that uses both L1 and L2 regularization.

The Keras implementation allows us to control our application of both regularization terms. It also allows for an additional L2 penalty to the loss function. There are eight named parameters in `tf.keras.optimizers.ftrl`

and the ability to pass keyword arguments. Of these eight, we should already be familiar with `learning_rate`

, and `name`

. That leaves `learning_rate_power`

, `initial_accumulator_value`

, `l1_regularization_strength`

, `l2_regularization_strength`

, `l2_shrinkage_regularization_strength`

, and `beta`

.

`learning_rate_power`

affects how the learning rate decreases over time. We can use a value of 0 to maintain a constant rate. We’ve actually seen `initial_accumulator_value`

once before – in Adagrad. This parameter sets the starting value for momentum. Both `l1_regularization_strength`

and `l2_regularization_strength`

are float values that represent the coefficients for the L1 and L2 regularizations.

The `l2_shrinkage_regularization_strength`

is a coefficient for an L2 regularization of the loss function. This isn’t implemented in the original paper, but an addition in the Keras implementation. Finally, we have `beta`

. `beta`

is a hyperparameter that can be changed to get the best performance from the function. It is part of the denominator in the per-coordinate learning rate.

```
tf.keras.optimizers.Ftrl(
learning_rate=0.001,
learning_rate_power=-0.5,
initial_accumulator_value=0.1,
l1_regularization_strength=0.0,
l2_regularization_strength=0.0,
name="Ftrl",
l2_shrinkage_regularization_strength=0.0,
beta=0.0,
**kwargs
)
```

### Keyword Arguments Parameter for `tf.keras.optimizers`

Functions

In all three examples we’ve seen so far, we’ve seen this `**kwargs`

parameter. This parameter allows us to pass in custom keyword arguments to our optimizers. However, we can’t just pass in any keywords willy-nilly.

The keywords that Keras optimizers recognize for this parameter are `clipvalue`

, `clipnorm`

, and `global_clipnorm`

. `clipvalue`

is a float that caps the value of the gradients. If it is passed, all gradients are clipped so that they are no higher than the passed in value.

`clipnorm`

is a float that caps the values of the norm of each gradient. Each weight is individually clipped to fit this limitation. Finally, `global_clipnorm`

sets a cap on the global norm of gradients. Each weight’s gradient is clipped so that the global norm is under the passed-in float value.

## Common Errors When Working with Keras Optimizers

The most commonly searched errors when it comes to using Keras optimizers are about RMSprop, SGD, and Adam. These are also the three most popular optimization algorithms used. In order of popularity it goes: Adam, SGD, then RMSprop. Historically, SGD came first, then RMSprop, and then Adam.

### ImportError: cannot import name adam from keras optimizers

As mentioned above, Adam comes from a 2014 paper titled Adam: A Method for Stochastic Optimization. This algorithm is highly popular because of the robustness, ease and speed of training, and accuracy. As a result, Adam is often suggested as the default optimizer in many cases.

The reason you are seeing an `ImportError`

is most likely due to a package mishap. As of Python 3.9, **it is no longer recommended to **`pip install keras`

. Keras is now bundled with Tensorflow, so you should do `pip install tensorflow`

and use `import tf.keras.optimizers.Adam`

.

### cannot import name ‘sgd’ from ‘keras.optimizers’

The paper Stochastic Estimation of the Maximum of a Regression Function, was published in 1952. There are many other papers similar to SGD, but I would consider this the first recognizable instance to the ML world. The reason that this algorithm is popular is because it’s such a foundational algorithm. Most introductory users will play with this algorithm.

Just like the Adam `ImportError`

, this SGD error is based on a package error. The same solution applies. Run `pip install tensorflow`

and then use `import tf.keras`

or `from tf.keras import optimizers`

. Whatever floats your goat on these imports, the time complexity is the same.

### module ‘keras.optimizers’ has no attribute ‘rmsprop’

The true origin of RMSprop is not well known. It was proposed in 2014 just a bit before Adam. The first use of RMSprop that I’ve been able to find comes from this paper about Generating Sequences from Recurrent Neural Networks by Alex Graves from the University of Toronto. RMSprop is used for its robustness and because it converges faster than SGD with momentum.

Once again, this is the third most searched because `rmsprop`

is the third most popular. The error here is often the same, it’s based on the library. Install `tensorflow`

and then use an `import`

starting with `tf.keras`

instead of `keras`

.

`keras.optimizers`

vs `tf.keras.optimizers`

The solution to all three problems is around an `ImportError`

. Why? Because Keras originated as, and technically still is, its own project. Keras is a *high level API* that can run on top of Tensorflow or Theano.

In industry and for hobbyists, Tensorflow has taken over. As of 2022, Theano is not popularly maintained and is almost exclusively used in Academia. The difference between `keras.optimizers`

and `tf.keras.optimizers`

is that the `tf.keras.optimizers`

library uses a Tensorflow optimized implementation of Keras.

With the current state of Tensorflow, Theano, and Keras it is no longer recommended to directly install Keras. The only reason you would directly install Keras is if you want to use it with the Theano backend. However, because Theano is no longer popularly maintained, this use case almost never comes. Hence, `tf.keras`

has replaced `keras`

for almost any code snippet from past tutorials you see.

## Summary of Keras Optimizers in Tensorflow and Common Errors

This post honestly came out a lot longer than I originally meant it to. In this post, we went over the eight Keras optimizers (as of 2022). For implementation, we looked at the Tensorflow backed versions, not the Theano backed ones. This is because Tensorflow is now the de facto Keras backend.

The eight optimizers we went over are: Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nesterov-Accelerated Adam (Nadam), and FTRL. The most popular are Adam, SGD, and RMSprop. Adam is the default optimizer due to its great performance. SGD is popular because it is an introductory technique. RMSprop is popular because of its robustness and overall convergence speed.

The most common errors that come up with Keras optimizers are around `keras.optimizers.Adam`

, `keras.optimizers.SGD`

, and `keras.optimizers.RMSprop`

. This is a result of their popularity and the absorption of Keras into Tensorflow. The most common solutions to these errors is to `pip install tensorflow`

and use `tf.keras`

instead of just `keras`

when calling functions.

## Further Reading

- Python Speech Recognition with SpeechRecognitin
- Pyrebase + Firebase Admin for Firebase Authentication with FastAPI
- Python Prims Algorithm
- My Google Interview Experience (2022)
- Why Programming is Easy but Software Engineering is Hard

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

## 2 thoughts on “Keras Optimizers in Tensorflow and Common Errors”