Keras Optimizers in Tensorflow and Common Errors

Header for Post on Optimizers in Tensorflow Keras and Errors

Keras is one of the most used frameworks for building machine learning models. Keras is a high level neural network API built on Python. That’s why we’ve covered how to use it so much on this blog! We’ve covered how to build Long Short Term Memory (LSTM) Models, Recurrent Neural Networks (RNNs), and Gated Recurrent Unit (GRU) Models in Keras.

This is the second part of a multipart series on Keras! Check out the first part on Keras Loss Functions by my friend, the amazing Kristen Kehrer, a Developer Advocate at Comet ML and one of my favorite LinkedIn influencers. Sign up for a Comet ML Account to improve your ML Model Monitoring.

In this post we’ll cover:

  • Keras Topics We’ve Covered so far
  • A Reference and Overview of Keras Optimizers
    • Adaptive Movement Estimation (keras.optimizers.Adam) Optimizer
    • Stochastic Gradient Descent (keras.optimizers.SGD)
      • What is the Nesterov Formulation with Respect to SGD?
    • Root Mean Squared Propagation (keras.optimizers.RMSprop)
    • Keras Adadelta Optimizer
    • tf.keras.optimizers.Adagrad Overview
    • Adamax Optimizer Keras Implementation Overview
    • Nestrov Adam Optimizer (tf.keras.optimizers.nadam)
    • Follow the Regularized Leader Keras Optimizer
    • Keyword Arguments Parameter for tf.keras.optimizers Functions
  • Common Errors Related to Keras Optimizers
  • Summary of Keras Optimizers in Tensorflow and Common Errors

A Reference and Overview of Keras Optimizers

Tensorflow Keras currently has eight optimizers. It includes Stochastic Gradient Descent (SGD), Root Mean Squared Propagation (RMSprop), Adaptive Movement Estimation (Adam), Adadelta, Adaptive Gradient (adagrad), Adamax, Nesterov-accelerated Adaptive Moment Estimation (Nadam), and Follow the Regularized Leader (ftrl). 

The top three most searched for Keras optimizers are Adam, SGD, and RMSprop. Note that Keras and the Tensorflow implementation of Keras are still separate projects. However, because Keras is now installed with Tensorflow (for Python 3.9+), we assume the Tensorflow implementation.

Adaptive Movement Estimation (keras.optimizers.Adam) Optimizer

The Adaptive Movement Estimation Optimizer, otherwise known as “Adam”, is one of the most popular optimizers on Keras as of 2022. Adam comes from a 2014 paper titled Adam: A Method for Stochastic Optimization by Diederik P. Kingma and Jimmy Ba. It uses both momentum and scaling, combining the effects of SGD with momentum and RMSprop (explained below).

Keras’s Adam Optimizer takes seven parameters. Six of those parameters are keyword parameters, and the seventh is **kwargs, which allows you to pass a custom list of keyword arguments. The six named keyword parameters for the Adam optimizer are learning_rate, beta_1, beta_2, epsilon, amsgrad, name

learning_rate passes the value of the learning rate of the optimizer and defaults to 0.001. The beta_1 and beta_2 values are the exponential decay rates of the first and second moments. They default to 0.9 and 0.999 respectively. epsilon is a small numerical constant used for function stability. It defaults to 1e-07. amsgrad is a boolean value that represents whether or not we should use the AMSgrad version of Adam. Finally, name is the prefix you give the resulting gradient operations which defaults to “Adam”.

Here is the default declaration of the Adam Keras optimizer:


Stochastic Gradient Descent (keras.optimizers.SGD)

Stochastic Gradient Descent (SGD) is one of the oldest neural network optimizers out there. It is a version of gradient descent that speeds up the process. SGD selects a random set of data to estimate the gradient. At high dimensions, this leads to a significant drop in computational expense and time.

tf.keras.optimizers SGD implementation takes four named parameters and allows a fifth one for keyword arguments. The four named parameters for Keras SGD are learning_rate, momentum, nesterov, and name.

learning_rate corresponds to the learning rate for the gradient descent function we decided on. The learning rate parameter can take a float (as shown), a tensor, a scheduler in the form of an tf.keras.optimizers.schedules.LearningRateSchedule object, or a function that takes no parameters. The momentum parameter accelerates gradient descent and dampens oscillation. At the default value of 0.0, we have regular gradient descent.

nesterov is a boolean that defaults to False. When true, we use the Nesterov formulation for SGD. The name parameter works the same as in tf.keras.optimizers.Adam. It defines the name of the resulting gradients.

   learning_rate=0.01, momentum=0.0, nesterov=False, name="SGD", **kwargs

What is the Nesterov Formulation with Respect to SGD?

The Neterov Accelerated Gradient formula for SGD is a version of SGD with momentum. The formulation for Stochastic Gradient Descent with momentum is shown below. Velocity is created from momentum and w is the weight.

velocity = momentum * velocity - learning_rate * g
w = w + velocity

The Nesterov version is shown below this paragraph. As you can see, there is a difference in the way we apply velocity in this version. Instead of directly adding velocity, we add the velocity with a momentum multiplier and then correct with the learning rate.

velocity = momentum * velocity - learning_rate * g
w = w + momentum * velocity - learning_rate * g

So what’s the main idea behind the Nesterov version of SGD vs regular momentum descent? The standard version of using momentum on SGD calculates the gradient and then “jumps” forward based on that gradient. The Nesterov formulation “jumps” based on the former gradient and then corrects using the learning rate.

The main idea of Nesterov Momentum is that it is better to make a mistake then correct it than to try to estimate the correct jump value.

Root Mean Squared Propagation (keras.optimizers.RMSprop)

The third most popular optimizer from tf.keras.optimizers is root mean squared propagation or RMSprop. The basic idea behind RMSprop is to maintain a moving average of the square of the gradients (hence mean squared), and divide the gradient by the root of this average as we go (hence root).

The Keras implementation of RMSprop takes six named parameters and allows a seventh for keyword arguments. The keyword arguments are for the same thing in here as they are for SGD and keras.optimizers.Adam. The six named parameters are learning_rate, rho, momentum, epsilon, centered, and name.

As with the Adam optimizer and SGD optimizer, we start with learning_rate, which represents the learning rate of the network. rho is the discount factor for the historical values of the gradients. Rho is used to calculate the exponentially weighted average of the square of the gradients.

The momentum parameter is used for momentum. It’s basically the same as the momentum parameter for accelerated SGD. Next we have epsilon, which is a small numerical constant for stability. It’s used in the same way as the tf.keras.optimizers.Adam momentum parameter.

centered is a boolean parameter that determines how we normalize the gradients. When set to false (the default value), we normalize by the value of the second moment. If set to true, we normalize the gradients based on the estimated variance. The last named parameter is name, which works just like the name parameter in SGD and Adam.

The basic tf.keras.optimizers.RMSprop declaration is below, along with default values.


tf.keras.optimizers.Adagrad Overview

Now let’s look at Adagrad. Adagrad is an adaptive learning method that makes smaller updates to parameters that are updated more often. The main advantage of Adagrad over stochastic gradient descent and SGD with momentum is that it doesn’t require manual tuning of parameters. Adagrad’s biggest drawback is the continuous accumulation of squared gradients in the denominator, causing it to have a continuously decreasing learning rate. Read the paper, Adagrad.

The Keras implementation of Adagrad takes four named parameters and allows for keyword arguments. The named parameters are learning_rate, initial_accumulator_value, epsilon, and name. We’ve already seen three of these, learning_rate, epsilon, and name. These have the same usage here as they did above. The remaining parameter, initial_accumulator_value, is the starting value for the per-parameter momentum values.


Keras Adadelta Optimizer

Adadelta is a version of the Adagrad algorithm. It comes from this 2012 paper titled “Adadelta, an adaptive learning rate method”. Adadelta uses a window of past gradients as opposed to all past gradients as used by Adagrad. It reduces the rate at which the monotonically decreasing learning rate falls. In addition, Adadelta doesn’t require a global learning rate to be set.

Keras’s implementation of Adadelta takes four named parameters in addition to the **kwargs parameter. These four named parameters are learning_rate, rho, epsilon, and name. As with all the other functions, learning_rate is the learning rate. The rho parameter for tf.keras.optimizers.Adadelta is the decay rate. epsilon is a small value for numerical stability. name is the name of the applied gradients. 

   learning_rate=0.001, rho=0.95, epsilon=1e-07, name="Adadelta", **kwargs

Adamax Optimizer Keras Implementation Overview

Adamax is another variation of the Adaptive Movement Estimation (Adam) algorithm. The main difference is that Adamax uses the L-Infinity norm instead of the L-2 norm for regularization. The L-2 norm is the “Euclidean distance” or the shortest line between two points. This is usually calculated as the root of the sum of the squares of the vector values. The L-Infinity norm is the largest magnitude of a value in the vector.

Adamax was mentioned as an extension of Adam in the original Adam paper. Adamax is known to outperform Adam on models with embeddings. Other differences include not adjusting the initial learning rate and a simpler bounded update. For more detailed differences between Adam and Adamax, see the paper. Adam: A Method for Stochastic Optimization.

The Keras implementation of Adamax has five named parameters. The learning_rate, name, and epsilon parameters we’ve seen before. They serve the same purpose here. The ones we haven’t seen before are beta_1 and beta_2. beta_1 is a float value representing the decay rate of the first moment estimates. beta_2 is another float value; it represents the decay rate of the infinity norm (L-infinity).

   learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Adamax", **kwargs

Nesterov Adam Optimizer (tf.keras.optimizers.Nadam)

Nesterov, Nesterov, Nesterov. We already came across this name once up above in the section on stochastic gradient descent. We’re talking about Yurii Nesterov. He published a paper on accelerated gradient descent in 2004 that has since worked its way into multiple optimizers. The tf.keras.optimizers.Nadam optimizer implements a version of Adam augmented with his acceleration parameters.

The Nesterov accelerated version of Adam takes five named parameters and a keyword arguments parameter. It takes the exact same set of named parameters as Adamax, which we saw above. These named parameters also serve the exact same purpose as they do in Adamax. The only difference here is that we’re applying Nesterov momentum. (ie using a priori momentum and then correcting).

   learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, name="Nadam", **kwargs

Follow the Regularized Leader Keras Optimizer (keras.optimizers.ftrl)

The last available Keras optimizer at the time of writing is Follow the Regularized Leader (FTRL). FTRL comes from a 2013 paper from researchers at Google looking to estimate click through rates. The original paper, Ad Click Prediction, proposes an algorithm that uses both L1 and L2 regularization. 

The Keras implementation allows us to control our application of both regularization terms. It also allows for an additional L2 penalty to the loss function. There are eight named parameters in tf.keras.optimizers.ftrl and the ability to pass keyword arguments. Of these eight, we should already be familiar with learning_rate, and name. That leaves learning_rate_power, initial_accumulator_value, l1_regularization_strength, l2_regularization_strength, l2_shrinkage_regularization_strength, and beta.

learning_rate_power affects how the learning rate decreases over time. We can use a value of 0 to maintain a constant rate. We’ve actually seen initial_accumulator_value once before – in Adagrad. This parameter sets the starting value for momentum. Both l1_regularization_strength and l2_regularization_strength are float values that represent the coefficients for the L1 and L2 regularizations.

The l2_shrinkage_regularization_strength is a coefficient for an L2 regularization of the loss function. This isn’t implemented in the original paper, but an addition in the Keras implementation. Finally, we have beta. beta is a hyperparameter that can be changed to get the best performance from the function. It is part of the denominator in the per-coordinate learning rate.


Keyword Arguments Parameter for tf.keras.optimizers Functions

In all three examples we’ve seen so far, we’ve seen this **kwargs parameter. This parameter allows us to pass in custom keyword arguments to our optimizers. However, we can’t just pass in any keywords willy-nilly. 

The keywords that Keras optimizers recognize for this parameter are clipvalue, clipnorm, and global_clipnorm. clipvalue is a float that caps the value of the gradients. If it is passed, all gradients are clipped so that they are no higher than the passed in value. 

clipnorm is a float that caps the values of the norm of each gradient. Each weight is individually clipped to fit this limitation. Finally, global_clipnorm sets a cap on the global norm of gradients. Each weight’s gradient is clipped so that the global norm is under the passed-in float value.

Common Errors When Working with Keras Optimizers

The most commonly searched errors when it comes to using Keras optimizers are about RMSprop, SGD, and Adam. These are also the three most popular optimization algorithms used. In order of popularity it goes: Adam, SGD, then RMSprop. Historically, SGD came first, then RMSprop, and then Adam.

ImportError: cannot import name adam from keras optimizers

As mentioned above, Adam comes from a 2014 paper titled Adam: A Method for Stochastic Optimization. This algorithm is highly popular because of the robustness, ease and speed of training, and accuracy. As a result, Adam is often suggested as the default optimizer in many cases.

The reason you are seeing an ImportError is most likely due to a package mishap. As of Python 3.9, it is no longer recommended to pip install keras. Keras is now bundled with Tensorflow, so you should do pip install tensorflow and use import tf.keras.optimizers.Adam.

cannot import name ‘sgd’ from ‘keras.optimizers’

The paper Stochastic Estimation of the Maximum of a Regression Function, was published in 1952. There are many other papers similar to SGD, but I would consider this the first recognizable instance to the ML world. The reason that this algorithm is popular is because it’s such a foundational algorithm. Most introductory users will play with this algorithm.

Just like the Adam ImportError, this SGD error is based on a package error. The same solution applies. Run pip install tensorflow and then use import tf.keras or from tf.keras import optimizers. Whatever floats your goat on these imports, the time complexity is the same.

​​module ‘keras.optimizers’ has no attribute ‘rmsprop’

The true origin of RMSprop is not well known. It was proposed in 2014 just a bit before Adam. The first use of RMSprop that I’ve been able to find comes from this paper about Generating Sequences from Recurrent Neural Networks by Alex Graves from the University of Toronto. RMSprop is used for its robustness and because it converges faster than SGD with momentum.

Once again, this is the third most searched because rmsprop is the third most popular. The error here is often the same, it’s based on the library. Install tensorflow and then use an import starting with tf.keras instead of keras.

keras.optimizers vs tf.keras.optimizers

The solution to all three problems is around an ImportError. Why? Because Keras originated as, and technically still is, its own project. Keras is a high level API that can run on top of Tensorflow or Theano.

In industry and for hobbyists, Tensorflow has taken over. As of 2022, Theano is not popularly maintained and is almost exclusively used in Academia. The difference between keras.optimizers and tf.keras.optimizers is that the tf.keras.optimizers library uses a Tensorflow optimized implementation of Keras.

With the current state of Tensorflow, Theano, and Keras it is no longer recommended to directly install Keras. The only reason you would directly install Keras is if you want to use it with the Theano backend. However, because Theano is no longer popularly maintained, this use case almost never comes. Hence, tf.keras has replaced keras for almost any code snippet from past tutorials you see.

Summary of Keras Optimizers in Tensorflow and Common Errors

This post honestly came out a lot longer than I originally meant it to. In this post, we went over the eight Keras optimizers (as of 2022). For implementation, we looked at the Tensorflow backed versions, not the Theano backed ones. This is because Tensorflow is now the de facto Keras backend.

The eight optimizers we went over are: Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nesterov-Accelerated Adam (Nadam), and FTRL. The most popular are Adam, SGD, and RMSprop. Adam is the default optimizer due to its great performance. SGD is popular because it is an introductory technique. RMSprop is popular because of its robustness and overall convergence speed.

The most common errors that come up with Keras optimizers are around keras.optimizers.Adam, keras.optimizers.SGD, and keras.optimizers.RMSprop. This is a result of their popularity and the absorption of Keras into Tensorflow. The most common solutions to these errors is to pip install tensorflow and use tf.keras instead of just keras when calling functions. 

Further Reading

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang
Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!


Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount


Or enter a custom amount


Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

One thought on “Keras Optimizers in Tensorflow and Common Errors

Leave a Reply

%d bloggers like this: