Categories
General Python NLP

Python Speech Recognition with SpeechRecognition

The Python SpeechRecognition library helps you easily do speech recognition with many different backends.

Speech Recognition with SpeechRecognition? Yeah. SpeechRecognition is an automatic speech recognition (ASR) library for Python. SpeechRecognition is a wrapper library that works with multiple backends including CMU Sphinx, Google Cloud, and Azure. Find the code we cover below in the official Python Speech Recognition Github.

In this post, we will take a look at how to use the Python SpeechRecognition library with multiple backends. We will cover:

  • What is the Python SpeechRecognition Library?
  • Getting Started with Python Speech Recognition
  • Prerequisites for Python Speech Recognition
  • Python Speech Recognition via CMU Sphinx
  • SpeechRecognition using Google Speech Recognition
  • Google Cloud Speech to Text for Speech Recognition with Python SpeechRecognition
  • Python Speech Recognition with Wit.AI
  • Microsoft Azure Speech to Text for Python Speech Recognition
  • Microsoft Bing Voice Recognition to do Speech Recognition in Python
  • Python Speech Recognition with Houndify
  • IBM Speech to Text in Python SpeechRecognition
  • Python Speech Recognition with Other Libraries
  • Summary of Python Speech Recognition with the SpeechRecognition Library

What is the Python SpeechRecognition Library?

Python SpeechRecognition is a project that was BSD 3-Clause licensed from 2014-2017 by Anthony Zhang. It is a wrapper that connects to multiple APIs and engines. The SpeechRecognition library is advertised to support CMU Sphinx, Google Speech Recognition, Google Cloud Speech API, Wit.ai, Microsoft Bing Voice Recognition, Houndify API, IBM Speech to Text, and Snowboy Hotword Detection. 

Note that Snowboy is no longer around. One drawback from SpeechRecognition is that this Python Speech Recognition library is missing some powerful backends. Some current powerful backends that are missing include PyTorch, Tensorflow, and newer web APIs like Deepgram

Getting Started with Python Speech Recognition

Let’s take a step back from the code and understand how Python speech recognition happens from a high level. Automatic Speech Recognition can be done both in real time by streaming audio and asynchronously on audio files. In this post, we’re going to cover how to use SpeechRecognition to run asynchronous speech recognition on an audio file.

We start with audio data, which looks like a wave form as shown in the image above. Python converts this wave form data into the form of a set of numbers known as a vector. We combine multiple vectors into a matrix. This vector/matrix formatted data is then fed into a trained neural network which gives us a prediction.

Now that we have an understanding of how speech recognition works, let’s get into the code. The Python SpeechRecognition library allows us to use many different models to do speech recognition. Each of these sections covers a different model/engine/backend that does speech recognition. You may get different results from each app.

Prerequisites for Python Speech Recognition

First, we need to install the Python SpeechRecognition Library. We can do that with the line pip install SpeechRecognition. Once we have the library installed, we start. All the code below belongs in the same file.

The setup starts by first importing the speech_recognition library and os. Then, we use the os library to find our audio file. In this example, there is an English WAV file, a French AIFF file, and a Chinese FLAC file. Next, we need to instantiate the SpeechRecognition speech recognizer. From there, we open up the audio file as the source and read it into the speech recognizer.

import speech_recognition as sr
 
# obtain path to "english.wav" in the same folder as this script
from os import path
AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")
# AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "french.aiff")
# AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "chinese.flac")
 
# use the audio file as the audio source
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
   audio = r.record(source)  # read the entire audio file

Python Speech Recognition via CMU Sphinx

The first backend we try for this example is CMU Sphinx. CMU Sphinx is an open source automatic speech recognition engine that came out of Carnegie Mellon University. CMU Sphinx has been largely dormant over the past decade, but maintenance has just restarted in 2022!

All we have to do to use the CMU Sphinx backend with Python SpeechRecognition is to call the recognize_sphinx() function on the audio data. We handle two different errors, unknown value errors and request errors.

# recognize speech using Sphinx
try:
   print("Sphinx thinks you said " + r.recognize_sphinx(audio))
except sr.UnknownValueError:
   print("Sphinx could not understand audio")
except sr.RequestError as e:
   print("Sphinx error; {0}".format(e))

SpeechRecognition using Google Speech Recognition

Next, we’ll look at using Google Speech Recognition. This uses the Chrome Speech Recognition API. This service doesn’t require a Google Cloud Developer account, but may be turned off by Google whenever. This service worked when this code was written, your mileage may vary.

Just like CMU Sphinx, implementing this is easy. We just call recognize_google on the audio data. We also handle the same two types of errors, an unknown value error and a request error.

# recognize speech using Google Speech Recognition
try:
   # for testing purposes, we're just using the default API key
   # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
   # instead of `r.recognize_google(audio)`
   print("Google Speech Recognition thinks you said " + r.recognize_google(audio))
except sr.UnknownValueError:
   print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
   print("Could not request results from Google Speech Recognition service; {0}".format(e))

Google Cloud Speech to Text for Speech Recognition with Python SpeechRecognition

This section covers using Google Cloud Speech to Text. Google Cloud Speech to Text is the Google Cloud Platform tool that does automatic speech recognition. It is a plug and play tool. Google provides tutorials on how to use Google Cloud Speech to Text with Go, Java, Python, and Node JS.

Unlike the CMU Sphinx option above, Google Cloud Speech to Text requires credentials. Google Cloud Speech to Text provides credentials in the form of a JSON file. When calling the function for this tool, recognize_gooogle_cloud, we pass the audio data and the credentials. Just like CMU Sphinx, we handle the same two errors, unknown values and request errors. 

# recognize speech using Google Cloud Speech
GOOGLE_CLOUD_SPEECH_CREDENTIALS = r"""INSERT THE CONTENTS OF THE GOOGLE CLOUD SPEECH JSON CREDENTIALS FILE HERE"""
try:
   print("Google Cloud Speech thinks you said " + r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS))
except sr.UnknownValueError:
   print("Google Cloud Speech could not understand audio")
except sr.RequestError as e:
   print("Could not request results from Google Cloud Speech service; {0}".format(e))

Python Speech Recognition with Wit.AI

Wit AI is a speech recognition tool acquired by Facebook (Meta) in 2015. They don’t have much info on their blog on who they are or what they do. Luckily, we can interface with Python SpeechRecognition to use it. Just like Google Cloud Speech to Text, Wit AI operates with an API key. Wit’s API key is a 32 character uppercase alphanumeric string.

Like all the other examples above, SpeechRecognition provides an inbuilt function for Wit AI. All we do is call recognize_wit with the audio data and pass the Wit AI API key into the key parameter. Just like all the options above, we manage the same two errors.

# recognize speech using Wit.ai
WIT_AI_KEY = "INSERT WIT.AI API KEY HERE"  # Wit.ai keys are 32-character uppercase alphanumeric strings
try:
   print("Wit.ai thinks you said " + r.recognize_wit(audio, key=WIT_AI_KEY))
except sr.UnknownValueError:
   print("Wit.ai could not understand audio")
except sr.RequestError as e:
   print("Could not request results from Wit.ai service; {0}".format(e))

Microsoft Azure Speech to Text for Python Speech Recognition

Microsoft Azure Speech to Text is Microsoft’s version of Google Cloud Speech to Text. The API key for Azure Speech to Text is a 32 character lowercase hexadecimal string. The same length as Wit’s but slightly different content. Much shorter than the JSON file that Google Cloud Speech to Text uses.

The Python Speech Recognition library makes it awfully easy to call any of these backends. In this case, we call recognize_azure and pass the audio data and the Azure Speech to Text API key. Just like CMU Sphinx and Google Cloud Speech to Text, we handle unknown values and request errors.

# recognize speech using Microsoft Azure Speech
AZURE_SPEECH_KEY = "INSERT AZURE SPEECH API KEY HERE"  # Microsoft Speech API keys 32-character lowercase hexadecimal strings
try:
   print("Microsoft Azure Speech thinks you said " + r.recognize_azure(audio, key=AZURE_SPEECH_KEY))
except sr.UnknownValueError:
   print("Microsoft Azure Speech could not understand audio")
except sr.RequestError as e:
   print("Could not request results from Microsoft Azure Speech service; {0}".format(e))

Microsoft Bing Voice Recognition to do Speech Recognition in Python

Why does Microsoft have two different voice recognition tools? Because of corporate dysfunction. Microsoft Bing Voice Recognition is another speech recognition tool from Microsoft. It may not have been on Azure at the time of the writing of this code, but it sure is now. 

Bing takes an API key that is in the same format as the Azure API key. Really makes you think, why are these two different? Anyway, we call recognize_bing on the audio data with the Bing API key to get our transcription. Just like all the other backends above, we handle the same two errors.

# recognize speech using Microsoft Bing Voice Recognition
BING_KEY = "INSERT BING API KEY HERE"  # Microsoft Bing Voice Recognition API keys 32-character lowercase hexadecimal strings
try:
   print("Microsoft Bing Voice Recognition thinks you said " + r.recognize_bing(audio, key=BING_KEY))
except sr.UnknownValueError:
   print("Microsoft Bing Voice Recognition could not understand audio")
except sr.RequestError as e:
   print("Could not request results from Microsoft Bing Voice Recognition service; {0}".format(e))

Python Speech Recognition with Houndify

Houndify is a voice AI platform from SoundHound. They provide more than just automatic speech recognition. Houndify also provides natural language understanding and text to speech capabilities. Unlike the Azure and Google Cloud speech recognition tools, Houndify uses two API keys.

Houndify requires an ID and a key. Both of these are base 64 encoded strings. Using Houndify’s speech recognition with Python SpeechRecognition is just as easy as the other engines. We call recognize_houndify and pass the audio data, the ID, and the key.

# recognize speech using Houndify
HOUNDIFY_CLIENT_ID = "INSERT HOUNDIFY CLIENT ID HERE"  # Houndify client IDs are Base64-encoded strings
HOUNDIFY_CLIENT_KEY = "INSERT HOUNDIFY CLIENT KEY HERE"  # Houndify client keys are Base64-encoded strings
try:
   print("Houndify thinks you said " + r.recognize_houndify(audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY))
except sr.UnknownValueError:
   print("Houndify could not understand audio")
except sr.RequestError as e:
   print("Could not request results from Houndify service; {0}".format(e))

IBM Speech to Text in Python SpeechRecognition

The last Python SpeechRecognition backend that we’re going to cover in this post is IBM Speech to Text. This is IBM’s competitor to Google Cloud and Azure speech to text. It’s built off of IBM’s legendary Watson AI. The API interface is slightly different in that it uses a username and password. The username is not in an easy format, which is not good.

You already know that SpeechRecognition provides a function to call this engine. We call the recognize_ibm function. We pass it the audio data, the username, and the password. Just as we did above, we also handle the same two types of errors: unknown values and request errors.

IBM_USERNAME = "INSERT IBM SPEECH TO TEXT USERNAME HERE"  # IBM Speech to Text usernames are strings of the form XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
IBM_PASSWORD = "INSERT IBM SPEECH TO TEXT PASSWORD HERE"  # IBM Speech to Text passwords are mixed-case alphanumeric strings
try:
   print("IBM Speech to Text thinks you said " + r.recognize_ibm(audio, username=IBM_USERNAME, password=IBM_PASSWORD))
except sr.UnknownValueError:
   print("IBM Speech to Text could not understand audio")
except sr.RequestError as e:
   print("Could not request results from IBM Speech to Text service; {0}".format(e))

Python Speech Recognition with Other Libraries

Python SpeechRecognition was made in the mid-2010s. While it is still useful and relevant, it misses some powerful modern libraries. Since the late 2010s, we’ve seen the rise of PyTorch and TensorFlow in machine learning. Now, we have libraries like PyTorch TorchAudio which can manipulate audio data and help us do speech recognition. 

Other speech recognition libraries in Python include DeepSpeech, Kaldi, and wav2vec. DeepSpeech came from a 2014 Baidu paper. Kaldi began in 2009 at Johns Hopkins. Facebook announced wav2vec as an ASR tool in 2019. We have also seen the rise of ASR companies like AssemblyAI, Deepgram, and Rev AI.

Summary of Python Speech Recognition with the SpeechRecognition Library

This post serves as an introduction to Python speech recognition. We cover how to use the Python SpeechRecognition library to interact with multiple backends. To do more advanced speech recognition, we can interact directly with these backends or use some unincluded ones.

The backends we interacted with are: CMU Sphinx, Google Cloud Speech to Text, Microsoft Azure/Bing Speech to Text, Houndify, Wit AI, and IBM Speech to Text. We also mentioned that we can do ASR with machine learning libraries like TensorFlow and PyTorch as well as web APIs. 

Further Reading

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

Leave a Reply Cancel reply