Categories
CoreNLP NLP NLTK spaCy

Top 3 Ready-to-Use Python NLP Libraries for 2022

80-90% of business data is unstructured text data. The businesses that win will be the ones that find a way to analyze their text data. How can we analyze text data? Natural Language Processing. NLP is one of the most important sectors of AI. It may be the fastest growing subfield of AI in the 2020s. In this post we’ll be going over three ready-to-use Python NLP Libraries. For a more fundamental understanding of Natural Language Processing, read an Introduction to NLP: Core Concepts.

Ready-to-Use Python NLP Libraries

The state of the art Natural Language Processing is to use neural networks. In particular, transformers are a popular model architecture. There are pros and cons to using transformer models, but we’re not going to focus on that now. There will always be architectural innovations. For this article, we’re going to focus on the top three ready-to-use NLP libraries. None of these libraries require a deep, fundamental understanding of how NLP works but will allow you to leverage its power.

The top 3 ready-to-use NLP libraries are spaCy, NLTK, and Stanford’s CoreNLP library. Each of these libraries has its own speciality and reason for being in the top 3. The spaCy library provides industrial strength NLP models. NLTK focuses on providing research focused NLP power. Stanford’s CoreNLP library was a Java library that is now adapted to multiple languages including Python.

NLP with spaCy

The spaCy library is made and maintained by Explosion. It provides multiple models and support for 18 languages. We’re going to focus on the English language models. There are four English language models for web data: en_core_web_sm, en_core_web_md, en_core_web_lg, and en_core_web_trf. The first three are optimized for CPU performance while en_core_web_trf is a transformer-based model, not optimized for CPU performance. Let’s go over some of the basic NLP techniques you can do with spaCy.

To get started with spaCy, open up your terminal and run the following commands:

pip install spacy
python -m spacy download en_core_web_sm

Part of Speech Tagging

Part of speech (POS) tagging is a fundamental part of natural language processing. This is usually one of the first things in an NLP pipeline. There are many different parts of speech, to learn more read this article on parts of speech. Here’s how we can do POS tagging with spaCy.

First, we import spacy. Then we load up the model we downloaded earlier, in this case en_core_web_sm. The text that we’re running POS tagging on is taken from How Many Solar Farms Does it Take to Power America? All we do is run the text through our NLP pipeline. Then to see the parts of speech, we loop through the tokenized document and check the part of speech and tag.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more 
land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
doc = nlp(text)
 
for token in doc:
    print(token.text, token.pos_, token.tag_)

Named Entity Recognition

Named Entity Recognition (NER) is an NLP technique that has POS as a prerequisite. The types of entities that can be named and recognized include people, organizations, locations, and time. This isn’t a comprehensive list though. For a full list of the named entities that can be recognized read this article on the Best Way to do Named Entity Recognition.

To do NER in spaCy, we’ll start by importing spacy. Then we’ll load the model. The text that we’re using for this is a random thing that I made up. The same as above, we’ll tokenize the text by running it through the NLP model. Then we’ll loop through each entity in the document and print out the text and label. Notice that ents is a default property of the document after running it through the NLP pipeline.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
doc = nlp(text)
 
for ent in doc.ents:
    print(ent.text, ent.label_)

Lemmatization

Lemmatization is the process of finding the lemmas of each word. A lemma is the root of a word. To learn more about lemmatization, read this article on what lemmatization is and how you can use it.

As we did in the above two NLP techniques with spaCy, we’ll start by importing spacy and loading the model. You can use any text you want. For this example, I’m using a random set of text about spaCy, the NFL, and about how Yujian Tang is the best software content creator. As we did above, we simply run the text through an NLP model. Then we’ll loop through each token in our tokenized document and print the lemma out.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "This is an example program for showing how lemmatization works in spaCy. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
 
doc = nlp(text)
 
for token in doc:
    print(token.lemma_)

NLP with Natural Language ToolKit (NLTK)

NLTK is a project led by Steven Bird and Lilang Tan. Different parts of NLTK are maintained by different people all around the world. It’s an open source natural language project made for playing with computational linguistics in Python.

To get started with NLTK, we need to install the library as well as some of its submodules. We can do so with the commands below. Note that we actually only need the last three submodules for NER.

pip install nltk
python
>>> import nltk
>>> nltk.download(“punkt”)
>>> nltk.download(“averaged_perceptron_tagger”)
>>> nltk.download(“maxent_ne_chunker”)
>>> nltk.download(“words”)

Part of Speech Tagging

We’re going to use the same piece of text to demonstrate Part of Speech Tagging with NLTK as we did with spaCy. To do part of speech tagging with NLTK we’ll start by importing the nltk library. We have to run two commands to do part of speech tagging. First we tokenize the text, then we use the pos_tag command on the tokenized text. To see the tagged parts of speech, we just print them out. Click here for a complete list of part of speech tags.

import nltk
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
for tag in tagged:
    print(tag)

Named Entity Recognition

Named Entity Recognition with NLTK requires the most libraries out of these three listed simple NLP techniques. Once again we’re going to use the same, somewhat nonsensical phrase as we did before. If you’re from Seattle, you’ll surely recognize Molly Moon. She is not a part of the UN’s Climate Action Committee.

To do NER with NLTK, we import our library, set up our text, and then call three functions on it. Just like above, we’ll start by tokenizing the string, and then running part of speech tagging on it. After part of speech tagging, we’ll run the ne_chunk command which stands for “named entity chunk”. To see the named entities tagged, we’ll look through all the chunks and if the chunk is labeled (recognized), we print it out.

import nltk
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
tokenized = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokenized)
chunks = nltk.ne_chunk(pos_tagged)
for chunk in chunks:
    if hasattr(chunk, 'label'):
        print(chunk)

Lemmatization

Lemmatization in NLTK works slightly differently than the other two NLP techniques we’ve looked at in this post. Let’s start by importing the NLTK library, and then also import the WordNetLemmatizer function from the nltk.stem sub-library. We’ll use the same text as above, a mix of random sentences about NLP, the NFL, Yujian Tang being the best software content creator, and The Text API.

We use the WordNetLemmatizer() as our lemmatizer. The first thing we’ll do is tokenize our text. Then we loop through the tokenized text and lemmatize each token with the lemmatizer.

import nltk
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
tokenized = nltk.word_tokenize("This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date.")
for t in tokenized:
    print(lemmatizer.lemmatize(t))

NLP with Stanford CoreNLP (Stanza in Python)

Stanford’s CoreNLP library is actually a Java Library. It has been adapted to be usable in Python in many different forms. The formally maintained library actually isn’t even called Stanford CoreNLP. It’s called “Stanza”. Curiously enough, NLTK actually has a way to interface with Stanford CoreNLP. To get started with stanza we simply install it and then download a model as shown below.

pip install stanza
>>> import stanza
>>> stanza.download("en")

Part of Speech Tagging

It’s worth mentioning here that just as spaCy has separated “part of speech” and “tags”, Core NLP separates upos or universal part of speech and xpos or treebank-specific part of speech. Here we’re going to be looking at the upos.

We’ll start the same we always start, by importing the library. Stanford’s stanza NLP package’s NLP model/concept explicitly uses a Pipeline instead of loading a model (spaCy) or calling different functions (NLTK). We’ll tell the pipeline that we want an en or English model, and we want to add tokenize, mwt (multi-word tokenizer), and pos (part of speech) to our pipeline. From here, we add the text, documentize the text with the pipeline, and print out all the universal parts of speech for each token in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos')
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
doc = nlp(text)
print(*[f"word: {word.text}\tupos: {word.upos}" for sent in doc.sentences for word in sent.words], sep='\n')

Named Entity Recognition

Named Entity Recognition with stanza works in much the same way POS does. We import the stanza library and create a pipeline. For this case we need to use the tokenize and ner pipelines. Once again, we use the same text, we documentize the text, and print out the entity type for each entity in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang='en', processors="tokenize,ner")
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
doc = nlp(text)
 
print(*[f"entity: {ent.text}\ttype: {ent.type}" for sent in doc.sentences for ent in sent.ents], sep='\n')

Lemmatization

We start off by importing our library and setting up our pipeline as usual. For lemmatization, we’ll need the same pipeline elements as we did for POS tagging and also the lemma element. Our text will be the same text as the spaCy and NLTK ones. All we have to do is documentize the text to get the lemmas. To see them, we simply print out all the texts and lemmas for each word in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang="en", processors="tokenize,mwt,pos,lemma")
text = "This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
doc = nlp(text)
print(*[f"word: {word.text}\t lemma: {word.lemma}" for sent in doc.sentences for word in sent.words], sep='\n')

Recap of the Top 3 Ready-To-Use Python NLP Libraries

In this post we went over the top 3 ready-to-use Python NLP libraries for 2022. Why are these the top 3? Because they’re actually maintained. There are a TON of NLP libraries for Python, but most of them have fallen into disuse. We went over how to do three of the most common and fundamental NLP techniques with each of these libraries. Which one of these libraries should you use? It depends on your use case. 

The spaCy library is targeted at industry Python users, the NLTK library is mainly for academic research around NLP and computational linguistics, and the Stanford CoreNLP library is compatible with multiple programming languages. Out of these three, I would say that the Stanford CoreNLP library is the most powerful and most complex. The NLTK library seems to be the most customizable. The spaCy library feels like the most simple to use while still being quite powerful.

Bonus: a language agnostic NLP Web API

Web APIs are also a popular choice for NLP. A great advantage of a web API is that you don’t have to host the model on your own computer. However, you also don’t have customizability over the model. The most comprehensive web API to date is The Text API. The only one of the fundamental NLP techniques we mentioned that it provides is NER. The Text API provides more business-ready use cases such as AI summarization, finding the most common phrases, keyword sentence extraction, and more. For more information, read this guide on how to automatically analyze text documents.