Categories
level 2 python NLP NLTK spaCy The Text API

Text Sentiment Analysis and How to Do it

Sentiment analysis is an example of applied Natural Language Processing (NLP). In this context, “sentiment” is almost interchangeable with text polarity. Text polarity is a measure from -1 to 1 of the sentiment of the text. The dictionary definition of sentiment is actually “one’s view or attitude towards something”, so this could include emotions from sadness to happiness to surprise. While it is possible to predict emotion, this article is going to focus on how positive or negative a text is. We’ll cover emotion in an article on emotion detection and how to do it.

In this article we’ll cover:

  • What is Text Sentiment
  • Text Sentiment vs Text Polarity vs Sentiment Analysis
  • How to use AI to get Text Sentiment
    • AI Text Sentiment with spaCy
    • Sentiment Analysis with NLTK
    • How to get the sentiment of a text with a web API
  • Applications of Text Sentiment Analysis
    • COVID headlines
  • Summary of How to do Sentiment Analysis with AI

What is Text Sentiment?

Let’s first take a look at what text sentiment is. Text sentiment is the general sentiment of a text. It’s the general outlook provided by a text document. We are using text sentiment to measure polarity from a value of -1 to 1. For our means, sentiment will measure whether a text document is generally positive or negative. A naive measure of text sentiment simply takes an average of the sentiment of each word.

We will measure the total sentiment of a text as a weighted combination of the sentiment of different words, phrases, and sentences. You are free to decide how you’d like to weigh each word, phrase, or sentence. In our implementation examples, we’ll take automatic sentiments with spaCy and NLTK that you can extrapolate and adjust. The Text API uses a proprietary mix of sentiments from words, phrases, and sentences.

Text Sentiment vs Text Polarity vs Sentiment Analysis

Before we get into some implementation examples, let’s get a more clear picture of sentiment. There’s three phrases that are used pretty much interchangeably in the NLP space by most people. Text sentiment, text polarity, and sentiment analysis are only distinguished when you have specific use cases or speaking with NLP experts. Let’s get the definitions.

  1. Text sentiment – the overall view of a text including positivity, outlook, and emotion
  2. Text polarity – a measure from -1 to 1 of how polarizing (positive or negative) a text is
  3. Sentiment analysis – the process of determining the sentiment of a text document

In this article, we are discussing how to use sentiment analysis to determine the polarity of a text.

How Can I Use AI to Get the Sentiment of a Text?

Natural Language Processing is a subfield of Artificial Intelligence. Polarity is a common technique for many NLP pipelines. In this post, we’ll cover how to use two of the biggest Python NLP libraries and an API to get text sentiment. First we’ll do text sentiment with spaCy, then NLTK, and finally with The Text API.

AI Text Sentiment with spaCy

To get the sentiment of a text with spaCy we’ll need to install two libraries and download a model. We can do that by using the lines below in the terminal.

pip install spacy spacytextblob
python -m spacy download en_core_web_sm

We’ll begin our program the same way we always do, by handling the imports. We’ll import the spacy library and the SpacyTextBlob class from the spacytextblob package. Next, we’ll load up the model and add the spacytextblob to the NLP pipeline. We can use any text, for this example, we’ll just use a text description of The Text API. Then, we’ll create a document from the text using the NLP model. Finally, we’ll print out the overall polarity of the text from the model.

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
 
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
doc = nlp(text)
 
print(doc._.polarity)

Sentiment Analysis with NLTK

To follow this example using the NLTK library, we’ll have to install the NLTK library and download the three of its packages. We can do this with the lines below in the terminal.

pip install nltk
python
>>> import nltk
>>> nltk.download([“averaged_perceptron_tagger”, “punkt”, “vader_lexicon”])

As always, we’ll start off our program with imports. We’ll need to import the SentimentIntensityAnalyzer class from the nltk.sentiment module. Then we’ll initialize an object of the SentimentIntensityAnalyzer class. We’ll use the same text here as we did for the spaCy model. Next, we’ll get the polarity_scores of the text from the SentimentIntensityAnalyzer object and print out the scores.

from nltk.sentiment import SentimentIntensityAnalyzer
 
sia = SentimentIntensityAnalyzer()
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
scores = sia.polarity_scores(text)
print(scores)

How to Get the Sentiment of a Text with an NLP API

For this example, we’ll need to install the requests library and get a free API key from The Text API. You can download the library with the line below in the terminal.

pip install requests

As always, we will start our program with the imports, we need to import the requests library to send requests and the json library to parse the response. I also imported the API key from my config file, but you can import it from wherever you saved it or use it in this file. We’ll use the exact same text as we did with spaCy and NLTK. 

We need to create some headers to send with the request. The headers will tell the server that we’re sending JSON content and pass the API key. The body will simply pass the text object. We also need to know the URL of the API endpoint. All we need to do is send a POST request and parse the response into a JSON object. The polarity will be the “text polarity” key of the returned object.

import requests
import json
from config import apikey
 
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/text_polarity"
 
response = requests.post(url, headers=headers, json=body)
polarity = json.loads(response.text)["text polarity"]
print(polarity)

Applications of Text Sentiment Analysis

Sentiment analysis for text can be applied in many ways. We can use it to get an idea of what people are really saying in reviews, how customers feel about our product, or even how employees feel about the company. We can also use it to analyze the news and see how positive or negative it is. In this section we’ll show an example of using text sentiment analysis to analyze COVID headlines over time.

Text Sentiment Polarity of COVID Headlines

One application of text sentiment analysis is to do analysis of the news. Since it’s been about two years into the COVID pandemic, analyzing COVID headlines could be interesting. I decided to do an analysis on the NY Times’ headlines about COVID over the last two years. What did I learn? That they were much more negative about COVID in the first year than they have been this year.

Text Polarity of COVID Article Headlines so Far

For a full tutorial, see Using AI to Analyze COVID Headlines.

Summary of How to do Sentiment Analysis with AI

In this article we learned about text sentiment, sentiment analysis, and text polarity. We learned that these terms are mostly interchangeable but have nuanced differences. Then we saw how we can use AI to get the sentiment of a text. We saw how to implement it in three different ways, with spaCy, NLTK, and The Text API. Finally, we saw an example of how we can apply text sentiment analysis.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy

NLP: Stop Words, When and Why to Use Them

There are 326 “Stop Words” by default in spaCy. What are stopwords (or stop words)? They’re common words that we don’t want to include in some of our analysis when we perform Natural Language Processing. These are words that generally don’t contribute anything to the meaning of the text. However, we can’t always remove stopwords. In this article we’re going to go over why we remove stopwords, which NLP techniques and applications should keep or remove stopwords, and lists of default stop words for spaCy and NLTK.

Why Do We Remove Stopwords?

Stopwords are words that don’t add to the overall meaning of our text. When performing NLP tasks that revolve around understanding, we don’t need these words. Since machine learning is computationally expensive, it benefits us to process as little data as possible while still being able to produce a usable result. Of course, we can’t remove stop words for every task, so let’s take a look at which tasks we should remove stopwords for and which tasks we should keep them for.

Which NLP Techniques or Applications Should Remove Stop Words?

As we talked about above, not all Natural Language Processing tasks require removing stop words. The NLP techniques or applications that should use stopword removal in the pipeline are ones that revolve around meaning. These are usually the Natural Language Understanding tasks. These include applications like sentiment analysis, semantic parsing, or spam filtering. The tasks that don’t require stop words are ones which don’t necessarily need these common words to construct their responses.

Which NLP Techniques of Applications Should Keep Stop Words?

So, if we want to remove stopwords for NLP techniques and applications that don’t require them in their responses, which ones should keep stop words? When we’re doing NLP tasks that require the whole text in its processing, we should keep stopwords. Examples of these kinds of NLP tasks include text summarization, language translation, and when doing question-answer tasks. You can see that these tasks depend on some common words such as “for”, “on”, or “in” to model the connection between words. 

List of Default English Stop Words from Different Libraries

In our introduction to the top 3 NLP libraries in Python, we went over spaCy, NLTK, and CoreNLP. Interestingly, there’s no universal list of stopwords. The spaCy library has 326 default stopwords in English, the NLTK library has 179, and CoreNLP doesn’t have its own list of default stopwords. Let’s take a look at the default stopwords from spaCy and NLTK and how to get them.

List of all 326 Default Stopwords in spaCy

spacy stopwords word cloud

There are 326 default stopwords in spaCy. To get these, we install the `spacy` library and download the `en_core_web_sm` model. The default stop words come with the model. We can see the stopwords by loading the model and printing it’s `Defaults.stop_words`.

pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load(“en_core_web_sm”)
print(nlp.Defaults.stop_words)
'you', 'something', 'anyhow', 'would', 'not', 'first', 'now', 'without', 'which', 'may', 'regarding', '’d', 'back', 'nevertheless', 'how', 'should', 'bottom', 'by', 'twelve', 'least', 'but', '‘d', 'thence', 'i', 'hers', 'are', 'therein', 'same', 'indeed', 'others', 'whither', 'your', '’ll', 'either', 'last', 'therefore', 'do', 'whence', 'we', 'top', 'beforehand', 'though', 'across', 'everyone', 'only', 'full', 'fifteen', 'hereby', 'since', 'while', 're', 'beside', 'quite', 'her', 'is', 'their', 'meanwhile', 'neither', 'various', 'everywhere', "'d", 'made', 'nowhere', 'name', 'of', 'done', 'ever', 'onto', 'off', 'its', 'most', 'twenty', 'next', 'after', 'does', 'whether', 'say', 'please', 'at', 'sometimes', "n't", 'hereafter', 'here', 'until', 'itself', 'latterly', 'well', 'became', 'under', 'behind', 'the', 'me', 'must', 'give', 'former', 'using', 'or', 'otherwise', 'noone', '‘s', 'yours', 'everything', 'wherein', 'even', 'take', 'put', 'ourselves', 'themselves', 'him', 'beyond', 'whose', 'another', 'with', 'every', 'whom', 'somewhere', 'forty', 'via', '’ve', 'get', "'s", '‘re', 'any', 'due', 'really', '’re', 'towards', 'it', 'whereupon', 'none', 'anyway', 'very', 'among', 'before', 'sixty', 'eleven', 'seeming', 'why', 'whereby', 'whenever', 'per', 'ours', 'namely', 'they', "'m", 'along', 'somehow', 'yourself', 'many', 'empty', 'who', 'becoming', 'hence', 'them', 'n’t', 'between', 'a', 'be', 'further', 'against', 'else', 'when', 'has', 'will', 'anyone', 'was', 'several', 'there', 'three', 'formerly', 'one', 'my', 'were', 'side', 'cannot', 'becomes', "'ll", 'make', 'such', 'never', 'amount', 'enough', 'just', 'our', 'those', 'besides', '’s', 'being', 'part', 'except', 'someone', 'often', 'seems', '‘ve', 'latter', "'ve", 'afterwards', 'both', 'during', 'unless', 'together', 'n‘t', 'show', 'keep', 'too', 'each', 'into', 'been', 'an', 'us', 'whereafter', 'to', 'in', 'nor', '‘ll', 'so', "'re", 'down', 'six', 'toward', 'five', 'doing', 'out', 'herein', 'thereupon', 'whole', 'anything', 'can', 'because', 'over', 'however', 'seem', 'serious', 'go', 'am', 'then', 'myself', 'within', 'four', 'his', 'nobody', 'sometime', 'yet', 'front', 'become', 'himself', 'wherever', 'upon', 'nothing', 'few', 'hundred', 'move', '‘m', 'what', 'as', 'below', 'elsewhere', 'mostly', 'anywhere', 'up', 'that', 'amongst', 'this', 'around', 'she', 'always', 'thereafter', 'nine', 'ca', 'already', 'herself', 'some', 'much', 'if', 'two', 'these', 'had', 'ten', 'whatever', 'also', 'through', 'thus', 'yourselves', 'see', 'he', 'throughout', 'for', 'moreover', '’m', 'seemed', 'again', 'might', 'all', 'on', 'almost', 'have', 'less', 'fifty', 'eight', 'could', 'used', 'thereby', 'perhaps', 'above', 'whereas', 'and', 'about', 'although', 'still', 'mine', 'from', 'than', 'rather', 'once', 'third', 'call', 'alone', 'did', 'more', 'thru', 'whoever', 'where', 'hereupon', 'other', 'own', 'no'

List of all 179 Default Stopwords in NLTK

nlt stopwords word cloud

There are 179 stop words in NLTK. To get all the default stopwords from NLTK, we install the library and download the `stopwords` submodule. Once we do that, we can see all the stopwords with a simple command.

pip install nltk
python
>>> nltk.download(“stopwords”)
from nltk.corpus import stopwords
print(stopwords.words('english'))  
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"

Stopwords Recap

In this post, we learned that stopwords are the most common words in a language that usually don’t provide much semantic value. Then we looked at why we remove stopwords. Some NLP tasks such as sentiment analysis should remove stop words. Some NLP tasks such as AI Summarization, shouldn’t remove stop words. Finally, we went over the default stopwords in spaCy and NLTK and how to get them.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
CoreNLP NLP NLTK spaCy

Top 3 Ready-to-Use Python NLP Libraries for 2022

80-90% of business data is unstructured text data. The businesses that win will be the ones that find a way to analyze their text data. How can we analyze text data? Natural Language Processing. NLP is one of the most important sectors of AI. It may be the fastest growing subfield of AI in the 2020s. In this post we’ll be going over three ready-to-use Python NLP Libraries. For a more fundamental understanding of Natural Language Processing, read an Introduction to NLP: Core Concepts.

Ready-to-Use Python NLP Libraries

The state of the art Natural Language Processing is to use neural networks. In particular, transformers are a popular model architecture. There are pros and cons to using transformer models, but we’re not going to focus on that now. There will always be architectural innovations. For this article, we’re going to focus on the top three ready-to-use NLP libraries. None of these libraries require a deep, fundamental understanding of how NLP works but will allow you to leverage its power.

The top 3 ready-to-use NLP libraries are spaCy, NLTK, and Stanford’s CoreNLP library. Each of these libraries has its own speciality and reason for being in the top 3. The spaCy library provides industrial strength NLP models. NLTK focuses on providing research focused NLP power. Stanford’s CoreNLP library was a Java library that is now adapted to multiple languages including Python.

NLP with spaCy

The spaCy library is made and maintained by Explosion. It provides multiple models and support for 18 languages. We’re going to focus on the English language models. There are four English language models for web data: en_core_web_sm, en_core_web_md, en_core_web_lg, and en_core_web_trf. The first three are optimized for CPU performance while en_core_web_trf is a transformer-based model, not optimized for CPU performance. Let’s go over some of the basic NLP techniques you can do with spaCy.

To get started with spaCy, open up your terminal and run the following commands:

pip install spacy
python -m spacy download en_core_web_sm

Part of Speech Tagging

Part of speech (POS) tagging is a fundamental part of natural language processing. This is usually one of the first things in an NLP pipeline. There are many different parts of speech, to learn more read this article on parts of speech. Here’s how we can do POS tagging with spaCy.

First, we import spacy. Then we load up the model we downloaded earlier, in this case en_core_web_sm. The text that we’re running POS tagging on is taken from How Many Solar Farms Does it Take to Power America? All we do is run the text through our NLP pipeline. Then to see the parts of speech, we loop through the tokenized document and check the part of speech and tag.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more 
land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
doc = nlp(text)
 
for token in doc:
    print(token.text, token.pos_, token.tag_)

Named Entity Recognition

Named Entity Recognition (NER) is an NLP technique that has POS as a prerequisite. The types of entities that can be named and recognized include people, organizations, locations, and time. This isn’t a comprehensive list though. For a full list of the named entities that can be recognized read this article on the Best Way to do Named Entity Recognition.

To do NER in spaCy, we’ll start by importing spacy. Then we’ll load the model. The text that we’re using for this is a random thing that I made up. The same as above, we’ll tokenize the text by running it through the NLP model. Then we’ll loop through each entity in the document and print out the text and label. Notice that ents is a default property of the document after running it through the NLP pipeline.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
doc = nlp(text)
 
for ent in doc.ents:
    print(ent.text, ent.label_)

Lemmatization

Lemmatization is the process of finding the lemmas of each word. A lemma is the root of a word. To learn more about lemmatization, read this article on what lemmatization is and how you can use it.

As we did in the above two NLP techniques with spaCy, we’ll start by importing spacy and loading the model. You can use any text you want. For this example, I’m using a random set of text about spaCy, the NFL, and about how Yujian Tang is the best software content creator. As we did above, we simply run the text through an NLP model. Then we’ll loop through each token in our tokenized document and print the lemma out.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "This is an example program for showing how lemmatization works in spaCy. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
 
doc = nlp(text)
 
for token in doc:
    print(token.lemma_)

NLP with Natural Language ToolKit (NLTK)

NLTK is a project led by Steven Bird and Lilang Tan. Different parts of NLTK are maintained by different people all around the world. It’s an open source natural language project made for playing with computational linguistics in Python.

To get started with NLTK, we need to install the library as well as some of its submodules. We can do so with the commands below. Note that we actually only need the last three submodules for NER.

pip install nltk
python
>>> import nltk
>>> nltk.download(“punkt”)
>>> nltk.download(“averaged_perceptron_tagger”)
>>> nltk.download(“maxent_ne_chunker”)
>>> nltk.download(“words”)

Part of Speech Tagging

We’re going to use the same piece of text to demonstrate Part of Speech Tagging with NLTK as we did with spaCy. To do part of speech tagging with NLTK we’ll start by importing the nltk library. We have to run two commands to do part of speech tagging. First we tokenize the text, then we use the pos_tag command on the tokenized text. To see the tagged parts of speech, we just print them out. Click here for a complete list of part of speech tags.

import nltk
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
for tag in tagged:
    print(tag)

Named Entity Recognition

Named Entity Recognition with NLTK requires the most libraries out of these three listed simple NLP techniques. Once again we’re going to use the same, somewhat nonsensical phrase as we did before. If you’re from Seattle, you’ll surely recognize Molly Moon. She is not a part of the UN’s Climate Action Committee.

To do NER with NLTK, we import our library, set up our text, and then call three functions on it. Just like above, we’ll start by tokenizing the string, and then running part of speech tagging on it. After part of speech tagging, we’ll run the ne_chunk command which stands for “named entity chunk”. To see the named entities tagged, we’ll look through all the chunks and if the chunk is labeled (recognized), we print it out.

import nltk
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
tokenized = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokenized)
chunks = nltk.ne_chunk(pos_tagged)
for chunk in chunks:
    if hasattr(chunk, 'label'):
        print(chunk)

Lemmatization

Lemmatization in NLTK works slightly differently than the other two NLP techniques we’ve looked at in this post. Let’s start by importing the NLTK library, and then also import the WordNetLemmatizer function from the nltk.stem sub-library. We’ll use the same text as above, a mix of random sentences about NLP, the NFL, Yujian Tang being the best software content creator, and The Text API.

We use the WordNetLemmatizer() as our lemmatizer. The first thing we’ll do is tokenize our text. Then we loop through the tokenized text and lemmatize each token with the lemmatizer.

import nltk
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
tokenized = nltk.word_tokenize("This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date.")
for t in tokenized:
    print(lemmatizer.lemmatize(t))

NLP with Stanford CoreNLP (Stanza in Python)

Stanford’s CoreNLP library is actually a Java Library. It has been adapted to be usable in Python in many different forms. The formally maintained library actually isn’t even called Stanford CoreNLP. It’s called “Stanza”. Curiously enough, NLTK actually has a way to interface with Stanford CoreNLP. To get started with stanza we simply install it and then download a model as shown below.

pip install stanza
>>> import stanza
>>> stanza.download("en")

Part of Speech Tagging

It’s worth mentioning here that just as spaCy has separated “part of speech” and “tags”, Core NLP separates upos or universal part of speech and xpos or treebank-specific part of speech. Here we’re going to be looking at the upos.

We’ll start the same we always start, by importing the library. Stanford’s stanza NLP package’s NLP model/concept explicitly uses a Pipeline instead of loading a model (spaCy) or calling different functions (NLTK). We’ll tell the pipeline that we want an en or English model, and we want to add tokenize, mwt (multi-word tokenizer), and pos (part of speech) to our pipeline. From here, we add the text, documentize the text with the pipeline, and print out all the universal parts of speech for each token in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos')
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
doc = nlp(text)
print(*[f"word: {word.text}\tupos: {word.upos}" for sent in doc.sentences for word in sent.words], sep='\n')

Named Entity Recognition

Named Entity Recognition with stanza works in much the same way POS does. We import the stanza library and create a pipeline. For this case we need to use the tokenize and ner pipelines. Once again, we use the same text, we documentize the text, and print out the entity type for each entity in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang='en', processors="tokenize,ner")
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
doc = nlp(text)
 
print(*[f"entity: {ent.text}\ttype: {ent.type}" for sent in doc.sentences for ent in sent.ents], sep='\n')

Lemmatization

We start off by importing our library and setting up our pipeline as usual. For lemmatization, we’ll need the same pipeline elements as we did for POS tagging and also the lemma element. Our text will be the same text as the spaCy and NLTK ones. All we have to do is documentize the text to get the lemmas. To see them, we simply print out all the texts and lemmas for each word in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang="en", processors="tokenize,mwt,pos,lemma")
text = "This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
doc = nlp(text)
print(*[f"word: {word.text}\t lemma: {word.lemma}" for sent in doc.sentences for word in sent.words], sep='\n')

Recap of the Top 3 Ready-To-Use Python NLP Libraries

In this post we went over the top 3 ready-to-use Python NLP libraries for 2022. Why are these the top 3? Because they’re actually maintained. There are a TON of NLP libraries for Python, but most of them have fallen into disuse. We went over how to do three of the most common and fundamental NLP techniques with each of these libraries. Which one of these libraries should you use? It depends on your use case. 

The spaCy library is targeted at industry Python users, the NLTK library is mainly for academic research around NLP and computational linguistics, and the Stanford CoreNLP library is compatible with multiple programming languages. Out of these three, I would say that the Stanford CoreNLP library is the most powerful and most complex. The NLTK library seems to be the most customizable. The spaCy library feels like the most simple to use while still being quite powerful.

Bonus: a language agnostic NLP Web API

Web APIs are also a popular choice for NLP. A great advantage of a web API is that you don’t have to host the model on your own computer. However, you also don’t have customizability over the model. The most comprehensive web API to date is The Text API. The only one of the fundamental NLP techniques we mentioned that it provides is NER. The Text API provides more business-ready use cases such as AI summarization, finding the most common phrases, keyword sentence extraction, and more. For more information, read this guide on how to automatically analyze text documents.

Categories
NLP NLTK spaCy

What is Lemmatization and How can I do It?

Lemmatization is an important part of Natural Language Processing. Other NLP topics we’ve covered include Text Polarity, Named Entity Recognition, and Summarization. Lemmatization is the process of turning a word into its lemma. A lemma is the “canonical form” of a word. A lemma is usually the dictionary version of a word, it’s picked by convention. Let’s look at some examples to make more sense of this.

The words “playing”, “played”, and “plays” all have the same lemma of the word “play”. The words “win”, “winning”, “won”, and “wins” all have the same lemma of the word “win”. Let’s take a look at one more example before we move on to how you can do lemmatization in Python. The words “programming”, “programs”, “programmed”, and “programmatic” all have the same lemma of the word “program”. Another way to think about it is to think of the lemma as the “root” of the word.

In this post we’ll cover:

  • How Can I Do Lemmatization with Python
    • Lemmatization with spaCy
    • Lemmatization with NLTK

How Can I Do Lemmatization with Python?

Python has many well known Natural Language Processing libraries, and we’re going to make use of two of them to do lemmatization. The first one we’ll look at is spaCy and the second one we’ll use is Natural Language Toolkit (NLTK).

Lemmatization with spaCy

This is pretty cool, we’re going to lemmatize our text in under 10 lines of code. To get started with spaCy we’ll install the spacy library and download a model. We can do this in the terminal with the following commands:

pip install spacy
python -m spacy download en_core_web_sm

To start off our program, we’ll import spacy and load the language model.

import spacy
 
nlp = spacy.load("en_core_web_sm")

Once we have the model, we’ll simply make up a text, turn it into a spaCy Doc, and that’s basically it. To get the lemma of each word, we’ll just print out the lemma_ attribute. Note that printing out the lemma attribute will get you a number corresponding to the lemma’s representation.

text = "This is an example program for showing how lemmatization works in spaCy. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
 
doc = nlp(text)
 
for token in doc:
    print(token.lemma_)

Our output should look like the following:

text lemmatization with spaCy output

Sounds like a pirate!

Lemmatization with NLTK

Cool, lemmatization with spaCy wasn’t that hard, let’s check it out with NLTK. For NLTK, we’ll need to install the library and install the wordnet submodule before we can write the program. We can do that in the terminal with the below commands.

pip install NLTK
python 
>>> import nltk
>>> nltk.download(‘wordnet’)
>>> exit()

Why are we running a Python script in shell and not just downloading wordnet at the start of our program? We only need to download it once to be able to use it, so we don’t want to put it in a program we’ll be running multiple times. As always, we’ll start out our program by importing the libraries we need. In this case, we’re just going to be importing nltk and the WordNetLemmatizer object from nltk.stem.

import nltk
from nltk.stem import WordNetLemmatizer

First we’ll use word_tokenize from nltk to tokenize our text. Then we’ll loop through the tokenized text and use the lemmatizer to lemmatize each token and print it out.

lemmatizer = WordNetLemmatizer()
tokenized = nltk.word_tokenize("This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date.")
for t in tokenized:
    print(lemmatizer.lemmatize(t))

We’ll end up with something like the image below. 

text lemmatization with NLTK results

As you can see, using NLTK returns a different lemmatization than using spaCy. It doesn’t seem to do lemmatization as well. NLTK and spaCy are made for different purposes, so I am usually impartial. However, spaCy definitely wins for built in lemmatization. NLTK can be customized because it’s highly used for research purposes, but that’s out of the scope for this article. Be on the lookout for an in depth dive though!

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy The Text API

Natural Language Processing: What is Text Polarity?

Natural Language Processing (NLP) and all of its applications will be huge in the 2020s. A lot of my blogging is about text processing and all the things that go with it such as Named Entity Recognition and Part of Speech Tagging. Text polarity is a basic text processing technique that gives us insight into how positive or negative a text is. The polarity of a text is essentially it’s “sentiment” rating from -1 to 1.

Overview of Text Polarity

In this post we’ll cover:

  • What is Text Polarity?
  • How to Get Text Polarity with spaCy
  • How to Get Text Polarity with NLTK
  • How to Get Text Polarity with a web API
  • Why are these Text Polarity Numbers so Different?

What is Text Polarity?

In short, text polarity is a measure of how negative or how positive a piece of text is. Polarity is the measure of the overall combination of the positive and negative emotions in a sentence. It’s notoriously hard for computers to predict this, in fact it’s even hard for people to predict this over text. Check out the following Key and Peele video for an example of what I mean.

Most of the time, NLP models can predict simply positive or negative words and phrases quite well. For example, the words “amazing”, “superb”, and “wonderful” can easily be labeled as highly positive. The words “bad”, “sad”, and “mad” can easily be labeled as negative. However, we can’t just look at polarity from the frame of individual words, it’s important to take a larger context for evaluating total polarity. For example, the word “bad” may be negative but what about the phrase “not bad”? Is that neutral? Or is that the opposite of bad? At this point we’re getting into linguistics and semantics rather than natural language processing.

Due to the nature of language and how words around each other can modify their meaning and polarity, when I personally implemented text polarity for The Text API, I used a combination of total text polarity and the polarity of individual phrases in it. The two biggest open source libraries for NLP in Python are spaCy and NLTK, and both of these libraries measure polarity on a normalized scale of -1 to 1. The Text API measures, combines, and normalizes values on both the polarity of the overall text, individual sentences, and individual phrases. This returns a better picture of the relative polarities of texts by not penalizing longer sentences that are expressing positive or negative emotion at scale but also contain neutral phrases. Let’s take a look at how we can implement text polarity with the libraries and API I mentioned above!

How to Get Text Polarity with spaCy

To get started with spaCy we’ll need to download two spaCy libraries with pip in our terminal as shown below:

pip install spacy spacytextblob

We’ll also need to download a model. As usual we’ll download the `en_core_web_sm` model to get started. Run the below command in the terminal after the pip installs are finished:

python -m spacy download en_core_web_sm

Now that we’ve downloaded our libraries and model, let’s get started with our code. We’ll need to import `spacy` and `SpacyTextBlob` from `spacytextblob.spacytextblob`. Spacy Text Blob is the pipeline component that we’ll be using to get polarity. We’ll start our program by loading the model we downloaded earlier and then adding the `spacytextblob` pipe to the `nlp` pipeline. Notice that we never actually explicitly call the `SpacyTextBlob` module, but rather pass it in as a string to `nlp`. If you’re using VSCode, you’ll see the `SpacyTextBlob` is grayed out like it’s not being used, but don’t be fooled, we require this import in order to add the pipeline component even though we don’t call it directly.

Next we’ll choose a text to process. For this example, I simply wrote two decently positive sentences on The Text API, which we’ll show an example for later. Then all we have to do is send the text to a document via our `nlp` object and check its polarity score.

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
 
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
doc = nlp(text)
 
print(doc._.polarity)

Our spaCy model predicted our text’s polarity score at 0.5. It’s hard to really judge how “accurate” the polarity of something is, so we’ll go through the other two methods and I’ll comment on this later.

Text Polarity from spaCy

How to Get Text Polarity with NLTK

Now that we’ve covered how to get polarity via spaCy, let’s check out how to get polarity with the Natural Language Toolkit. As always, we’ll start out by installing the library and dependencies we’ll need.

pip install nltk

Once we install NLTK, we’ll fire up and interactive Python shell in the command line to install the NLTK modules that we need with the commands below.

python
>>> import nltk
>>> nltk.download([“averaged_perceptron_tagger”, “punkt”, “vader_lexicon”])

Averaged Perceptron Tagger handles part of speech tagging. It’s the best tagger in the NLTK library at the time of writing, so you’ll probably use it for something else as well as polarity. Punkt is for recognizing punctuation. I know what you’re thinking:

Vader

But no, the VADER lexicon library actually stands for “Valence Aware Dictionary and sEntiment Reasoner”. It is the library that provides the sentiment analysis tool we need. Once we have all these installed, it’s pretty simple to just import the library and call it. We need the `SentimentIntensityAnalyzer` library, pass our text to it, and call it to score the text on polarity.

from nltk.sentiment import SentimentIntensityAnalyzer
 
sia = SentimentIntensityAnalyzer()
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
scores = sia.polarity_scores(text)
print(scores)

We should get a print out like the one below.

NLTK Text Polarity

This result tells us that none of the text is negative, 61.8% is neutral, and 38.2% of it is positive. Compound is a normalized sentiment score that you can see calculated in the VADER package on GitHub. It’s calculated before the negative, neutral, and positive scores, and represents a normalized polarity score of the sentence. So NLTK has calculated our sentence to be very positive.

How to Get Text Polarity with The Text API

Finally, let’s take a look at how to get a text polarity score from The Text API. A major advantage of using a web API like The Text API to do text processing is that you don’t need to download any machine learning libraries or maintain any models. Simply using the requests library, which if you don’t have by now you can install with the pip command below, and then go to The Text API website and sign up for your free API key.

pip install requests

When you land on The Text API’s homepage you should scroll all the way down and you’ll see a button that you can click to sign up for your free API key. 

Once you log in your API key will be right at the top of the page. Now that we’re all set up, let’s take a dive into the code. All we’re going to do is set up a request with headers that tells the server we’re sending a JSON request and pass the API key, a body with the text we want to analyze, and the URL endpoint we’re going to hit (in this case “https://app.thetextapi.com/text/text_polarity”) and then send a request and parse the response.

import requests
import json
from config import apikey
 
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/text_polarity"
 
response = requests.post(url, headers=headers, json=body)
polarity = json.loads(response.text)["text polarity"]
print(polarity)

Once we send off our request we’ll get a response that looks like the following:

The Text API Text Polarity

The Text API thinks that my praise of The Text API is roughly .575 polarity, that translates to like ~79% AMAZING (if 1 is AMAZING). 

Why Are These Polarities So Different?

Earlier I mentioned that we’d discuss the different polarity scores at the end so here we are. We used three different methods to get the polarity of the same document of text, so why were our polarity scores so different? The obvious answer is that each method used a) a different model and b) a different way to calculate document polarity. However, there’s also another underlying factor at play here.

Remember that Key and Peele video earlier? It’s hard for people to even understand the polarity of comments even with context. Remember that machines don’t have the ability to understand context yet. Also a range of -1 to 1 without really providing examples of what is a polarity of 1 and what is a polarity of -1 makes it hard to interpret. However, all three methods at least agree that the text is quite positive in general. Of course there are ways to improve the interpretability of these results, but that will be in a coming post!

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy The Text API

The Best Way to do Named Entity Recognition (NER)

Named Entity Recognition (NER) is a common Natural Language Processing technique. It’s so often used that it comes in the basic pipeline for spaCy. NER can help us quickly parse out a document for all the named entities of many different types. For example, if we’re reading an article, we can use named entity recognition to immediately get an idea of the who/what/when/where of the article.

In this post we’re going to cover three different ways you can implement NER in Python. We’ll be going over:

What is Named Entity Recognition?

Named Entity Recognition, or NER for short, is the Natural Language Processing (NLP) topic about recognizing entities in a text document or speech file. Of course, this is quite a circular definition. In order to understand what NER really is, we’ll have to define what an entity is. For the purposes of NLP, an entity is essentially a noun that defines an individual, group of individuals, or a recognizable object. While there is not a TOTAL consensus on what kinds of entities there are, I’ve compiled a rather complete list of the possible types of entities that popular NLP libraries such as spaCy or Natural Language Toolkit (NLTK) can recognize. You can find the GitHub repo here.

List of Common Named Entities

Entity TypeDescription of the NER object
PERSONA person – usually a recognized as a first and last name
NORPNationalities or Religious/Political Groups
FACThe name of a Facility
ORGThe name of an Organization
GPEThe name of a Geopolitical Entity
LOCA location
PRODUCTThe name of a product
EVENTThe name of an event
WORK OF ARTThe name of a work of art
LAWA law that has been published (US only as far as I know)
LANGUAGEThe name of a language
DATEA date, doesn’t have to be an exact date, could be a relative date like “a day ago”
TIMEA time, like date it doesn’t have to be exact, it could be like “middle of the day”
PERCENTA percentage
MONEYAn amount of money, like “$100”
QUANTITYMeasurements of weight or distance
CARDINALA number, similar to quantity but not a measurement
ORDINALA number, but signifying a relative position such as “first” or “second”

How Can I Implement NER in Python?

Earlier, I mentioned that you can implement NER with both spaCy and NLTK. The difference between these libraries is that NLTK is built for academic/research purposes and spaCy is built for production purposes. Both are free to use open source libraries. NER is extremely easy to implement with these open source libraries. In this article I will show you how to get started implementing your own Named Entity Recognition programs.

spaCy Named Entity Recognition (NER)

We’ll start with spaCy, to get started run the commands below in your terminal to install the library and download a starter model.

pip install spacy
python -m spacy download en_core_web_sm

We can implement NER in spaCy in just a few lines of code. All we need to do is import the spacy library, load a model, give it some text to process, and then call the processed document to get our named entities. For this example we’ll be using the “en_core_web_sm” model we downloaded earlier, this is the “small” model trained on web text. The text we’ll use is just some random sentence I made up, we should expect the NER to identify Molly Moon as a Person (NER isn’t advanced enough to detect that she is a cow), to identify the United Nations’ as an organization, and the Climate Action Committee as a second organization.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "Molly Moon is a cow. She is part of the United Nations Climate Action Committee."
 
doc = nlp(text)
 
for ent in doc.ents:
    print(ent.text, ent.label_)

After we run this we should see a result like the one below. We see that this spaCy model is unable to separate the United Nations and its Climate Action Committee as separate orgs.

named entity recognition spacy results

Named Entity Recognition with NLTK

Let’s take a look at how to implement NER with NLTK. As with spaCy, we’ll start by installing the NLTK library and also downloading the extensions we need.

pip install nltk

After we run our initial pip install, we’ll need to download four extensions to get our Named Entity Recognition program running. I recommend simply firing up Python in your terminal and running these commands as the libraries only need to be downloaded once to work, so including them in your NER program will only slow it down.

python
>>> import nltk
>>> nltk.download(“punkt”)
>>> nltk.download(“averaged_perceptron_tagger”)
>>> nltk.download(“maxent_ne_chunker”)
>>> nltk.download(“words”)

Punkt is a tokenizer package that recognizes punctuation. Averaged Perceptron Tagger is the default part of speech tagger for NLTK. Maxent NE Chunker is the Named Entity Chunker for NLTK. The Words library is an NLTK corpus of words. We can already see here that NLTK is far more customizable, and consequently also more complex to set up. Let’s dive into the program to see how we can extract our named entities.

Once again we simply start by importing our library and declaring our text. Then we’ll tokenize the text, tag the parts of speech, and chunk it using the named entity chunker. Finally, we’ll loop through our chunks and display the ones that are labeled.

import nltk
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
tokenized = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokenized)
chunks = nltk.ne_chunk(pos_tagged)
for chunk in chunks:
    if hasattr(chunk, 'label'):
        print(chunk)

When you run this program in your terminal you should see an output like the one below.

named entity recognition results – nltk

Notice that NLTK has identified “Climate Action Committee” as a Person and Moon as a Person. That’s clearly incorrect, but this is all on pre trained data. Also this time, I let it print out the entire chunk, and it shows the parts of speech. NLTK has tagged all of these as “NNP” which signals a proper noun.

A Simpler and More Accurate NER Implementation

Alright, now that we’ve discussed how to implement NER with open source libraries, let’s take a look at how we can do it without ever having to download extra packages and machine learning models! We can simply ping a web API that already has a pre-trained model and pipeline for tons of text processing needs. We’ll be using the open beta of the The Text API, scroll down to the bottom of the page and get your API key.

The only library we need to install is the requests library, and we only need to be able to send an API request as outlined in How to Send a Web API Request. So, let’s take a look at the code.

All we need is to construct a request to send to the endpoint, send the request, and parse the response. The API key should be passed in the headers as “apikey” and also we should specify that the content type is json. The body simply needs to pass the text in. The endpoint that we’ll hit is “https://app.thetextapi.com/text/ner”. Once we get our request back, we’ll use the json library (native to Python) to parse our response.

import requests
import json
from config import apikey
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/ner"
 
response = requests.post(url, headers=headers, json=body)
ner = json.loads(response.text)["ner"]
print(ner)

Once we send this request, we should see an output like the one below.

named entity recognition with the text api

Woah! Our API actually recognizes all three of the named entities successfully! Not only is using The Text API simpler than downloading multiple models and libraries, but in this use case, we can see that it’s also more accurate.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy

Natural Language Processing: Part of Speech Tagging

Part of Speech (POS) Tagging is an integral part of Natural Language Processing (NLP). The first step in most state of the art NLP pipelines is tokenization. Tokenization is the separating of text into “tokens”. Tokens are generally regarded as individual pieces of languages – words, whitespace, and punctuation.

Once we tokenize our text we can tag it with the part of speech, note that this article only covers the details of part of speech tagging for English. Part of speech tagging is done on all tokens except for whitespace. We’ll take a look at how to do POS with the two most popular and easy to use NLP Python libraries – spaCy and NLTK – coincidentally also my favorite two NLP libraries to play with.

What is Part of Speech (POS) Tagging?

Traditionally, there are nine parts of speech taught in English literature – nouns, adjectives, determiners, adverbs, pronouns, prepositions, conjunctions, and interjections. We’ll see below, that for NLP reasons, we’ll actually be using way more than nine tags. The spaCy library tags 19 different parts of speech, and over 50 “tags” (depending how you count different punctuation marks).

In spaCy tags are more granularized parts of speech. NLTK’s part of speech tagging tags 34 parts of speech. It is more like spaCy’s tagging concept than spaCy’s parts of speech. We’ll take a look at the parts of speech labels from both, and then spaCy’s fine grained tagging. You can find the Github Repo that contains code for POS tagging here.

In this post, we’ll go over:

  • List of spaCy automatic parts of speech (POS)
  • List of NLTK parts of speech (POS)
  • Fine-grained Part of Speech (POS) tags in spaCy
  • spaCy POS Tagging Example
  • NLTK POS Tagging Example

List of spaCy parts of speech (automatic):

POSDescriptionPOSDescription
ADJAdjective – big, purple, creamyADPAdposition – in, to, during
ADVAdverb – very, really, thereAUXAuxiliary – is, has, will
CONJConjunction – and, or, butCCONJCoordinating conjunction – either…or, neither…nor, not only
DETDeterminer – a, an, theINTJInterjection – psst, oops, oof
NOUNNoun – cat, dog, frogNUMNumeral – 1, one, 20
PARTParticle – ‘s, ‘nt, ‘dPRONPronoun – he, she, me
PROPNProper noun – Yujian Tang, Michael Jordan, Andrew NgPUNCTPunctuation – commas, periods, semicolons
SCONJSubordinating conjunction – if, while, butSYMSymbol – $, %, ^
VERBVerb – sleep, eat, runXOther – asdf, xyz, abc
SPACESpace – space lol

List of NLTK parts of speech:

POSDescriptionPOSDescription
CCCoordinating Conjunction – either…or, neither…nor, not onlyCDCardinal Digit – 1, 2, twelve
DTDeterminer – a, an, theEXExistential There – “there” used for introducing a topic
FWForeign Word – bonjour, ciao, 你好INPreposition/Subordinating Conjunction – in, at, on
JJAdjective – bigJJRComparative Adjective – bigger
JJSSuperlative Adjective – biggestLSList Marker – first, A., 1), etc
MDModal – can, cannot, mayNNSingular Noun – student, learner, enthusiast
NNSPlural Noun – students, programmers, geniusesNNPSingular Proper Noun – Yujian Tang, Tom Brady, Fei Fei Li
NNPSPlural Proper Noun – Americans, Democrats, PresidentsPDTPredeterminer – all, both, many
POSPossessive Ending – ‘sPRPPersonal Pronoun – her, him, yourself
PRP$Possessive Pronoun – her, his, mineRBAdverb – occasionally, technologically, magically
RBRComparative Adjective – further, higher, betterRBSSuperlative Adjective – best, biggest, highest
RPParticle – aboard, into, uponTOInfinitive Marker – “to” when it is used as an infinitive marker or preposition
UHInterjection – uh, wow, jinkies!VBVerb – ask, assemble, brush
VBGVerb Gerund – stirring, showing, displayingVBDVerb Past Tense – dipped, diced, wrote
VBNVerb Past Participle – condensed, refactored, unsettledVBPVerb Present Tense not 3rd person singular – predominate, wrap, resort
VBZVerb Present Tense, 3rd person singular – bases, reconstructs, emergesWDTWh-determiner – that, what, which
WPWh-pronoun – that, what, whateverWRBWh-adverb – how, however, wherever

We can see that NLTK and spaCy have different parts of speech tagging, this is because there are many ways to tag parts of speech and the different ways that NLTK has split it up is advantageous for academic process. Above, I’ve only shown spaCy’s automatic POS tagging, but spaCy actually has a fine grained part of speech tagging as well, they call it “tag” instead of “part of speech”. I’ll break down how parts of speech map to tagging in spaCy below.

List of spaCy Part of Speech Tags (Fine grained)

POSMapped TagsPOSMapped Tags
ADJAFX – affix: “pre-”
JJ – adjective: good
JJR – comparative adjective: better
JJS – superlative adjective: best
PDT – predeterminer: half
PRP$ – possessive pronoun: his, her
WDT – wh-determiner: which
WP$ – possessive wh-pronoun: whose
ADPIN – subordinating conjunction or preposition: “in”
ADVEX – existential there: there
RB – adverb: quickly
RBR – comparative adverb: quicker
RBS – superlative adverb: quickest
WRB – wh-adverb: when
CONJCC – coordinating conjunction: and
DETDT – determiner: this, a, anINTJUH – interjection: uh, uhm, ruh-roh!
NOUNNN – noun: sentence
NNS – plural noun: sentences
WP – wh-pronoun: who
NUMCD – cardinal number: three, 5, twelve
PARTPOS – possessive ending: ‘s
RP – particle adverb: back (put it “back”)

TO – infinitive to: “to”
PRONPRP – personal pronoun: I, you
PROPNNNP – proper singular noun: Yujian Tang
NNPS – proper plural nouns: Pythonistas
PUNCT-LRB- left round bracket: “(“
-RRB- right round bracket: “)”
(actual punctuation marks): , : ; . “ ‘ (etc)
HYPH – hyphen
LS – list item marker: a., A), iii.
NFP – superfluous punctuation
SYM(like punctuation, these are pretty self explanatory)#
$
SYM – symbol
VERBBES – auxiliary “be”
HVS – “have”: ‘ve
MD – auxiliary modal: could
VB – base form verb: go
VBD – past tense verb: was
VBG – gerund: going
VBN – past participle verb: lost
VBP – non 3rd person singular present verb: want
VBZ – 3rd person singular present verb: wants
XADD – email
FW – foreign word
GW – additional word
XX – unknown

How do I Implement POS Tagging?

Part of Speech Tagging is at the cornerstone of Natural Language Processing. It is one of the most basic parts of NLP, and as a result it comes standard as part of any respectable NLP library. Below, I’m going to cover how you can do POS tagging in just a few lines of code with spaCy and NLTK.

Spacy POS Tagging

We’ll start by implementing part of speech tagging in spaCy. The first thing we’ll need to do is install spaCy and download a model.

pip install spacy
python -m spacy download en_core_web_sm

Once we have our required libraries downloaded we can start. Like I said above, POS tagging is one of the cornerstones of natural language processing. It’s so important that the spaCy pipeline automatically does it upon tokenization. For this example, I’m using a large piece of text, this text about solar energy comes from How Many Solar Farms Does it Take to Power America?

First we import spaCy, then we load our NLP model, then we feed the NLP model our text to create our NLP document. After creating the document, we can simply loop through it and print out the different parts of the tokens. For this example, we’ll print out the token text, the token part of speech, and the token tag.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
doc = nlp(text)
 
for token in doc:
    print(token.text, token.pos_, token.tag_)

Once you run this you should see an output like the one pictured below.

Part of Speech Tagging Results – spaCy

NLTK POS Tagging

Now let’s take a look at how to do POS tagging with the Natural Language Toolkit. We’ll get started with this the same way we got started with spaCy, by downloading the library and the model we’ll need. We’re going to need to install NLTK and download the NLTK “punkt” tokenizer model.

pip install nltk
python
>>> import nltk
>>> nltk.download(‘punkt’)

Once we have our libraries downloaded, we can fire up our favorite Python editor and get started. Like with spaCy, there’s only a few steps we need to do to start tagging parts of speech with the NLTK library. First, we need to tokenize our text. Then, we simply call the NLTK part of speech tagger on the tokenized text and voila! We’re done. I’ve used the exact same text from above.

import nltk
from nltk.tokenize import word_tokenize
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
tokenized = word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
for tag in tagged:
    print(tag)

Once we’re done, we simply run this in a terminal and we should see an output like the following.

Parts of Speech Tagging Results – NLTK

You can compare and see that NLTK and spaCy have pretty much the same tagging at the tag level.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.