Categories
level 2 python NLP NLTK spaCy The Text API

Text Sentiment Analysis and How to Do it

Sentiment analysis is an example of applied Natural Language Processing (NLP). In this context, “sentiment” is almost interchangeable with text polarity. Text polarity is a measure from -1 to 1 of the sentiment of the text. The dictionary definition of sentiment is actually “one’s view or attitude towards something”, so this could include emotions from sadness to happiness to surprise. While it is possible to predict emotion, this article is going to focus on how positive or negative a text is. We’ll cover emotion in an article on emotion detection and how to do it.

In this article we’ll cover:

  • What is Text Sentiment
  • Text Sentiment vs Text Polarity vs Sentiment Analysis
  • How to use AI to get Text Sentiment
    • AI Text Sentiment with spaCy
    • Sentiment Analysis with NLTK
    • How to get the sentiment of a text with a web API
  • Applications of Text Sentiment Analysis
    • COVID headlines
  • Summary of How to do Sentiment Analysis with AI

What is Text Sentiment?

Let’s first take a look at what text sentiment is. Text sentiment is the general sentiment of a text. It’s the general outlook provided by a text document. We are using text sentiment to measure polarity from a value of -1 to 1. For our means, sentiment will measure whether a text document is generally positive or negative. A naive measure of text sentiment simply takes an average of the sentiment of each word.

We will measure the total sentiment of a text as a weighted combination of the sentiment of different words, phrases, and sentences. You are free to decide how you’d like to weigh each word, phrase, or sentence. In our implementation examples, we’ll take automatic sentiments with spaCy and NLTK that you can extrapolate and adjust. The Text API uses a proprietary mix of sentiments from words, phrases, and sentences.

Text Sentiment vs Text Polarity vs Sentiment Analysis

Before we get into some implementation examples, let’s get a more clear picture of sentiment. There’s three phrases that are used pretty much interchangeably in the NLP space by most people. Text sentiment, text polarity, and sentiment analysis are only distinguished when you have specific use cases or speaking with NLP experts. Let’s get the definitions.

  1. Text sentiment – the overall view of a text including positivity, outlook, and emotion
  2. Text polarity – a measure from -1 to 1 of how polarizing (positive or negative) a text is
  3. Sentiment analysis – the process of determining the sentiment of a text document

In this article, we are discussing how to use sentiment analysis to determine the polarity of a text.

How Can I Use AI to Get the Sentiment of a Text?

Natural Language Processing is a subfield of Artificial Intelligence. Polarity is a common technique for many NLP pipelines. In this post, we’ll cover how to use two of the biggest Python NLP libraries and an API to get text sentiment. First we’ll do text sentiment with spaCy, then NLTK, and finally with The Text API.

AI Text Sentiment with spaCy

To get the sentiment of a text with spaCy we’ll need to install two libraries and download a model. We can do that by using the lines below in the terminal.

pip install spacy spacytextblob
python -m spacy download en_core_web_sm

We’ll begin our program the same way we always do, by handling the imports. We’ll import the spacy library and the SpacyTextBlob class from the spacytextblob package. Next, we’ll load up the model and add the spacytextblob to the NLP pipeline. We can use any text, for this example, we’ll just use a text description of The Text API. Then, we’ll create a document from the text using the NLP model. Finally, we’ll print out the overall polarity of the text from the model.

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
 
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
doc = nlp(text)
 
print(doc._.polarity)

Sentiment Analysis with NLTK

To follow this example using the NLTK library, we’ll have to install the NLTK library and download the three of its packages. We can do this with the lines below in the terminal.

pip install nltk
python
>>> import nltk
>>> nltk.download([“averaged_perceptron_tagger”, “punkt”, “vader_lexicon”])

As always, we’ll start off our program with imports. We’ll need to import the SentimentIntensityAnalyzer class from the nltk.sentiment module. Then we’ll initialize an object of the SentimentIntensityAnalyzer class. We’ll use the same text here as we did for the spaCy model. Next, we’ll get the polarity_scores of the text from the SentimentIntensityAnalyzer object and print out the scores.

from nltk.sentiment import SentimentIntensityAnalyzer
 
sia = SentimentIntensityAnalyzer()
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
scores = sia.polarity_scores(text)
print(scores)

How to Get the Sentiment of a Text with an NLP API

For this example, we’ll need to install the requests library and get a free API key from The Text API. You can download the library with the line below in the terminal.

pip install requests

As always, we will start our program with the imports, we need to import the requests library to send requests and the json library to parse the response. I also imported the API key from my config file, but you can import it from wherever you saved it or use it in this file. We’ll use the exact same text as we did with spaCy and NLTK. 

We need to create some headers to send with the request. The headers will tell the server that we’re sending JSON content and pass the API key. The body will simply pass the text object. We also need to know the URL of the API endpoint. All we need to do is send a POST request and parse the response into a JSON object. The polarity will be the “text polarity” key of the returned object.

import requests
import json
from config import apikey
 
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/text_polarity"
 
response = requests.post(url, headers=headers, json=body)
polarity = json.loads(response.text)["text polarity"]
print(polarity)

Applications of Text Sentiment Analysis

Sentiment analysis for text can be applied in many ways. We can use it to get an idea of what people are really saying in reviews, how customers feel about our product, or even how employees feel about the company. We can also use it to analyze the news and see how positive or negative it is. In this section we’ll show an example of using text sentiment analysis to analyze COVID headlines over time.

Text Sentiment Polarity of COVID Headlines

One application of text sentiment analysis is to do analysis of the news. Since it’s been about two years into the COVID pandemic, analyzing COVID headlines could be interesting. I decided to do an analysis on the NY Times’ headlines about COVID over the last two years. What did I learn? That they were much more negative about COVID in the first year than they have been this year.

Text Polarity of COVID Article Headlines so Far

For a full tutorial, see Using AI to Analyze COVID Headlines.

Summary of How to do Sentiment Analysis with AI

In this article we learned about text sentiment, sentiment analysis, and text polarity. We learned that these terms are mostly interchangeable but have nuanced differences. Then we saw how we can use AI to get the sentiment of a text. We saw how to implement it in three different ways, with spaCy, NLTK, and The Text API. Finally, we saw an example of how we can apply text sentiment analysis.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
level 2 python spaCy The Text API

What AI Keyword Extraction Is and How to Do It

Keyword extraction is an example of applied Natural Language Processing (NLP). NLP is the subfield of AI concerned with analyzing, understanding, and generating language. Keyword extraction is one of the basic techniques in NLP. The first step to keyword extraction is tokenization. After tokenizing a text, it’s a simple step to look through for a keyword.

Even though keyword extraction is a relatively simple process, it plays a big role in NLP. Keyword extraction can be applied to multiple contexts from finding headlines, as we’ll see in the examples, to AI Content Moderation to finding relevant sentences in legal documents.

In this post we’ll go over:

  • What is Keyword Extraction?
  • How Can AI Keyword Extraction be Applied?
  • Implementing Keyword Extraction
    • Keyword Extraction for One Keyword with spaCy
    • Keyword Extraction for Multiple Keywords with The Text API
  • Applied Examples of AI Keyword Extraction
    • COVID Headlines
    • Obama Headlines
  • Summary of Keyword Extraction with AI

What is Keyword Extraction?

Let’s start by answering the obvious question before we dive into the details – what is keyword extraction? Keyword extraction is the process of finding each occurrence of one or many keywords in a text. Keyword extraction can be used to extract sentences, paragraphs, or sections containing a keyword. At a more basic level, it may also be used to simply find occurrences of a keyword in the text without extracting surrounding information.

How Can AI Keyword Extraction be Applied?

As we mentioned above, keyword extraction can be applied to many contexts. In this post we’ll go over two examples of keyword extraction by AI applied to headlines. Another important application of AI Keyword extraction is to the legal field. Legal papers such as court documents, laws, bills, or other similar legal documentation often need to be searched. Usually these documents are 10s or 100s of pages long. Imagine going through that. That’s a lot of looking through documents.

Although the legal field is generally all paper, they’ve begun digitizing. Digital documents can be searched much more efficiently using keyword extraction. Other than using AI keyword extraction to search legal documents more efficiently, it can also be used for reviews. For example, if I run a restaurant and I want to know the public’s opinion about my new dish, the “pepperoni pizza”, I can gather all my reviews and use an AI keyword extractor to get all the sentences about pepperoni pizza. From there, I can either read the sentences, or even just run them through a sentiment analyzer and get their polarity value if I want to know how the public feels about it.

Implementing Keyword Extraction with AI

As we said above there’s multiple things you can do with keyword extraction from extraction sentences to paragraphs to sections. In this post, we’ll implement AI keyword extraction for keywords. First, we’ll go over extracting sentences for one keyword using an NLP library, spaCy. Then, we’ll go over extracting sentences for multiple keywords using a web API, The Text API.

To follow along with these implementations, you’ll have to install some libraries. For the first AI sentence extraction implementation with spaCy, you’ll need to install spaCy and download a model. You can do that with the lines below in the terminal:

pip install spacy
python -m spacy download en_core_web_sm

For the second example using a web API to do keyword extraction on multiple keywords we’ll need an API key from The Text API. We’ll also need to download the requests library. You can do that with the line in the terminal below.

pip install requests

Sentence Extraction for One Keyword (spaCy)

For our first example, we’re going to use spaCy to do keyword extraction for one keyword. As always, we’ll start with importing the libraries we need. We’ll need the spacy library, we’ll also import the Matcher object from spacy.matcher. Not totally necessary, but makes things look nicer. After the imports, we’ll load the NLP model we downloaded earlier.

Next, we’ll create the function that will get all the sentences we’re looking for. This function will take three parameters, the language model, the text to search, and the keyword to search for. The first thing we’ll do here is create a document from the NLP language model and the passed in text. Then we’ll create a pattern to pass to the matcher. After creating the pattern, we’ll create a Matcher object using the text and then add the pattern to the matcher. Note that we’ll have to add it as a list object around it because the matcher expects a string and a list of lists as it’s two positional parameters.

Now, we’ll create an empty list to hold the returned strings and loop through the sentences in the document. Now we’ll loop through each entry in our matcher, and check if the pattern is in the sentence. If the pattern is in the sentence, then we’ll add it to our return values, otherwise we’ll move on. Finally, we’ll return the list of return values as a set. We have to wrap the return value in a set because we may get repeated sentences if the keyword appears twice in a sentence.

import spacy
from spacy.matcher import Matcher
 
nlp = spacy.load("en_core_web_sm")
 
def get_sentences(nlp:spacy.lang, text: str, keyword: str):
    doc = nlp(text)
    pattern = [{"TEXT": word}]
    matcher = Matcher(nlp.vocab)
    matcher.add(keyword, [pattern])
    retval = []
    for sent in doc.sents:
        for match_id, start, end in matcher(nlp(sent.text)):
            if nlp.vocab.strings[match_id] in [keyword]:
                snippet = sent.text
                retval.append(snippet)
    return list(set(retval))

Extracted Sentences from spaCy Keyword Extraction

Let’s take a piece of text from this primer on climate change and green energy and do a keyword extraction on this. The keyword we’ll use is “energy”. We’ll call the spaCy function above to get all the sentences that contain the word “energy” in the text below.

text = """Green energy will be the backbone of decarbonizing our energy systems, and by extension, human society as a whole. Using the breakdown of GHG emissions by sector in the US below, replacing our direct electricity usage emissions with electricity from green energy sources (we can call this green electricity) would already reduce emissions by 25%. Furthermore, reducing emissions from transport and industry (another 52% of emissions) would require replacing burning hydrocarbons with using green electricity in a process called electrification. For transportation, replacing internal combustion engine vehicles with electric vehicles would enable the transportation sector to use green electricity instead of gasoline. For industry, electrifying manufacturing equipment or combining heat and power processes can enable the sector to use green electricity instead of burning coal. For commercial and residential, we could electrify heating and cooling for homes. Right now, there's a lot of propane and natural gas systems, and converting these to electricity would reduce the carbon footprint of the average American home. These pathways to decarbonization suggest that we need to install a lot of green electricity capacity and ensure our energy systems (like the electric grid) are capable of meeting people's new and existing demands without relying on hydrocarbons."""
kw = "energy"
print(get_sentences(nlp, text, kw))

We should see an output like the one below.

Keyword Extraction via spaCy

Sentence Extraction for Multiple Keywords (The Text API)

For this example, we’re going to use AI to extract multiple keywords. As always we’ll start with the libraries we need to import. First we’ll import the requests and json libraries. I also imported the API key from my config file. This is The Text API API key. Next, we’ll create a headers file which tells the server that we’re sending JSON content and sends the API key. We’ll also declare the keyword URL API endpoint. 

Just for consistency, we’ll continue by using the same text in this example as we did last time. After declaring the text, we’ll establish the keywords. In this example, we’ll get two keywords, “energy” and “process”. Then we’ll create the body that we’ll send to the server, the body contains the text and the keywords.

Now we’ll send a request to the server and parse that response into a JSON. After parsing it into a JSON we’ll print out the values for the keys “energy” and “process”. That’s all there is to using AI to extract sentences for multiple keywords with a web API.

import requests
import json
from config import apikey
 
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
kw_url = "https://app.thetextapi.com/text/sentences_with_keywords"
text = """Green energy will be the backbone of decarbonizing our energy systems, and by extension, human society as a whole. Using the breakdown of GHG emissions by sector in the US below, replacing our direct electricity usage emissions with electricity from green energy sources (we can call this green electricity) would already reduce emissions by 25%. Furthermore, reducing emissions from transport and industry (another 52% of emissions) would require replacing burning hydrocarbons with using green electricity in a process called electrification. For transportation, replacing internal combustion engine vehicles with electric vehicles would enable the transportation sector to use green electricity instead of gasoline. For industry, electrifying manufacturing equipment or combining heat and power processes can enable the sector to use green electricity instead of burning coal. For commercial and residential, we could electrify heating and cooling for homes. Right now, there's a lot of propane and natural gas systems, and converting these to electricity would reduce the carbon footprint of the average American home. These pathways to decarbonization suggest that we need to install a lot of green electricity capacity and ensure our energy systems (like the electric grid) are capable of meeting people's new and existing demands without relying on hydrocarbons."""
kws = ["energy", "process"]
body = {
    "text": text,
    "keywords": kws
}
 
response = requests.post(kw_url, headers=headers, json=body)
_dict = json.loads(response.text)
print(_dict["energy"])
print(_dict["process"])

You should get a response like the image below.

Keyword Extraction with The Text API

Applied Examples of AI Keyword Extraction

Now that we’ve seen some examples of keyword extraction with AI, let’s see some real life applied examples. We’re going to look at how we can use keyword extraction to do data analysis. The two applied examples we’ll look at both revolve around extracting headlines. In the first example, we’ll extract COVID headlines, in the second, we’ll extract Obama headlines.

COVID Headlines

One example of what we can do with keyword extraction is extract headlines from archives. We can use AI to extract all the headlines from the NY Times that contain the word COVID. In this section, I’ll display some of the headlines we extracted as well as go over a bit of what we learned. For the full example check out Using AI to Analyze COVID Headlines Over Time.

We extracted headlines containing the word “covid” from the NY Times archive from 2020 to 2021. We found that there were no headlines about COVID for the first 3 months of 2020! Then in April, we got 6, here’s what they were:

  1. life, covid-free, after 22 days in the hospital.
  2. covid or no covid, it’s important to plan.
  3. pregnant and scared of ‘covid hospitals,’ they’re giving birth at home, women scared of hospitals are increasingly turning to midwives.
  4. 32 days on a ventilator: one covid patient’s fight to breathe again, gasping for breaths the size of a tablespoon.
  5. ‘possible covid’: why the lulls never last for weary e.m.s. crews, a call pierces the lulls for exhausted paramedics: ‘possible covid’.
  6. arthritis drug did not help seriously ill covid patients, early data shows, drug shows slim promise for critical covid cases.

This was the graph of the number of COVID headlines per month from 2020 to 2021. This plot was created using AI keyword extraction with The Text API and matplotlib.

Number of COVID headlines over time

Obama Headlines

We can also get the headlines for all of the news about Obama back during his presidency. I chose Obama because he’s one of the internet’s favorite presidents, and the most followed person on Twitter. For a full tutorial on how we got the headlines, read Using NLP to get the Obama Presidency in Headlines. There are a TON of headlines about Obama. 

Read the headlines we extracted about Obama each year in these files: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017. We got these through using The Text API and the NY Times archive by extracting the word “Obama”. From these we were able to find the media’s portrayal of Obama through finding the most common phrases, and the word cloud summaries.

Summary of AI Keyword Extraction

In this article we learned about AI Keyword extraction. We learned that we can use keyword extraction on text documents to get the sentences, paragraphs, or sections around keywords. Then went over some examples of possible uses of AI keyword extraction including analyzing text data, looking through legal documents, and analyzing reviews. Next, we saw how we could implement keyword extraction for sentences through spaCy and The Text API. First implementing sentence extraction for one keyword using spaCy, then for multiple keywords with The Text API. Finally, we took a look at two examples of AI keyword extraction, an analysis of COVID headlines and an analysis of the media’s portrayal of Obama.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy

NLP: Stop Words, When and Why to Use Them

There are 326 “Stop Words” by default in spaCy. What are stopwords (or stop words)? They’re common words that we don’t want to include in some of our analysis when we perform Natural Language Processing. These are words that generally don’t contribute anything to the meaning of the text. However, we can’t always remove stopwords. In this article we’re going to go over why we remove stopwords, which NLP techniques and applications should keep or remove stopwords, and lists of default stop words for spaCy and NLTK.

Why Do We Remove Stopwords?

Stopwords are words that don’t add to the overall meaning of our text. When performing NLP tasks that revolve around understanding, we don’t need these words. Since machine learning is computationally expensive, it benefits us to process as little data as possible while still being able to produce a usable result. Of course, we can’t remove stop words for every task, so let’s take a look at which tasks we should remove stopwords for and which tasks we should keep them for.

Which NLP Techniques or Applications Should Remove Stop Words?

As we talked about above, not all Natural Language Processing tasks require removing stop words. The NLP techniques or applications that should use stopword removal in the pipeline are ones that revolve around meaning. These are usually the Natural Language Understanding tasks. These include applications like sentiment analysis, semantic parsing, or spam filtering. The tasks that don’t require stop words are ones which don’t necessarily need these common words to construct their responses.

Which NLP Techniques of Applications Should Keep Stop Words?

So, if we want to remove stopwords for NLP techniques and applications that don’t require them in their responses, which ones should keep stop words? When we’re doing NLP tasks that require the whole text in its processing, we should keep stopwords. Examples of these kinds of NLP tasks include text summarization, language translation, and when doing question-answer tasks. You can see that these tasks depend on some common words such as “for”, “on”, or “in” to model the connection between words. 

List of Default English Stop Words from Different Libraries

In our introduction to the top 3 NLP libraries in Python, we went over spaCy, NLTK, and CoreNLP. Interestingly, there’s no universal list of stopwords. The spaCy library has 326 default stopwords in English, the NLTK library has 179, and CoreNLP doesn’t have its own list of default stopwords. Let’s take a look at the default stopwords from spaCy and NLTK and how to get them.

List of all 326 Default Stopwords in spaCy

spacy stopwords word cloud

There are 326 default stopwords in spaCy. To get these, we install the `spacy` library and download the `en_core_web_sm` model. The default stop words come with the model. We can see the stopwords by loading the model and printing it’s `Defaults.stop_words`.

pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load(“en_core_web_sm”)
print(nlp.Defaults.stop_words)
'you', 'something', 'anyhow', 'would', 'not', 'first', 'now', 'without', 'which', 'may', 'regarding', '’d', 'back', 'nevertheless', 'how', 'should', 'bottom', 'by', 'twelve', 'least', 'but', '‘d', 'thence', 'i', 'hers', 'are', 'therein', 'same', 'indeed', 'others', 'whither', 'your', '’ll', 'either', 'last', 'therefore', 'do', 'whence', 'we', 'top', 'beforehand', 'though', 'across', 'everyone', 'only', 'full', 'fifteen', 'hereby', 'since', 'while', 're', 'beside', 'quite', 'her', 'is', 'their', 'meanwhile', 'neither', 'various', 'everywhere', "'d", 'made', 'nowhere', 'name', 'of', 'done', 'ever', 'onto', 'off', 'its', 'most', 'twenty', 'next', 'after', 'does', 'whether', 'say', 'please', 'at', 'sometimes', "n't", 'hereafter', 'here', 'until', 'itself', 'latterly', 'well', 'became', 'under', 'behind', 'the', 'me', 'must', 'give', 'former', 'using', 'or', 'otherwise', 'noone', '‘s', 'yours', 'everything', 'wherein', 'even', 'take', 'put', 'ourselves', 'themselves', 'him', 'beyond', 'whose', 'another', 'with', 'every', 'whom', 'somewhere', 'forty', 'via', '’ve', 'get', "'s", '‘re', 'any', 'due', 'really', '’re', 'towards', 'it', 'whereupon', 'none', 'anyway', 'very', 'among', 'before', 'sixty', 'eleven', 'seeming', 'why', 'whereby', 'whenever', 'per', 'ours', 'namely', 'they', "'m", 'along', 'somehow', 'yourself', 'many', 'empty', 'who', 'becoming', 'hence', 'them', 'n’t', 'between', 'a', 'be', 'further', 'against', 'else', 'when', 'has', 'will', 'anyone', 'was', 'several', 'there', 'three', 'formerly', 'one', 'my', 'were', 'side', 'cannot', 'becomes', "'ll", 'make', 'such', 'never', 'amount', 'enough', 'just', 'our', 'those', 'besides', '’s', 'being', 'part', 'except', 'someone', 'often', 'seems', '‘ve', 'latter', "'ve", 'afterwards', 'both', 'during', 'unless', 'together', 'n‘t', 'show', 'keep', 'too', 'each', 'into', 'been', 'an', 'us', 'whereafter', 'to', 'in', 'nor', '‘ll', 'so', "'re", 'down', 'six', 'toward', 'five', 'doing', 'out', 'herein', 'thereupon', 'whole', 'anything', 'can', 'because', 'over', 'however', 'seem', 'serious', 'go', 'am', 'then', 'myself', 'within', 'four', 'his', 'nobody', 'sometime', 'yet', 'front', 'become', 'himself', 'wherever', 'upon', 'nothing', 'few', 'hundred', 'move', '‘m', 'what', 'as', 'below', 'elsewhere', 'mostly', 'anywhere', 'up', 'that', 'amongst', 'this', 'around', 'she', 'always', 'thereafter', 'nine', 'ca', 'already', 'herself', 'some', 'much', 'if', 'two', 'these', 'had', 'ten', 'whatever', 'also', 'through', 'thus', 'yourselves', 'see', 'he', 'throughout', 'for', 'moreover', '’m', 'seemed', 'again', 'might', 'all', 'on', 'almost', 'have', 'less', 'fifty', 'eight', 'could', 'used', 'thereby', 'perhaps', 'above', 'whereas', 'and', 'about', 'although', 'still', 'mine', 'from', 'than', 'rather', 'once', 'third', 'call', 'alone', 'did', 'more', 'thru', 'whoever', 'where', 'hereupon', 'other', 'own', 'no'

List of all 179 Default Stopwords in NLTK

nlt stopwords word cloud

There are 179 stop words in NLTK. To get all the default stopwords from NLTK, we install the library and download the `stopwords` submodule. Once we do that, we can see all the stopwords with a simple command.

pip install nltk
python
>>> nltk.download(“stopwords”)
from nltk.corpus import stopwords
print(stopwords.words('english'))  
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"

Stopwords Recap

In this post, we learned that stopwords are the most common words in a language that usually don’t provide much semantic value. Then we looked at why we remove stopwords. Some NLP tasks such as sentiment analysis should remove stop words. Some NLP tasks such as AI Summarization, shouldn’t remove stop words. Finally, we went over the default stopwords in spaCy and NLTK and how to get them.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
CoreNLP NLP NLTK spaCy

Top 3 Ready-to-Use Python NLP Libraries for 2022

80-90% of business data is unstructured text data. The businesses that win will be the ones that find a way to analyze their text data. How can we analyze text data? Natural Language Processing. NLP is one of the most important sectors of AI. It may be the fastest growing subfield of AI in the 2020s. In this post we’ll be going over three ready-to-use Python NLP Libraries. For a more fundamental understanding of Natural Language Processing, read an Introduction to NLP: Core Concepts.

Ready-to-Use Python NLP Libraries

The state of the art Natural Language Processing is to use neural networks. In particular, transformers are a popular model architecture. There are pros and cons to using transformer models, but we’re not going to focus on that now. There will always be architectural innovations. For this article, we’re going to focus on the top three ready-to-use NLP libraries. None of these libraries require a deep, fundamental understanding of how NLP works but will allow you to leverage its power.

The top 3 ready-to-use NLP libraries are spaCy, NLTK, and Stanford’s CoreNLP library. Each of these libraries has its own speciality and reason for being in the top 3. The spaCy library provides industrial strength NLP models. NLTK focuses on providing research focused NLP power. Stanford’s CoreNLP library was a Java library that is now adapted to multiple languages including Python.

NLP with spaCy

The spaCy library is made and maintained by Explosion. It provides multiple models and support for 18 languages. We’re going to focus on the English language models. There are four English language models for web data: en_core_web_sm, en_core_web_md, en_core_web_lg, and en_core_web_trf. The first three are optimized for CPU performance while en_core_web_trf is a transformer-based model, not optimized for CPU performance. Let’s go over some of the basic NLP techniques you can do with spaCy.

To get started with spaCy, open up your terminal and run the following commands:

pip install spacy
python -m spacy download en_core_web_sm

Part of Speech Tagging

Part of speech (POS) tagging is a fundamental part of natural language processing. This is usually one of the first things in an NLP pipeline. There are many different parts of speech, to learn more read this article on parts of speech. Here’s how we can do POS tagging with spaCy.

First, we import spacy. Then we load up the model we downloaded earlier, in this case en_core_web_sm. The text that we’re running POS tagging on is taken from How Many Solar Farms Does it Take to Power America? All we do is run the text through our NLP pipeline. Then to see the parts of speech, we loop through the tokenized document and check the part of speech and tag.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more 
land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
doc = nlp(text)
 
for token in doc:
    print(token.text, token.pos_, token.tag_)

Named Entity Recognition

Named Entity Recognition (NER) is an NLP technique that has POS as a prerequisite. The types of entities that can be named and recognized include people, organizations, locations, and time. This isn’t a comprehensive list though. For a full list of the named entities that can be recognized read this article on the Best Way to do Named Entity Recognition.

To do NER in spaCy, we’ll start by importing spacy. Then we’ll load the model. The text that we’re using for this is a random thing that I made up. The same as above, we’ll tokenize the text by running it through the NLP model. Then we’ll loop through each entity in the document and print out the text and label. Notice that ents is a default property of the document after running it through the NLP pipeline.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
doc = nlp(text)
 
for ent in doc.ents:
    print(ent.text, ent.label_)

Lemmatization

Lemmatization is the process of finding the lemmas of each word. A lemma is the root of a word. To learn more about lemmatization, read this article on what lemmatization is and how you can use it.

As we did in the above two NLP techniques with spaCy, we’ll start by importing spacy and loading the model. You can use any text you want. For this example, I’m using a random set of text about spaCy, the NFL, and about how Yujian Tang is the best software content creator. As we did above, we simply run the text through an NLP model. Then we’ll loop through each token in our tokenized document and print the lemma out.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "This is an example program for showing how lemmatization works in spaCy. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
 
doc = nlp(text)
 
for token in doc:
    print(token.lemma_)

NLP with Natural Language ToolKit (NLTK)

NLTK is a project led by Steven Bird and Lilang Tan. Different parts of NLTK are maintained by different people all around the world. It’s an open source natural language project made for playing with computational linguistics in Python.

To get started with NLTK, we need to install the library as well as some of its submodules. We can do so with the commands below. Note that we actually only need the last three submodules for NER.

pip install nltk
python
>>> import nltk
>>> nltk.download(“punkt”)
>>> nltk.download(“averaged_perceptron_tagger”)
>>> nltk.download(“maxent_ne_chunker”)
>>> nltk.download(“words”)

Part of Speech Tagging

We’re going to use the same piece of text to demonstrate Part of Speech Tagging with NLTK as we did with spaCy. To do part of speech tagging with NLTK we’ll start by importing the nltk library. We have to run two commands to do part of speech tagging. First we tokenize the text, then we use the pos_tag command on the tokenized text. To see the tagged parts of speech, we just print them out. Click here for a complete list of part of speech tags.

import nltk
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
tokenized = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
for tag in tagged:
    print(tag)

Named Entity Recognition

Named Entity Recognition with NLTK requires the most libraries out of these three listed simple NLP techniques. Once again we’re going to use the same, somewhat nonsensical phrase as we did before. If you’re from Seattle, you’ll surely recognize Molly Moon. She is not a part of the UN’s Climate Action Committee.

To do NER with NLTK, we import our library, set up our text, and then call three functions on it. Just like above, we’ll start by tokenizing the string, and then running part of speech tagging on it. After part of speech tagging, we’ll run the ne_chunk command which stands for “named entity chunk”. To see the named entities tagged, we’ll look through all the chunks and if the chunk is labeled (recognized), we print it out.

import nltk
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
tokenized = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokenized)
chunks = nltk.ne_chunk(pos_tagged)
for chunk in chunks:
    if hasattr(chunk, 'label'):
        print(chunk)

Lemmatization

Lemmatization in NLTK works slightly differently than the other two NLP techniques we’ve looked at in this post. Let’s start by importing the NLTK library, and then also import the WordNetLemmatizer function from the nltk.stem sub-library. We’ll use the same text as above, a mix of random sentences about NLP, the NFL, Yujian Tang being the best software content creator, and The Text API.

We use the WordNetLemmatizer() as our lemmatizer. The first thing we’ll do is tokenize our text. Then we loop through the tokenized text and lemmatize each token with the lemmatizer.

import nltk
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
tokenized = nltk.word_tokenize("This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date.")
for t in tokenized:
    print(lemmatizer.lemmatize(t))

NLP with Stanford CoreNLP (Stanza in Python)

Stanford’s CoreNLP library is actually a Java Library. It has been adapted to be usable in Python in many different forms. The formally maintained library actually isn’t even called Stanford CoreNLP. It’s called “Stanza”. Curiously enough, NLTK actually has a way to interface with Stanford CoreNLP. To get started with stanza we simply install it and then download a model as shown below.

pip install stanza
>>> import stanza
>>> stanza.download("en")

Part of Speech Tagging

It’s worth mentioning here that just as spaCy has separated “part of speech” and “tags”, Core NLP separates upos or universal part of speech and xpos or treebank-specific part of speech. Here we’re going to be looking at the upos.

We’ll start the same we always start, by importing the library. Stanford’s stanza NLP package’s NLP model/concept explicitly uses a Pipeline instead of loading a model (spaCy) or calling different functions (NLTK). We’ll tell the pipeline that we want an en or English model, and we want to add tokenize, mwt (multi-word tokenizer), and pos (part of speech) to our pipeline. From here, we add the text, documentize the text with the pipeline, and print out all the universal parts of speech for each token in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos')
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
doc = nlp(text)
print(*[f"word: {word.text}\tupos: {word.upos}" for sent in doc.sentences for word in sent.words], sep='\n')

Named Entity Recognition

Named Entity Recognition with stanza works in much the same way POS does. We import the stanza library and create a pipeline. For this case we need to use the tokenize and ner pipelines. Once again, we use the same text, we documentize the text, and print out the entity type for each entity in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang='en', processors="tokenize,ner")
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
doc = nlp(text)
 
print(*[f"entity: {ent.text}\ttype: {ent.type}" for sent in doc.sentences for ent in sent.ents], sep='\n')

Lemmatization

We start off by importing our library and setting up our pipeline as usual. For lemmatization, we’ll need the same pipeline elements as we did for POS tagging and also the lemma element. Our text will be the same text as the spaCy and NLTK ones. All we have to do is documentize the text to get the lemmas. To see them, we simply print out all the texts and lemmas for each word in each sentence.

import stanza
 
nlp = stanza.Pipeline(lang="en", processors="tokenize,mwt,pos,lemma")
text = "This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
doc = nlp(text)
print(*[f"word: {word.text}\t lemma: {word.lemma}" for sent in doc.sentences for word in sent.words], sep='\n')

Recap of the Top 3 Ready-To-Use Python NLP Libraries

In this post we went over the top 3 ready-to-use Python NLP libraries for 2022. Why are these the top 3? Because they’re actually maintained. There are a TON of NLP libraries for Python, but most of them have fallen into disuse. We went over how to do three of the most common and fundamental NLP techniques with each of these libraries. Which one of these libraries should you use? It depends on your use case. 

The spaCy library is targeted at industry Python users, the NLTK library is mainly for academic research around NLP and computational linguistics, and the Stanford CoreNLP library is compatible with multiple programming languages. Out of these three, I would say that the Stanford CoreNLP library is the most powerful and most complex. The NLTK library seems to be the most customizable. The spaCy library feels like the most simple to use while still being quite powerful.

Bonus: a language agnostic NLP Web API

Web APIs are also a popular choice for NLP. A great advantage of a web API is that you don’t have to host the model on your own computer. However, you also don’t have customizability over the model. The most comprehensive web API to date is The Text API. The only one of the fundamental NLP techniques we mentioned that it provides is NER. The Text API provides more business-ready use cases such as AI summarization, finding the most common phrases, keyword sentence extraction, and more. For more information, read this guide on how to automatically analyze text documents.

Categories
NLP NLTK spaCy

What is Lemmatization and How can I do It?

Lemmatization is an important part of Natural Language Processing. Other NLP topics we’ve covered include Text Polarity, Named Entity Recognition, and Summarization. Lemmatization is the process of turning a word into its lemma. A lemma is the “canonical form” of a word. A lemma is usually the dictionary version of a word, it’s picked by convention. Let’s look at some examples to make more sense of this.

The words “playing”, “played”, and “plays” all have the same lemma of the word “play”. The words “win”, “winning”, “won”, and “wins” all have the same lemma of the word “win”. Let’s take a look at one more example before we move on to how you can do lemmatization in Python. The words “programming”, “programs”, “programmed”, and “programmatic” all have the same lemma of the word “program”. Another way to think about it is to think of the lemma as the “root” of the word.

In this post we’ll cover:

  • How Can I Do Lemmatization with Python
    • Lemmatization with spaCy
    • Lemmatization with NLTK

How Can I Do Lemmatization with Python?

Python has many well known Natural Language Processing libraries, and we’re going to make use of two of them to do lemmatization. The first one we’ll look at is spaCy and the second one we’ll use is Natural Language Toolkit (NLTK).

Lemmatization with spaCy

This is pretty cool, we’re going to lemmatize our text in under 10 lines of code. To get started with spaCy we’ll install the spacy library and download a model. We can do this in the terminal with the following commands:

pip install spacy
python -m spacy download en_core_web_sm

To start off our program, we’ll import spacy and load the language model.

import spacy
 
nlp = spacy.load("en_core_web_sm")

Once we have the model, we’ll simply make up a text, turn it into a spaCy Doc, and that’s basically it. To get the lemma of each word, we’ll just print out the lemma_ attribute. Note that printing out the lemma attribute will get you a number corresponding to the lemma’s representation.

text = "This is an example program for showing how lemmatization works in spaCy. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
 
doc = nlp(text)
 
for token in doc:
    print(token.lemma_)

Our output should look like the following:

text lemmatization with spaCy output

Sounds like a pirate!

Lemmatization with NLTK

Cool, lemmatization with spaCy wasn’t that hard, let’s check it out with NLTK. For NLTK, we’ll need to install the library and install the wordnet submodule before we can write the program. We can do that in the terminal with the below commands.

pip install NLTK
python 
>>> import nltk
>>> nltk.download(‘wordnet’)
>>> exit()

Why are we running a Python script in shell and not just downloading wordnet at the start of our program? We only need to download it once to be able to use it, so we don’t want to put it in a program we’ll be running multiple times. As always, we’ll start out our program by importing the libraries we need. In this case, we’re just going to be importing nltk and the WordNetLemmatizer object from nltk.stem.

import nltk
from nltk.stem import WordNetLemmatizer

First we’ll use word_tokenize from nltk to tokenize our text. Then we’ll loop through the tokenized text and use the lemmatizer to lemmatize each token and print it out.

lemmatizer = WordNetLemmatizer()
tokenized = nltk.word_tokenize("This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date.")
for t in tokenized:
    print(lemmatizer.lemmatize(t))

We’ll end up with something like the image below. 

text lemmatization with NLTK results

As you can see, using NLTK returns a different lemmatization than using spaCy. It doesn’t seem to do lemmatization as well. NLTK and spaCy are made for different purposes, so I am usually impartial. However, spaCy definitely wins for built in lemmatization. NLTK can be customized because it’s highly used for research purposes, but that’s out of the scope for this article. Be on the lookout for an in depth dive though!

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP spaCy The Text API

Build Your Own AI Text Summarizer in Python

For this example, we’re going to build a naive extractive text summarizer in 25 lines of Python. An extractive summary is a summary of a document that is directly extracted from the text. For more information on AI summaries, check out this article on What is AI Text Summarization and How Can I Use It?

We will build an AI text summarizer in two ways. First with spaCy, then with The Text API. spaCy is one of the open source Python libraries for Natural Language Processing. The Text API is the best comprehensive sentiment analysis API online.

In this post on how to build an AI Text Summarizer in Python, we will cover:

Build an AI Text Summarizer in Under 30 Lines of Python

Before we can get started with the code we need to install spaCy and download a model. We can do this in the terminal with the following two commands. The en_core_web_sm model is the smallest model and the fastest to get started with. You can also download en_core_web_md, en_core_web_lg, and en_core_web_trf for other, larger English language models.

pip install spacy
python -m spacy download en_core_web_sm

Let’s get started with the code for our text summarizer! First, we’ll import spacy and load up the language model we downloaded earlier.

import spacy
 
nlp = spacy.load("en_core_web_sm")

For this tutorial, we’ll be building a simple extractive text summarizer based purely on the words in the text and how often they’re mentioned. We’re going to break down this text summarizer into a few simple steps.

First we’re going to create a word dictionary to keep track of word count. Then we’re going to score each sentence based on how often each word in that sentence appears. After that, we’re going to sort the sentences based on their score. Finally, we’ll take the top three scoring sentences and return them in the same order they originally appeared in the text.

Before we get into all that let’s load up our text and turn it into a spaCy Document. You can use whatever text you want. The text provided is just an example that talks about me and this blog.

# extractive summary by word count
text = """This is an example text. We will use seven sentences and we will return 3. This blog is written by Yujian Tang. Yujian is the best software content creator. This is a software content blog focused on Python, your software career, and Machine Learning. Yujian's favorite ML subcategory is Natural Language Processing. This is the end of our example."""
# tokenize
doc = nlp(text)

Getting All the Word Counts

Now that we have our text in Doc form, we can get all our word counts. You can actually do this before by splitting the string on spaces, but this is easier and we’ll need the Doc again later anyway.

First let’s create a word dictionary. Next, we’ll loop through the text and check if each word is in the dictionary. If the word is in the dictionary we’ll increment its counter, if not we’ll set its counter to one. We’ll save every word in lowercase format.

# create dictionary
word_dict = {}
# loop through every sentence and give it a weight
for word in doc:
    word = word.text.lower()
    if word in word_dict:
        word_dict[word] += 1
    else:
        word_dict[word] = 1

Scoring the Sentences for Our AI Text Summarizer

Once we’ve gathered all the word counts, we can use those to score our sentences. We’ll create a list of tuples. Each tuple contains information we need about the sentence – the sentences text, the sentences score, and the original sentence index. We’ll loop through each index and sentence in the enumerated Doc sentences.

The enumerate command returns an index and the element at that index for any iterable. For each word in the sentence, we’ll add the word score to the sentence score. At the end of looping through all the words in the sentence, we append the sentence text, the sentence score normalized by length, and the original index.

# create a list of tuple (sentence text, score, index)
sents = []
# score sentences
sent_score = 0
for index, sent in enumerate(doc.sents):
    for word in sent:
        word = word.text.lower()
        sent_score += word_dict[word]
    sents.append((sent.text.replace("\n", " "), sent_score/len(sent), index))

Sorting the Sentences for the Text Summarizer

Now that our list of sentences is created, we’ll have to sort them so that we get the highest scored sentences in our summary. First we’ll use a lambda function to sort by the negative version of the score.

Why negative? Because the automatic sort function sorts from smallest to largest. After we’ve sorted by score, we take the top 3 and then re-sort those by index so that our summary is in order. You can take however many sentences you’d like and even change the number of sentences you want based on the length of the text.

# sort sentence by word occurrences
sents = sorted(sents, key=lambda x: -x[1])
# return top 3
sents = sorted(sents[:3], key=lambda x: x[2])

Returning the Summary

All we have to do to get our resulting summary is take the list of sorted sentences and put them together, separated by a space. Finally, we’ll print it out to take a look.

# compile them into text
summary_text = ""
for sent in sents:
    summary_text += sent[0] + " "
 
print(summary_text)

Once we run our program, we should see an example like the one below. That’s all there is to building a simple text summarizer in Python with spaCy!

example text summarizer output

Build an AI Text Summarizer in 15 Lines of Python

So we’ve covered how to build an AI text summarizer in under 30 lines of code, let’s also do it in 15. For this part of the tutorial we only need to send an HTTP request. Before we get started with that we’ll have to go to The Text API and register for a free API key. Once you’ve registered for a key, you’ll need to install the requests library.

pip install requests

We’ll import the libraries we need to get started. We’ll use requests to send our HTTP request and json to parse the response.

import requests
import json
 
from config import apikey

Setting Up the API Request

Let’s set up the request. The text we’ll summarize is a description of The Text API and what it can do. We’ll also need to set up some headers, the body, and the URL endpoint. The headers will tell the server that the content we’re sending is in JSON format and also pass the API key we got earlier. The body will simply pass in the text we have as the “text” attribute. The URL will be the summarize endpoint from The Text API.

text = "The Text API is easy to use and useful for anyone who needs to do text processing. It's the best Text Processing web API. The Text API allows you to do amazing NLP without having to download or manage any models. The Text API provides many NLP capabilities. These capabilities range from custom Named Entity Recognition (NER) to Summarization to extracting the Most Common Phrases. NER and Summarizations are both commonly used endpoints with business use cases. Use cases include identifying entities in articles, summarizing news articles, and more. The Text API is built on a transformer model."
 
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/summarize"

Parsing the AI Text Summarizer Response

After setting up the request, all we have to do is send the request and then parse our response via JSON. The request will return both a user item and a summary item. We only need the value of the summary item. 

response = requests.post(url, headers=headers, json=body)
summary = json.loads(response.text)["summary"]
print(summary)

Let’s run this and see our response. It should look something like the response below.

example text summarizer output

You can read more about other NLP concepts such as Named Entity Recognition (NER), Part of Speech (POS) Tagging, and more on this blog.

Summary of Building an AI Text Summarizer

In this post we looked at two ways you can build an AI text summarizer. First, we used the popular text library, spaCy. We built a simple extractive AI text summarizer on top of a spaCy model using basic logic. Second, we built a text summarizer using The Text API, a comprehensive and easy to use web API.

Unlike many functions such as voice to text transcription or speech recognition, summarization is more subjective. The true value of an AI text summarizer lies in how effective it is for the end user’s requirements. So be aware of the end user’s use case when you think about creating your AI text summarizer.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy The Text API

Natural Language Processing: What is Text Polarity?

Natural Language Processing (NLP) and all of its applications will be huge in the 2020s. A lot of my blogging is about text processing and all the things that go with it such as Named Entity Recognition and Part of Speech Tagging. Text polarity is a basic text processing technique that gives us insight into how positive or negative a text is. The polarity of a text is essentially it’s “sentiment” rating from -1 to 1.

Overview of Text Polarity

In this post we’ll cover:

  • What is Text Polarity?
  • How to Get Text Polarity with spaCy
  • How to Get Text Polarity with NLTK
  • How to Get Text Polarity with a web API
  • Why are these Text Polarity Numbers so Different?

What is Text Polarity?

In short, text polarity is a measure of how negative or how positive a piece of text is. Polarity is the measure of the overall combination of the positive and negative emotions in a sentence. It’s notoriously hard for computers to predict this, in fact it’s even hard for people to predict this over text. Check out the following Key and Peele video for an example of what I mean.

Most of the time, NLP models can predict simply positive or negative words and phrases quite well. For example, the words “amazing”, “superb”, and “wonderful” can easily be labeled as highly positive. The words “bad”, “sad”, and “mad” can easily be labeled as negative. However, we can’t just look at polarity from the frame of individual words, it’s important to take a larger context for evaluating total polarity. For example, the word “bad” may be negative but what about the phrase “not bad”? Is that neutral? Or is that the opposite of bad? At this point we’re getting into linguistics and semantics rather than natural language processing.

Due to the nature of language and how words around each other can modify their meaning and polarity, when I personally implemented text polarity for The Text API, I used a combination of total text polarity and the polarity of individual phrases in it. The two biggest open source libraries for NLP in Python are spaCy and NLTK, and both of these libraries measure polarity on a normalized scale of -1 to 1. The Text API measures, combines, and normalizes values on both the polarity of the overall text, individual sentences, and individual phrases. This returns a better picture of the relative polarities of texts by not penalizing longer sentences that are expressing positive or negative emotion at scale but also contain neutral phrases. Let’s take a look at how we can implement text polarity with the libraries and API I mentioned above!

How to Get Text Polarity with spaCy

To get started with spaCy we’ll need to download two spaCy libraries with pip in our terminal as shown below:

pip install spacy spacytextblob

We’ll also need to download a model. As usual we’ll download the `en_core_web_sm` model to get started. Run the below command in the terminal after the pip installs are finished:

python -m spacy download en_core_web_sm

Now that we’ve downloaded our libraries and model, let’s get started with our code. We’ll need to import `spacy` and `SpacyTextBlob` from `spacytextblob.spacytextblob`. Spacy Text Blob is the pipeline component that we’ll be using to get polarity. We’ll start our program by loading the model we downloaded earlier and then adding the `spacytextblob` pipe to the `nlp` pipeline. Notice that we never actually explicitly call the `SpacyTextBlob` module, but rather pass it in as a string to `nlp`. If you’re using VSCode, you’ll see the `SpacyTextBlob` is grayed out like it’s not being used, but don’t be fooled, we require this import in order to add the pipeline component even though we don’t call it directly.

Next we’ll choose a text to process. For this example, I simply wrote two decently positive sentences on The Text API, which we’ll show an example for later. Then all we have to do is send the text to a document via our `nlp` object and check its polarity score.

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
 
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
doc = nlp(text)
 
print(doc._.polarity)

Our spaCy model predicted our text’s polarity score at 0.5. It’s hard to really judge how “accurate” the polarity of something is, so we’ll go through the other two methods and I’ll comment on this later.

Text Polarity from spaCy

How to Get Text Polarity with NLTK

Now that we’ve covered how to get polarity via spaCy, let’s check out how to get polarity with the Natural Language Toolkit. As always, we’ll start out by installing the library and dependencies we’ll need.

pip install nltk

Once we install NLTK, we’ll fire up and interactive Python shell in the command line to install the NLTK modules that we need with the commands below.

python
>>> import nltk
>>> nltk.download([“averaged_perceptron_tagger”, “punkt”, “vader_lexicon”])

Averaged Perceptron Tagger handles part of speech tagging. It’s the best tagger in the NLTK library at the time of writing, so you’ll probably use it for something else as well as polarity. Punkt is for recognizing punctuation. I know what you’re thinking:

Vader

But no, the VADER lexicon library actually stands for “Valence Aware Dictionary and sEntiment Reasoner”. It is the library that provides the sentiment analysis tool we need. Once we have all these installed, it’s pretty simple to just import the library and call it. We need the `SentimentIntensityAnalyzer` library, pass our text to it, and call it to score the text on polarity.

from nltk.sentiment import SentimentIntensityAnalyzer
 
sia = SentimentIntensityAnalyzer()
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
scores = sia.polarity_scores(text)
print(scores)

We should get a print out like the one below.

NLTK Text Polarity

This result tells us that none of the text is negative, 61.8% is neutral, and 38.2% of it is positive. Compound is a normalized sentiment score that you can see calculated in the VADER package on GitHub. It’s calculated before the negative, neutral, and positive scores, and represents a normalized polarity score of the sentence. So NLTK has calculated our sentence to be very positive.

How to Get Text Polarity with The Text API

Finally, let’s take a look at how to get a text polarity score from The Text API. A major advantage of using a web API like The Text API to do text processing is that you don’t need to download any machine learning libraries or maintain any models. Simply using the requests library, which if you don’t have by now you can install with the pip command below, and then go to The Text API website and sign up for your free API key.

pip install requests

When you land on The Text API’s homepage you should scroll all the way down and you’ll see a button that you can click to sign up for your free API key. 

Once you log in your API key will be right at the top of the page. Now that we’re all set up, let’s take a dive into the code. All we’re going to do is set up a request with headers that tells the server we’re sending a JSON request and pass the API key, a body with the text we want to analyze, and the URL endpoint we’re going to hit (in this case “https://app.thetextapi.com/text/text_polarity”) and then send a request and parse the response.

import requests
import json
from config import apikey
 
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/text_polarity"
 
response = requests.post(url, headers=headers, json=body)
polarity = json.loads(response.text)["text polarity"]
print(polarity)

Once we send off our request we’ll get a response that looks like the following:

The Text API Text Polarity

The Text API thinks that my praise of The Text API is roughly .575 polarity, that translates to like ~79% AMAZING (if 1 is AMAZING). 

Why Are These Polarities So Different?

Earlier I mentioned that we’d discuss the different polarity scores at the end so here we are. We used three different methods to get the polarity of the same document of text, so why were our polarity scores so different? The obvious answer is that each method used a) a different model and b) a different way to calculate document polarity. However, there’s also another underlying factor at play here.

Remember that Key and Peele video earlier? It’s hard for people to even understand the polarity of comments even with context. Remember that machines don’t have the ability to understand context yet. Also a range of -1 to 1 without really providing examples of what is a polarity of 1 and what is a polarity of -1 makes it hard to interpret. However, all three methods at least agree that the text is quite positive in general. Of course there are ways to improve the interpretability of these results, but that will be in a coming post!

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy The Text API

The Best Way to do Named Entity Recognition (NER)

Named Entity Recognition (NER) is a common Natural Language Processing technique. It’s so often used that it comes in the basic pipeline for spaCy. NER can help us quickly parse out a document for all the named entities of many different types. For example, if we’re reading an article, we can use named entity recognition to immediately get an idea of the who/what/when/where of the article.

In this post we’re going to cover three different ways you can implement NER in Python. We’ll be going over:

What is Named Entity Recognition?

Named Entity Recognition, or NER for short, is the Natural Language Processing (NLP) topic about recognizing entities in a text document or speech file. Of course, this is quite a circular definition. In order to understand what NER really is, we’ll have to define what an entity is. For the purposes of NLP, an entity is essentially a noun that defines an individual, group of individuals, or a recognizable object. While there is not a TOTAL consensus on what kinds of entities there are, I’ve compiled a rather complete list of the possible types of entities that popular NLP libraries such as spaCy or Natural Language Toolkit (NLTK) can recognize. You can find the GitHub repo here.

List of Common Named Entities

Entity TypeDescription of the NER object
PERSONA person – usually a recognized as a first and last name
NORPNationalities or Religious/Political Groups
FACThe name of a Facility
ORGThe name of an Organization
GPEThe name of a Geopolitical Entity
LOCA location
PRODUCTThe name of a product
EVENTThe name of an event
WORK OF ARTThe name of a work of art
LAWA law that has been published (US only as far as I know)
LANGUAGEThe name of a language
DATEA date, doesn’t have to be an exact date, could be a relative date like “a day ago”
TIMEA time, like date it doesn’t have to be exact, it could be like “middle of the day”
PERCENTA percentage
MONEYAn amount of money, like “$100”
QUANTITYMeasurements of weight or distance
CARDINALA number, similar to quantity but not a measurement
ORDINALA number, but signifying a relative position such as “first” or “second”

How Can I Implement NER in Python?

Earlier, I mentioned that you can implement NER with both spaCy and NLTK. The difference between these libraries is that NLTK is built for academic/research purposes and spaCy is built for production purposes. Both are free to use open source libraries. NER is extremely easy to implement with these open source libraries. In this article I will show you how to get started implementing your own Named Entity Recognition programs.

spaCy Named Entity Recognition (NER)

We’ll start with spaCy, to get started run the commands below in your terminal to install the library and download a starter model.

pip install spacy
python -m spacy download en_core_web_sm

We can implement NER in spaCy in just a few lines of code. All we need to do is import the spacy library, load a model, give it some text to process, and then call the processed document to get our named entities. For this example we’ll be using the “en_core_web_sm” model we downloaded earlier, this is the “small” model trained on web text. The text we’ll use is just some random sentence I made up, we should expect the NER to identify Molly Moon as a Person (NER isn’t advanced enough to detect that she is a cow), to identify the United Nations’ as an organization, and the Climate Action Committee as a second organization.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "Molly Moon is a cow. She is part of the United Nations Climate Action Committee."
 
doc = nlp(text)
 
for ent in doc.ents:
    print(ent.text, ent.label_)

After we run this we should see a result like the one below. We see that this spaCy model is unable to separate the United Nations and its Climate Action Committee as separate orgs.

named entity recognition spacy results

Named Entity Recognition with NLTK

Let’s take a look at how to implement NER with NLTK. As with spaCy, we’ll start by installing the NLTK library and also downloading the extensions we need.

pip install nltk

After we run our initial pip install, we’ll need to download four extensions to get our Named Entity Recognition program running. I recommend simply firing up Python in your terminal and running these commands as the libraries only need to be downloaded once to work, so including them in your NER program will only slow it down.

python
>>> import nltk
>>> nltk.download(“punkt”)
>>> nltk.download(“averaged_perceptron_tagger”)
>>> nltk.download(“maxent_ne_chunker”)
>>> nltk.download(“words”)

Punkt is a tokenizer package that recognizes punctuation. Averaged Perceptron Tagger is the default part of speech tagger for NLTK. Maxent NE Chunker is the Named Entity Chunker for NLTK. The Words library is an NLTK corpus of words. We can already see here that NLTK is far more customizable, and consequently also more complex to set up. Let’s dive into the program to see how we can extract our named entities.

Once again we simply start by importing our library and declaring our text. Then we’ll tokenize the text, tag the parts of speech, and chunk it using the named entity chunker. Finally, we’ll loop through our chunks and display the ones that are labeled.

import nltk
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
tokenized = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokenized)
chunks = nltk.ne_chunk(pos_tagged)
for chunk in chunks:
    if hasattr(chunk, 'label'):
        print(chunk)

When you run this program in your terminal you should see an output like the one below.

named entity recognition results – nltk

Notice that NLTK has identified “Climate Action Committee” as a Person and Moon as a Person. That’s clearly incorrect, but this is all on pre trained data. Also this time, I let it print out the entire chunk, and it shows the parts of speech. NLTK has tagged all of these as “NNP” which signals a proper noun.

A Simpler and More Accurate NER Implementation

Alright, now that we’ve discussed how to implement NER with open source libraries, let’s take a look at how we can do it without ever having to download extra packages and machine learning models! We can simply ping a web API that already has a pre-trained model and pipeline for tons of text processing needs. We’ll be using the open beta of the The Text API, scroll down to the bottom of the page and get your API key.

The only library we need to install is the requests library, and we only need to be able to send an API request as outlined in How to Send a Web API Request. So, let’s take a look at the code.

All we need is to construct a request to send to the endpoint, send the request, and parse the response. The API key should be passed in the headers as “apikey” and also we should specify that the content type is json. The body simply needs to pass the text in. The endpoint that we’ll hit is “https://app.thetextapi.com/text/ner”. Once we get our request back, we’ll use the json library (native to Python) to parse our response.

import requests
import json
from config import apikey
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/ner"
 
response = requests.post(url, headers=headers, json=body)
ner = json.loads(response.text)["ner"]
print(ner)

Once we send this request, we should see an output like the one below.

named entity recognition with the text api

Woah! Our API actually recognizes all three of the named entities successfully! Not only is using The Text API simpler than downloading multiple models and libraries, but in this use case, we can see that it’s also more accurate.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy

Natural Language Processing: Part of Speech Tagging

Part of Speech (POS) Tagging is an integral part of Natural Language Processing (NLP). The first step in most state of the art NLP pipelines is tokenization. Tokenization is the separating of text into “tokens”. Tokens are generally regarded as individual pieces of languages – words, whitespace, and punctuation.

Once we tokenize our text we can tag it with the part of speech, note that this article only covers the details of part of speech tagging for English. Part of speech tagging is done on all tokens except for whitespace. We’ll take a look at how to do POS with the two most popular and easy to use NLP Python libraries – spaCy and NLTK – coincidentally also my favorite two NLP libraries to play with.

What is Part of Speech (POS) Tagging?

Traditionally, there are nine parts of speech taught in English literature – nouns, adjectives, determiners, adverbs, pronouns, prepositions, conjunctions, and interjections. We’ll see below, that for NLP reasons, we’ll actually be using way more than nine tags. The spaCy library tags 19 different parts of speech, and over 50 “tags” (depending how you count different punctuation marks).

In spaCy tags are more granularized parts of speech. NLTK’s part of speech tagging tags 34 parts of speech. It is more like spaCy’s tagging concept than spaCy’s parts of speech. We’ll take a look at the parts of speech labels from both, and then spaCy’s fine grained tagging. You can find the Github Repo that contains code for POS tagging here.

In this post, we’ll go over:

  • List of spaCy automatic parts of speech (POS)
  • List of NLTK parts of speech (POS)
  • Fine-grained Part of Speech (POS) tags in spaCy
  • spaCy POS Tagging Example
  • NLTK POS Tagging Example

List of spaCy parts of speech (automatic):

POSDescriptionPOSDescription
ADJAdjective – big, purple, creamyADPAdposition – in, to, during
ADVAdverb – very, really, thereAUXAuxiliary – is, has, will
CONJConjunction – and, or, butCCONJCoordinating conjunction – either…or, neither…nor, not only
DETDeterminer – a, an, theINTJInterjection – psst, oops, oof
NOUNNoun – cat, dog, frogNUMNumeral – 1, one, 20
PARTParticle – ‘s, ‘nt, ‘dPRONPronoun – he, she, me
PROPNProper noun – Yujian Tang, Michael Jordan, Andrew NgPUNCTPunctuation – commas, periods, semicolons
SCONJSubordinating conjunction – if, while, butSYMSymbol – $, %, ^
VERBVerb – sleep, eat, runXOther – asdf, xyz, abc
SPACESpace – space lol

List of NLTK parts of speech:

POSDescriptionPOSDescription
CCCoordinating Conjunction – either…or, neither…nor, not onlyCDCardinal Digit – 1, 2, twelve
DTDeterminer – a, an, theEXExistential There – “there” used for introducing a topic
FWForeign Word – bonjour, ciao, 你好INPreposition/Subordinating Conjunction – in, at, on
JJAdjective – bigJJRComparative Adjective – bigger
JJSSuperlative Adjective – biggestLSList Marker – first, A., 1), etc
MDModal – can, cannot, mayNNSingular Noun – student, learner, enthusiast
NNSPlural Noun – students, programmers, geniusesNNPSingular Proper Noun – Yujian Tang, Tom Brady, Fei Fei Li
NNPSPlural Proper Noun – Americans, Democrats, PresidentsPDTPredeterminer – all, both, many
POSPossessive Ending – ‘sPRPPersonal Pronoun – her, him, yourself
PRP$Possessive Pronoun – her, his, mineRBAdverb – occasionally, technologically, magically
RBRComparative Adjective – further, higher, betterRBSSuperlative Adjective – best, biggest, highest
RPParticle – aboard, into, uponTOInfinitive Marker – “to” when it is used as an infinitive marker or preposition
UHInterjection – uh, wow, jinkies!VBVerb – ask, assemble, brush
VBGVerb Gerund – stirring, showing, displayingVBDVerb Past Tense – dipped, diced, wrote
VBNVerb Past Participle – condensed, refactored, unsettledVBPVerb Present Tense not 3rd person singular – predominate, wrap, resort
VBZVerb Present Tense, 3rd person singular – bases, reconstructs, emergesWDTWh-determiner – that, what, which
WPWh-pronoun – that, what, whateverWRBWh-adverb – how, however, wherever

We can see that NLTK and spaCy have different parts of speech tagging, this is because there are many ways to tag parts of speech and the different ways that NLTK has split it up is advantageous for academic process. Above, I’ve only shown spaCy’s automatic POS tagging, but spaCy actually has a fine grained part of speech tagging as well, they call it “tag” instead of “part of speech”. I’ll break down how parts of speech map to tagging in spaCy below.

List of spaCy Part of Speech Tags (Fine grained)

POSMapped TagsPOSMapped Tags
ADJAFX – affix: “pre-”
JJ – adjective: good
JJR – comparative adjective: better
JJS – superlative adjective: best
PDT – predeterminer: half
PRP$ – possessive pronoun: his, her
WDT – wh-determiner: which
WP$ – possessive wh-pronoun: whose
ADPIN – subordinating conjunction or preposition: “in”
ADVEX – existential there: there
RB – adverb: quickly
RBR – comparative adverb: quicker
RBS – superlative adverb: quickest
WRB – wh-adverb: when
CONJCC – coordinating conjunction: and
DETDT – determiner: this, a, anINTJUH – interjection: uh, uhm, ruh-roh!
NOUNNN – noun: sentence
NNS – plural noun: sentences
WP – wh-pronoun: who
NUMCD – cardinal number: three, 5, twelve
PARTPOS – possessive ending: ‘s
RP – particle adverb: back (put it “back”)

TO – infinitive to: “to”
PRONPRP – personal pronoun: I, you
PROPNNNP – proper singular noun: Yujian Tang
NNPS – proper plural nouns: Pythonistas
PUNCT-LRB- left round bracket: “(“
-RRB- right round bracket: “)”
(actual punctuation marks): , : ; . “ ‘ (etc)
HYPH – hyphen
LS – list item marker: a., A), iii.
NFP – superfluous punctuation
SYM(like punctuation, these are pretty self explanatory)#
$
SYM – symbol
VERBBES – auxiliary “be”
HVS – “have”: ‘ve
MD – auxiliary modal: could
VB – base form verb: go
VBD – past tense verb: was
VBG – gerund: going
VBN – past participle verb: lost
VBP – non 3rd person singular present verb: want
VBZ – 3rd person singular present verb: wants
XADD – email
FW – foreign word
GW – additional word
XX – unknown

How do I Implement POS Tagging?

Part of Speech Tagging is at the cornerstone of Natural Language Processing. It is one of the most basic parts of NLP, and as a result it comes standard as part of any respectable NLP library. Below, I’m going to cover how you can do POS tagging in just a few lines of code with spaCy and NLTK.

Spacy POS Tagging

We’ll start by implementing part of speech tagging in spaCy. The first thing we’ll need to do is install spaCy and download a model.

pip install spacy
python -m spacy download en_core_web_sm

Once we have our required libraries downloaded we can start. Like I said above, POS tagging is one of the cornerstones of natural language processing. It’s so important that the spaCy pipeline automatically does it upon tokenization. For this example, I’m using a large piece of text, this text about solar energy comes from How Many Solar Farms Does it Take to Power America?

First we import spaCy, then we load our NLP model, then we feed the NLP model our text to create our NLP document. After creating the document, we can simply loop through it and print out the different parts of the tokens. For this example, we’ll print out the token text, the token part of speech, and the token tag.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
doc = nlp(text)
 
for token in doc:
    print(token.text, token.pos_, token.tag_)

Once you run this you should see an output like the one pictured below.

Part of Speech Tagging Results – spaCy

NLTK POS Tagging

Now let’s take a look at how to do POS tagging with the Natural Language Toolkit. We’ll get started with this the same way we got started with spaCy, by downloading the library and the model we’ll need. We’re going to need to install NLTK and download the NLTK “punkt” tokenizer model.

pip install nltk
python
>>> import nltk
>>> nltk.download(‘punkt’)

Once we have our libraries downloaded, we can fire up our favorite Python editor and get started. Like with spaCy, there’s only a few steps we need to do to start tagging parts of speech with the NLTK library. First, we need to tokenize our text. Then, we simply call the NLTK part of speech tagger on the tokenized text and voila! We’re done. I’ve used the exact same text from above.

import nltk
from nltk.tokenize import word_tokenize
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
tokenized = word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
for tag in tagged:
    print(tag)

Once we’re done, we simply run this in a terminal and we should see an output like the following.

Parts of Speech Tagging Results – NLTK

You can compare and see that NLTK and spaCy have pretty much the same tagging at the tag level.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.