Categories
level 2 python NLP

Find the Most Common Named Entities by Type

Natural Language Processing techniques have come far in the last decade. NLP will be an even more important field in the coming decade as we get more and more unstructured text data. One of the base applications of NLP is Named Entity Recognition (NER). However, NER by itself just gives us the names of entities such as the people, location, organizations, and times mentioned in a document. It’s not really that useful for getting insight from a text. You know what would be useful though? Finding the most commonly named entities in a text. 

This post will cover how to find the most common named entities in a text. Before we go over the code and technique to do that, we’ll also briefly cover what NER is. In this post we’ll go over:

  • What is Named Entity Recognition (NER)
  • Performing NER on a Text
    • What does a returned NER usually look like?
  • Processing the returned NERs to find the most common ones
    • Splitting the Named Entities by Type
    • Sorting Each Type of Named Entities
  • Example of the Most Common Named Entities from a NER of Tweets

What is Named Entity Recognition (NER)?

Named Entity Recognition (NER) is the NLP technique used to extract certain types of entities from a text. This post on the Best Way to do Named Entity Recognition (NER) has a list of all the entity types for both spaCy and NLTK. In general, you can think of NER as the process of extracting the people, places, times, organizations, and other conceptual entities from a text.

Performing Named Entity Recognition (NER) on a Text

There are many ways to do Named Entity Recognition. In the Best Way to do Named Entity Recognition (NER) post we went over how to do NER with spaCy, NLTK, and The Text API. In this example, we’ll go over using The Text API since it was the one with the highest quality NER that we found in that post. To do this example, you’ll need to sign up for a free API key at The Text API and install the requests module with the line below.

pip install requests

All we’re going to do here is set up the request by passing the URL, the headers, and the body, and then parse the response. The headers will tell the server that we’re looking to send a JSON request and pass the API key. The body will pass the text we’re trying to do NER on. You can skip this section; we will use an example return value below.

ner_url = "https://app.thetextapi.com/text/ner"
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text = <some text here>
body = {
    "text" : text
}
response = requests.post(url, headers=headers, json=body)

What Does a Returned List of Named Entities Look Like?

I didn’t include a text in the above example because we’re going to be using text pulled from Twitter. To find out how to pull Tweets, follow the guide on how to Scrape the Text from All Tweets for a Search Term. The below image was generated by scraping all the text from tweets by Elon Musk and then running them through a NER.

You can see that the result is a list of lists. Each inner list contains two entries, the entity type, and the name of the entity. It’s important to know the data structure so we can manipulate it and extract the data we want.

[['PERSON', '@cleantechnica'], ['PERSON', 'Webb'], ['ORG', '@NASA'], ['DATE', 'a crazy tough year'], ['ORG', 'Tesla'], ['TIME', '6pm Christmas Eve'], ['TIME', 'last hour'], ['DATE', 'the last day'], ['DATE', 'two days'], ['DATE', 'Christmas'], ['ORG', 'WSJ'], ['ORG', 'Wikipedia'], ['ORG', 'Tesla'], ['DATE', 'holiday'], ['DATE', 'today'], ['ORG', 'Tesla'], ['DATE', 'today'], ['PERSON', 'Doge'], ['ORG', '@SkepticsGuide']]

Processing the NERs to Find the Most Common Ones

It’s great to run NER and get all the named entities in a text, but just getting all the named entities isn’t really that useful to us. It’s much more applicable to extract the most common entities of each type from a document. Let’s take a look at how to do that.

Splitting Named Entities by Type of Entity

The first thing we need to do is find the most common named entities recognized in a document is process the list of lists into a dictionary based on entity type. We’ll create a function, build_dict, that will take one parameter, ners, which is a list. Note that even though we expect a list on the surface, we actually expect a list of lists like the one shown above.

In our function, the first thing we’ll do is create an empty dictionary. This is the dictionary that will contain the entity types and their counts that we will return later. Next, we’ll loop through the ners list. Remember how the ners list is set up. The entity type is the first entry and the entity name is the second. 

If the entity type is already in our dictionary, we’ll check for the entity name in its entry. If the entity name is in our inner dictionary, we’ll increment its count by 1. Otherwise, we’ll add the entry to our inner dictionary and set the count to 1. When the entity type is not in the dictionary, we’ll create an inner dictionary with one entry, the entity name set to a count of 1. After looping through all the ners, we return the outer dictionary.

# build dictionary of NERs
# extract most common NERs
# expects list of lists
def build_dict(ners: list):
    outer_dict = {}
    for ner in ners:
        entity_type = ner[0]
        entity_name = ner[1]
        if entity_type in outer_dict:
            if entity_name in outer_dict[entity_type]:
                outer_dict[entity_type][entity_name] += 1
            else:
                outer_dict[entity_type][entity_name] = 1
        else:
            outer_dict[entity_type] = {
                entity_name: 1
            }
    return outer_dict

Sorting Each Type of Named Entity

The next thing we want to do is sort the entities. We’ll create a function called most_common that will take one input, ners, the list of lists. This is the same input that build_dict takes. Note that we can opt for this function to take a dictionary instead and chain the call to build_dict in the orchestrator. For this example, however, we will call build_dict inside of this function.

Once we have the dictionary of NERs split by entity type, let’s create an empty dictionary that will hold the most commonly mentioned entity of each type. Next, let’s loop through each of the NER types in the built dictionary. For each of those types, we’ll create a sorted list based on the value (count) of each named entity in that type. Then, we’ll set the value of the ner_type in the dictionary of most common NER types to the first entry in the sorted list. Finally, we return the dictionary of most common NER types.

# return most common entities after building the NERS out
def most_common(ners: list):
    _dict = build_dict(ners)
    mosts = {}
    for ner_type in _dict:
        sorted_types = sorted(_dict[ner_type], key=lambda x: _dict[ner_type][x], reverse=True)
        mosts[ner_type] = sorted_types[0]
    return mosts

Example of the Most Common Named Entities From a NER of Tweets

Based on the example NERs we extracted from the Tweets above, when we run the above code, we should get a result like the image below. It is a dictionary with multiple NER types, PERSON, ORG, DATE, and TIME, with the most commonly named entity in each type.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
level 2 python NLP The Text API

Create Your Own AI Content Moderator – Part 3

As content on the web increases, content moderation becomes more and more important to protect sensitive groups such as children and people who have suffered from trauma. We’re going to learn how to create your own AI content moderator using Python, Selenium, Beautiful Soup 4, and The Text API.

Our AI content moderator will be built in three parts, a webscraper to scrape all the text from a page, a module for the content moderation with AI using The Text API, and an orchestrator to put it all together.

Video Guide Here:

In this post we’ll create the orchestrator to put the webscraper and the AI content moderation module together. 

To create this orchestrator we need to:

  • Import the Webscraper and Content Moderation Functions
  • Create Orchestrator Function
    • Get URL from Input
    • Scrape the Page for All the Text
    • Moderate the Scraped Text
  • Test Orchestration

Import the Webscraper and Content Moderation Functions

An orchestrator is simply a module that “orchestrates” the rest of the functions and modules in the software. In our case, we only have two other modules that we’re working with so our orchestrator only needs these two modules. Each module simply contains one function so we’ll just import each of those functions from each of their modules.

# imports
from webscraper import scrape_page_text
from content_moderator import moderate

Create Orchestrator Function

With our imports in place, we now need to create our orchestrator. Our orchestrator function will take a URL from the user, scrape the text, and moderate the text. After moderating the text, it will return the content moderation rating and whether or not it contains a triggering word.

Get URL from Input

We want to be able to run our AI content moderation on any URL. So, the first thing we’ll do when we create our orchestrate function is get the URL. All we have to do for this is call the Python input function. We’ll use the input to prompt the user for a URL and save it to a variable.

# function
def orchestrate():
    # ask user for website URL
    url = input("What URL would you like to moderate? ")

Scrape the Page for All the Text

After we get the URL, we’ll scrape it. First, we’ll print out a statement to tell the user that we’re scraping the page text. Then we’ll call the scrape_page_text method we imported from the webscraper and pass in the URL. We’ll save the returned text into a variable.

    # call webscraper on the URL
    print("Scraping Page Text ...")
    text = scrape_page_text(url)

Moderate the Scraped Text

Now that we have the text from the scraped URL, we have to moderate it. We’ll tell the user that we’re moderating the page text, and then moderate it. We will use the AI moderation function that we created earlier and pass it the text. Then we’ll save the output as the rating and whether or not there’s a trigger word.

    # call content moderator on the scraped data
    print("Moderating Page Text ...")
    rating, trigger = moderate(text)

Full Code for Orchestration Function

Here’s the full code for the orchestration function.

# function
def orchestrate():
    # ask user for website URL
    url = input("What URL would you like to moderate? ")
    # call webscraper on the URL
    print("Scraping Page Text ...")
    text = scrape_page_text(url)
    # call content moderator on the scraped data
    print("Moderating Page Text ...")
    rating, trigger = moderate(text)
    # return verdict
    return rating, trigger

Test Orchestration

Now let’s test our orchestration function. All we’re going to do is print out the call to orchestrate. We’ll test my article about how I’m finally seeing the results of PythonAlgos effect on helping people learn Python.

print(orchestrate())

We should see an output like the one below.

Testing Orchestration of the AI Content Moderator

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
NLP

Ask NLP: The Media on the Obama Presidency Over Time

Recently we’ve used NLP to do an exploration of the media’s portrayal of Obama in two parts, based on the most common phrases used in headlines about him, and an AI summary of the headlines about him. We also explored the who/what/when/where of the article headlines that we got in when we pulled the Obama related headlines from the NY Times. In this post, we’ll be looking at the sentiment surrounding his presidency over time.

Click here to skip directly to graphs of the headline sentiments.

To follow along with this tutorial you’ll need to get your API key from the NY Times and The Text API. You’ll also need to use your package manager to install the `requests` module. You can install it by using the line below in your terminal.

pip install requests matplotlib

Setting Up the API Request

We’ve been here many times before. Every time we start a program, we want to handle the imports. As with many of our prior programs, we’re going to be using the `requests`, and `json` libraries for the API request and parsing. We’ll also be using `matplotlib.pyplot` to plot the sentiment over time. Once again, I’m using the `sys` library purely because I stored my API keys in a parent directory and we need access to them in order to do this project. I also import the base URL from the config, this is the API endpoint. It’s “https://app.thetextapi.com/text/”.

# import libraries
import requests
import json
import matplotlib.pyplot as plt
import sys
sys.path.append("../..")
from nyt.config import thetextapikey, _url
 
# set up request headers and URL
headers = {
    "Content-Type": "application/json",
    "apikey": thetextapikey
}
polarity_by_sentence_url = _url + "polarity_by_sentence"

Getting the Sentiments for Each Headline in Each Year

Everything’s set up, let’s get the actual sentiments. You’ll see that, just like in the last post about running Named Entity Recognition, we’re going to set up a function that performs a loop through all the years. Alternatively, you could set up a function that only does one year and then set up a loop to call that function on each year. We’ll do this for the next function.

In each loop, we’ll open up the `txt` file we downloaded when we got the Obama Headlines and read that into a list. Then we’ll join the list into one single string to send to the endpoint. For this request body, we don’t have any extra parameters to adjust, we’ll just send in the text. After we send in the text, we’ll parse the response.
The response will be in the form of a list of lists. To save it to a `txt` file, we’ll loop through each element in the list and write the second element, followed by a colon, followed by the first. Why the second element and then the first? The way the response is returned, as outlined in the documentation, is the polarity and then the sentence, to make our document more readable, we want to put the sentence first.

# loop through each year
def get_polarities():
    for i in list(range(2008, 2018)):
        with open(f"obama_{i}.txt", "r") as f:
            headlines = f.readlines()
        # combine list of headlines into one text
        text = "".join(headlines)
       
        # set up request bodies
        body = {
            "text": text
        }
        # parse responses
        response = requests.post(url=polarity_by_sentence_url, headers=headers, json=body)
        _dict = json.loads(response.text)
        # save to text file
        with open(f"obama/{i}_sentence_polarities.txt", "w") as f:
            for entry in _dict["polarity by sentence"]:
                f.write(f"{entry[1]} : {entry[0]}")
 
get_polarities()

Plotting the Sentiments For Each Year

Now that we’ve gotten all the polarity values, we’re ready to plot them. As I said above, we’ll be running this function with one parameter, the `year`. We will open each file, read the entries in as a list, and then iterate through them to get their polarity values. Notice that I encompass splitting the entries with a `try/except` block, this is just in case there were any errors or anomalies in our write to earlier based on the original data. 

As we loop through, we’ll add each polarity value to a list. At the end of looping through each of the titles and their sentiments, we’ll create a second list that is the length of the list of sentiment values. This one will contain values from 0 to however many headlines we processed. We run this function on each year from 2008 to 2017, and the plots are below.

# plot each datapoint
def plot_polarities(year):
    with open(f"obama/{year}_sentence_polarities.txt", "r") as f:
        entries = f.readlines()
    ys = []
    for entry in entries:
        try:
            _entry = entry.split(" : ")
            ys.append(float(_entry[1]))
        except:
            continue
    xs = list(range(len(ys)))
    plt.plot(xs, ys)
    plt.title(f"Obama Sentiments, {year}”)
    plt.xlabel("Headline Number")
    plt.ylabel("Average Polarity")
    plt.show()
   
# plot each year
for year in range(2008, 2018):
    plot_polarities(year)

Sentiment of Each Obama Headline from 2008 to 2017

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
APIs NLP The Text API

Twitter Sentiment for Stocks? Starbucks 11/29/21

Updated 6:19pm PST 11/29/2021 – Our sentiment prediction was right! Next step is to predict how much it’ll go up.

Recently I’ve been playing around with sentiment analysis on Tweets a lot. I discovered the Twitter API over the Thanksgiving holidays and it’s like Christmas came early. Sort of like how Christmas comes earlier to malls every year. I have applied for upgraded access already because it’s been so fun and I’m hoping they grant it to me soon. My sentiment analysis of Black Friday on Twitter was quite popular, getting over 400 views this last weekend! Other than just holidays though, I wanted to see if analyzing Twitter sentiment could be used for a real life application. THIS IS NOT FINANCIAL ADVICE AND SHOULD NOT BE TAKEN AS SUCH.

I decided that hey, I like to play with Twitter’s API, Natural Language Processing, and stocks, why not see if I can combine all of them? Thus, the idea for using Twitter sentiment to predict stocks was born for me. I’ve always been a big fan of Starbucks, both the actual store and the stock. #SBUX has made me a lot of money over the past couple years. So this post will be about the Starbucks stock and seeing how Twitter does in predicting its performance for just one day. Click here to skip directly to the results.

This project will be built with two files. You’ll need access to not only the Twitter API linked above, but also to get a free API key from The Text API to do the sentiment analysis part. Make sure you save the Bearer Token from Twitter and your API Key from The Text API in a safe place. I stored them in a config file. You’ll also need to install the requests library for this. You can install that in the command line with the command below:

pip install requests

Using the Twitter API to Get Tweets About Starbucks

As we always do, we’ll get started by importing the libraries we need. We’ll need the requests library we installed earlier to send off HTTP requests to Twitter. We will also need the json library to parse the response. I have also imported my Twitter Bearer Token from my config here. As I said above, you may choose to store and access this token however you’d like.

import requests
import json
from config import bearertoken

Once we’ve imported the libraries and the Bearer Token, we’ll set up the endpoint and header for the request. You can find these in the Twitter API documentation. We need to use the recent search endpoint (only goes back last 7 days). The only header we need is the Authorization header passing in the Bearer token.

search_recent_endpoint = "https://api.twitter.com/2/tweets/search/recent"
headers = {
    "Authorization": f"Bearer {bearertoken}"
}

Creating Our Twitter Search Function

Everything to search the Twitter API is set up and ready to go. The reason we declared the headers and URL outside of the function is because they may be able to be used in a context outside of the function. Now let’s define our search function. Our search function will take one parameter – a search term in the form of a string. We will use our search term to create a set of search parameters. In this example, we will create a query looking for English Tweets with no links and are not retweets that contain our term. We are also going to set the maximum number of returned results to 100. 

With our parameters, headers, and URL set up, we can now send our request. We use the requests module to send a request and use the json module to parse the text of the returned response. Then we open up a file and save the JSON to that file.

# automatically builds a search query from the requested term
# looks for english tweets with no links that are not retweets
# saves the latest 100 tweets into a json
def search(term: str):
    params = {
        "query": f'{term} lang:en -has:links -is:retweet',
        'max_results': 100
    }
    response = requests.get(url=search_recent_endpoint, headers=headers, params=params)
    res = json.loads(response.text)
    with open(f"{term}.json", "w") as f:
        json.dump(res, f)

Once we’ve set up the search function, we simply prompt the user for what term they’d like to search for and then call our search function on that term.

term = input("What would you like to search Twitter for? ")
search(term)

When we run our program it will look like this:

The saved JSON file should look something like this:

Analyzing the Tweets for Average Sentiment

Now let’s get into the second part, the fun part, the part you’re all here for, the sentiment analysis. As I said earlier, we’ll be using The Text API for this. If you don’t already have a free API key, go over to the site and grab one. We won’t be needing any Python libraries we didn’t already install for using the Twitter API so we can dive right into the code. I created a second file for this part to follow the rule of modularity in code. You can opt to do this in the same file and you’ll only need the third import – The Text API key.

Setting Up and Building the Request

As always, we’ll want to get started by importing our libraries. Just like before we’ll need the requests and json library and we will do the same thing with them as above – sending the HTTP request and parsing the response. For this file we’ll import our API key from The Text API instead of the Twitter Bearer Token. 

We have a couple other differences as well. The URL endpoint we’ll be hitting is going to be the polarity_by_sentence URL. The headers we need to send will tell the server that we’re sending JSON content and also pass in the API key through an apikey keyword.

import requests
import json
from config import text_apikey
 
text_url = "https://app.thetextapi.com/text/"
polarity_by_sentence_url = text_url+"polarity_by_sentence"
headers = {
    "Content-Type": "application/json",
    "apikey": text_apikey
}

Just like we did with the Twitter API, we’ll need to build a request for The Text API. Our build_request function will take in a term in the form of a string. We’ll use this term to open the corresponding JSON file. Then we’ll combine all the text from the tweets to form a final text string that we will send to The Text API to be analyzed. Finally, we’ll create a body in the form of a JSON that we will send to the endpoint and return that JSON body.

# build request
def build_request(term: str):
    with open(f"{term}.json", "r") as f:
        entries = json.load(f)
 
    text = ""
    for entry in entries["data"]:
        text += entry["text"] + " "
 
    body = {
        "text": text
    }
    return body

Getting the Average Sentiment

Okay so here we’re actually going to get the average Text Polarity, but that’s about synonymous with sentiment. Text polarity tells us how positive or negative a piece of text was, sentiment is usually used in the same way. Let’s create a polarity_analysis function that will take in a dictionary as a parameter. The dictionary input will be the JSON that we send as the body of the request to the polarity_by_sentence endpoint. Once we get our response back and parse it to get the list of polarities and sentences, we can calculate the average polarity. For an idea of what the response looks like, check out the documentation.

Once we have the response, all we have to do is calculate the average polarity. The way the response is structured, the first element of an entry is the polarity and the second is the sentence text. For our use case, we just care about the polarity. We are also going to ignore neutral sentences because they’re entirely useless to and don’t affect whether the overall outcome will be positive or negative. They could affect the absolute value of the outcome, and maybe we could take that into account, but as long as we approach these problems with the same method each time, it won’t matter.

# get average sentence polarity
def polarity_analysis(body: dict):
    response = requests.post(url=polarity_by_sentence_url, headers=headers, json=body)
    results = json.loads(response.text)["polarity by sentence"]
    # initialize average polarity score and count
    avg_polarity = 0.0
    count = 0
    # loop through all the results
    for res in results:
        # ignore the neutral ones
        if res[0] == 0.0:
            continue
        avg_polarity += res[0]
        count += 1
    # average em out
    avg_polarity = avg_polarity/count
    print(avg_polarity)

Twitter Sentiment on Starbucks Over Thanksgiving Weekend 2021

When we run this program we’ll get something that looks like the following. I’ve pulled Tweets about Starbucks from Saturday, Sunday, and Monday morning (today) and renamed the files from their original names (starbucks.json).

Looks like Sunday was a little less positive than Saturday or Monday, but overall the Twitter sentiment towards Starbucks is positive. I predict the stock price will go up today. Let’s see how much.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
NLP NLTK spaCy

What is Lemmatization and How can I do It?

Lemmatization is an important part of Natural Language Processing. Other NLP topics we’ve covered include Text Polarity, Named Entity Recognition, and Summarization. Lemmatization is the process of turning a word into its lemma. A lemma is the “canonical form” of a word. A lemma is usually the dictionary version of a word, it’s picked by convention. Let’s look at some examples to make more sense of this.

The words “playing”, “played”, and “plays” all have the same lemma of the word “play”. The words “win”, “winning”, “won”, and “wins” all have the same lemma of the word “win”. Let’s take a look at one more example before we move on to how you can do lemmatization in Python. The words “programming”, “programs”, “programmed”, and “programmatic” all have the same lemma of the word “program”. Another way to think about it is to think of the lemma as the “root” of the word.

In this post we’ll cover:

  • How Can I Do Lemmatization with Python
    • Lemmatization with spaCy
    • Lemmatization with NLTK

How Can I Do Lemmatization with Python?

Python has many well known Natural Language Processing libraries, and we’re going to make use of two of them to do lemmatization. The first one we’ll look at is spaCy and the second one we’ll use is Natural Language Toolkit (NLTK).

Lemmatization with spaCy

This is pretty cool, we’re going to lemmatize our text in under 10 lines of code. To get started with spaCy we’ll install the spacy library and download a model. We can do this in the terminal with the following commands:

pip install spacy
python -m spacy download en_core_web_sm

To start off our program, we’ll import spacy and load the language model.

import spacy
 
nlp = spacy.load("en_core_web_sm")

Once we have the model, we’ll simply make up a text, turn it into a spaCy Doc, and that’s basically it. To get the lemma of each word, we’ll just print out the lemma_ attribute. Note that printing out the lemma attribute will get you a number corresponding to the lemma’s representation.

text = "This is an example program for showing how lemmatization works in spaCy. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
 
doc = nlp(text)
 
for token in doc:
    print(token.lemma_)

Our output should look like the following:

text lemmatization with spaCy output

Sounds like a pirate!

Lemmatization with NLTK

Cool, lemmatization with spaCy wasn’t that hard, let’s check it out with NLTK. For NLTK, we’ll need to install the library and install the wordnet submodule before we can write the program. We can do that in the terminal with the below commands.

pip install NLTK
python 
>>> import nltk
>>> nltk.download(‘wordnet’)
>>> exit()

Why are we running a Python script in shell and not just downloading wordnet at the start of our program? We only need to download it once to be able to use it, so we don’t want to put it in a program we’ll be running multiple times. As always, we’ll start out our program by importing the libraries we need. In this case, we’re just going to be importing nltk and the WordNetLemmatizer object from nltk.stem.

import nltk
from nltk.stem import WordNetLemmatizer

First we’ll use word_tokenize from nltk to tokenize our text. Then we’ll loop through the tokenized text and use the lemmatizer to lemmatize each token and print it out.

lemmatizer = WordNetLemmatizer()
tokenized = nltk.word_tokenize("This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date.")
for t in tokenized:
    print(lemmatizer.lemmatize(t))

We’ll end up with something like the image below. 

text lemmatization with NLTK results

As you can see, using NLTK returns a different lemmatization than using spaCy. It doesn’t seem to do lemmatization as well. NLTK and spaCy are made for different purposes, so I am usually impartial. However, spaCy definitely wins for built in lemmatization. NLTK can be customized because it’s highly used for research purposes, but that’s out of the scope for this article. Be on the lookout for an in depth dive though!

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
NLP NLTK spaCy The Text API

The Best Way to do Named Entity Recognition (NER)

Named Entity Recognition (NER) is a common Natural Language Processing technique. It’s so often used that it comes in the basic pipeline for spaCy. NER can help us quickly parse out a document for all the named entities of many different types. For example, if we’re reading an article, we can use named entity recognition to immediately get an idea of the who/what/when/where of the article.

In this post we’re going to cover three different ways you can implement NER in Python. We’ll be going over:

What is Named Entity Recognition?

Named Entity Recognition, or NER for short, is the Natural Language Processing (NLP) topic about recognizing entities in a text document or speech file. Of course, this is quite a circular definition. In order to understand what NER really is, we’ll have to define what an entity is. For the purposes of NLP, an entity is essentially a noun that defines an individual, group of individuals, or a recognizable object. While there is not a TOTAL consensus on what kinds of entities there are, I’ve compiled a rather complete list of the possible types of entities that popular NLP libraries such as spaCy or Natural Language Toolkit (NLTK) can recognize. You can find the GitHub repo here.

List of Common Named Entities

Entity TypeDescription of the NER object
PERSONA person – usually a recognized as a first and last name
NORPNationalities or Religious/Political Groups
FACThe name of a Facility
ORGThe name of an Organization
GPEThe name of a Geopolitical Entity
LOCA location
PRODUCTThe name of a product
EVENTThe name of an event
WORK OF ARTThe name of a work of art
LAWA law that has been published (US only as far as I know)
LANGUAGEThe name of a language
DATEA date, doesn’t have to be an exact date, could be a relative date like “a day ago”
TIMEA time, like date it doesn’t have to be exact, it could be like “middle of the day”
PERCENTA percentage
MONEYAn amount of money, like “$100”
QUANTITYMeasurements of weight or distance
CARDINALA number, similar to quantity but not a measurement
ORDINALA number, but signifying a relative position such as “first” or “second”

How Can I Implement NER in Python?

Earlier, I mentioned that you can implement NER with both spaCy and NLTK. The difference between these libraries is that NLTK is built for academic/research purposes and spaCy is built for production purposes. Both are free to use open source libraries. NER is extremely easy to implement with these open source libraries. In this article I will show you how to get started implementing your own Named Entity Recognition programs.

spaCy Named Entity Recognition (NER)

We’ll start with spaCy, to get started run the commands below in your terminal to install the library and download a starter model.

pip install spacy
python -m spacy download en_core_web_sm

We can implement NER in spaCy in just a few lines of code. All we need to do is import the spacy library, load a model, give it some text to process, and then call the processed document to get our named entities. For this example we’ll be using the “en_core_web_sm” model we downloaded earlier, this is the “small” model trained on web text. The text we’ll use is just some random sentence I made up, we should expect the NER to identify Molly Moon as a Person (NER isn’t advanced enough to detect that she is a cow), to identify the United Nations’ as an organization, and the Climate Action Committee as a second organization.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "Molly Moon is a cow. She is part of the United Nations Climate Action Committee."
 
doc = nlp(text)
 
for ent in doc.ents:
    print(ent.text, ent.label_)

After we run this we should see a result like the one below. We see that this spaCy model is unable to separate the United Nations and its Climate Action Committee as separate orgs.

named entity recognition spacy results

Named Entity Recognition with NLTK

Let’s take a look at how to implement NER with NLTK. As with spaCy, we’ll start by installing the NLTK library and also downloading the extensions we need.

pip install nltk

After we run our initial pip install, we’ll need to download four extensions to get our Named Entity Recognition program running. I recommend simply firing up Python in your terminal and running these commands as the libraries only need to be downloaded once to work, so including them in your NER program will only slow it down.

python
>>> import nltk
>>> nltk.download(“punkt”)
>>> nltk.download(“averaged_perceptron_tagger”)
>>> nltk.download(“maxent_ne_chunker”)
>>> nltk.download(“words”)

Punkt is a tokenizer package that recognizes punctuation. Averaged Perceptron Tagger is the default part of speech tagger for NLTK. Maxent NE Chunker is the Named Entity Chunker for NLTK. The Words library is an NLTK corpus of words. We can already see here that NLTK is far more customizable, and consequently also more complex to set up. Let’s dive into the program to see how we can extract our named entities.

Once again we simply start by importing our library and declaring our text. Then we’ll tokenize the text, tag the parts of speech, and chunk it using the named entity chunker. Finally, we’ll loop through our chunks and display the ones that are labeled.

import nltk
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
tokenized = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokenized)
chunks = nltk.ne_chunk(pos_tagged)
for chunk in chunks:
    if hasattr(chunk, 'label'):
        print(chunk)

When you run this program in your terminal you should see an output like the one below.

named entity recognition results – nltk

Notice that NLTK has identified “Climate Action Committee” as a Person and Moon as a Person. That’s clearly incorrect, but this is all on pre trained data. Also this time, I let it print out the entire chunk, and it shows the parts of speech. NLTK has tagged all of these as “NNP” which signals a proper noun.

A Simpler and More Accurate NER Implementation

Alright, now that we’ve discussed how to implement NER with open source libraries, let’s take a look at how we can do it without ever having to download extra packages and machine learning models! We can simply ping a web API that already has a pre-trained model and pipeline for tons of text processing needs. We’ll be using the open beta of the The Text API, scroll down to the bottom of the page and get your API key.

The only library we need to install is the requests library, and we only need to be able to send an API request as outlined in How to Send a Web API Request. So, let’s take a look at the code.

All we need is to construct a request to send to the endpoint, send the request, and parse the response. The API key should be passed in the headers as “apikey” and also we should specify that the content type is json. The body simply needs to pass the text in. The endpoint that we’ll hit is “https://app.thetextapi.com/text/ner”. Once we get our request back, we’ll use the json library (native to Python) to parse our response.

import requests
import json
from config import apikey
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/ner"
 
response = requests.post(url, headers=headers, json=body)
ner = json.loads(response.text)["ner"]
print(ner)

Once we send this request, we should see an output like the one below.

named entity recognition with the text api

Woah! Our API actually recognizes all three of the named entities successfully! Not only is using The Text API simpler than downloading multiple models and libraries, but in this use case, we can see that it’s also more accurate.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
NLP NLTK spaCy

Natural Language Processing: Part of Speech Tagging

Part of Speech (POS) Tagging is an integral part of Natural Language Processing (NLP). The first step in most state of the art NLP pipelines is tokenization. Tokenization is the separating of text into “tokens”. Tokens are generally regarded as individual pieces of languages – words, whitespace, and punctuation.

Once we tokenize our text we can tag it with the part of speech, note that this article only covers the details of part of speech tagging for English. Part of speech tagging is done on all tokens except for whitespace. We’ll take a look at how to do POS with the two most popular and easy to use NLP Python libraries – spaCy and NLTK – coincidentally also my favorite two NLP libraries to play with.

What is Part of Speech (POS) Tagging?

Traditionally, there are nine parts of speech taught in English literature – nouns, adjectives, determiners, adverbs, pronouns, prepositions, conjunctions, and interjections. We’ll see below, that for NLP reasons, we’ll actually be using way more than nine tags. The spaCy library tags 19 different parts of speech, and over 50 “tags” (depending how you count different punctuation marks).

In spaCy tags are more granularized parts of speech. NLTK’s part of speech tagging tags 34 parts of speech. It is more like spaCy’s tagging concept than spaCy’s parts of speech. We’ll take a look at the parts of speech labels from both, and then spaCy’s fine grained tagging. You can find the Github Repo that contains code for POS tagging here.

In this post, we’ll go over:

  • List of spaCy automatic parts of speech (POS)
  • List of NLTK parts of speech (POS)
  • Fine-grained Part of Speech (POS) tags in spaCy
  • spaCy POS Tagging Example
  • NLTK POS Tagging Example

List of spaCy parts of speech (automatic):

POSDescriptionPOSDescription
ADJAdjective – big, purple, creamyADPAdposition – in, to, during
ADVAdverb – very, really, thereAUXAuxiliary – is, has, will
CONJConjunction – and, or, butCCONJCoordinating conjunction – either…or, neither…nor, not only
DETDeterminer – a, an, theINTJInterjection – psst, oops, oof
NOUNNoun – cat, dog, frogNUMNumeral – 1, one, 20
PARTParticle – ‘s, ‘nt, ‘dPRONPronoun – he, she, me
PROPNProper noun – Yujian Tang, Michael Jordan, Andrew NgPUNCTPunctuation – commas, periods, semicolons
SCONJSubordinating conjunction – if, while, butSYMSymbol – $, %, ^
VERBVerb – sleep, eat, runXOther – asdf, xyz, abc
SPACESpace – space lol

List of NLTK parts of speech:

POSDescriptionPOSDescription
CCCoordinating Conjunction – either…or, neither…nor, not onlyCDCardinal Digit – 1, 2, twelve
DTDeterminer – a, an, theEXExistential There – “there” used for introducing a topic
FWForeign Word – bonjour, ciao, 你好INPreposition/Subordinating Conjunction – in, at, on
JJAdjective – bigJJRComparative Adjective – bigger
JJSSuperlative Adjective – biggestLSList Marker – first, A., 1), etc
MDModal – can, cannot, mayNNSingular Noun – student, learner, enthusiast
NNSPlural Noun – students, programmers, geniusesNNPSingular Proper Noun – Yujian Tang, Tom Brady, Fei Fei Li
NNPSPlural Proper Noun – Americans, Democrats, PresidentsPDTPredeterminer – all, both, many
POSPossessive Ending – ‘sPRPPersonal Pronoun – her, him, yourself
PRP$Possessive Pronoun – her, his, mineRBAdverb – occasionally, technologically, magically
RBRComparative Adjective – further, higher, betterRBSSuperlative Adjective – best, biggest, highest
RPParticle – aboard, into, uponTOInfinitive Marker – “to” when it is used as an infinitive marker or preposition
UHInterjection – uh, wow, jinkies!VBVerb – ask, assemble, brush
VBGVerb Gerund – stirring, showing, displayingVBDVerb Past Tense – dipped, diced, wrote
VBNVerb Past Participle – condensed, refactored, unsettledVBPVerb Present Tense not 3rd person singular – predominate, wrap, resort
VBZVerb Present Tense, 3rd person singular – bases, reconstructs, emergesWDTWh-determiner – that, what, which
WPWh-pronoun – that, what, whateverWRBWh-adverb – how, however, wherever

We can see that NLTK and spaCy have different parts of speech tagging, this is because there are many ways to tag parts of speech and the different ways that NLTK has split it up is advantageous for academic process. Above, I’ve only shown spaCy’s automatic POS tagging, but spaCy actually has a fine grained part of speech tagging as well, they call it “tag” instead of “part of speech”. I’ll break down how parts of speech map to tagging in spaCy below.

List of spaCy Part of Speech Tags (Fine grained)

POSMapped TagsPOSMapped Tags
ADJAFX – affix: “pre-”
JJ – adjective: good
JJR – comparative adjective: better
JJS – superlative adjective: best
PDT – predeterminer: half
PRP$ – possessive pronoun: his, her
WDT – wh-determiner: which
WP$ – possessive wh-pronoun: whose
ADPIN – subordinating conjunction or preposition: “in”
ADVEX – existential there: there
RB – adverb: quickly
RBR – comparative adverb: quicker
RBS – superlative adverb: quickest
WRB – wh-adverb: when
CONJCC – coordinating conjunction: and
DETDT – determiner: this, a, anINTJUH – interjection: uh, uhm, ruh-roh!
NOUNNN – noun: sentence
NNS – plural noun: sentences
WP – wh-pronoun: who
NUMCD – cardinal number: three, 5, twelve
PARTPOS – possessive ending: ‘s
RP – particle adverb: back (put it “back”)

TO – infinitive to: “to”
PRONPRP – personal pronoun: I, you
PROPNNNP – proper singular noun: Yujian Tang
NNPS – proper plural nouns: Pythonistas
PUNCT-LRB- left round bracket: “(“
-RRB- right round bracket: “)”
(actual punctuation marks): , : ; . “ ‘ (etc)
HYPH – hyphen
LS – list item marker: a., A), iii.
NFP – superfluous punctuation
SYM(like punctuation, these are pretty self explanatory)#
$
SYM – symbol
VERBBES – auxiliary “be”
HVS – “have”: ‘ve
MD – auxiliary modal: could
VB – base form verb: go
VBD – past tense verb: was
VBG – gerund: going
VBN – past participle verb: lost
VBP – non 3rd person singular present verb: want
VBZ – 3rd person singular present verb: wants
XADD – email
FW – foreign word
GW – additional word
XX – unknown

How do I Implement POS Tagging?

Part of Speech Tagging is at the cornerstone of Natural Language Processing. It is one of the most basic parts of NLP, and as a result it comes standard as part of any respectable NLP library. Below, I’m going to cover how you can do POS tagging in just a few lines of code with spaCy and NLTK.

Spacy POS Tagging

We’ll start by implementing part of speech tagging in spaCy. The first thing we’ll need to do is install spaCy and download a model.

pip install spacy
python -m spacy download en_core_web_sm

Once we have our required libraries downloaded we can start. Like I said above, POS tagging is one of the cornerstones of natural language processing. It’s so important that the spaCy pipeline automatically does it upon tokenization. For this example, I’m using a large piece of text, this text about solar energy comes from How Many Solar Farms Does it Take to Power America?

First we import spaCy, then we load our NLP model, then we feed the NLP model our text to create our NLP document. After creating the document, we can simply loop through it and print out the different parts of the tokens. For this example, we’ll print out the token text, the token part of speech, and the token tag.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
doc = nlp(text)
 
for token in doc:
    print(token.text, token.pos_, token.tag_)

Once you run this you should see an output like the one pictured below.

Part of Speech Tagging Results – spaCy

NLTK POS Tagging

Now let’s take a look at how to do POS tagging with the Natural Language Toolkit. We’ll get started with this the same way we got started with spaCy, by downloading the library and the model we’ll need. We’re going to need to install NLTK and download the NLTK “punkt” tokenizer model.

pip install nltk
python
>>> import nltk
>>> nltk.download(‘punkt’)

Once we have our libraries downloaded, we can fire up our favorite Python editor and get started. Like with spaCy, there’s only a few steps we need to do to start tagging parts of speech with the NLTK library. First, we need to tokenize our text. Then, we simply call the NLTK part of speech tagger on the tokenized text and voila! We’re done. I’ve used the exact same text from above.

import nltk
from nltk.tokenize import word_tokenize
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
tokenized = word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
for tag in tagged:
    print(tag)

Once we’re done, we simply run this in a terminal and we should see an output like the following.

Parts of Speech Tagging Results – NLTK

You can compare and see that NLTK and spaCy have pretty much the same tagging at the tag level.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly