Find the Most Common Named Entities by Type

Natural Language Processing techniques have come far in the last decade. NLP will be an even more important field in the coming decade as we get more and more unstructured text data. One of the base applications of NLP is Named Entity Recognition (NER). However, NER by itself just gives us the names of entities such as the people, location, organizations, and times mentioned in a document. It’s not really that useful for getting insight from a text. You know what would be useful though? Finding the most commonly named entities in a text. 

This post will cover how to find the most common named entities in a text. Before we go over the code and technique to do that, we’ll also briefly cover what NER is. In this post we’ll go over:

  • What is Named Entity Recognition (NER)
  • Performing NER on a Text
    • What does a returned NER usually look like?
  • Processing the returned NERs to find the most common ones
    • Splitting the Named Entities by Type
    • Sorting Each Type of Named Entities
  • Example of the Most Common Named Entities from a NER of Tweets

What is Named Entity Recognition (NER)?

Named Entity Recognition (NER) is the NLP technique used to extract certain types of entities from a text. This post on the Best Way to do Named Entity Recognition (NER) has a list of all the entity types for both spaCy and NLTK. In general, you can think of NER as the process of extracting the people, places, times, organizations, and other conceptual entities from a text.

Performing Named Entity Recognition (NER) on a Text

There are many ways to do Named Entity Recognition. In the Best Way to do Named Entity Recognition (NER) post we went over how to do NER with spaCy, NLTK, and The Text API. In this example, we’ll go over using The Text API since it was the one with the highest quality NER that we found in that post. To do this example, you’ll need to sign up for a free API key at The Text API and install the requests module with the line below.

pip install requests

All we’re going to do here is set up the request by passing the URL, the headers, and the body, and then parse the response. The headers will tell the server that we’re looking to send a JSON request and pass the API key. The body will pass the text we’re trying to do NER on. You can skip this section; we will use an example return value below.

ner_url = "https://app.thetextapi.com/text/ner"
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text = <some text here>
body = {
    "text" : text
}
response = requests.post(url, headers=headers, json=body)

What Does a Returned List of Named Entities Look Like?

I didn’t include a text in the above example because we’re going to be using text pulled from Twitter. To find out how to pull Tweets, follow the guide on how to Scrape the Text from All Tweets for a Search Term. The below image was generated by scraping all the text from tweets by Elon Musk and then running them through a NER.

You can see that the result is a list of lists. Each inner list contains two entries, the entity type, and the name of the entity. It’s important to know the data structure so we can manipulate it and extract the data we want.

[['PERSON', '@cleantechnica'], ['PERSON', 'Webb'], ['ORG', '@NASA'], ['DATE', 'a crazy tough year'], ['ORG', 'Tesla'], ['TIME', '6pm Christmas Eve'], ['TIME', 'last hour'], ['DATE', 'the last day'], ['DATE', 'two days'], ['DATE', 'Christmas'], ['ORG', 'WSJ'], ['ORG', 'Wikipedia'], ['ORG', 'Tesla'], ['DATE', 'holiday'], ['DATE', 'today'], ['ORG', 'Tesla'], ['DATE', 'today'], ['PERSON', 'Doge'], ['ORG', '@SkepticsGuide']]

Processing the NERs to Find the Most Common Ones

It’s great to run NER and get all the named entities in a text, but just getting all the named entities isn’t really that useful to us. It’s much more applicable to extract the most common entities of each type from a document. Let’s take a look at how to do that.

Splitting Named Entities by Type of Entity

The first thing we need to do is find the most common named entities recognized in a document is process the list of lists into a dictionary based on entity type. We’ll create a function, build_dict, that will take one parameter, ners, which is a list. Note that even though we expect a list on the surface, we actually expect a list of lists like the one shown above.

In our function, the first thing we’ll do is create an empty dictionary. This is the dictionary that will contain the entity types and their counts that we will return later. Next, we’ll loop through the ners list. Remember how the ners list is set up. The entity type is the first entry and the entity name is the second. 

If the entity type is already in our dictionary, we’ll check for the entity name in its entry. If the entity name is in our inner dictionary, we’ll increment its count by 1. Otherwise, we’ll add the entry to our inner dictionary and set the count to 1. When the entity type is not in the dictionary, we’ll create an inner dictionary with one entry, the entity name set to a count of 1. After looping through all the ners, we return the outer dictionary.

# build dictionary of NERs
# extract most common NERs
# expects list of lists
def build_dict(ners: list):
    outer_dict = {}
    for ner in ners:
        entity_type = ner[0]
        entity_name = ner[1]
        if entity_type in outer_dict:
            if entity_name in outer_dict[entity_type]:
                outer_dict[entity_type][entity_name] += 1
            else:
                outer_dict[entity_type][entity_name] = 1
        else:
            outer_dict[entity_type] = {
                entity_name: 1
            }
    return outer_dict

Sorting Each Type of Named Entity

The next thing we want to do is sort the entities. We’ll create a function called most_common that will take one input, ners, the list of lists. This is the same input that build_dict takes. Note that we can opt for this function to take a dictionary instead and chain the call to build_dict in the orchestrator. For this example, however, we will call build_dict inside of this function.

Once we have the dictionary of NERs split by entity type, let’s create an empty dictionary that will hold the most commonly mentioned entity of each type. Next, let’s loop through each of the NER types in the built dictionary. For each of those types, we’ll create a sorted list based on the value (count) of each named entity in that type. Then, we’ll set the value of the ner_type in the dictionary of most common NER types to the first entry in the sorted list. Finally, we return the dictionary of most common NER types.

# return most common entities after building the NERS out
def most_common(ners: list):
    _dict = build_dict(ners)
    mosts = {}
    for ner_type in _dict:
        sorted_types = sorted(_dict[ner_type], key=lambda x: _dict[ner_type][x], reverse=True)
        mosts[ner_type] = sorted_types[0]
    return mosts

Example of the Most Common Named Entities From a NER of Tweets

Based on the example NERs we extracted from the Tweets above, when we run the above code, we should get a result like the image below. It is a dictionary with multiple NER types, PERSON, ORG, DATE, and TIME, with the most commonly named entity in each type.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Yujian Tang

Leave a Reply

%d bloggers like this: