Ask NLP: Who/What/When/Where of the Obama Presidency

Recently we’ve used NLP to do an exploration of the media’s portrayal of Obama in two parts, based on the most common phrases used in headlines about him, and an AI summary of the headlines about him. Now, we’re going to explore further into it by taking a look at the who/what/when/where of the Obama Presidency. We’ll do this by using the NLP technique of Named Entity Recognition on the article headlines that we got in when we pulled the Obama related headlines from the NY Times.

Click here to skip directly to the most commonly mentioned named entities recognized for each year.

To follow along with this tutorial you’ll need to get your API key from the NY Times and The Text API. You’ll also need to use your package manager to install the requests module. You can install it by using the line below in your terminal.

pip install requests

Set Up the API Requests

As we always do, we’ll start out by setting up our program. The only two libraries we need are the requests and json libraries. The sys library I just have to access the config file which I have saved in the parent folder. I use a config file to store the API keys and base URL, you can store your API keys wherever you’d like. You can even store them in this file.

After our imports, we need to set up the headers. The headers will tell the server that we’re sending data in a JSON format and also pass the API key. Then we’ll finish up by completing the API endpoint URL (“https://app.thetextapi.com/text/”). For this example, we’ll be using the ner endpoint.

import requests
import json
import sys
sys.path.append("../..")
from nyt.config import thetextapikey, _url
 
# set up request headers and URL
headers = {
    "Content-Type": "application/json",
    "apikey": thetextapikey
}
ner_url = _url + "ner"

Do Named Entity Recognition On the Headlines

Now that we’ve got everything set up, let’s run the Named Entity Recognition on our list of headlines. In the previous examples, I’ve just been running the loop inside this function, but for this program, we’re going to encapsulate the loop inside a function. There are multiple ways to set up this function, and we’ll see a second way to set this up in a further function in this program. For this function, we’re not going to take any parameters and just loop through all of the years automatically.

Inside each loop, we’re going to start by opening the text documents containing the Obama headlines that we downloaded when we pulled the Obama headlines from the NY Times. We’ll read in each text file as a list of headlines and then join the list into a single string. Once we have it set up as a string, we will make the body that we’ll send to the server. The body will take the string as a text and for the labels key, we’ll send “ARTICLE” to tell the API what kind of text we’re sending. This directs the way the NER is executed. The ARTICLE labels tell the server to search for people, places, organizations, locations, and times.

With the body, headers, and URL set up, we simply send a POST request with the requests library and parse the response with the json.loads() function from the json library. We’ll save the parsed response to a txt file. To do this, we’ll loop through each entry in the ner entry of the response and turn the first and second element of that entry into a string. Then we’ll write each of the entry strings to the document. To run this function, we simply call it after it’s complete.

def get_ners():
    # loop through each year
    for i in list(range(2008, 2018)):
        with open(f"obama_{i}.txt", "r") as f:
            headlines = f.readlines()
        # combine list of headlines into one text
        text = "".join(headlines)
       
        # set up request bodies
        body = {
            "text": text,
            "labels": "ARTICLE"
        }
        # parse responses
        response = requests.post(url=ner_url, headers=headers, json=body)
        _dict = json.loads(response.text)
        # save results to a txt
        with open(f"obama/{i}_ner.txt", "w") as f:
            for entry in _dict["ner"]:
                ner = entry[0] + ": " + entry[1]
                f.write(ner)
                f.write('\n')
 
get_ners()

Analyze the Raw Named Entities

In theory, this could be a totally separate file, but we’re going to stay in the same file because the imports are roughly the same and we’re still on the same task. Best practice would actually be to split up all of these functions into their own separate files and then orchestrate them, but this isn’t production software, and we’re just playing around with it. In a future post, we’ll cover how to properly orchestrate this.

To analyze/clean the ner results some more, we’ll create a function which takes one parameter, the year. I alluded to this earlier as another way to execute the above function. You can either execute it all in one function or have a function for each year and execute the function multiple times. We’re going to use this function to compress the list of all the NERs into a more understandable format by separating them first by type (person, organization, location, time, etc0 and then by the actual name and how many times that name is mentioned.

Functionality

In this function, we’re going to open the ner file for each year that we downloaded above and read in the ners as a list. Now we’ll create a dictionary that will use the type of entity as a key and hold a dictionary of the name of the entity to the count of the entity. Next, we’ll loop through each of the named entities in the list and we’ll split them on the “:” character. For each of these split entities, we’ll try to assign the first value to the type and the second value to the name. We have to do a bit of adjustment based on the way that we saved this list to begin with. 

We’ll cut off the last element of the name because that’s a newline character. We’ll also check if there’s a ’s in the name because that’s possessive and we don’t want repeat names. Finally, we’ll check for a space at the beginning of the name (this is actually expected based on the way we saved above; I’ll leave the why for you to think about). We have to encapsulate this in a try/except in case there were any errors in the function above.

Now that we have the names and types separated out, we can start saving them into the dictionary. If the type is in the dictionary, then we check if the name is in the type’s dictionary value., then if the name is in there, then we increment the count by 1, else we create the name entry and set it to one. If the type isn’t in the dictionary, then we create the inner dictionary of the name and set that count equal to 1 and set that inner dictionary to the value of the type in the outer dictionary. Finally, we’ll save this to a JSON file too. To run this function on all our data, we have to loop through all the years and call it on each year.

# find number of mentions of each type of entity
# dictionary of type: dictionary of name: count
def analyze_ner(year):
    with open(f"obama/{year}_ner.txt", "r") as f:
        ners = f.readlines()
    # type to dictionary dict
    outer_dict = {}
    for ner in ners:
        elements = ner.split(":")
        try:
            _type = elements[0]
            _name = elements[1][:-1]
            if "'s" in _name:
                _name = _name.replace("'s", "")
            if _name[0] == " ":
                _name = _name[1:]
        except:
            continue
        # find number of mentions of each entity within types
        if _type in outer_dict:
            if _name in outer_dict[_type]:
                outer_dict[_type][_name] += 1
            else:
                outer_dict[_type][_name] = 1
        else:
            inner_dict = {}
            inner_dict[_name] = 1
            outer_dict[_type] = inner_dict
    with open(f"obama/{year}_analyzed_ner.json", "w") as f:
        json.dump(outer_dict, f)
 
for i in range(2008, 2018):
    analyze_ner(i)

Find the Most Common Named Entities

Now that we have the named entities, we can find the most commonly named ones. To do this, we’ll set up a function that takes one parameter, the year, much like our function above. We’ll open up the analyzed ner JSON document and load up the entries into a dictionary. We’ll then sort the dictionary in reverse order and print out the type along with the most commonly mentioned entity of that type. The only exception we’ll make is if the type is a person, we already know Obama will be the most mentioned one, so let’s skip to the second most mentioned.

# find most commonly mentioned entity of each type
def most_common(year):
    with open(f"obama/{year}_analyzed_ner.json", "r") as f:
        entries = json.load(f)
    for _type in entries:
        _sorted = sorted(entries[_type].items(), key = lambda item: item[1], reverse=True)
        if _type == 'PERSON':
            print(f"Most common {_type} (other than Obama) in headlines about Obama in {year} was {_sorted[1][0]} common {_sorted[1][1]} times")
        else:
            print(f"Most common {_type} in headlines about Obama in {year} was {_sorted[0][0]} mentioned {_sorted[0][1]} times")
    print('\n')
 
for i in range(2008, 2018):
    most_common(i)

The Who/What/When/Where of the Obama Presidency

2008:

2009:

2010:

2011:

2012:

2013:

2014:

2015:

2016:

2017:

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Yujian Tang

2 thoughts on “Ask NLP: Who/What/When/Where of the Obama Presidency

Leave a Reply

%d bloggers like this: