Let’s talk about the internet’s favorite recent president – Barack Obama. I scraped every single NY times from his presidency and for mentions of “Obama” and collected those headlines. In this post, we’re going to take a look at how I scraped those headlines and the January 2017 ones. Originally, I had planned to take a look at ALL of them in this post, but there are literally thousands, so they will be published elsewhere.

Here are the results for 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, and 2017.

I used this tutorial on how to download archived headlines to get the NY Times archive of headlines from November 2008 to January 2017. After downloading all the headlines, I used The Text API to extract all the sentences (headlines) with the keyword “Obama” in them. In future posts, we’ll also use an AI Summarizer for a more succinct version of the headlines, check the average sentiment with a Text Polarity tool, and extract the Named Entities.

We’re only going to need one non-native Python library for this, requests. We can install the requests library in the command line with the command below:

pip install requests

Setting Up the Keyword Sentence Extraction

Let’s start by importing the libraries we’re going to need. The only two you will for sure need are json and requests. You’ll need requests to send the HTTP request and json to parse the response. The other libraries I’ve imported are due to my setup. I’ve saved my Text API key in a config file in the parent directory to this file, nyt. To access it, I imported the sys library and appended ../.. to the path. I’ve also imported the month_dict item from my archive file, the file I used to download archived headlines. The month_dict object is just a dictionary mapping the month number to the month name.

After we set up our imports, we’ll have to set up our headers and URL. The headers that we’ll send to the server will tell the server that we’re sending a JSON body and pass in the API key. I have saved the base endpoint for text processing URL in a separate variable from the sentences_with_keywords endpoint that we’ll be hitting for the sentences with keywords. This so we can extend our program to hit more endpoints for when we want to do future processing on the text.

import json
import requests
import sys
 
from archive import month_dict
sys.path.append("../..")
from nyt.config import thetextapikey
 
headers = {
    "Content-Type": "application/json",
    "apikey": thetextapikey
}
text_url = "https://app.thetextapi.com/text/"
keyword_url = text_url+"sentences_with_keywords"

Parse Archived Headlines for the API Request

After all the stuff needed for requests is set up, we still have to load and parse our documents to get them ready for sending to the API. We’ll make two functions for this, get_doc and split_headlines. We’ll use get_doc to load archive headlines for a specific month. It will take two parameters, a year and a month. Then it will find the JSON file for that year and month, open up the file, parse it, and return the response. If the name of the file does not exist, it will instead raise an error telling us that the file doesn’t exist.

After we have the entries, we’ll need to parse them into reasonably sized documents to send to the endpoint to avoid the connection closing before we get our response back. To do this we’ll create a list of headlines and two index trackers. Our headlines list will be a list of lists where each inner list is a list of headlines. The index trackers correspond to the index of the inner and outer lists.

Then we’ll loop through each of the entries and extract the headlines. We replace the periods to make sure we calculate each headline as its own sentence. Then we check to see if we’re on a new index for the outer list or not. If we are, we’ll append a new list that is just one headline. Otherwise, we’ll append our headline to the inner list. Once we reach 200 headlines in the inner list (this is an arbitrary number, you can do like 100, or 300, or 250, but probably stay below 500), we increment our headlines list index. After looping through all the entries, we return the outer list of list of headlines.

def get_doc(year, month):
    filename = f"{year}/{month_dict[month]}.json"
    try:
        with open(filename, "r") as f:
            entries = json.load(f)
        return entries
    except:
        raise NameError("No Such File")
 
def split_headlines(entries):
    # create total list of headlines and index trackers
    # one index tracker for the inner list of headlines
    # one index tracker for the outer list of headlines
    headlines = []
    idx_tracker = 0
    headlines_idx_tracker = 0
    # loop through all the entries
    for entry in entries:
        # get the headline and modify it
        headline = entry['headline']['main']
        headline = headline.replace('.', '')
        # if the headline index tracker index exists in headlines
        # append to it, otherwise, make it
        if len(headlines) == headlines_idx_tracker:
            headlines.append([headline])
        else:
            headlines[headlines_idx_tracker].append(headline)
        # increment the index tracker
        idx_tracker += 1
        # if the index tracker is at 200, reset it and
        # increment the headlines index tracker
        if idx_tracker == 200:
            headlines_idx_tracker += 1
            idx_tracker = 0
    return headlines

Getting Sentences Containing “Obama”

Now that we’ve done all the loading and parsing, we can finally extract all the sentences containing Obama in them. To do this, we’ll create a function called get_sentences_with_keywords. This function will take two arguments, a list of kws, and a string, text. Inside of our function, all we are going to do is build the body of the request, send the request using the requests module, and parse the response into JSON format. Finally, we just return our parsed response.

def get_sentences_with_keywords(kws: list, text: str):
    body = {
        "text": text,
        "keywords": kws
    }
    response = requests.post(keyword_url, headers=headers, json=body)
    print(response)
    _dict = json.loads(response.text)
    return _dict

Searching all the Headline from November 2008 to January 2017

With all the functions set up, we can search all the months Obama was in office from November 2008 to January 2017. I guess technically he came into office in January of 2009, but there’s bound to be a lot of headlines about him starting November 2008. In a later article, we’ll also explore the way he was portrayed in the news during his first run for presidency.

# search obama from Nov 2008 to Jan 2017
search_obama(2008, 11)
search_obama(2008, 12)
for year in [2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]:
    for month in range(12):
        search_obama(year, month+1)
search_obama(2017, 1)

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Setting Up the Keyword Sentence Extraction

Parse Archived Headlines for the API Request

Getting Sentences Containing “Obama”

Searching all the Headline from November 2008 to January 2017

Share this: