How to Automatically Analyze Documents with Python

Analyzing a stack of documents is a pain. Why do it yourself when you can automate it? In this post we’ll go over how you can automatically analyze documents with Python. We’ll be doing this with the help of The Text API, the best comprehensive sentiment analysis web API. First, let’s go over what goes into analyzing a document. When we analyze a document, we want to understand some of the main themes of that document. We may also want to know what some of the least mentioned things were, perhaps those are things that we’ll have to do some research on. Another thing that would be useful would be to have an extractive summary of the document, that will give us a good idea of the document’s main ideas and outline. Finally, we’d like to know what names, places, times, and events the document mentions.

With these things in mind, let’s get into it. The first thing we’ll need to do is install the requests library to send HTTP requests. We can easily do that via the command like with the command below:

pip install requests

For this example, the document that I’ll be using is the set of all news headlines from the NY Times in October of 2021. For an explanation of how to get these, check out How to Download Archived News Headlines. Before we get into how to do the document analysis, let’s set up our Python script. We’ll import the json and requests libraries to handle JSON documents and send HTTP requests. I’ve imported my API key from The Text API, you can get a free API key from The Text API website.

import json
import requests
 
from archive import month_dict
from config import thetextapikey

The month_dict object is just a dictionary that maps the month number to the month name. It looks like this:

month_dict = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December"
}

We’ll also want to set up our headers and the base URL that we’ll want to use. The headers will tell the server that the request has JSON type content and pass in the API key we got earlier. The base URL is the base URL for all the endpoints we’ll need to do our analysis.

headers = {
    "Content-Type": "application/json",
    "apikey": thetextapikey
}
text_url = "https://app.thetextapi.com/text/"

Setup Helper Functions

Let’s also set up some helper functions. These are functions that will perform commands we need to do multiple times. In each of the four text analysis functions we’ll make, we’ll need to open the downloaded data from the archive that we saved in the form of a JSON and we’ll also need to split the headlines. Splitting the headlines isn’t strictly necessary, but due to the fact that the terminal doesn’t keep the connection open long enough to support high volume data transfer, it’s best to do this.

Our first helper function is the one to open the downloaded JSON. It requires two parameters, a year and a month. The function will construct the filename from the year and the month, open the file, and return the loaded data. We’ll encapsulate the opening of the file in a try/except just in case the file we’re trying to open doesn’t exist.

def get_doc(year, month):
    filename = f"{year}/{month_dict[month]}.json"
    try:
        with open(filename, "r") as f:
            entries = json.load(f)
        return entries
    except:
        raise NameError("No Such File")

Our second helper function will be a way to split the headlines into smaller groups. It takes one parameter, the number of entries. Inside the function we’ll create a set of headlines – I created four, but you can feel free to do three, four, five, six, or however many you’d like. We’ll then loop through an enumeration of our entries, that simply means we’ll also have access to the entry index. We’ll use the index to determine which headlines string the current entry’s headline belongs to. There are ~3200 entries in the October 2021 set so I separated the headlines into sets of 800. Finally we return a list with all our headlines.

def split_headlines(entries):
    headlines1 = ""
    headlines2 = ""
    headlines3 = ""
    headlines4 = ""
    for index, entry in enumerate(entries):
        headline = entry['headline']['main']
        headline = headline.replace('.', '')
        if index < 800:
            headlines1 += headline + ". "
        elif index < 1600:
            headlines2 += headline + ". "
        elif index < 2400:
            headlines3 += headline + ". "
        else:
            headlines4 += headline + ". "
    return [headlines1, headlines2, headlines3, headlines4]

Alright, now that we’ve made our two helper functions, let’s get into the code to get our text analysis back.

Finding the Main Themes

By definition, themes are central topics to documents. In literary context, themes are repeated throughout the document. For our case, we’re going to find the document’s most commonly used noun phrases to get insight on its themes. In this case we’ll go ahead and use the most_common_phrases endpoint of The Text API to get our themes.

To get our themes, first we’ll extend the text_url variable we made earlier into an mcp_url variable that will point to the most_common_phrases endpoint of The Text API. The first thing we’ll do is use our helper functions to get our entries and the split up list of headlines. Then we’ll create a list of most common phrases titled mcps that will hold the responses we get from the API. For each set of headlines in our list, we’ll construct a body to send in our request that includes the headlines and a num_phrases key that indicates the number of phrases we want. The automatic number of phrases returned in 3, but let’s get 5 back. Once we get our response, we’ll simply add it to our mcps list. Once we complete processing our list we’ll save it to a JSON document.

mcp_url = text_url+"most_common_phrases"
def get_mcp(year, month):
    entries = get_doc(year, month)
    headlines_list = split_headlines(entries)
    mcps = []
    for headlines in headlines_list:
        body = {
            "text": headlines,
            "num_phrases": 5
        }
        res = requests.post(mcp_url, headers=headers, json=body)
        _dict = json.loads(res.text)
        mcps.append(_dict["most common phrases"])
    with open(f"{year}/{month_dict[month]}_MCPs.json", "w") as f:
        json.dump(mcps, f)
 
get_mcp(2021, 10)

Let’s take a look at our JSON, just out of curiosity.

From this we can see that the themes of October 2021 have been COVID, the COVID vaccine, and THE CLIMATE! YES! I WANT TO SEE MORE NEWS ABOUT THE CLIMATE! Additionally, the last fourth of the month had a lot of news about New York City. Not surprising given that these are headlines from the NY Times.

Prospecting for Gold: Finding Hidden Topics

Sometimes, the least commonly mentioned things in a document are important for us to know. Often, these can be hidden themes in the document. In the case of something like a tutorial, these can also be prerequisites that we may need to research more. For this, we’ll be using the least_common_phrases endpoint of The Text API.

Before we make our function, we’ll construct our endpoint URL by extending the base text URL with least_common_phrases. As with our most common phrases function, we’ll start by getting the entries from the JSON and splitting the headlines. We’ll create a list to hold our least common phrases and then loop through our list of headlines and call the requests to get our analyzed text back. The least common phrases endpoint also takes a num_phrases endpoint that is automatically set to 3. Here I’ve set it to 5 again.

lcp_url = text_url+"least_common_phrases"
def get_lcp(year, month):
    entries = get_doc(year, month)
    headlines_list = split_headlines(entries)
    lcps = []
    for headlines in headlines_list:
        body = {
            "text": headlines,
            "num_phrases": 5
        }
        res = requests.post(lcp_url, headers=headers, json=body)
        _dict = json.loads(res.text)
        lcps.append(_dict["least common phrases"])
    with open(f"{year}/{month_dict[month]}_LCPs.json", "w") as f:
        json.dump(lcps, f)
 
get_lcp(2021, 10)

Once we run our function, our JSON will look something like the image below.

Pretty interesting, the least commonly mentioned phrases in the NY Times articles headlines in October of 2021 were money numbers. Look how much money is out there, crazy what the government is doing. I know for sure those $1 Trillion/$1 trillion numbers were about the infrastructure bill.

Extracting a Summary

We all know that most documents are like 70% useless or repetitive information. Let’s cut out all that wasted time by extracting a summary that contains the most important information. For this, we’ll hit the summarize endpoint for The Text API. Just like the other two functions we’ve written so far, this one is similar. We extract out the list of entries and headlines and then loop through our headlines and call the request with each one. An important difference here is that we will be passing a proportion parameter instead of a num_phrases parameter. The proportion of each summary is set to 0.3 by default to turn the document into one that’s 30% of the original size. In this case, I don’t want to read through 900 article headlines, so I reduced this to 2.5% or about 80 headlines.

summarizer_url = text_url+"summarize"
def summarize_headlines(year, month):
    entries = get_doc(year, month)
    headlines_list = split_headlines(entries)
    summaries = []
    for headlines in headlines_list:
        body = {
            "text": headlines,
            "proportion": 0.025
        }
        res = requests.post(summarizer_url, headers=headers, json=body)
        _dict = json.loads(res.text)
        summaries.append(_dict["summary"])
    with open(f"{year}/{month_dict[month]}_Summary.json", "w") as f:
        json.dump(summaries, f)
 
summarize_headlines(2021, 10)

To see what the summarizer produced, check out Ask NLP: What Going On in the News? October 2021

Names, Places, Times, and Events

Last but not least, we’ll also extract the names, places, times, and events mentioned in a document. We’ll do this with Named Entity Recognition.We’ll be looking to get the names of people, nationalities, geopolitical entities, organizations, and laws. For places we’ll be looking for entities classified as locations. For times we’ll be looking for dates and times of day. Finally, for events we’ll be looking for entities classified as events. As a bonus, we’ll also extract out the monetary amounts mentioned since this specific document that we’re analyzing is about the news. 

We’ll start our Named Entity Recognition function the same way we start the other functions. We’ll set up our URL first and then use the helper functions we made earlier to load our JSON file, extract the headlines, and split them into manageable blocks. Then we’ll loop through each set of headlines and ping our endpoint for the named entities in them. In the body of the NER call, we can pass a labels that will tell the server what kind of document we’re looking to analyze. In this case, I would say that a set of news headlines is similar enough to a news article to call it an article. Notice that ARTICLE is in all caps, this is the syntax the server expects.

ner_url = text_url+"ner"
def get_ner(year, month):
    entries = get_doc(year, month)
    headlines_list = split_headlines(entries)
    ners = []
    for headlines in headlines_list:
        body = {
            "text": headlines,
            "labels": "ARTICLE"
        }
        res = requests.post(ner_url, headers=headers, json=body)
        _dict = json.loads(res.text)
        ners.append(_dict["ner"])
    with open(f"{year}/{month_dict[month]}_NER.json", "w") as f:
        json.dump(ners, f)
 
get_ner(2021, 10)

When we get our JSON document back, it should look like the following image:

We can see a bunch of persons and organizations listed at the top. For this specific analysis, there were ~2000 identified entities. You’ll have to run this yourself if you want to see all of them 🙂

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang
Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

Leave a Reply

%d bloggers like this: