Categories
APIs level 3 python NLP The Text API

Using NLP to Get Insights from Twitter

I’m interested in analyzing the Tweets of a bunch of famous people so I can learn from them. I’ve built a program that will do this by pulling a list of recent tweets and doing some NLP on them. In this post we’re going to go over:

  • Get all the Text for a Search Term on Twitter
  • NLP Techniques to Run on Tweets
    • Summarization
    • Most Common Phrases
    • Named Entity Recognition
    • Sentiment Analysis
  • Running all the NLP Techniques Concurrently
  • Further Text Processing
    • Finding the Most Commonly Named Entities
  • Orchestration
  • A Summary

To follow along you’ll need a free API key from The Text API and to install the requests and aiohttp library with the following line in your terminal:

pip install requests aiohttp

Overview of Project Structure

In this project we’re going to create multiple files and folders. We’re going to create a file for getting all the text called pull_tweets.py. We’ll create a totally separate folder for the text processing, and we’ll have three files in there. Those three files are async_pool.py for sending the text processing requests, ner_processing.py for further text processing after doing NER, and a text_orchestrator.py for putting the text analysis together.

Get all the Text for a Search Term on Twitter

We went over how to Scrape the Text from All Tweets for a Search Term in a recent post. For the purposes of this program, we’ll do almost the exact same thing with a twist. I’ll give a succinct description of what we’re doing in the code here. You’ll have to go read that post for a play-by-play of the code. This is the pull_tweets.py file.

First we’ll import our libraries and bearer token. Then we’ll set up the request and headers and create a function to search Twitter. Our function will check if our search term is a user or not by checking to see if the first character is the “@” symbol. Then we’ll create our search body and send off the request. When we get the request back, we’ll parse it into JSON and compile all the Tweets into one string. Finally, we’ll return that string.

import requests
import json
 
from twitter_config import bearertoken
 
search_recent_endpoint = "https://api.twitter.com/2/tweets/search/recent"
headers = {
    "Authorization": f"Bearer {bearertoken}"
}
 
# automatically builds a search query from the requested term
# looks for english tweets with no links that are not retweets
# returns the tweets
def search(term: str):
    if term[0] == '@':
        params = {
            "query": f'from:{term[1:]} lang:en -has:links -is:retweet',
            'max_results': 25
        }
    else:
        params = {
            "query": f'{term} lang:en -has:links -is:retweet',
            'max_results': 25
        }
    response = requests.get(url=search_recent_endpoint, headers=headers, params=params)
    res = json.loads(response.text)
    tweets = res["data"]
    text = ". ".join( for tweet in tweets])
    return text

NLP Techniques to Run on Tweets

There’s a ton of different NLP techniques we can run, we can do Named Entity Recognition, analyze the text for polarity, summarize the text, and much more. Remember what we’re trying to do here. We’re trying to get some insight from these Tweets. With this in mind, for this project we’ll summarize the tweets, find the most common phrases, do named entity recognition, and run sentiment analysis.

We’re going to run all of these concurrently with asynchronous API requests. In the following sections we’re just going to set up the API requests. The first thing we’ll do is set up the values that are constant among all the tweets. This is creating the async_pool.py file.

Setup Constants

Before we can set up our requests, we have to set up the constants for them. We’ll also do the imports for the rest of the async_pool.py function. First, we’ll import the asyncio, aiohttp, and json libraries. We’ll use the asyncio and aiohttp libraries for the async API calls later. We’ll also import our API key that we got earlier from The Text API.

We need to set up the headers for our requests. The headers will tell the server that we’re sending JSON data and also pass the API key. Then we’ll set up the API endpoints. The API endpoints that we’re hitting are the summarize, ner, most_common_phrases, and text_polarity API endpoints.

import asyncio
import aiohttp
import json
 
from .text_config import apikey
 
# configure request constants
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
summarize_url = text_url+"summarize"
ner_url = text_url+"ner"
mcp_url = text_url+"most_common_phrases"
polarity_url = text_url+"text_polarity"

Summarize the Tweets

We’ll set up a function to return these bodies so we can use them later. We only need one parameter for this function, the text that we’re going to send. The first thing we’ll do in this function is set up an empty dictionary. Next we’ll set up the body to send to the summarize endpoint. The summarize endpoint will send the text and tell the server that we want a proportion of 0.1 of the Tweets.

def configure_bodies(text: str):
    _dict = {}    
    _dict[summarize_url] = {
        "text": text,
        "proportion": 0.1
    }

Find Most Common Phrases

After setting up the summarization body, we will set up the most_common_phrases body. This request will send the text and set the number of phrases to 5.

    _dict[mcp_url] = {
        "text": text,
        "num_phrases": 5
    }

Named Entity Recognition

Now we’ve set up the summarization and most common phrases request bodies. After those, we’ll set up the NER request body. The NER request body will pass the text and tell the server that we’re sending an “ARTICLE” type. The “ARTICLE” type returns people, places, organization, locations, and times.

    _dict[ner_url] = {
        "text": text,
        "labels": "ARTICLE"
    }

Sentiment Analysis

We’ve now set up the summarization, most common phrases, and named entity recognition request bodies. Next, is the sentiment analysis or text polarity body. Those terms are basically interchangeable. This request will just send the text in the body. We don’t need to specify any other optional parameters here. We’ll return the dictionar we created after setting this body.

    _dict[polarity_url] = {
        "text": text
    }
    return _dict

Full Code for Configuring Requests

Here’s the full code for configuring the request bodies.

# configure request bodies
# return a dict of url: body
def configure_bodies(text: str):
    _dict = {}
    _dict[summarize_url] = {
        "text": text,
        "proportion": 0.1
    }
    _dict[ner_url] = {
        "text": text,
        "labels": "ARTICLE"
    }
    _dict[mcp_url] = {
        "text": text,
        "num_phrases": 5
    }
    _dict[polarity_url] = {
        "text": text
    }
    return _dict

Run All NLP Techniques Concurrently

For a full play-by-play of this code check out how to send API requests asynchronously. I’ll go over an outline here. This is almost the exact same code with a few twists. This consists of three functions, gather_with_concurrency, post_async, and pool.

First, we’ll look at the gather_with_concurrency function. This function takes two parameters, the number of concurrent tasks, and the list of tasks. All we’ll do in this function is set up a semaphore to asynchronously execute these tasks. At the end of the function, we’ll return the gathered tasks.

Next we’ll create the post_async function. This function will take four parameters, the url, session, headers, and body for the request. We’ll asynchronously use the session passed in to execute a request. We’ll return the text after getting the response back.

Finally, we’ll create a pool function to execute all of the requests concurrently. This function will take one parameter, the text we want to process. We’ll create a connection and a session and then use the configure_requests function to get the request bodies. Next, we’ll use the gather_with_concurrency and post_async function to execute all the requests asynchronously. Finally, we’ll close the session and return the summary, most common phrases, recognized named entities, and polarity.

# configure async requests
# configure gathering of requests
async def gather_with_concurrency(n, *tasks):
    semaphore = asyncio.Semaphore(n)
    async def sem_task(task):
        async with semaphore:
            return await task
   
    return await asyncio.gather(*(sem_task(task) for task in tasks))
 
# create async post function
async def post_async(url, session, headers, body):
    async with session.post(url, headers=headers, json=body) as response:
        text = await response.text()
        return json.loads(text)
   
async def pool(text):
    conn = aiohttp.TCPConnector(limit=None, ttl_dns_cache=300)
    session = aiohttp.ClientSession(connector=conn)
    urls_bodies = configure_bodies(text)
    conc_req = 4
    summary, ner, mcp, polarity = await gather_with_concurrency(conc_req, *[post_async(url, session, headers, body) for url, body in urls_bodies.items()])
    await session.close()
    return summary["summary"], ner["ner"], mcp["most common phrases"], polarity["text polarity"]

Full Code for Asynchronously Executing all NLP techniques

Here’s the full code for async_pool.py.

import asyncio
import aiohttp
import json
 
from .text_config import apikey
 
# configure request constants
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
summarize_url = text_url+"summarize"
ner_url = text_url+"ner"
mcp_url = text_url+"most_common_phrases"
polarity_url = text_url+"text_polarity"
 
# configure request bodies
# return a dict of url: body
def configure_bodies(text: str):
    _dict = {}
    _dict[summarize_url] = {
        "text": text,
        "proportion": 0.1
    }
    _dict[ner_url] = {
        "text": text,
        "labels": "ARTICLE"
    }
    _dict[mcp_url] = {
        "text": text,
        "num_phrases": 5
    }
    _dict[polarity_url] = {
        "text": text
    }
    return _dict
 
# configure async requests
# configure gathering of requests
async def gather_with_concurrency(n, *tasks):
    semaphore = asyncio.Semaphore(n)
    async def sem_task(task):
        async with semaphore:
            return await task
   
    return await asyncio.gather(*(sem_task(task) for task in tasks))
 
# create async post function
async def post_async(url, session, headers, body):
    async with session.post(url, headers=headers, json=body) as response:
        text = await response.text()
        return json.loads(text)
   
async def pool(text):
    conn = aiohttp.TCPConnector(limit=None, ttl_dns_cache=300)
    session = aiohttp.ClientSession(connector=conn)
    urls_bodies = configure_bodies(text)
    conc_req = 4
    summary, ner, mcp, polarity = await gather_with_concurrency(conc_req, *[post_async(url, session, headers, body) for url, body in urls_bodies.items()])
    await session.close()
    return summary["summary"], ner["ner"], mcp["most common phrases"], polarity["text polarity"]

Further Text Processing

After doing the initial NLP we’ll still get some text back. We can continue doing some NLP on the summarization, most common phrases, and the named entities. Let’s go back to what we’re trying to do – get insights. The summary will help us get a general idea, the most common phrases will tell us what the most commonly said things are, but the NER is a little too broad still. Let’s further process the NER by finding the most commonly named entities.

Most Commonly Named Entities

For a play-by-play of this code, read the post on how to Find the Most Common Named Entities of Each Type. I’m going to give a high-level overview here. We’re going to build two functions, build_dict to split the named entities into each type, and most_common to sort that dictionary.

The build_dict function will take one parameter, ners, a list of lists. We’ll start off this function by creating an empty dictionary. Then we’ll loop through the list of ners and add those to the dictionary based on whether or not we’ve seen the type and name of the ner.

The most_common function will take one parameter as well, ners, a list of lists. The first thing we’ll do with this function is call build_dict to create the dictionary. Then, we’ll initialize an empty dictionary. Next, we’ll loop through the dictionary and sort each list of NER types. Finally, we’ll add the most common names in each type to the initialized dictionary and return that.

# build dictionary of NERs
# extract most common NERs
# expects list of lists
def build_dict(ners: list):
    outer_dict = {}
    for ner in ners:
        entity_type = ner[0]
        entity_name = ner[1]
        if entity_type in outer_dict:
            if entity_name in outer_dict[entity_type]:
                outer_dict[entity_type][entity_name] += 1
            else:
                outer_dict[entity_type][entity_name] = 1
        else:
            outer_dict[entity_type] = {
                entity_name: 1
            }
    return outer_dict
 
# return most common entities after building the NERS out
def most_common(ners: list):
    _dict = build_dict(ners)
    mosts = {}
    for ner_type in _dict:
        sorted_types = sorted(_dict[ner_type], key=lambda x: _dict[ner_type][x], reverse=True)
        mosts[ner_type] = sorted_types[0]
    return mosts

Orchestration

Finally, we’ll orchestrate our functions. First, we’ll start by importing the asyncio library and the two functions we’ll need to orchestrate, pool, and most_common. We’ll create one function, orchestrate_text_analysis, which will take one parameter, text.

The first thing we’ll do in our orchestrator is get the summary, NERs, most common phrases, and text polarity using asyncio to execute the four NLP techniques concurrently. Then, we’ll do more text processing on the NERs. We’ll also replace the newlines in the summary to make it more readable. Finally, we’ll return the summary, most common entities, most common phrases, and sentiment.

import asyncio
 
from .async_pool import pool
from .ner_processing import most_common
 
def orchestrate_text_analysis(text:str):
    """Step 1"""
    # task to execute all requests
    summary, ner, mcp, polarity = asyncio.get_event_loop().run_until_complete(pool(text))
   
    """Step 2"""
    # do NER analysis
    most_common_ners = most_common(ner)
    summary = summary.replace("\n", "")
    return summary, most_common_ners, mcp, polarity

Summary

In this post we went over how to pull Tweets for a search term and transform that into a text. Then, we went over how to asynchronously call four APIs to run NLP on the Tweets. Next, we went over how to do some further text processing. Finally, we went over how to orchestrate the NLP on the text. I’ll be using this program to get insights from some people I want to be like on Twitter.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP The Text API

Ask AI: What Does US News Say About Top Colleges? Part 3

This is a follow up to the Ask NLP projects on What Does US News Say About Top Colleges. Top Colleges Part 1 was a naive analysis without data cleaning where we pulled the most positive and most negative sentences about each college and found tons of repeat data. In Top Colleges Part 2, we cleaned the data and repeated what we did in Part 1. We found that cleaning the data by removing repeated sentences showed us much more interpretable results. For this episode, part 3, we’ll assume that you’ve already cleaned your data as we did in Part 2. This episode will be focused around finding the most common phrases and producing objective summaries of each school. To get the data that we’ve been analyzing, check out Web Scraping the Easy Way with Python, Selenium, and Beautiful Soup 4. SKIP TO THE RESULTS HERE

To follow this tutorial, you’ll need to get your free API key from The Text API. Simply scroll down to the page and click “Get My Free API Key”.

Using Python to Summarize a Text and Get the Most Common Phrases

Alright so from this point, we’ll assume that you’ve already gotten the data, and used the code in Part 2 to clean it. What we’re about to program is very similar to what we programmed in Part 1. We’ll start our program with our imports as usual. We need requests to send our HTTP requests and json to parse the JSON response. We also need to import our Text API key. You can get your free API key from The Text API.

import requests
import json
 
from text_api_config import apikey

Setting Up the API Requests

Let’s start by setting up our API requests. We’ll need to create headers, define the URLs, and create a JSON body. The JSON body will be different for each text that we analyze, but the headers and URLs can be reused. We’ll declare the headers and URL here. The headers will tell the server that the content type we’re sending is JSON and also pass in the API key. The URLs we’ll be using are the most_common_phrases and summarize endpoints.

headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
mcps_url = text_url + "most_common_phrases"
summarize_url = text_url + "summarize"

Now let’s declare our filenames that we’ll be saving our results to and the filenames of the files we’ll be reading our text from. We’ll also put all the filenames for the colleges in a list. Later we’ll loop through this list to read each of the text files in.

mcps_filename = "mcps_universities.json"
summaries_filename = "university_summaries.json"
 
caltech = "california-institute-of-technology-1131.txt"
columbia = "columbia-university-2707.txt"
duke = "duke-university-2920.txt"
harvard = "harvard-university-2155.txt"
mit = "massachusetts-institute-of-technology-2178.txt"
princeton = "princeton-university-2627.txt"
stanford = "stanford-university-1305.txt"
uchicago = "university-of-chicago-1774.txt"
penn = "university-of-pennsylvania-3378.txt"
yale = "yale-university-1426.txt"
 
university_files = [caltech, columbia, duke, harvard, mit, princeton, stanford, uchicago, penn, yale]

Calling the Summarize and Most Common Phrases Endpoints

We’ll start by creating two dictionaries for the summaries and the most common endpoints.

mcps = {}
summaries = {}

Then we’ll loop through that list of filenames for each of the colleges we scraped earlier. For each file, we’ll open it up and read it in as the text. Then we’ll create a body and send the text to the summarize endpoint. Before we send a request to the most_common_phrases endpoint, we’ll have to clean our text a little more. How do I know that we’ve got to clean these phrases out of the text? Because I ran it without removing them and saw these phrases in six or seven of the ten results and that’s a clear indicator that these are repetitive phrases that shouldn’t be considered.

for university in university_files:
    with open(university, "r") as f:
        text = f.read()
    body = {
        "text": text
    }
    response = requests.post(url=summarize_url, headers=headers, json=body)
    _dict = json.loads(response.text)
    summaries[university] = _dict["summary"]
	
    text = text.replace("\n", "")
    text = text.replace("#", "")
    text = text.replace("Students", "")
    text = text.replace("student satisfaction", "")
    body = {
        "text": text
    }
 
    response = requests.post(url=mcps_url, headers=headers, json=body)
    _dict = json.loads(response.text)
    mcps[university] = _dict["most common phrases"]

Saving Our Results in a JSON

After we’ve looped through all the files, we’ll have all our data saved in the dictionaries we created earlier. All we have to do is open them up and use json.dump to save each dictionary into a file.

with open(mcps_filename, "w") as f:
    json.dump(mcps, f)
 
with open(summaries_filename, "w") as f:
    json.dump(summaries, f)

The Most Common Phrases for Each Top 10 College

Now that we’ve finished coding up our Python program to get the most common phrases and summaries of the top 10 colleges, let’s take a look at our results. We’ll only check out the most common phrases here. The summaries are already up in another article on the AI Summaries of the Top 10 Schools in America. 

Let’s take a look at what we can learn from this analysis.

  • Caltech has student waiters – apparently it’s a tradition to have student waiters serve dinners for student dining.
  • Looks like Barack Obama went to Columbia
  • Duke has a lot of schools
  • Harvard has a lot of schools
  • MIT really is an engineering school
  • US News has nothing of importance to say about Princeton
  • Stanford has an emphasis around student organizations
  • UChicago has a variety of programs mentioned from Computer Science to Public Policy to Study Abroad
  • Penn is focused on the sciences
  • (and finally) Yale has a focus on student organizations and student service

Caltech:

  • “Student houses”
  • “Caltech alumni”
  • “student waiters”

Columbia:

  • “former President Barack Obama”
  • “Business School”
  • “Columbia University admissions”

Duke:

  • “Sanford School”
  • “Nicholas School”
  • “Pratt School”

Harvard:

  • “John F. Kennedy School
  • “Graduate Education School”
  • “Business School”

MIT:

  • “Biomedical Engineering”
  • “Engineering”
  • “Mechanical Engineering”

Princeton:

  • “students”
  • “undergraduate students”
  • “Princeton University admissions”

Stanford:

  • “Stanford University admissions”
  • “Graduate School”
  • “student organizations”

UChicago: (No problems this time around!)

  • “study abroad experiences”
  • “Public Policy Analysis”
  • “Computer Science”

Penn:

  • “Physical Sciences”
  • “Social Sciences”
  • “Sciences”

Yale:

  • “student organizations”
  • “student service”
  • “Yale University admissions”

To learn more feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Python skills!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP The Text API

AI Summaries of the Top 10 Schools in America

These are the summaries of the Top 10 schools in America as summarized by The Text API, the most comprehensive sentiment analysis model. This is part of a series on the Top Colleges in the US. Part 1 is a naïve analysis. Part 2 is a cleaned data analysis. Finally, part 3 shows you how we got these summaries. This text was originally scraped from US News.

Caltech

It has a total undergraduate enrollment of 901 (fall 2020), its setting is suburban, and the campus size is 124 acres. It utilizes a quarter-based academic calendar. Its tuition and fees are $58,680. Caltech, which focuses on science and engineering, is located in Pasadena, California, approximately 11 miles northeast of Los Angeles. Social and academic life at Caltech centers on 11 student residences and houses, which the school describes as \”self-governing living groups.\” The Caltech Beavers have a number of NCAA Division III teams that compete in the Southern California Intercollegiate Athletic Conference. Caltech maintains a strong tradition of pranking with the Massachusetts Institute of Technology, another top-ranked science and technology university. Famous film director Frank Capra also graduated from Caltech.  California Institute of Technology admissions is most selective with an acceptance rate of 7%.

Columbia

Columbia University is a private institution that was founded in 1754. It has a total undergraduate enrollment of 6,170 (fall 2020), its setting is urban, and the campus size is 36 acres. It utilizes a semester-based academic calendar. Its tuition and fees are $63,530. The university also has a well-regarded College of Dental Medicine and graduate Journalism School. Columbia offers a range of student activities, including 28 Greek chapters. More than 90% of students live on campus. Columbia also administers the Pulitzer Prizes. The average freshman retention rate, an indicator of student satisfaction, is 98%.

Duke

Duke University is a private institution that was founded in 1838. It has a total undergraduate enrollment of 6,717 (fall 2020), and the setting is Suburban. It utilizes a semester-based academic calendar. Its tuition and fees are $60,489. Its \”Bull City\” nickname comes from the Blackwell Tobacco Company’s Bull Durham Tobacco. Approximately 30 percent of the student body is affiliated with Greek life, which encompasses almost 40 fraternities and sororities. It provides about 18 students from each class with a four-year scholarship and the opportunity for unique academic and extracurricular opportunities at both universities.

Harvard

It has a total undergraduate enrollment of 5,222 (fall 2020), its setting is urban, and the campus size is 5,076 acres. It utilizes a semester-based academic calendar. Its tuition and fees are $55,587. The school was initially created to educate members of the clergy, according to the university’s archives. The first commencement ceremony at Harvard, held in 1642, had nine graduates. Eight U.S. presidents graduated from Harvard, including Franklin Delano Roosevelt and John F. Kennedy. Harvard also has the largest endowment of any school in the world.    

MIT

It has a total undergraduate enrollment of 4,361 (fall 2020), its setting is urban, and the campus size is 168 acres. It utilizes a 4-1-4-based academic calendar. Its tuition and fees are $55,878. Located outside Boston in Cambridge, Massachusetts, MIT focuses on scientific and technological research and is divided into five schools. Freshmen are required to live on campus, and about 70% of all undergraduates live on campus. Architect Steven Holl designed one dorm, commonly called \”The Sponge.\” The Independent Activities Program, a four-week term in January, offers special courses, lectures, competitions and projects.  

Princeton

It has a total undergraduate enrollment of 4,773 (fall 2020), its setting is suburban, and the campus size is 600 acres. It utilizes a semester-based academic calendar. Its tuition and fees are $56,010. Princeton, among the oldest colleges in the U.S., is located in the quiet town of Princeton, New Jersey. Within the walls of its historic ivy-covered campus, Princeton offers a number of events, activities and organizations. The Princeton Tigers, members of the Ivy League, are well known for their consistently strong men’s and women’s lacrosse teams. The eating clubs serve as social and dining organizations for the students who join them.    

Stanford

Stanford University is a private institution that was founded in 1885. It has a total undergraduate enrollment of 6,366 (fall 2020), its setting is suburban, and the campus size is 8,180 acres. It utilizes a quarter-based academic calendar. Its tuition and fees are $56,169. The Stanford Cardinal are well known for the traditional \”Big Game\” against Cal, an annual football competition that awards the Stanford Axe — a sought-after trophy — to the victor. Stanford also has successful programs in tennis and golf. Greek life at Stanford represents approximately 25 percent of the student body. Four of Stanford University’s seven schools offer undergraduate and graduate coursework, and the remaining three serve as purely graduate schools. Stanford has a number of well-known theatrical and musical groups, including the Ram’s Head Theatrical Society and the Mendicants, an all-male a cappella group.

UChicago

It utilizes a quarter-based academic calendar. Its tuition and fees are $60,963. The university offers more than 450 student organizations. It utilizes a quarter-based academic calendar. UChicago is also renowned for the unparalleled resources it provides its undergraduate students. It utilizes a quarter-based academic calendar. UChicago is also renowned for the unparalleled resources it provides its undergraduate students. It utilizes a quarter-based academic calendar. UChicago is also renowned for the unparalleled resources it provides its undergraduate students. It utilizes a quarter-based academic calendar. Its tuition and fees are $59,298 (2019-20).The UChicago is also renowned for the unparalleled resources it provides its undergraduate students.

Penn

It has a total undergraduate enrollment of 9,872 (fall 2020), its setting is urban, and the campus size is 299 acres. It utilizes a semester-based academic calendar. Its tuition and fees are $61,710. The Penn Quakers have more than 25 NCAA Division I sports that compete in the Ivy League, and are noted for successful basketball and lacrosse teams. Penn works closely with the West Philadelphia area through community service and advocacy groups. Penn has 12 schools: Five offer undergraduate and graduate studies, and seven offer only graduate studies. More than 2,500 students each year participate in international study programs offered in more than 50 countries around the world.  University of Pennsylvania admissions is most selective with an acceptance rate of 9%.  

Yale

Yale University is a private institution that was founded in 1701. It has a total undergraduate enrollment of 4,703 (fall 2020), its setting is city, and the campus size is 373 acres. It utilizes a semester-based academic calendar. Its tuition and fees are $59,950. The Yale Bulldogs compete in the Ivy League and are well known for their rivalry with Harvard. Yale is made up of the College, the Graduate School of Arts and Sciences and 12 professional schools. The Yale Record is the oldest college humor magazine in the nation.  The application deadline is Jan. 2 and the application fee at Yale University is $80.

To learn more feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Python skills!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy The Text API

Natural Language Processing: What is Text Polarity?

Natural Language Processing (NLP) and all of its applications will be huge in the 2020s. A lot of my blogging is about text processing and all the things that go with it such as Named Entity Recognition and Part of Speech Tagging. Text polarity is a basic text processing technique that gives us insight into how positive or negative a text is. The polarity of a text is essentially it’s “sentiment” rating from -1 to 1.

Overview of Text Polarity

In this post we’ll cover:

  • What is Text Polarity?
  • How to Get Text Polarity with spaCy
  • How to Get Text Polarity with NLTK
  • How to Get Text Polarity with a web API
  • Why are these Text Polarity Numbers so Different?

What is Text Polarity?

In short, text polarity is a measure of how negative or how positive a piece of text is. Polarity is the measure of the overall combination of the positive and negative emotions in a sentence. It’s notoriously hard for computers to predict this, in fact it’s even hard for people to predict this over text. Check out the following Key and Peele video for an example of what I mean.

Most of the time, NLP models can predict simply positive or negative words and phrases quite well. For example, the words “amazing”, “superb”, and “wonderful” can easily be labeled as highly positive. The words “bad”, “sad”, and “mad” can easily be labeled as negative. However, we can’t just look at polarity from the frame of individual words, it’s important to take a larger context for evaluating total polarity. For example, the word “bad” may be negative but what about the phrase “not bad”? Is that neutral? Or is that the opposite of bad? At this point we’re getting into linguistics and semantics rather than natural language processing.

Due to the nature of language and how words around each other can modify their meaning and polarity, when I personally implemented text polarity for The Text API, I used a combination of total text polarity and the polarity of individual phrases in it. The two biggest open source libraries for NLP in Python are spaCy and NLTK, and both of these libraries measure polarity on a normalized scale of -1 to 1. The Text API measures, combines, and normalizes values on both the polarity of the overall text, individual sentences, and individual phrases. This returns a better picture of the relative polarities of texts by not penalizing longer sentences that are expressing positive or negative emotion at scale but also contain neutral phrases. Let’s take a look at how we can implement text polarity with the libraries and API I mentioned above!

How to Get Text Polarity with spaCy

To get started with spaCy we’ll need to download two spaCy libraries with pip in our terminal as shown below:

pip install spacy spacytextblob

We’ll also need to download a model. As usual we’ll download the `en_core_web_sm` model to get started. Run the below command in the terminal after the pip installs are finished:

python -m spacy download en_core_web_sm

Now that we’ve downloaded our libraries and model, let’s get started with our code. We’ll need to import `spacy` and `SpacyTextBlob` from `spacytextblob.spacytextblob`. Spacy Text Blob is the pipeline component that we’ll be using to get polarity. We’ll start our program by loading the model we downloaded earlier and then adding the `spacytextblob` pipe to the `nlp` pipeline. Notice that we never actually explicitly call the `SpacyTextBlob` module, but rather pass it in as a string to `nlp`. If you’re using VSCode, you’ll see the `SpacyTextBlob` is grayed out like it’s not being used, but don’t be fooled, we require this import in order to add the pipeline component even though we don’t call it directly.

Next we’ll choose a text to process. For this example, I simply wrote two decently positive sentences on The Text API, which we’ll show an example for later. Then all we have to do is send the text to a document via our `nlp` object and check its polarity score.

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
 
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
doc = nlp(text)
 
print(doc._.polarity)

Our spaCy model predicted our text’s polarity score at 0.5. It’s hard to really judge how “accurate” the polarity of something is, so we’ll go through the other two methods and I’ll comment on this later.

Text Polarity from spaCy

How to Get Text Polarity with NLTK

Now that we’ve covered how to get polarity via spaCy, let’s check out how to get polarity with the Natural Language Toolkit. As always, we’ll start out by installing the library and dependencies we’ll need.

pip install nltk

Once we install NLTK, we’ll fire up and interactive Python shell in the command line to install the NLTK modules that we need with the commands below.

python
>>> import nltk
>>> nltk.download([“averaged_perceptron_tagger”, “punkt”, “vader_lexicon”])

Averaged Perceptron Tagger handles part of speech tagging. It’s the best tagger in the NLTK library at the time of writing, so you’ll probably use it for something else as well as polarity. Punkt is for recognizing punctuation. I know what you’re thinking:

Vader

But no, the VADER lexicon library actually stands for “Valence Aware Dictionary and sEntiment Reasoner”. It is the library that provides the sentiment analysis tool we need. Once we have all these installed, it’s pretty simple to just import the library and call it. We need the `SentimentIntensityAnalyzer` library, pass our text to it, and call it to score the text on polarity.

from nltk.sentiment import SentimentIntensityAnalyzer
 
sia = SentimentIntensityAnalyzer()
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
scores = sia.polarity_scores(text)
print(scores)

We should get a print out like the one below.

NLTK Text Polarity

This result tells us that none of the text is negative, 61.8% is neutral, and 38.2% of it is positive. Compound is a normalized sentiment score that you can see calculated in the VADER package on GitHub. It’s calculated before the negative, neutral, and positive scores, and represents a normalized polarity score of the sentence. So NLTK has calculated our sentence to be very positive.

How to Get Text Polarity with The Text API

Finally, let’s take a look at how to get a text polarity score from The Text API. A major advantage of using a web API like The Text API to do text processing is that you don’t need to download any machine learning libraries or maintain any models. Simply using the requests library, which if you don’t have by now you can install with the pip command below, and then go to The Text API website and sign up for your free API key.

pip install requests

When you land on The Text API’s homepage you should scroll all the way down and you’ll see a button that you can click to sign up for your free API key. 

Once you log in your API key will be right at the top of the page. Now that we’re all set up, let’s take a dive into the code. All we’re going to do is set up a request with headers that tells the server we’re sending a JSON request and pass the API key, a body with the text we want to analyze, and the URL endpoint we’re going to hit (in this case “https://app.thetextapi.com/text/text_polarity”) and then send a request and parse the response.

import requests
import json
from config import apikey
 
text = "The Text API is super easy to use and super useful for anyone who needs to do text processing. It's the best Text Processing web API and allows you to do amazing NLP without having to download or manage any models."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/text_polarity"
 
response = requests.post(url, headers=headers, json=body)
polarity = json.loads(response.text)["text polarity"]
print(polarity)

Once we send off our request we’ll get a response that looks like the following:

The Text API Text Polarity

The Text API thinks that my praise of The Text API is roughly .575 polarity, that translates to like ~79% AMAZING (if 1 is AMAZING). 

Why Are These Polarities So Different?

Earlier I mentioned that we’d discuss the different polarity scores at the end so here we are. We used three different methods to get the polarity of the same document of text, so why were our polarity scores so different? The obvious answer is that each method used a) a different model and b) a different way to calculate document polarity. However, there’s also another underlying factor at play here.

Remember that Key and Peele video earlier? It’s hard for people to even understand the polarity of comments even with context. Remember that machines don’t have the ability to understand context yet. Also a range of -1 to 1 without really providing examples of what is a polarity of 1 and what is a polarity of -1 makes it hard to interpret. However, all three methods at least agree that the text is quite positive in general. Of course there are ways to improve the interpretability of these results, but that will be in a coming post!

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Categories
NLP NLTK spaCy The Text API

The Best Way to do Named Entity Recognition (NER)

Named Entity Recognition (NER) is a common Natural Language Processing technique. It’s so often used that it comes in the basic pipeline for spaCy. NER can help us quickly parse out a document for all the named entities of many different types. For example, if we’re reading an article, we can use named entity recognition to immediately get an idea of the who/what/when/where of the article.

In this post we’re going to cover three different ways you can implement NER in Python. We’ll be going over:

What is Named Entity Recognition?

Named Entity Recognition, or NER for short, is the Natural Language Processing (NLP) topic about recognizing entities in a text document or speech file. Of course, this is quite a circular definition. In order to understand what NER really is, we’ll have to define what an entity is. For the purposes of NLP, an entity is essentially a noun that defines an individual, group of individuals, or a recognizable object. While there is not a TOTAL consensus on what kinds of entities there are, I’ve compiled a rather complete list of the possible types of entities that popular NLP libraries such as spaCy or Natural Language Toolkit (NLTK) can recognize. You can find the GitHub repo here.

List of Common Named Entities

Entity TypeDescription of the NER object
PERSONA person – usually a recognized as a first and last name
NORPNationalities or Religious/Political Groups
FACThe name of a Facility
ORGThe name of an Organization
GPEThe name of a Geopolitical Entity
LOCA location
PRODUCTThe name of a product
EVENTThe name of an event
WORK OF ARTThe name of a work of art
LAWA law that has been published (US only as far as I know)
LANGUAGEThe name of a language
DATEA date, doesn’t have to be an exact date, could be a relative date like “a day ago”
TIMEA time, like date it doesn’t have to be exact, it could be like “middle of the day”
PERCENTA percentage
MONEYAn amount of money, like “$100”
QUANTITYMeasurements of weight or distance
CARDINALA number, similar to quantity but not a measurement
ORDINALA number, but signifying a relative position such as “first” or “second”

How Can I Implement NER in Python?

Earlier, I mentioned that you can implement NER with both spaCy and NLTK. The difference between these libraries is that NLTK is built for academic/research purposes and spaCy is built for production purposes. Both are free to use open source libraries. NER is extremely easy to implement with these open source libraries. In this article I will show you how to get started implementing your own Named Entity Recognition programs.

spaCy Named Entity Recognition (NER)

We’ll start with spaCy, to get started run the commands below in your terminal to install the library and download a starter model.

pip install spacy
python -m spacy download en_core_web_sm

We can implement NER in spaCy in just a few lines of code. All we need to do is import the spacy library, load a model, give it some text to process, and then call the processed document to get our named entities. For this example we’ll be using the “en_core_web_sm” model we downloaded earlier, this is the “small” model trained on web text. The text we’ll use is just some random sentence I made up, we should expect the NER to identify Molly Moon as a Person (NER isn’t advanced enough to detect that she is a cow), to identify the United Nations’ as an organization, and the Climate Action Committee as a second organization.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = "Molly Moon is a cow. She is part of the United Nations Climate Action Committee."
 
doc = nlp(text)
 
for ent in doc.ents:
    print(ent.text, ent.label_)

After we run this we should see a result like the one below. We see that this spaCy model is unable to separate the United Nations and its Climate Action Committee as separate orgs.

named entity recognition spacy results

Named Entity Recognition with NLTK

Let’s take a look at how to implement NER with NLTK. As with spaCy, we’ll start by installing the NLTK library and also downloading the extensions we need.

pip install nltk

After we run our initial pip install, we’ll need to download four extensions to get our Named Entity Recognition program running. I recommend simply firing up Python in your terminal and running these commands as the libraries only need to be downloaded once to work, so including them in your NER program will only slow it down.

python
>>> import nltk
>>> nltk.download(“punkt”)
>>> nltk.download(“averaged_perceptron_tagger”)
>>> nltk.download(“maxent_ne_chunker”)
>>> nltk.download(“words”)

Punkt is a tokenizer package that recognizes punctuation. Averaged Perceptron Tagger is the default part of speech tagger for NLTK. Maxent NE Chunker is the Named Entity Chunker for NLTK. The Words library is an NLTK corpus of words. We can already see here that NLTK is far more customizable, and consequently also more complex to set up. Let’s dive into the program to see how we can extract our named entities.

Once again we simply start by importing our library and declaring our text. Then we’ll tokenize the text, tag the parts of speech, and chunk it using the named entity chunker. Finally, we’ll loop through our chunks and display the ones that are labeled.

import nltk
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
 
tokenized = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokenized)
chunks = nltk.ne_chunk(pos_tagged)
for chunk in chunks:
    if hasattr(chunk, 'label'):
        print(chunk)

When you run this program in your terminal you should see an output like the one below.

named entity recognition results – nltk

Notice that NLTK has identified “Climate Action Committee” as a Person and Moon as a Person. That’s clearly incorrect, but this is all on pre trained data. Also this time, I let it print out the entire chunk, and it shows the parts of speech. NLTK has tagged all of these as “NNP” which signals a proper noun.

A Simpler and More Accurate NER Implementation

Alright, now that we’ve discussed how to implement NER with open source libraries, let’s take a look at how we can do it without ever having to download extra packages and machine learning models! We can simply ping a web API that already has a pre-trained model and pipeline for tons of text processing needs. We’ll be using the open beta of the The Text API, scroll down to the bottom of the page and get your API key.

The only library we need to install is the requests library, and we only need to be able to send an API request as outlined in How to Send a Web API Request. So, let’s take a look at the code.

All we need is to construct a request to send to the endpoint, send the request, and parse the response. The API key should be passed in the headers as “apikey” and also we should specify that the content type is json. The body simply needs to pass the text in. The endpoint that we’ll hit is “https://app.thetextapi.com/text/ner”. Once we get our request back, we’ll use the json library (native to Python) to parse our response.

import requests
import json
from config import apikey
 
text = "Molly Moon is a cow. She is part of the United Nations' Climate Action Committee."
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
body = {
    "text": text
}
url = "https://app.thetextapi.com/text/ner"
 
response = requests.post(url, headers=headers, json=body)
ner = json.loads(response.text)["ner"]
print(ner)

Once we send this request, we should see an output like the one below.

named entity recognition with the text api

Woah! Our API actually recognizes all three of the named entities successfully! Not only is using The Text API simpler than downloading multiple models and libraries, but in this use case, we can see that it’s also more accurate.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.