Using NLP to Get Insights from Twitter

I’m interested in analyzing the Tweets of a bunch of famous people so I can learn from them. I’ve built a program that will do this by pulling a list of recent tweets and doing some NLP on them. In this post we’re going to go over:

  • Get all the Text for a Search Term on Twitter
  • NLP Techniques to Run on Tweets
    • Summarization
    • Most Common Phrases
    • Named Entity Recognition
    • Sentiment Analysis
  • Running all the NLP Techniques Concurrently
  • Further Text Processing
    • Finding the Most Commonly Named Entities
  • Orchestration
  • A Summary

To follow along you’ll need a free API key from The Text API and to install the requests and aiohttp library with the following line in your terminal:

pip install requests aiohttp

Overview of Project Structure

In this project we’re going to create multiple files and folders. We’re going to create a file for getting all the text called pull_tweets.py. We’ll create a totally separate folder for the text processing, and we’ll have three files in there. Those three files are async_pool.py for sending the text processing requests, ner_processing.py for further text processing after doing NER, and a text_orchestrator.py for putting the text analysis together.

Get all the Text for a Search Term on Twitter

We went over how to Scrape the Text from All Tweets for a Search Term in a recent post. For the purposes of this program, we’ll do almost the exact same thing with a twist. I’ll give a succinct description of what we’re doing in the code here. You’ll have to go read that post for a play-by-play of the code. This is the pull_tweets.py file.

First we’ll import our libraries and bearer token. Then we’ll set up the request and headers and create a function to search Twitter. Our function will check if our search term is a user or not by checking to see if the first character is the “@” symbol. Then we’ll create our search body and send off the request. When we get the request back, we’ll parse it into JSON and compile all the Tweets into one string. Finally, we’ll return that string.

import requests
import json
 
from twitter_config import bearertoken
 
search_recent_endpoint = "https://api.twitter.com/2/tweets/search/recent"
headers = {
    "Authorization": f"Bearer {bearertoken}"
}
 
# automatically builds a search query from the requested term
# looks for english tweets with no links that are not retweets
# returns the tweets
def search(term: str):
    if term[0] == '@':
        params = {
            "query": f'from:{term[1:]} lang:en -has:links -is:retweet',
            'max_results': 25
        }
    else:
        params = {
            "query": f'{term} lang:en -has:links -is:retweet',
            'max_results': 25
        }
    response = requests.get(url=search_recent_endpoint, headers=headers, params=params)
    res = json.loads(response.text)
    tweets = res["data"]
    text = ". ".join( for tweet in tweets])
    return text

NLP Techniques to Run on Tweets

There’s a ton of different NLP techniques we can run, we can do Named Entity Recognition, analyze the text for polarity, summarize the text, and much more. Remember what we’re trying to do here. We’re trying to get some insight from these Tweets. With this in mind, for this project we’ll summarize the tweets, find the most common phrases, do named entity recognition, and run sentiment analysis.

We’re going to run all of these concurrently with asynchronous API requests. In the following sections we’re just going to set up the API requests. The first thing we’ll do is set up the values that are constant among all the tweets. This is creating the async_pool.py file.

Setup Constants

Before we can set up our requests, we have to set up the constants for them. We’ll also do the imports for the rest of the async_pool.py function. First, we’ll import the asyncio, aiohttp, and json libraries. We’ll use the asyncio and aiohttp libraries for the async API calls later. We’ll also import our API key that we got earlier from The Text API.

We need to set up the headers for our requests. The headers will tell the server that we’re sending JSON data and also pass the API key. Then we’ll set up the API endpoints. The API endpoints that we’re hitting are the summarize, ner, most_common_phrases, and text_polarity API endpoints.

import asyncio
import aiohttp
import json
 
from .text_config import apikey
 
# configure request constants
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
summarize_url = text_url+"summarize"
ner_url = text_url+"ner"
mcp_url = text_url+"most_common_phrases"
polarity_url = text_url+"text_polarity"

Summarize the Tweets

We’ll set up a function to return these bodies so we can use them later. We only need one parameter for this function, the text that we’re going to send. The first thing we’ll do in this function is set up an empty dictionary. Next we’ll set up the body to send to the summarize endpoint. The summarize endpoint will send the text and tell the server that we want a proportion of 0.1 of the Tweets.

def configure_bodies(text: str):
    _dict = {}    
    _dict[summarize_url] = {
        "text": text,
        "proportion": 0.1
    }

Find Most Common Phrases

After setting up the summarization body, we will set up the most_common_phrases body. This request will send the text and set the number of phrases to 5.

    _dict[mcp_url] = {
        "text": text,
        "num_phrases": 5
    }

Named Entity Recognition

Now we’ve set up the summarization and most common phrases request bodies. After those, we’ll set up the NER request body. The NER request body will pass the text and tell the server that we’re sending an “ARTICLE” type. The “ARTICLE” type returns people, places, organization, locations, and times.

    _dict[ner_url] = {
        "text": text,
        "labels": "ARTICLE"
    }

Sentiment Analysis

We’ve now set up the summarization, most common phrases, and named entity recognition request bodies. Next, is the sentiment analysis or text polarity body. Those terms are basically interchangeable. This request will just send the text in the body. We don’t need to specify any other optional parameters here. We’ll return the dictionar we created after setting this body.

    _dict[polarity_url] = {
        "text": text
    }
    return _dict

Full Code for Configuring Requests

Here’s the full code for configuring the request bodies.

# configure request bodies
# return a dict of url: body
def configure_bodies(text: str):
    _dict = {}
    _dict[summarize_url] = {
        "text": text,
        "proportion": 0.1
    }
    _dict[ner_url] = {
        "text": text,
        "labels": "ARTICLE"
    }
    _dict[mcp_url] = {
        "text": text,
        "num_phrases": 5
    }
    _dict[polarity_url] = {
        "text": text
    }
    return _dict

Run All NLP Techniques Concurrently

For a full play-by-play of this code check out how to send API requests asynchronously. I’ll go over an outline here. This is almost the exact same code with a few twists. This consists of three functions, gather_with_concurrency, post_async, and pool.

First, we’ll look at the gather_with_concurrency function. This function takes two parameters, the number of concurrent tasks, and the list of tasks. All we’ll do in this function is set up a semaphore to asynchronously execute these tasks. At the end of the function, we’ll return the gathered tasks.

Next we’ll create the post_async function. This function will take four parameters, the url, session, headers, and body for the request. We’ll asynchronously use the session passed in to execute a request. We’ll return the text after getting the response back.

Finally, we’ll create a pool function to execute all of the requests concurrently. This function will take one parameter, the text we want to process. We’ll create a connection and a session and then use the configure_requests function to get the request bodies. Next, we’ll use the gather_with_concurrency and post_async function to execute all the requests asynchronously. Finally, we’ll close the session and return the summary, most common phrases, recognized named entities, and polarity.

# configure async requests
# configure gathering of requests
async def gather_with_concurrency(n, *tasks):
    semaphore = asyncio.Semaphore(n)
    async def sem_task(task):
        async with semaphore:
            return await task
   
    return await asyncio.gather(*(sem_task(task) for task in tasks))
 
# create async post function
async def post_async(url, session, headers, body):
    async with session.post(url, headers=headers, json=body) as response:
        text = await response.text()
        return json.loads(text)
   
async def pool(text):
    conn = aiohttp.TCPConnector(limit=None, ttl_dns_cache=300)
    session = aiohttp.ClientSession(connector=conn)
    urls_bodies = configure_bodies(text)
    conc_req = 4
    summary, ner, mcp, polarity = await gather_with_concurrency(conc_req, *[post_async(url, session, headers, body) for url, body in urls_bodies.items()])
    await session.close()
    return summary["summary"], ner["ner"], mcp["most common phrases"], polarity["text polarity"]

Full Code for Asynchronously Executing all NLP techniques

Here’s the full code for async_pool.py.

import asyncio
import aiohttp
import json
 
from .text_config import apikey
 
# configure request constants
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
summarize_url = text_url+"summarize"
ner_url = text_url+"ner"
mcp_url = text_url+"most_common_phrases"
polarity_url = text_url+"text_polarity"
 
# configure request bodies
# return a dict of url: body
def configure_bodies(text: str):
    _dict = {}
    _dict[summarize_url] = {
        "text": text,
        "proportion": 0.1
    }
    _dict[ner_url] = {
        "text": text,
        "labels": "ARTICLE"
    }
    _dict[mcp_url] = {
        "text": text,
        "num_phrases": 5
    }
    _dict[polarity_url] = {
        "text": text
    }
    return _dict
 
# configure async requests
# configure gathering of requests
async def gather_with_concurrency(n, *tasks):
    semaphore = asyncio.Semaphore(n)
    async def sem_task(task):
        async with semaphore:
            return await task
   
    return await asyncio.gather(*(sem_task(task) for task in tasks))
 
# create async post function
async def post_async(url, session, headers, body):
    async with session.post(url, headers=headers, json=body) as response:
        text = await response.text()
        return json.loads(text)
   
async def pool(text):
    conn = aiohttp.TCPConnector(limit=None, ttl_dns_cache=300)
    session = aiohttp.ClientSession(connector=conn)
    urls_bodies = configure_bodies(text)
    conc_req = 4
    summary, ner, mcp, polarity = await gather_with_concurrency(conc_req, *[post_async(url, session, headers, body) for url, body in urls_bodies.items()])
    await session.close()
    return summary["summary"], ner["ner"], mcp["most common phrases"], polarity["text polarity"]

Further Text Processing

After doing the initial NLP we’ll still get some text back. We can continue doing some NLP on the summarization, most common phrases, and the named entities. Let’s go back to what we’re trying to do – get insights. The summary will help us get a general idea, the most common phrases will tell us what the most commonly said things are, but the NER is a little too broad still. Let’s further process the NER by finding the most commonly named entities.

Most Commonly Named Entities

For a play-by-play of this code, read the post on how to Find the Most Common Named Entities of Each Type. I’m going to give a high-level overview here. We’re going to build two functions, build_dict to split the named entities into each type, and most_common to sort that dictionary.

The build_dict function will take one parameter, ners, a list of lists. We’ll start off this function by creating an empty dictionary. Then we’ll loop through the list of ners and add those to the dictionary based on whether or not we’ve seen the type and name of the ner.

The most_common function will take one parameter as well, ners, a list of lists. The first thing we’ll do with this function is call build_dict to create the dictionary. Then, we’ll initialize an empty dictionary. Next, we’ll loop through the dictionary and sort each list of NER types. Finally, we’ll add the most common names in each type to the initialized dictionary and return that.

# build dictionary of NERs
# extract most common NERs
# expects list of lists
def build_dict(ners: list):
    outer_dict = {}
    for ner in ners:
        entity_type = ner[0]
        entity_name = ner[1]
        if entity_type in outer_dict:
            if entity_name in outer_dict[entity_type]:
                outer_dict[entity_type][entity_name] += 1
            else:
                outer_dict[entity_type][entity_name] = 1
        else:
            outer_dict[entity_type] = {
                entity_name: 1
            }
    return outer_dict
 
# return most common entities after building the NERS out
def most_common(ners: list):
    _dict = build_dict(ners)
    mosts = {}
    for ner_type in _dict:
        sorted_types = sorted(_dict[ner_type], key=lambda x: _dict[ner_type][x], reverse=True)
        mosts[ner_type] = sorted_types[0]
    return mosts

Orchestration

Finally, we’ll orchestrate our functions. First, we’ll start by importing the asyncio library and the two functions we’ll need to orchestrate, pool, and most_common. We’ll create one function, orchestrate_text_analysis, which will take one parameter, text.

The first thing we’ll do in our orchestrator is get the summary, NERs, most common phrases, and text polarity using asyncio to execute the four NLP techniques concurrently. Then, we’ll do more text processing on the NERs. We’ll also replace the newlines in the summary to make it more readable. Finally, we’ll return the summary, most common entities, most common phrases, and sentiment.

import asyncio
 
from .async_pool import pool
from .ner_processing import most_common
 
def orchestrate_text_analysis(text:str):
    """Step 1"""
    # task to execute all requests
    summary, ner, mcp, polarity = asyncio.get_event_loop().run_until_complete(pool(text))
   
    """Step 2"""
    # do NER analysis
    most_common_ners = most_common(ner)
    summary = summary.replace("\n", "")
    return summary, most_common_ners, mcp, polarity

Summary

In this post we went over how to pull Tweets for a search term and transform that into a text. Then, we went over how to asynchronously call four APIs to run NLP on the Tweets. Next, we went over how to do some further text processing. Finally, we went over how to orchestrate the NLP on the text. I’ll be using this program to get insights from some people I want to be like on Twitter.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

Leave a Reply

Discover more from PythonAlgos

Subscribe now to keep reading and get access to the full archive.

Continue reading