Twitter, NLP, and the 2022 World Cup

This article is published at 9:55am EST, 5 minutes before the start of the 2022 World Cup Final between Argentina and France.

This is the umpteenth installment in my exploration of whether or not Twitter sentiment is good at predicting anything. Last year, I used it on NFL games and Starbucks stock prices. The results? Twitter sentiment was better than most bettors for NFL games, coming in at about a 60% correct prediction by betting on the lower sentiment team. However, it was absolutely abysmal for stock prices. For this post, we’re not just going to look at sentiment, but also get a summary, extract the most common phrases, and the most common named entities through named entity recognition.

In the midst of the 2022 World Cup hype, I thought I’d revive this project and see how it predicts Argentina vs France. In this post we’ll take a look at:

  • Project Outline
  • What Are We Getting From Twitter?
  • Applying NLP Techniques
    • Asynchronous Calls to The Text API
    • Getting the Most Common Named Entities
    • Putting it Together
  • Predictions from Twitter vs My Personal Thoughts
  • Extras + Disclaimers
    • Create a Word Cloud
  • Summary

Project Outline

Before we get started, let’s take a look at what the outline of this project looks like. All the .png files except cloud_shape.png are produced by the program. Pay no attention to the __pycache__. The important files to look at here are orchestrator, pull_tweets, and inside of the text_analysis folder: async_pool, ner_processing, and text_orchestrator. The text_config and twitter_config files are the files I used to store my API keys so they don’t get uploaded to GitHub.

What Are We Getting From Twitter?

I used the Twitter API to get these Tweets. There’s some complaints about the limited use of this API, but it’s good enough unless you’re one of those people that needs to get every single Tweet on Twitter. In which case, you’re out of luck, this tutorial won’t help you do that. Anyway, we’re also going to need the requests and json library to send our HTTP request and parse the response. This is the pull_tweets file.

Once we’ve set everything up, we need to create a header to send to the API endpoint. For the Twitter API, that’s simply the bearer token. It also needs to be preceded by Bearer . Annoying, I know. I also added some params to our Twitter search. Specifically, we’re only getting English Tweets (lang:en) without links (-has:links) and are not retweets (-is:retweet). We’re also only going to grab the latest 50 Tweets. This is more about the amount of time that the connection can stay open rather than the total number of Tweets we care to analyze.

Once we have everything set up, we simply send the request to get the Tweets. After getting the Twitter data returned, we extract just the data portion to get the list of Tweets. I also join them all up with periods and a space to create one text paragraph. This is mainly for further processing down the line.

import requests
import json
 
from twitter_config import bearertoken
 
search_recent_endpoint = "https://api.twitter.com/2/tweets/search/recent"
headers = {
   "Authorization": f"Bearer {bearertoken}"
}
 
# automatically builds a search query from the requested term
# looks for english tweets with no links that are not retweets
# returns the tweets
def search(term: str):
   if term[0] == '@':
       params = {
           "query": f'from:{term[1:]} lang:en -has:links -is:retweet',
           'max_results': 50
       }
   else:
       params = {
           "query": f'{term} lang:en -has:links -is:retweet',
           'max_results': 50
       }
   response = requests.get(url=search_recent_endpoint, headers=headers, params=params)
   res = json.loads(response.text)
   tweets = res["data"]
   text = ". ".join( for tweet in tweets])
   return text

Applying NLP Techniques

Now we’re going to get into the text_analysis folder. There are three main files to look at there. First, the async_pool file to call The Text API, a comprehensive NLP API for text, to get a summary of the combined Tweets, the most common phrases, the named entities, and the overall sentiment. Second, a file to process the named entities into the most common named entities. Third, an orchestrator to put the two together.

Asynchronous Calls to The Text API

This is the async_pool file. We’ll need the asyncio, aiohttp, and json libraries to execute this file. We call four different API endpoints asynchronously using asyncio and aiohttp. The first thing we do is set the headers and API endpoints.

Next, we create a function to configure the request bodies that we send. We simply create a dictionary and assign a different request body as the value to each API endpoint key. Learn more about the optional values (proportion, labels, and num_phrases) in the documentation.

We need two more helper functions before we can pool the tasks and call the API asynchronously. One to gather the tasks into a semaphore to execute, and one to asynchronously call the API endpoints. Once we have these two, we simply create a thread pool and call the API endpoints. Learn more in this tutorial on how to call APIs asynchronously.

import asyncio
import aiohttp
import json
 
from .text_config import apikey
 
# configure request constants
headers = {
   "Content-Type": "application/json",
   "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
summarize_url = text_url+"summarize"
ner_url = text_url+"ner"
mcp_url = text_url+"most_common_phrases"
polarity_url = text_url+"text_polarity"
 
# configure request bodies
# return a dict of url: body
def configure_bodies(text: str):
   _dict = {}
   _dict[summarize_url] = {
       "text": text,
       "proportion": 0.1
   }
   _dict[ner_url] = {
       "text": text,
       "labels": "ARTICLE"
   }
   _dict[mcp_url] = {
       "text": text,
       "num_phrases": 5
   }
   _dict[polarity_url] = {
       "text": text
   }
   return _dict
 
# configure async requests
# configure gathering of requests
async def gather_with_concurrency(n, *tasks):
   semaphore = asyncio.Semaphore(n)
   async def sem_task(task):
       async with semaphore:
           return await task
  
   return await asyncio.gather(*(sem_task(task) for task in tasks))
 
# create async post function
async def post_async(url, session, headers, body):
   async with session.post(url, headers=headers, json=body) as response:
       text = await response.text()
       return json.loads(text)
  
async def pool(text):
   conn = aiohttp.TCPConnector(limit=None, ttl_dns_cache=300)
   session = aiohttp.ClientSession(connector=conn)
   urls_bodies = configure_bodies(text)
   conc_req = 4
   summary, ner, mcp, polarity = await gather_with_concurrency(conc_req, *[post_async(url, session, headers, body) for url, body in urls_bodies.items()])
   await session.close()
   return summary["summary"], ner["ner"], mcp["most common phrases"], polarity["text polarity"]

Getting the Most Common Named Entities

This is the ner_processing file. The ner_processing file takes the named entities returned from the last file, async_pool, and processes them. We take the whole list, which is actually a list of lists, and loop through it twice. For each of the named entities in the list of lists, there are two entries. First, a named entity type, which could be a person, place, thing, organization, and so on. Learn more in this post about named entity recognition and its types.

We build a helper function to find the most common named entities by creating a nested dictionary. The key in the first layer is the named entity type. The inner dictionary contains key-value pairs of the named entity and how often it appears. Then we create a function to sort each of the inner dictionaries using lambda functions and return the top values to get the most common named entities.

# build dictionary of NERs
# extract most common NERs
# expects list of lists
def build_dict(ners: list):
   outer_dict = {}
   for ner in ners:
       entity_type = ner[0]
       entity_name = ner[1]
       if entity_type in outer_dict:
           if entity_name in outer_dict[entity_type]:
               outer_dict[entity_type][entity_name] += 1
           else:
               outer_dict[entity_type][entity_name] = 1
       else:
           outer_dict[entity_type] = {
               entity_name: 1
           }
   return outer_dict
 
# return most common entities after building the NERS out
def most_common(ners: list):
   _dict = build_dict(ners)
   mosts = {}
   for ner_type in _dict:
       sorted_types = sorted(_dict[ner_type], key=lambda x: _dict[ner_type][x], reverse=True)
       mosts[ner_type] = sorted_types[0]
   return mosts

Putting it Together

This is the text_orchestrator file. The orchestrator for the text analysis simply strings the functionality we created above together. First, we run the asynchronous API calls to get our summary, named entities, most common phrases, and the overall polarity. Then, we process our named entities and replace the newlines in the summary for a pretty printout and return all the values.

# input: text body
import asyncio
 
from .async_pool import pool
from .ner_processing import most_common
 
def orchestrate_text_analysis(text:str):
   """Step 1"""
   # task to execute all requests
   summary, ner, mcp, polarity = asyncio.get_event_loop().run_until_complete(pool(text))
  
   """Step 2"""
   # do NER analysis
   most_common_ners = most_common(ner)
   summary = summary.replace("\n", "")
   return summary, most_common_ners, mcp, polarity

Predicted Outcome From Twitter and Personal Thoughts

We run the root level orchestrator file to put it all together and see what Twitter thinks. This file calls the functions we made earlier. We create a function to put the word tasks together and an orchestration function to string the word tasks in after calling the search function to pull from Twitter. I’ve also added a timing to see how long things take. This is more for curiosity and benchmarking than anything else.

Finally, to orchestrate, we simply call the orchestrate function. In best practice, you would not do this in the same function. But this is simply an example tutorial so we will.

#imports
from pull_tweets import search
from text_analysis.text_orchestrator import orchestrate_text_analysis
from word_cloud import word_cloud
import time
 
def word_tasks(text: str, term: str):
   summary, most_common_ners, mcp, polarity = orchestrate_text_analysis(text)
   word_cloud(text, term)
   return summary, most_common_ners, mcp, polarity
 
# create function
def orchestrate(term: str):
   # pull tweets
   starttime = time.time()
   text = search(term)
   # call thread task for word stuff
   summary, most_common_ners, mcp, polarity = word_tasks(text, term)
   # thread tasks to create summary, ner, mcp, polarity, and word cloud tweets
   print(summary)
   print(most_common_ners)
   print(mcp)
   print(polarity)
   print(time.time()-starttime)
 
orchestrate("#argentina")
orchestrate("#france")
orchestrate("#worldcup")

Here’s the results we get:

The important thing to note here is this: France has a higher sentiment than Argentina. This is the main thing that I wanted to explore. Can we use Twitter sentiment to predict sports matches? If we follow the NFL logic – 60% of the time the lower sentiment team won – we can expect that Argentina will likely win. 

My personal prediction is also that Argentina will win – go Messi!

Extras + Disclaimers

I should put some disclaimers here – this is by no way a perfect method. This is simply a sample project that I thought would be fun to create to explore Twitter, NLP, and the World Cup! These Tweets were pulled about an hour ahead of time at 8:56am EST. 

Create Word Clouds

We also called a word cloud function in our orchestrator that I did not address earlier. The code below shows that function. Learn more in this tutorial on how to create a word cloud in Python.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
 
# wordcloud function
def word_cloud(text, filename):
   stopwords = set(STOPWORDS)
   frame_mask=np.array(Image.open("cloud_shape.png"))
   wordcloud = WordCloud(max_words=50, mask=frame_mask, stopwords=stopwords, background_color="white").generate(text)
   plt.imshow(wordcloud, interpolation='bilinear')
   plt.axis("off")
   plt.savefig(f'{filename}.png')

Here are the images of the #Argentina and #France word clouds:

Summary

In this tutorial project we pulled tweets from Twitter via the Twitter API, asynchronously called four API endpoints for NLP tasks, created a word cloud, and orchestrated all of it. Let’s go Argentina!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Yujian Tang

Leave a Reply

%d bloggers like this: