Using NLP to Analyze a YouTube Lecture Series – Part 2

In part one of Using NLP to Analyze a YouTube Lecture Series we asynchronously sent API requests to use NLP on downloaded YouTube Transcripts. As part of getting our analyzed data, we had to split up the text data to get all the requests due to internet speed limits. In this post, we’ll combine all the parts for each of the NLP techniques and get some surface level insights. Full source code including results here.

Here’s the parts of this post:

  • A Summary of Part 1
  • Creating Word Clouds from Transcripts
    • Word Cloud Function
    • Create Word Clouds for Each Transcript
  • Combining separate files
    • Collecting each NLP analysis into a single file for each technique and lecture
  • Further processing on NLP responses
    • Most Commonly Named Entities
    • Average Sentiment Polarities
    • Most Common Phrases
  • Summary of Using NLP to Analyze a YouTube Lecture Series

Downloading the YouTube Transcripts and Using The Text API

In part one of using NLP to analyze YouTube Transcripts we downloaded transcripts and used NLP to get the entities, average sentiment values, most common phrases, and summaries of each episode. We used the youtube-transcript-api Python library to download the transcripts and The Text API to run NLP techniques on it. Due to the size of transcripts and internet speed limits, we broke down each transcript to be analyzed in parts. Now we have multiple files for each of the entities, average sentiment values, most common phrases, and summaries for each episode. In this post, we’re going to create word clouds from each of the transcripts and combine the analyzed parts of each lecture into one document for each technique for each lecture.

Creating Word Clouds from Transcripts

Word clouds provide a great image of the vernacular of a text. They give us an easy way to visualize the words in a text. Word clouds may not always give us a great picture of the topics in a text, but they give us a great picture of the vocabulary used. To follow along with this section you’ll need the matplotlib, numpy, and wordcloud libraries. You can install these libraries with the following line in the terminal:

pip install matplotlib numpy wordcloud

Word Cloud Module

We’ll begin our word cloud module with the imports as usual. Our function takes two parameters, text and filename. The function will use an image, which you can find here. All we’ll do in this function is use the WordCloud object to create a word cloud and the matplotlib library to save the created image. For a more description tutorial, read how to create a word cloud in 10 lines of Python.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
 
# wordcloud function
def word_cloud(text, filename):
    stopwords = set(STOPWORDS)
    frame_mask=np.array(Image.open("cloud_shape.png"))
    wordcloud = WordCloud(max_words=50, mask=frame_mask, stopwords=stopwords, background_color="white").generate(text)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.savefig(f'{filename}.png')

Create Word Clouds for Each Transcript

Now that we’ve created the word cloud function, we’ll create a function that will create a word cloud for each episode. As always, we’ll begin with our imports, we’ll need the os and json libraries. We’ll loop through each of the files in our directory containing the transcripts. 

For each of the files, we’ll load them up using the json.load() function to turn the JSON into a list. Each item in the list is a dictionary. Before looping through each of the dictionaries, we’ll create an empty string to hold the text. For each of these items, we’ll add the text value of the dictionary to our current text string. Finally, we’ll call the word cloud function to create a word cloud.

import json
import os
 
for video_transcript in os.listdir("./jp"):
    with open(f"./jp/{video_transcript}", "r") as f:
        entries = json.load(f)
    text = ""
    for entry in entries:
        text += entry["text"] + " "
       
    word_cloud(text, f"{video_transcript[:-5]}")

Combining Separate Files

As we said earlier, we had to separate each of the transcripts into separate parts to call the APIs. Now that we have separate pieces of each transcript, we have to gather them together. 

Gather the Names of All the Files

Let’s create a function that will get all the filenames we need. We’ll start by importing the library we need, os. We’ll create a function that doesn’t take any parameters. The first thing we’ll do in our function is create an empty list to hold the file names. Next, we’ll go through each of the filenames in the jp directory that holds all the transcripts and append them to the list. Finally, we’ll return the list of file names.

import os
 
# get filenames
def get_filenames():
    filenames = []
    for filename in os.listdir("./jp/"):
        filenames.append(filename[:-5])
    return filenames

Collect Each NLP Analysis into a Single File

Now let’s collect each of the parts of the files and combine them into one file. First, we’ll import the os library and the get_filenames function we created above. Next, we’ll get a list of filenames using the get_filenames function. Now we’ll loop through all the filenames from the jp directory and use those to find the partial files. The partial files will match the same prefix as the filename. After getting all the parts for one filename, we’ll sort it based on length so that the partial files ending in two digits like _11 come after the ones ending in one digit like _2 or _3. Note that we only need to loop through one folder to collect the partial file names because they will be the same for each folder.

Now that we have all the filenames and the partial filenames, we will loop through all the filenames and combine the partials. For each directory, the named entities, the most common phrases, the polarities, and the summaries, we’ll create lists to hold the partial values. Then, for each filename, we’ll loop through its parts and read the files corresponding to each part. Next, for each of the directories corresponding to each part, we’ll collect the values from the parts in that directory and put them into their corresponding list. Then we’ll write all the values in the list to the full filename in the directory.

import os
from get_filenames import get_filenames
 
# collect all the separate mcps, ners, polarities, and summaries
filenames = get_filenames()
   
part_filenames = []
for filename in filenames:
    parts = []
    for part_name in os.listdir("./ners"):
        if part_name.startswith(filename+"_"):
            parts.append(part_name)
    parts = sorted(parts, key=len)
    part_filenames.append(parts)
 
for i, filename in enumerate(filenames):
    # ners
    ners = []
    for part_filename in part_filenames[i]:
        with open(f"./ners/{part_filename}", "r") as f:
            entries = f.read()
            ners.append(entries)
    with open(f"./ners/{filename}.txt", "w") as f:
        for entry in ners:
            f.write(entry)
   
    # mcps
    mcps = []
    for part_filename in part_filenames[i]:
        with open(f"./mcps/{part_filename}", "r") as f:
            entries = f.read()
            mcps.append(entries)
    with open(f"./mcps/{filename}.txt", "w") as f:
        for entry in mcps:
            f.write(entry)
   
    # polarities
    polarities = []
    for part_filename in part_filenames[i]:
        with open(f"./polarities/{part_filename}", "r") as f:
            entries = f.read()
            polarities.append(entries)
    with open(f"./polarities/{filename}.txt", "w") as f:
        for entry in polarities:
            f.write(entry + '\n')
           
    # summaries
    summaries = []
    for part_filename in part_filenames[i]:
        with open(f"./summaries/{part_filename}", "r") as f:
            entries = f.read()
            summaries.append(entries)
    with open(f"./summaries/{filename}.txt", "w") as f:
        for entry in summaries:
            f.write(entry + '\n')

Further Processing on NLP Results

Now, we’ve gathered all the pieces of each NLP analysis together. Let’s do some further processing for the named entities, sentiment polarities, and most common phrases.

Getting the Most Commonly Named Entities

We can learn more from our list of named entities by getting the most commonly named entities in the list. The only import we’ll need for this file is the filenames function. Next, we’ll create a function to build a dictionary out of a list of strings. 

The first thing this function will do is create an empty outer dictionary to hold the organized named entities. Next, we’ll loop through each of the pairs of strings in the list of recognized named entities. We’ll split the strings into the type and name of the entity at the first space. Then we’ll check against the entity type’s existence in the outer dictionary and the named entity in the entry and count the number of entries for each list.
Next, we’ll build a function that sorts the list of named entities after they’ve been compiled into a list using the build_dict function. This will return a dictionary of the highest occurring name corresponding to each entity type. Finally, we’ll create another function that will loop through each entry in the directory and get the most common entities. It will read each file, run it through the most_common function, and save the results into another file. You can see the files for the most common named entities here with the suffix most_common.

from get_filenames import get_filenames
 
# build dictionary of NERs
# extract most common NERs
# expects list strings
def build_dict(ners: list):
    outer_dict = {}
    for ner in ners:
        splitup = ner.split(" ", 1)
        entity_type = splitup[0]
        entity_name = splitup[1]
        if entity_type in outer_dict:
            if entity_name in outer_dict[entity_type]:
                outer_dict[entity_type][entity_name] += 1
            else:
                outer_dict[entity_type][entity_name] = 1
        else:
            outer_dict[entity_type] = {
                entity_name: 1
            }
    return outer_dict
 
# return most common entities after building the NERS out
def most_common(ners: list):
    _dict = build_dict(ners)
    mosts = {}
    for ner_type in _dict:
        sorted_types = sorted(_dict[ner_type], key=lambda x: _dict[ner_type][x], reverse=True)
        mosts[ner_type] = sorted_types[0]
    return mosts
 
def ner_processing():
    filenames = get_filenames()
    for filename in filenames:
        with open(f"./ners/{filename}.txt", "r") as f:
            entries = f.read()
        entries = entries.split('\n')
        while '' in entries:
            entries.remove('')
        print(filename)
        mce = most_common(entries)
        with open(f"./ners/{filename}_most_common.txt", "w") as f:
            for ner in mce.items():
                f.write(ner[0] + " " + ner[1] + '\n')
               
ner_processing()

Average Sentiment Polarities

Now that we’ve processed the named entities, let’s also process the sentiment polarities. Each part of each episode has a different polarity value. Let’s get an average. We’ll start by importing the get_filenames function to get the filenames. Then we’ll create a function that will return the average of a list.

Next, we’ll create our polarity averaging function. This function will start by getting all the file names. It will loop through each episode from the filenames. For each episode, we’ll create a list of polarities from the episode file. Note that we have to convert the string to a float value and get rid of the last entry which will be an empty string. Then we’ll write the average polarity to its own file. You can find the averages and separate sentiment values here.

from get_filenames import get_filenames
 
def avg(_list):
    return sum(_list)/len(_list)
 
# get average polarities
def avg_polarities():
    filenames = get_filenames()
    for filename in filenames:
        with open(f"./polarities/{filename}.txt", "r") as f:
            entries = f.read()
        polarities = [float(entry) for entry in entries.split('\n') if len(entry) > 1]
        with open(f"./polarities/{filename}_avg.txt", "w") as f:
            f.write(str(avg(polarities)))
 
avg_polarities()

Most Common Phrases Combined and Reevaluated

Now let’s take a look at the most common phrases. Right now, we have combined files of the 5 most common phrases for each part of an episode. Let’s get the 5 most common phrases for an entire episode based on those. To follow along, you’ll need a free API key from The Text API and to install the Python requests library. You can install the library with the line in the terminal below:

pip install requests

We’ll begin by importing the libraries we need, requests and json, the API key, and the get_filenames function. Then we’ll create the headers which will tell the server we’re sending JSON data and pass the API key and declare the API URL we need to hit. 

We will create one function to get the most common phrases from all the most common phrases. The function will get the file names and then loop through all of them. For each file, we’ll start by getting all the most common phrases and creating a body that will request the five most common phrases. All we need to do from there is send the request, parse the JSON response, and save it to another file. You can find the results here.

import requests
import json
 
from text_config import apikey
from get_filenames import get_filenames
 
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
mcp_url = "https://app.thetextapi.com/text/most_common_phrases"
# get most common phrases
def collect_mcps():
    filenames = get_filenames()
    for filename in filenames:
        with open(f"./mcps/{filename}.txt", "r") as f:
            entries = f.read()
        body = {
            "text": entries,
            "num_phrases": 5
        }
        response = requests.post(mcp_url, headers=headers, json=body)
        _dict = json.loads(response.text)
        mcps = _dict["most common phrases"]
        with open(f"./mcps/{filename}_re.txt", "w") as f:
            for mcp in mcps:
                f.write(mcp + '\n')
           
collect_mcps()

Summary of Using NLP to Analyze a YouTube Lecture Series

In this series, we created word clouds from transcripts of two YouTube series and further processed NLP results from the same transcripts. To further process our results, we first combined the partial NLP analyses for each episode into one file for each episode. From there, we ran different processes. We found the most commonly named entities, the average sentiment polarities, and once again found the most common phrases from the already existing most common phrases.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

Leave a Reply

%d bloggers like this: