Using NLP to Analyze a YouTube Lecture Series – Part 1

YouTube is one of the best places to learn online today. What if you are just curious what the main points of a series is though? We can use natural language processing to analyze a YouTube series. This way we can gain some insight into what the series is about and decide if we want to further explore it or not. One of my friends recently suggested that I check out some of Jordan Peterson’s lectures on YouTube. 

To be honest, and to be fair to JP, I’ve only ever seen him meme’d on Instagram so I know nothing about him. In this example project, I’m going to do an NLP analysis of two of his video series before I spend time watching them. The two lecture series we’ll be analyzing are 2017 Maps of Meaning: The Architecture of Belief and 2017 Personality and Its Transformations. This is only part one of two, In this post we will:

  • Download YouTube Transcripts
  • Run a text analysis on those YouTube Transcripts with NLP
    • Create a function to asynchronously call many API endpoints
    • Save the returned data into parseable files
    • Parse the transcripts and call the API
  • Summarize how we used NLP to analyze YouTube transcripts

Download YouTube Transcripts

We can download YouTube Transcripts with the YouTube-Transcript-Api Python library. To follow along, you’ll need to install it in your terminal with the following line:

pip install youtube-transcript-api

To begin our function we’ll import the YouTubeTranscriptApi object from the youtube_transcript_api library and the json library. Then we’ll list the IDs of the videos for each playlist. Finally, we’ll loop through both lists and save the file to a JSON file with a filename corresponding to its episode if it exists. Some videos don’t have transcripts associated with them and we will print out errors and episode numbers for those videos. For a more detailed explanation, see how to download YouTube transcripts in 3 lines of Python.

from youtube_transcript_api import YouTubeTranscriptApi
import json
 
personality_vid_ids = ["kYYJlNbV1OM", "HbAZ6cFxCeY", "wLc_MC7NQek",
                       "BQ4VSRg4e8w", "3iLiKMUiyTI", "X6pbJTqv2hw",
                       "YFWLwYyrMRE", "68tFnjkIZ1Q", "4qZ3EsrKPsc",
                       "11oBFCNeTAs", "w84uRYq0Uc8", "pCceO_D4AlY",
                       "AqkFg1pvNDw", "ewU7Vb9ToXg", "G1eHJ9DdoEA",
                       "D7Kn5p7TP_Y", "fjtBDa4aSGM", "MBWyBdUYPgk",
                       "Q7GKmznaqsQ", "J9j-bVDrGdI"]
 
mom_vid_ids = ["I8Xc2_FtpHI", "EN2lyN7rM4E", "Us979jCjHu8"
               "bV16NEWld8Q", "RudKmwzDpNY", "nsZ8XqHPjI4",
               "F3n5qtj89QE", "Nb5cBkbQpGY", "yXZSeiAl4PI",
               "7XtEZvLo-Sc", "T4fjSrVCDvA", "6V1eMvGGcXQ"]
 
for index, _id in enumerate(personality_vid_ids):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(_id)
        with open(f'personality_{index}.json', 'w', encoding='utf-8') as json_file:
            json.dump(transcript, json_file)
    except:
        print(f"{index} not valid")
       
for index, _id in enumerate(mom_vid_ids):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(_id)
        with open(f'mom_{index}.json', 'w', encoding='utf-8') as json_file:
            json.dump(transcript, json_file)
    except:
        print(f"{index} not valid")

Text Analysis on YouTube Transcripts with NLP

Some of the videos in the lists did not have transcripts. Indexes 5, 7, and 11 of Maps of Meaning failed, meaning episodes 6, 8, and 12 did not have transcripts (Python is 0 indexed by default). The Personality lecture series episodes 13, 14, and 16 also lacked transcripts. Now that we have (most of) our YouTube series Transcripts, we can do NLP text analysis on them. To follow along you’ll need a free API key from The Text API, and the aiohttp library. We can download the library with the line below:

pip install aiohttp

Create Module to Asynchronously Call API Endpoints

The first thing we’re going to do is set up a module to asynchronously call our API endpoints. The four NLP endpoints we’re going to call are the ones to get a summary, do named entity recognition, get the polarity, and get the most common phrases. We’ll start this module by importing the libraries we need and the API key.

Then we’ll set up the headers and the four API endpoints we’ll need. We’re going to be running an event loop with asyncio and aiohttp. To avoid a possible RuntimeError: Event loop closed, we’re going to create a silencing wrapper around the Proactor loop’s delete function. Next, we’ll create a function to create the bodies to send to each of the endpoints based on a passed in text.
Then we’ll create the two functions needed to concurrently and asynchronously execute multiple API calls. The gather_with_concurrency function uses a Semaphore object to concurrently execute multiple requests. The post_async function asynchronously executes POST requests and waits for the response. For a more detailed description, read how to Asynchronously Call API Endpoints in Python.

import asyncio
import aiohttp
import json
 
from text_config import apikey
 
# configure request constants
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
summarize_url = text_url+"summarize"
ner_url = text_url+"ner"
mcp_url = text_url+"most_common_phrases"
polarity_url = text_url+"text_polarity"
 
"""fix yelling at me error"""
from functools import wraps
 
from asyncio.proactor_events import _ProactorBasePipeTransport
 
def silence_event_loop_closed(func):
    @wraps(func)
    def wrapper(self, *args, **kwargs):
        try:
            return func(self, *args, **kwargs)
        except RuntimeError as e:
            if str(e) != 'Event loop is closed':
                raise
    return wrapper
 
_ProactorBasePipeTransport.__del__ = silence_event_loop_closed(_ProactorBasePipeTransport.__del__)
"""fix yelling at me error end"""
 
# configure request bodies
# return a dict of url: body
def configure_bodies(text: str):
    _dict = {}
    _dict[summarize_url] = {
        "text": text,
        "proportion": 0.1
    }
    _dict[ner_url] = {
        "text": text,
        "labels": "ARTICLE"
    }
    _dict[mcp_url] = {
        "text": text,
        "num_phrases": 5
    }
    _dict[polarity_url] = {
        "text": text
    }
    return _dict
 
# configure async requests
# configure gathering of requests
async def gather_with_concurrency(n, *tasks):
    semaphore = asyncio.Semaphore(n)
    async def sem_task(task):
        async with semaphore:
            return await task
   
    return await asyncio.gather(*(sem_task(task) for task in tasks))
 
# create async post function
async def post_async(url, session, headers, body):
    async with session.post(url, headers=headers, json=body) as response:
        text = await response.text()
        return json.loads(text)

Saving Returned Data

Now that we’ve set up the module for asynchronous API calls, let’s finish it out by creating a function to call the API endpoints and save the returned data. In our function, we’ll create a Session object, set up the request bodies, and then call gather_with_concurrency to do all four API calls.

Once we execute the API calls, we’ll save them into separate files. We’ll have to have different ways to save each response because they’re all structured differently. We can save the summary response verbatim. The named entity recognition response will need a bit of parsing to turn the list of lists into a list of strings. Then we can save the list of strings in a file separated by newlines for readability.

The most common phrases come as a list of strings, so we can directly save each string in a file. We’ll also save the most common phrases separated by newlines for readability. Finally, we’ll save the polarities in a text file as is, just like the summaries. This is also covered in the post about asynchronous API calls in more detail.

async def pool(text: str, term: str):
    conn = aiohttp.TCPConnector(limit=None, ttl_dns_cache=300)
    session = aiohttp.ClientSession(connector=conn)
    urls_bodies = configure_bodies(text)
    conc_req = 4
    summary, ner, mcp, polarity = await gather_with_concurrency(conc_req, *[post_async(url, session, headers, body) for url, body in urls_bodies.items()])
    await session.close()
   
    # write docs
    with open(f"summaries/{term}.txt", "w") as f:
        f.write(summary["summary"])
   
    ners = ner["ner"]
    list_ners = []
    for ner in ners:
        list_ners.append(" ".join(ner))
   
    with open(f"ners/{term}.txt", "w") as f:
        for ner in list_ners:
            f.write(ner + '\n')
       
    mcps = mcp["most common phrases"]
    with open(f"mcps/{term}.txt", "w") as f:
        for mcp in mcps:
            f.write(mcp + '\n')
   
    with open(f"polarities/{term}.txt", "w") as f:
        f.write(str(polarity["text polarity"]))

Parse Transcripts and Call API

Now let’s create an orchestrator module to parse the downloaded transcripts and run asynchronous calls. First, we’ll import the modules we need and the pool function from the asynchronous API call module we created above. Then we’ll create a synchronous function that uses asyncio to run our asynchronous function. Next, we’ll create a function to remove the non-language captions from our YouTube transcripts.

Now let’s loop through all of the video transcripts we downloaded and use NLP to do text analysis on them. I loop through every file in the directory I downloaded the transcript to using os.listdir, which gets all the filenames in a directory. For each of the files, we create a list of strings using the entries. Each one will be slightly over 6000 characters. You can adjust this character soft limit based on your internet speed. For each string in this list, we will call The Text API to get a text analysis.

import json
import asyncio
import os
 
from async_pool import pool
 
# send parts of it through to orchestrate text analysis at a time
 
def call_pool(text: str, term: str):
    asyncio.new_event_loop().run_until_complete(pool(text, term))
   
def remove_brackets(text: str):
    while "[" in text:
        index1 = text.find("[")
        index2 = text.find("]")
        text = text[:index1] + text[index2+1:]
    return text
 
for video_transcript in os.listdir("./jp"):
    with open(f"./jp/{video_transcript}", "r") as f:
        entries = json.load(f)
    text = ""
    texts = []
    for entry in entries:
        text += entry["text"] + " "
        if len(text) > 6000:
            texts.append(text)
            text = ""
    for index, text in enumerate(texts):
        call_pool(text, f"{video_transcript[:-5]}_{index}")
        print(f"{video_transcript} {index} done")

Summary of Using NLP to Analyze a YouTube Series – Part 1

In this post we covered how to get the NLP analysis of a YouTube series. In this part we covered how to download YouTube transcripts, create a module to asynchronously call API endpoints and save the returned data, and how to orchestrate the parsing of the downloaded files and API calls. This is only part one, in part two, we’ll gather the parts of the text analysis returned and glean insights from that. 

When we finish the next part, we’ll have links here: Insights from 2017 Maps of Meaning: The Architecture of Belief and 2017 Personality and Its Transformations

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

2 thoughts on “Using NLP to Analyze a YouTube Lecture Series – Part 1

Leave a Reply

%d bloggers like this: