Categories
APIs NLP The Text API

What Are The Most Common Phrases on YouTube’s Front Page?

Accompanying YouTube video:

Have you ever wondered how to make the front page of YouTube? I certainly have, so I used the best sentiment analysis API out there, The Text API, in combination with Selenium and Beautiful Soup to find out what the most common phrases in the titles on YouTube’s front page are. We’re going to do this in two steps:

  1. Scrape YouTube’s Front Page
  2. Find the Most Common Phrases

You can find the source code for this project on GitHub. TL;DR – COVID, but what about when you take that away? You’ll have to scroll to the bottom to find out 🙂

Scraping YouTube’s Front Page for Titles

The first thing we’ll need to do to analyze the most common phrases on YouTube’s Front Page is to scrape the titles of the videos on the page. I actually did this in the last article on How Many Views Per Day Do Front Page YouTube Videos Get? That article discusses how to scrape YouTube’s front page for title texts, author, views, and the length of the video, then how to convert those views into views per day. We only need to scrape for the titles, so let’s get into how to do that. Just like the last article, you’ll need to download Chromedriver. Let’s get started by downloading our libraries.

pip install selenium beautifulsoup4 requests

Download Chromedriver at the link above and pick whichever version you’d like. It’ll lead you to a page that looks like the image below. Then, pick the right zip file depending on your OS, unzip it, and remember the path or move it somewhere easily accessible.

chromedriver download page view

Let’s import our libraries. We’ll use re for regex to pull the titles, Selenium and some sublibraries for using our chromedriver with options, sleep to make our chrome driver wait for the page to load, randint to randomize the wait time (this is not entirely necessary, I just like it because it makes us look less like a bot), and bs4 for Beautiful Soup to extract the HTML doc.

import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from time import sleep
from random import randint
from bs4 import BeautifulSoup

Now we’ll simply spin up our chromedriver, go to YouTube, and use BeautifulSoup to extract all the links that start with “watch”. We’ll shut down the driver after we get all the HTML docs we need.

chromedriver_path = "<your chromedriver path>"
service = Service(chromedriver_path)
chrome_options = Options()
chrome_options.headless = True
driver = webdriver.Chrome(service=service, options=chrome_options)
 
home = "https://www.youtube.com/"
driver.get(home)
sleep(randint(2, 4))
# --------- run here and check for where to find videos ----------
 
soup = BeautifulSoup(driver.page_source, 'html.parser')
titles = soup.find_all("a", href=re.compile("watch.*"))
driver.quit()

It’s important to have an idea of what the text we’re extracting looks like. So let’s take a look at what the text of the “aria-label” of the titles look like.

title of a youtube video on the front page

Then we split this on whitespaces, rejoin the text, and the split again on the “by” keyword. We’ll get a list and we can get the title as the first element. Finally, we can store all the titles in a list.

title of a youtube video on the front page split up
# title, when uploaded, number of views
title_list = []
for title in titles:
    text = title.get('aria-label')
    if text is None:
        continue
    elements = text.split(' ')
    re_join = ' '.join(elements[:-2]).split('by')
    title_text = re_join[0]
    title_list.append(title_text)

Finding the Most Common Phrases in All the Titles

Now that we’ve gathered a list of titles, we’ll find the most common phrases in them. For this part, we’ll be using the “Common Phrases” endpoint of The Text API. Head on over to The Text API website and when you land on the page, scroll all the way down and click the “Get Your API Key” button as shown below.

get the text api key

Once you log in, you should see your API key front and center at the top of the page like shown in the image below. I have a paid account, but you can do this project with a free account, and paid accounts are in closed beta at the time of writing anyway.

Just as the documentation on How to use V1.1.0 of the Text API says to do, we’ll hit the most_common_phrases endpoint with the headers and body described. First we’ll combine our text into one string so that we can detect the most common phrases in the entire corpus of titles on the front page, and then we’ll send our request.

text = ""
for t in title_list:
    text += t
 
body = {
    "text": text
}
 
res = json.loads(requests.post(cw_url, headers=headers, json=body).text)
cws = res["most common phrases"]
print(cws)

When we run this we should see an output like the one below.

most common phrases extracted from youtube’s front page circa october 2021

It’s no surprise that the most common phrases among the titles are COVID related. This article is being written in November of 2021, during the middle of the COVID pandemic. So, let’s take a deeper dive and see what happens if we get rid of the COVID keyword from our text. Just a note, I got rid of ALL titles containing the word COVID. I used the lowercase versions just in case some titles are cased incorrectly. The “continue” keyword tells the program to go to the next iteration of the for loop.

text = ""
for t in title_list:
    if "covid" in t.lower():
        continue
    text += t
 
body = {
    "text": text
}
 
res = json.loads(requests.post(cw_url, headers=headers, json=body).text)
cws = res["most common phrases"]
print(cws)

Before we go on, I want you to think in your head, what you think the most popular phrases will be. Alright, let’s go. When we run this we’ll see an output like the one below:

most common phrases on youtube’s front page minus covid

What a weird output. I would never have guessed that the top keywords on YouTube’s front page after removing titles about COVID would have been related to water and study music. What about you? Tell me in the comments below!

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
NLP The Text API

How Many Views Per Day do Front Page YouTube Videos Get?

YouTube is the celebrity maker of our generation. I don’t know about you, but I’m pretty curious as to how many views per day a YouTube video on the front page gets. Let’s take a look at the videos on YouTube’s front page and see how many views per day each of these videos get. We’re going to write a Python program that will pull all this information down and analyze it for us. To do this, we’ll need the help of Selenium with Chromedriver, Beautiful Soup, and The Text API.

Let’s start by downloading our libraries (we’ll use the dateparser library to calculate when the video was posted, but there are other ways to do it too)

pip install selenium beautifulsoup4 dateparser requests

Note that the Beautiful Soup Python library is actually packaged under the name “beautifulsoup4”. After we install our Python libraries, we’ll want to go to the Chromedriver link provided above and install Chromedriver. You can pick whichever version of Chromedriver you want, and it should lead you to a page with “.zip” files to download that looks something like this:

Download the right “.zip” file for your operating system. I’m using windows, so I downloaded the “_win32” version. Once you download it, you’ll need to extract it and either keep track of it’s location or move the chromedriver.exe file into a folder that you can easily remember. Once you’re done downloading your chromedriver, head on over to The Text API website and sign up for an API key. When you land on the page, scroll all the way down and click the “Get Your API Key” button as shown below.

Sign up for The Text API

Once you log in, you should see your API key front and center at the top of the page like shown in the image below. I have a paid account, but you can do this project with a free account. Paid accounts are currently in closed Beta.

Create a Web Scraper to Scrape YouTube

Alright, now we’re done with all the setup, let’s get into the code! First we’ll have to import all the libraries we need. I’ve imported the re library for regular expressions, requests to send API requests to The Text API, dateparser to parse dates, datetime to get the current date, multiple Selenium libraries and sublibraries for running the chromedriver with options, sleep from the time library to wait for the page to load, randint to allow for a randomized waiting time so that we don’t look too much like a bot, BeautifulSoup from bs4 to parse the webpage, and json to parse JSON objects. I’ve also imported ner_url and headers from my text_api_config file. These are for hitting The Text API URL endpoint for extracting the date, and the headers contain the API key.

import re
import requests
import dateparser
import datetime
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from time import sleep
from random import randint
from bs4 import BeautifulSoup
import json
 
from text_api_config import headers, ner_url

The URL endpoint we’ll be hitting is the “Named Entity Recognition (NER)” endpoint so that we can extract the date that the video was uploaded. Before we move into the actual logic, I’ll show you what the headers and ner_url objects should look like:

headers = {
    "Content-Type": "application/json",
    "apikey": <your API key here>
}
text_url = "https://app.thetextapi.com/text/"
ner_url = text_url + "ner"

The first thing we’ll want to do is actually launch the chromedriver and navigate to the front page of YouTube. We’ll use the Service library we imported from Selenium earlier to open up chrome, and I’ll open it up in “headless” mode, which simply means that the driver will be able to access chrome, but we won’t see the browser. You can opt to run in headless mode or not. Once we get to YouTube, we should stop here and check to see where we can find video titles based on the HTML elements of the page.

chromedriver_path = "<wherever you saved chromedriver>"
service = Service(chromedriver_path)
chrome_options = Options()
chrome_options.headless = True
driver = webdriver.Chrome(service=service, options=chrome_options)
 
home = "https://www.youtube.com/"
driver.get(home)
sleep(randint(2, 4))
# --------- run here and check for where to find videos ----------

Alright, once we’ve found where to find videos, we’ll need to extract the videos. This is where BeautifulSoup comes in to help us out. We can simply use BeautifulSoup to extract the HTML structure of the page and pull all the “a” elements (links) that start with “watch” (thank you regular expressions) to find the locations of all the YouTube videos on the front page of YouTube. After we’re done getting the elements, we should quit the chromedriver so it doesn’t continue to run in the background. This is just best practice to reduce processing power, but you’re actually free to let it run if you’d like.

soup = BeautifulSoup(driver.page_source, 'html.parser')
titles = soup.find_all("a", href=re.compile("watch.*"))
driver.quit()

Now we gotta get to the logic part of extracting the page. The “text” of the YouTube video title is, at the time of writing, located in the “aria-label” text of the element. It should include a title, when it was posted, how long it is, and how many views it has. An example looks like this:

We can then split the title text on the “by” keyword and get the title and the author, date of posing, and length of video.

Parse and Analyze YouTube Videos and Their Views

We’ll have to parse the elements we want and send the second half, that is the elements without the title, to The Text API so we can extract the date, we can also get the length of the video, but we won’t need that for the scope of this project. I’ve included a comment below the response so as to understand what the ner_url returns as an object. It will be returned as a string because we are accessing it via the .text attribute of the response. We’ll be using a regular expression to look for the first text that starts with a number because we expect the date to be the first entry with a number. Of course, this will fail if the author’s name starts with a number. We’ll put our extraction of the date in a try/except block and then save our data into a dictionary with the relative date returned from the NER endpoint of The Text API and the total number of views under the title in the dictionary.

# title, when uploaded, number of views
title_dict = {}
for title in titles:
    text = title.get('aria-label')
    if text is None:
        continue
    print(text)
    elements = text.split(' ')
    num_views = elements[-2]
    re_join = ' '.join(elements[:-2]).split('by')
    print(re_join)
    title_text = re_join[0]
    when_uploaded = re_join[1]
    body = {
        "text": when_uploaded
    }
    response = requests.post(ner_url, headers=headers, json=body)
    # {"ner":[["DATE","2 weeks ago"],["TIME","1 minute, 12 seconds"]]}
    x = re.findall(r"\"([0-9].*?)\"", response.text, re.DOTALL)
    try:
        y=x[0]
    except:
        continue

    # print(f"{title_text}: {num_views}\nUploaded on: {when_uploaded}")
    title_dict[title_text] = (y, num_views)

Now that we have the titles, their views, and when they were uploaded, we can calculate the average views per day that a video gets. To do this, we’ll use the dateparser library we installed earlier to parse the relative date into an absolute epoch, then use the datetime library to get the current date. We’ll take the difference between them and convert that to a number of days. To avoid a divide by 0 case, we’ll convert 0 days into 1 day. Then we will parse the number of views by stripping the commas and converting it into an int type so we can perform division on it. Then we write it to a JSON file so we can keep track of it to use later.

title_to_avg_daily_views = {}
for title in title_dict:
    print(title)
    try:
        dt_object_then = dateparser.parse(title_dict[title][0])
    except:
        continue
    print(dt_object_then)
    days_since = (datetime.datetime.now() - dt_object_then).days
    if days_since == 0:
        days_since = 1
    avg_views_per_day = int(title_dict[title][1].replace(',',''))/days_since
    print(avg_views_per_day)
    title_to_avg_daily_views[title] = avg_views_per_day
 
json_dict = json.dumps(title_to_avg_daily_views, indent=4)
with open("local_titles_and_views.json", "w") as f:
    f.write(json_dict)

The final JSON file should look something like this:

That’s it! Now we know how many views a day each of the videos on the front page of YouTube get. In our next articles, we’ll be exploring the most common phrases among these videos and whether or not more polarizing YouTube videos get more views per day!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly