What Are The Most Common Phrases on YouTube’s Front Page?

Accompanying YouTube video:

Have you ever wondered how to make the front page of YouTube? I certainly have, so I used the best sentiment analysis API out there, The Text API, in combination with Selenium and Beautiful Soup to find out what the most common phrases in the titles on YouTube’s front page are. We’re going to do this in two steps:

  1. Scrape YouTube’s Front Page
  2. Find the Most Common Phrases

You can find the source code for this project on GitHub. TL;DR – COVID, but what about when you take that away? You’ll have to scroll to the bottom to find out 🙂

Scraping YouTube’s Front Page for Titles

The first thing we’ll need to do to analyze the most common phrases on YouTube’s Front Page is to scrape the titles of the videos on the page. I actually did this in the last article on How Many Views Per Day Do Front Page YouTube Videos Get? That article discusses how to scrape YouTube’s front page for title texts, author, views, and the length of the video, then how to convert those views into views per day. We only need to scrape for the titles, so let’s get into how to do that. Just like the last article, you’ll need to download Chromedriver. Let’s get started by downloading our libraries.

pip install selenium beautifulsoup4 requests

Download Chromedriver at the link above and pick whichever version you’d like. It’ll lead you to a page that looks like the image below. Then, pick the right zip file depending on your OS, unzip it, and remember the path or move it somewhere easily accessible.

chromedriver download page view
chromedriver download page view

Let’s import our libraries. We’ll use re for regex to pull the titles, Selenium and some sublibraries for using our chromedriver with options, sleep to make our chrome driver wait for the page to load, randint to randomize the wait time (this is not entirely necessary, I just like it because it makes us look less like a bot), and bs4 for Beautiful Soup to extract the HTML doc.

import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from time import sleep
from random import randint
from bs4 import BeautifulSoup

Now we’ll simply spin up our chromedriver, go to YouTube, and use BeautifulSoup to extract all the links that start with “watch”. We’ll shut down the driver after we get all the HTML docs we need.

chromedriver_path = "<your chromedriver path>"
service = Service(chromedriver_path)
chrome_options = Options()
chrome_options.headless = True
driver = webdriver.Chrome(service=service, options=chrome_options)
 
home = "https://www.youtube.com/"
driver.get(home)
sleep(randint(2, 4))
# --------- run here and check for where to find videos ----------
 
soup = BeautifulSoup(driver.page_source, 'html.parser')
titles = soup.find_all("a", href=re.compile("watch.*"))
driver.quit()

It’s important to have an idea of what the text we’re extracting looks like. So let’s take a look at what the text of the “aria-label” of the titles look like.

title of a youtube video on the front page

Then we split this on whitespaces, rejoin the text, and the split again on the “by” keyword. We’ll get a list and we can get the title as the first element. Finally, we can store all the titles in a list.

title of a youtube video on the front page split up
# title, when uploaded, number of views
title_list = []
for title in titles:
    text = title.get('aria-label')
    if text is None:
        continue
    elements = text.split(' ')
    re_join = ' '.join(elements[:-2]).split('by')
    title_text = re_join[0]
    title_list.append(title_text)

Finding the Most Common Phrases in All the Titles

Now that we’ve gathered a list of titles, we’ll find the most common phrases in them. For this part, we’ll be using the “Common Phrases” endpoint of The Text API. Head on over to The Text API website and when you land on the page, scroll all the way down and click the “Get Your API Key” button as shown below.

get the text api key

Once you log in, you should see your API key front and center at the top of the page like shown in the image below. I have a paid account, but you can do this project with a free account, and paid accounts are in closed beta at the time of writing anyway.

Just as the documentation on How to use V1.1.0 of the Text API says to do, we’ll hit the most_common_phrases endpoint with the headers and body described. First we’ll combine our text into one string so that we can detect the most common phrases in the entire corpus of titles on the front page, and then we’ll send our request.

text = ""
for t in title_list:
    text += t
 
body = {
    "text": text
}
 
res = json.loads(requests.post(cw_url, headers=headers, json=body).text)
cws = res["most common phrases"]
print(cws)

When we run this we should see an output like the one below.

most common phrases extracted from youtube’s front page circa october 2021

It’s no surprise that the most common phrases among the titles are COVID related. This article is being written in November of 2021, during the middle of the COVID pandemic. So, let’s take a deeper dive and see what happens if we get rid of the COVID keyword from our text. Just a note, I got rid of ALL titles containing the word COVID. I used the lowercase versions just in case some titles are cased incorrectly. The “continue” keyword tells the program to go to the next iteration of the for loop.

text = ""
for t in title_list:
    if "covid" in t.lower():
        continue
    text += t
 
body = {
    "text": text
}
 
res = json.loads(requests.post(cw_url, headers=headers, json=body).text)
cws = res["most common phrases"]
print(cws)

Before we go on, I want you to think in your head, what you think the most popular phrases will be. Alright, let’s go. When we run this we’ll see an output like the one below:

most common phrases on youtube’s front page minus covid

What a weird output. I would never have guessed that the top keywords on YouTube’s front page after removing titles about COVID would have been related to water and study music. What about you? Tell me in the comments below!

Further Reading

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang
Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

2 thoughts on “What Are The Most Common Phrases on YouTube’s Front Page?

Leave a Reply

%d bloggers like this: