YouTube is the celebrity maker of our generation. I don’t know about you, but I’m pretty curious as to how many views per day a YouTube video on the front page gets. Let’s take a look at the videos on YouTube’s front page and see how many views per day each of these videos get. We’re going to write a Python program that will pull all this information down and analyze it for us. To do this, we’ll need the help of Selenium with Chromedriver, Beautiful Soup, and The Text API.
Let’s start by downloading our libraries (we’ll use the dateparser
library to calculate when the video was posted, but there are other ways to do it too)
pip install selenium beautifulsoup4 dateparser requests
Note that the Beautiful Soup Python library is actually packaged under the name “beautifulsoup4”. After we install our Python libraries, we’ll want to go to the Chromedriver link provided above and install Chromedriver. You can pick whichever version of Chromedriver you want, and it should lead you to a page with “.zip” files to download that looks something like this:
Download the right “.zip” file for your operating system. I’m using windows, so I downloaded the “_win32” version. Once you download it, you’ll need to extract it and either keep track of it’s location or move the chromedriver.exe file into a folder that you can easily remember. Once you’re done downloading your chromedriver, head on over to The Text API website and sign up for an API key. When you land on the page, scroll all the way down and click the “Get Your API Key” button as shown below.
Sign up for The Text API
Once you log in, you should see your API key front and center at the top of the page like shown in the image below. I have a paid account, but you can do this project with a free account. Paid accounts are currently in closed Beta.
Create a Web Scraper to Scrape YouTube
Alright, now we’re done with all the setup, let’s get into the code! First we’ll have to import all the libraries we need. I’ve imported the re
library for regular expressions, requests
to send API requests to The Text API, dateparser
to parse dates, datetime
to get the current date, multiple Selenium libraries and sublibraries for running the chromedriver with options, sleep
from the time
library to wait for the page to load, randint
to allow for a randomized waiting time so that we don’t look too much like a bot, BeautifulSoup
from bs4
to parse the webpage, and json
to parse JSON objects. I’ve also imported ner_url
and headers
from my text_api_config
file. These are for hitting The Text API URL endpoint for extracting the date, and the headers contain the API key.
import re
import requests
import dateparser
import datetime
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from time import sleep
from random import randint
from bs4 import BeautifulSoup
import json
from text_api_config import headers, ner_url
The URL endpoint we’ll be hitting is the “Named Entity Recognition (NER)” endpoint so that we can extract the date that the video was uploaded. Before we move into the actual logic, I’ll show you what the headers
and ner_url
objects should look like:
headers = {
"Content-Type": "application/json",
"apikey": <your API key here>
}
text_url = "https://app.thetextapi.com/text/"
ner_url = text_url + "ner"
The first thing we’ll want to do is actually launch the chromedriver and navigate to the front page of YouTube. We’ll use the Service
library we imported from Selenium earlier to open up chrome, and I’ll open it up in “headless” mode, which simply means that the driver will be able to access chrome, but we won’t see the browser. You can opt to run in headless mode or not. Once we get to YouTube, we should stop here and check to see where we can find video titles based on the HTML elements of the page.
chromedriver_path = "<wherever you saved chromedriver>"
service = Service(chromedriver_path)
chrome_options = Options()
chrome_options.headless = True
driver = webdriver.Chrome(service=service, options=chrome_options)
home = "https://www.youtube.com/"
driver.get(home)
sleep(randint(2, 4))
# --------- run here and check for where to find videos ----------
Alright, once we’ve found where to find videos, we’ll need to extract the videos. This is where BeautifulSoup comes in to help us out. We can simply use BeautifulSoup to extract the HTML structure of the page and pull all the “a” elements (links) that start with “watch” (thank you regular expressions) to find the locations of all the YouTube videos on the front page of YouTube. After we’re done getting the elements, we should quit the chromedriver so it doesn’t continue to run in the background. This is just best practice to reduce processing power, but you’re actually free to let it run if you’d like.
soup = BeautifulSoup(driver.page_source, 'html.parser')
titles = soup.find_all("a", href=re.compile("watch.*"))
driver.quit()
Now we gotta get to the logic part of extracting the page. The “text” of the YouTube video title is, at the time of writing, located in the “aria-label” text of the element. It should include a title, when it was posted, how long it is, and how many views it has. An example looks like this:
We can then split the title text on the “by” keyword and get the title and the author, date of posing, and length of video.
Parse and Analyze YouTube Videos and Their Views
We’ll have to parse the elements we want and send the second half, that is the elements without the title, to The Text API so we can extract the date, we can also get the length of the video, but we won’t need that for the scope of this project. I’ve included a comment below the response so as to understand what the ner_url
returns as an object. It will be returned as a string because we are accessing it via the .text
attribute of the response. We’ll be using a regular expression to look for the first text that starts with a number because we expect the date to be the first entry with a number. Of course, this will fail if the author’s name starts with a number. We’ll put our extraction of the date in a try/except
block and then save our data into a dictionary with the relative date returned from the NER endpoint of The Text API and the total number of views under the title in the dictionary.
# title, when uploaded, number of views
title_dict = {}
for title in titles:
text = title.get('aria-label')
if text is None:
continue
print(text)
elements = text.split(' ')
num_views = elements[-2]
re_join = ' '.join(elements[:-2]).split('by')
print(re_join)
title_text = re_join[0]
when_uploaded = re_join[1]
body = {
"text": when_uploaded
}
response = requests.post(ner_url, headers=headers, json=body)
# {"ner":[["DATE","2 weeks ago"],["TIME","1 minute, 12 seconds"]]}
x = re.findall(r"\"([0-9].*?)\"", response.text, re.DOTALL)
try:
y=x[0]
except:
continue
# print(f"{title_text}: {num_views}\nUploaded on: {when_uploaded}")
title_dict[title_text] = (y, num_views)
Now that we have the titles, their views, and when they were uploaded, we can calculate the average views per day that a video gets. To do this, we’ll use the dateparser
library we installed earlier to parse the relative date into an absolute epoch, then use the datetime library to get the current date. We’ll take the difference between them and convert that to a number of days. To avoid a divide by 0 case, we’ll convert 0 days into 1 day. Then we will parse the number of views by stripping the commas and converting it into an int
type so we can perform division on it. Then we write it to a JSON file so we can keep track of it to use later.
title_to_avg_daily_views = {}
for title in title_dict:
print(title)
try:
dt_object_then = dateparser.parse(title_dict[title][0])
except:
continue
print(dt_object_then)
days_since = (datetime.datetime.now() - dt_object_then).days
if days_since == 0:
days_since = 1
avg_views_per_day = int(title_dict[title][1].replace(',',''))/days_since
print(avg_views_per_day)
title_to_avg_daily_views[title] = avg_views_per_day
json_dict = json.dumps(title_to_avg_daily_views, indent=4)
with open("local_titles_and_views.json", "w") as f:
f.write(json_dict)
The final JSON file should look something like this:
That’s it! Now we know how many views a day each of the videos on the front page of YouTube get. In our next articles, we’ll be exploring the most common phrases among these videos and whether or not more polarizing YouTube videos get more views per day!
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly
2 thoughts on “How Many Views Per Day do Front Page YouTube Videos Get?”