Web Scraping the Easy Way: Python, Selenium, Beautiful Soup

Ever want to scrape a website for work, for fun, or simply to exercise your Python skills? I did all 3. Let’s take a look at a super easy way to scrape the web in under 50 lines of Python. Disclaimer: I’m not sure this follows the terms of service and if you build a web scraper for whatever site, you may also be breaking their ToS! 

Here’s the video version:

In this example we’ll be scraping the pages of the top 10 colleges in America in 2021 as ranked by US News for text. For this project you’ll need to get Chromedriver, and install Selenium and Beautiful Soup 4. You can use pip in the terminal to do so.

pip install selenium beautifulsoup4

As always we’ll start off by importing the libraries we need. We’ll be using re, the regex module to extract our links from Beautiful Soup. The webdriver submodule from selenium as well as the Service submodule from selenium’s chrome webdriver are needed to run the webdriver. We’ll need BeautifulSoup to parse our HTML, and finally we’ll need sleep and randint to make ourselves look less like a bot.

import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
from time import sleep
from random import randint

Spin Up the Web Scraper

Next we’ll use the chromedriver executable we downloaded earlier to create a Chrome Service. Then we’ll use the Chrome webdriver to start it up and go to the url. We’ll make the program sleep for some small random number of seconds to ensure the webpage loads and we don’t look too much like a bot. Although you’ll see in the video that we run into some problems with this.

chromedriver_path = "C:\\Users\\ytang\\Documents\\workspace\\teaching\\intermediate\\youtube\\front_page_titles\\chromedriver.exe"
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)
 
url = "http://www.usnews.com/best-colleges/rankings/national-universities"
driver.get(url)
sleep(randint(3, 5))

Now let’s take a look at all the links on this page. We’ll use Beautiful Soup to parse the webpage and then we can quit the driver. I quit the driver here for two reasons, to conserve unnecessary processing, and you’ll have to watch the video to understand the second one. Let’s start off by checking out all the links on the page. We call the find_all function of Beautiful Soup to look for all the link elements and then add their href value (the actual link) to a set and print out the set. I call the set we make top10 because in a moment I’m going to change the way we look for the links on the page to get the links for the top 10 schools.

soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
 
links = soup.find_all("a")
top10 = set()
for link in links:
    top10.add(link.get("href"))
for link in top10:
    print(link)

We should get something that looks like this:

That’s a lot of links we don’t care about. Let’s use regex to trim this down. In fact, the only link we care about in that image above is /best-colleges/princeton-university-2627. Since the Princeton link looks like this, we can extrapolate that the other links will also start with /best-colleges/.  You’ll notice I also included some regex to remove the links with the word “rankings” from the list. We don’t need those, but they exist on the page. Other links that start with best-colleges also exist on the page, but instead of writing a bunch of really complicated regex to sort those out, I simply excluded them using or statements and an if statement. Since these links will all be coming from the same base URL, we’ll also need to declare that.

base_url = "http:/www.usnews.com"
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
links = soup.find_all("a", href=re.compile("^/best-colleges/(?!rankings).*"))
top10 = set()
for link in links:
    if "college-search" in link["href"] or "compare" in link["href"] or "admissions" in link["href"] or "myfit" in link["href"] or "photos" in link["href"] or "reviews" in link["href"]:
        continue
    top10.add(base_url+link["href"])

Our web scraper should give us our top 10 links like shown in the image below.

Now let’s loop through them and parse them. In three simple steps we’ll navigate to the page just like we did earlier, make soup just like we did earlier, and then get text from all the paragraphs, which is new. To get the CSS Selector simple right click on the element you want on a web page and click “Inspect Element” and then read the CSS on the side. To see how I got this CSS Selector, watch the video. After getting all the paragraph elements we loop through them and append their text to a string. Finally we save our string to a file and repeat for the next link. In this example, we’ll split the URL string on the best-colleges/ string and take the second element (the URL name for the school) and use that to create a .txt file.

for link in top10:
    # navigate to page
    # make soup
    # get text from all paragraphs
    driver = webdriver.Chrome(service=service, options=options)
    driver.get(link)
    sleep(randint(2, 4))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    paragraphs = soup.find_all("div", class_=re.compile("^Raw-.*"))
    text = ""
    for p in paragraphs:
        text += p.get_text() + " "
    driver.quit()
    filename = link.split("best-colleges/")[1] + ".txt"
    with open(filename, "w") as f:
        f.write(text)

That’s it! That’s all you need to do to make a simple webscraper. I think BeautifulSoup is such a useful library and using it has made scraping the web SO much easier than when I was just using Selenium! Check out the first part of the project I’ve done with this scraped information – Ask NLP: What Does US News Have to Say About Top Colleges?.

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang
Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

2 thoughts on “Web Scraping the Easy Way: Python, Selenium, Beautiful Soup

Leave a Reply

%d bloggers like this: