Python Web Scraping with Selenium and Beautiful Soup 4

Ever want to scrape a website for work, for fun, or simply to exercise your Python skills? I did all 3. Let’s take a look at a super easy way to use Selenium for web scraping in under 50 lines of Python. Disclaimer: I’m not sure this follows the terms of service and if you build a web scraper for whatever site, you may also be breaking their ToS! 

In this post, we will cover:

Here’s the video version:

live codeing a seleneium web scraping python with a software engineer

In this example we’ll be scraping the pages of the top 10 colleges in America in 2021 as ranked by US News for text. For this project you’ll need to get Chromedriver, and install Selenium and Beautiful Soup 4. You can use pip in the terminal to do so.

pip install selenium beautifulsoup4

As always we’ll start off by importing the libraries we need. We’ll be using re, the regex module to extract our links from Beautiful Soup. The webdriver submodule from selenium as well as the Service submodule from selenium’s chrome webdriver are needed to run the webdriver. We’ll need BeautifulSoup to parse our HTML, and finally we’ll need sleep and randint to make ourselves look less like a bot.

import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
from time import sleep
from random import randint

Python Web Scraping with Selenium – Getting Links

Next we’ll use the chromedriver executable we downloaded earlier to create a Chrome Service. Then we’ll use the Chrome webdriver to start it up and go to the url. We’ll make the program sleep for some small random number of seconds to ensure the webpage loads and we don’t look too much like a bot. Although you’ll see in the video that we run into some problems with this.

chromedriver_path = "C:\\Users\\ytang\\Documents\\workspace\\teaching\\intermediate\\youtube\\front_page_titles\\chromedriver.exe"
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)
 
url = "http://www.usnews.com/best-colleges/rankings/national-universities"
driver.get(url)
sleep(randint(3, 5))

Now let’s take a look at all the links on this page. We’ll use Beautiful Soup to parse the webpage and then we can quit the driver. I quit the driver here for two reasons, to conserve unnecessary processing, and you’ll have to watch the video to understand the second one. Let’s start off by checking out all the links on the page. We call the find_all function of Beautiful Soup to look for all the link elements and then add their href value (the actual link) to a set and print out the set. I call the set we make top10 because in a moment I’m going to change the way we look for the links on the page to get the links for the top 10 schools.

soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
 
links = soup.find_all("a")
top10 = set()
for link in links:
    top10.add(link.get("href"))
for link in top10:
    print(link)

We should get something that looks like this:

python web scraping links with selenium
python selenium web scraper scraped links from usnews

That’s a lot of links we don’t care about. Let’s use regex to trim this down. In fact, the only link we care about in that image above is /best-colleges/princeton-university-2627. Since the Princeton link looks like this, we can extrapolate that the other links will also start with /best-colleges/.  You’ll notice I also included some regex to remove the links with the word “rankings” from the list. We don’t need those, but they exist on the page. Other links that start with best-colleges also exist on the page, but instead of writing a bunch of really complicated regex to sort those out, I simply excluded them using or statements and an if statement. Since these links will all be coming from the same base URL, we’ll also need to declare that.

base_url = "http:/www.usnews.com"
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
links = soup.find_all("a", href=re.compile("^/best-colleges/(?!rankings).*"))
top10 = set()
for link in links:
    if "college-search" in link["href"] or "compare" in link["href"] or "admissions" in link["href"] or "myfit" in link["href"] or "photos" in link["href"] or "reviews" in link["href"]:
        continue
    top10.add(base_url+link["href"])

Our web scraper should give us our top 10 links like shown in the image below.

python web scraping and parsing links with beautifulsoup 4
parsed links from beautiful soup 4

Now let’s loop through them and parse them. In three simple steps we’ll navigate to the page just like we did earlier, make soup just like we did earlier, and then get text from all the paragraphs, which is new. To get the CSS Selector simple right click on the element you want on a web page and click “Inspect Element” and then read the CSS on the side. To see how I got this CSS Selector, watch the video. After getting all the paragraph elements we loop through them and append their text to a string. Finally we save our string to a file and repeat for the next link. In this example, we’ll split the URL string on the best-colleges/ string and take the second element (the URL name for the school) and use that to create a .txt file.

for link in top10:
    # navigate to page
    # make soup
    # get text from all paragraphs
    driver = webdriver.Chrome(service=service, options=options)
    driver.get(link)
    sleep(randint(2, 4))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    paragraphs = soup.find_all("div", class_=re.compile("^Raw-.*"))
    text = ""
    for p in paragraphs:
        text += p.get_text() + " "
    driver.quit()
    filename = link.split("best-colleges/")[1] + ".txt"
    with open(filename, "w") as f:
        f.write(text)

That’s it! That’s all you need to do to make a simple webscraper. I think BeautifulSoup is such a useful library and using it has made scraping the web SO much easier than when I was just using Selenium! Check out the first part of the project I’ve done with this scraped information – Ask NLP: What Does US News Have to Say About Top Colleges?.

Chromedriver Executable Needs to be in Path Error

When working with Chromedriver on Selenium, you may come across an error like this: selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home.

There are a few solutions to this. You can install a package to handle your Chromedriver, you can pass the direct path of your Chromedriver installation, or you can add Chromedriver to your PATH environment variable.

Use Existing Python Packages to Manage Chromedriver Path

There are two packages that help you manage your Chromedriver installation. These are chromedriver-autoinstaller, and webdriver_manager. Both allow you to download Chromedriver while the program is running. Here’s how you use the chromedriver autoinstaller:

from selenium import webdriver
import chromedriver_autoinstaller

chromedriver_autoinstaller.install()
driver = webdriver.Chrome()

Here’s how you use webdriver manager:

import selenium
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

Set Your Chromedriver Executable Path in the Code

See this piece of code from above.

Adding Your Chromedriver Executable to PATH

Just like for the solution right above where we declare the absolute path to the Chromedriver executable, this solution also requires you know where Chromedriver is installed. Here’s how to do it in Windows. On Mac or other *nix OS we can run export PATH = <path to Chromedriver Executable>:$PATH in the terminal.

Summary of Python Web Scraping with Selenium and Beautiful Soup 4

In this tutorial we saw how we can easily scrape the web with Python and Selenium. We did some web scraping for links from the US News Top 10 colleges page. Then, we saved those links to a file locally. We also covered some common errors around the Chromedriver executable.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

¤5.00
¤15.00
¤100.00
¤5.00
¤15.00
¤100.00
¤5.00
¤15.00
¤100.00

Or enter a custom amount


Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Yujian Tang

3 thoughts on “Python Web Scraping with Selenium and Beautiful Soup 4

Leave a Reply

%d