Ever want to scrape a website for work, for fun, or simply to exercise your Python skills? I did all 3. Let’s take a look at a super easy way to use Selenium for web scraping in under 50 lines of Python. Disclaimer: I’m not sure this follows the terms of service and if you build a web scraper for whatever site, you may also be breaking their ToS!
In this post, we will cover:
- Python Web Scraping with Selenium – Getting Links
- Storing and Parsing Selenium Web Scraping Results
- Possible Errors: Chrome Driver Needs to be in Path
- Summary of Python Web Scraping with Selenium
Here’s the video version:
In this example we’ll be scraping the pages of the top 10 colleges in America in 2021 as ranked by US News for text. For this project you’ll need to get Chromedriver, and install Selenium and Beautiful Soup 4. You can use
pip in the terminal to do so.
pip install selenium beautifulsoup4
As always we’ll start off by importing the libraries we need. We’ll be using
re, the regex module to extract our links from Beautiful Soup. The
webdriver submodule from
selenium as well as the
Service submodule from
webdriver are needed to run the webdriver. We’ll need
BeautifulSoup to parse our HTML, and finally we’ll need
randint to make ourselves look less like a bot.
import re from selenium import webdriver from selenium.webdriver.chrome.service import Service from bs4 import BeautifulSoup from time import sleep from random import randint
Python Web Scraping with Selenium – Getting Links
Next we’ll use the chromedriver executable we downloaded earlier to create a Chrome
Service. Then we’ll use the
webdriver to start it up and go to the url. We’ll make the program sleep for some small random number of seconds to ensure the webpage loads and we don’t look too much like a bot. Although you’ll see in the video that we run into some problems with this.
chromedriver_path = "C:\\Users\\ytang\\Documents\\workspace\\teaching\\intermediate\\youtube\\front_page_titles\\chromedriver.exe" service = Service(chromedriver_path) driver = webdriver.Chrome(service=service) url = "http://www.usnews.com/best-colleges/rankings/national-universities" driver.get(url) sleep(randint(3, 5))
Now let’s take a look at all the links on this page. We’ll use Beautiful Soup to parse the webpage and then we can quit the driver. I quit the driver here for two reasons, to conserve unnecessary processing, and you’ll have to watch the video to understand the second one. Let’s start off by checking out all the links on the page. We call the
find_all function of
Beautiful Soup to look for all the link elements and then add their
href value (the actual link) to a set and print out the set. I call the set we make
top10 because in a moment I’m going to change the way we look for the links on the page to get the links for the top 10 schools.
soup = BeautifulSoup(driver.page_source, 'html.parser') driver.quit() links = soup.find_all("a") top10 = set() for link in links: top10.add(link.get("href")) for link in top10: print(link)
We should get something that looks like this:
Storing and Parsing Selenium Web Scraping Results
That’s a lot of links we don’t care about. Let’s use regex to trim this down. In fact, the only link we care about in that image above is
/best-colleges/princeton-university-2627. Since the Princeton link looks like this, we can extrapolate that the other links will also start with
/best-colleges/. You’ll notice I also included some regex to remove the links with the word “rankings” from the list. We don’t need those, but they exist on the page. Other links that start with best-colleges also exist on the page, but instead of writing a bunch of really complicated regex to sort those out, I simply excluded them using
or statements and an
if statement. Since these links will all be coming from the same base URL, we’ll also need to declare that.
base_url = "http:/www.usnews.com" soup = BeautifulSoup(driver.page_source, 'html.parser') driver.quit() links = soup.find_all("a", href=re.compile("^/best-colleges/(?!rankings).*")) top10 = set() for link in links: if "college-search" in link["href"] or "compare" in link["href"] or "admissions" in link["href"] or "myfit" in link["href"] or "photos" in link["href"] or "reviews" in link["href"]: continue top10.add(base_url+link["href"])
Our web scraper should give us our top 10 links like shown in the image below.
Now let’s loop through them and parse them. In three simple steps we’ll navigate to the page just like we did earlier, make soup just like we did earlier, and then get text from all the paragraphs, which is new. To get the CSS Selector simple right click on the element you want on a web page and click “Inspect Element” and then read the CSS on the side. To see how I got this CSS Selector, watch the video. After getting all the paragraph elements we loop through them and append their text to a string. Finally we save our string to a file and repeat for the next link. In this example, we’ll split the URL string on the
best-colleges/ string and take the second element (the URL name for the school) and use that to create a
for link in top10: # navigate to page # make soup # get text from all paragraphs driver = webdriver.Chrome(service=service, options=options) driver.get(link) sleep(randint(2, 4)) soup = BeautifulSoup(driver.page_source, 'html.parser') paragraphs = soup.find_all("div", class_=re.compile("^Raw-.*")) text = "" for p in paragraphs: text += p.get_text() + " " driver.quit() filename = link.split("best-colleges/") + ".txt" with open(filename, "w") as f: f.write(text)
That’s it! That’s all you need to do to make a simple webscraper. I think BeautifulSoup is such a useful library and using it has made scraping the web SO much easier than when I was just using Selenium! Check out the first part of the project I’ve done with this scraped information – Ask NLP: What Does US News Have to Say About Top Colleges?.
Chromedriver Executable Needs to be in Path Error
When working with Chromedriver on Selenium, you may come across an error like this:
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home.
There are a few solutions to this. You can install a package to handle your Chromedriver, you can pass the direct path of your Chromedriver installation, or you can add Chromedriver to your PATH environment variable.
Use Existing Python Packages to Manage Chromedriver Path
There are two packages that help you manage your Chromedriver installation. These are
webdriver_manager. Both allow you to download Chromedriver while the program is running. Here’s how you use the chromedriver autoinstaller:
from selenium import webdriver import chromedriver_autoinstaller chromedriver_autoinstaller.install() driver = webdriver.Chrome()
Here’s how you use webdriver manager:
import selenium from webdriver_manager.chrome import ChromeDriverManager driver = webdriver.Chrome(ChromeDriverManager().install())
Set Your Chromedriver Executable Path in the Code
See this piece of code from above.
Adding Your Chromedriver Executable to PATH
Just like for the solution right above where we declare the absolute path to the Chromedriver executable, this solution also requires you know where Chromedriver is installed. Here’s how to do it in Windows. On Mac or other *nix OS we can run
export PATH = <path to Chromedriver Executable>:$PATH in the terminal.
Summary of Python Web Scraping with Selenium and Beautiful Soup 4
In this tutorial we saw how we can easily scrape the web with Python and Selenium. We did some web scraping for links from the US News Top 10 colleges page. Then, we saved those links to a file locally. We also covered some common errors around the Chromedriver executable.
- Python Dotenv for Environment Variables
- Build Your Own AI Text Summarizer in Python
- NLP Python Lemmatization
- Dijkstra’s Algorithm in 5 Steps in Python
- Python to Send Email with Attachment
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.DonateDonate monthlyDonate yearly
3 thoughts on “Python Web Scraping with Selenium and Beautiful Soup 4”