Ever want to scrape a website for work, for fun, or simply to exercise your Python skills? I did all 3. Let’s take a look at a super easy way to scrape the web in under 50 lines of Python. Disclaimer: I’m not sure this follows the terms of service and if you build a web scraper for whatever site, you may also be breaking their ToS!
Here’s the video version:
In this example we’ll be scraping the pages of the top 10 colleges in America in 2021 as ranked by US News for text. For this project you’ll need to get Chromedriver, and install Selenium and Beautiful Soup 4. You can use
pip in the terminal to do so.
pip install selenium beautifulsoup4
As always we’ll start off by importing the libraries we need. We’ll be using
re, the regex module to extract our links from Beautiful Soup. The
webdriver submodule from
selenium as well as the
Service submodule from
webdriver are needed to run the webdriver. We’ll need
BeautifulSoup to parse our HTML, and finally we’ll need
randint to make ourselves look less like a bot.
import re from selenium import webdriver from selenium.webdriver.chrome.service import Service from bs4 import BeautifulSoup from time import sleep from random import randint
Spin Up the Web Scraper
Next we’ll use the chromedriver executable we downloaded earlier to create a Chrome
Service. Then we’ll use the
webdriver to start it up and go to the url. We’ll make the program sleep for some small random number of seconds to ensure the webpage loads and we don’t look too much like a bot. Although you’ll see in the video that we run into some problems with this.
chromedriver_path = "C:\\Users\\ytang\\Documents\\workspace\\teaching\\intermediate\\youtube\\front_page_titles\\chromedriver.exe" service = Service(chromedriver_path) driver = webdriver.Chrome(service=service) url = "http://www.usnews.com/best-colleges/rankings/national-universities" driver.get(url) sleep(randint(3, 5))
Now let’s take a look at all the links on this page. We’ll use Beautiful Soup to parse the webpage and then we can quit the driver. I quit the driver here for two reasons, to conserve unnecessary processing, and you’ll have to watch the video to understand the second one. Let’s start off by checking out all the links on the page. We call the
find_all function of
Beautiful Soup to look for all the link elements and then add their
href value (the actual link) to a set and print out the set. I call the set we make
top10 because in a moment I’m going to change the way we look for the links on the page to get the links for the top 10 schools.
soup = BeautifulSoup(driver.page_source, 'html.parser') driver.quit() links = soup.find_all("a") top10 = set() for link in links: top10.add(link.get("href")) for link in top10: print(link)
We should get something that looks like this:
Parse Scraped Link Data
That’s a lot of links we don’t care about. Let’s use regex to trim this down. In fact, the only link we care about in that image above is
/best-colleges/princeton-university-2627. Since the Princeton link looks like this, we can extrapolate that the other links will also start with
/best-colleges/. You’ll notice I also included some regex to remove the links with the word “rankings” from the list. We don’t need those, but they exist on the page. Other links that start with best-colleges also exist on the page, but instead of writing a bunch of really complicated regex to sort those out, I simply excluded them using
or statements and an
if statement. Since these links will all be coming from the same base URL, we’ll also need to declare that.
base_url = "http:/www.usnews.com" soup = BeautifulSoup(driver.page_source, 'html.parser') driver.quit() links = soup.find_all("a", href=re.compile("^/best-colleges/(?!rankings).*")) top10 = set() for link in links: if "college-search" in link["href"] or "compare" in link["href"] or "admissions" in link["href"] or "myfit" in link["href"] or "photos" in link["href"] or "reviews" in link["href"]: continue top10.add(base_url+link["href"])
Our web scraper should give us our top 10 links like shown in the image below.
Now let’s loop through them and parse them. In three simple steps we’ll navigate to the page just like we did earlier, make soup just like we did earlier, and then get text from all the paragraphs, which is new. To get the CSS Selector simple right click on the element you want on a web page and click “Inspect Element” and then read the CSS on the side. To see how I got this CSS Selector, watch the video. After getting all the paragraph elements we loop through them and append their text to a string. Finally we save our string to a file and repeat for the next link. In this example, we’ll split the URL string on the
best-colleges/ string and take the second element (the URL name for the school) and use that to create a
for link in top10: # navigate to page # make soup # get text from all paragraphs driver = webdriver.Chrome(service=service, options=options) driver.get(link) sleep(randint(2, 4)) soup = BeautifulSoup(driver.page_source, 'html.parser') paragraphs = soup.find_all("div", class_=re.compile("^Raw-.*")) text = "" for p in paragraphs: text += p.get_text() + " " driver.quit() filename = link.split("best-colleges/") + ".txt" with open(filename, "w") as f: f.write(text)
That’s it! That’s all you need to do to make a simple webscraper. I think BeautifulSoup is such a useful library and using it has made scraping the web SO much easier than when I was just using Selenium! Check out the first part of the project I’ve done with this scraped information – Ask NLP: What Does US News Have to Say About Top Colleges?.
To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.DonateDonate monthlyDonate yearly