As content on the web increases, content moderation becomes more and more important to protect sensitive groups such as children and people who have suffered from trauma. We’re going to learn how to create your own AI content moderator using Python, Selenium, Beautiful Soup 4, and The Text API.
Our AI content moderator will be built in three parts, a webscraper to scrape all the text from a page, a module for the content moderation with AI using The Text API, and an orchestrator to put it all together.
Video Guide Here:
In this post, we’re going to take a look at how to build a webscraper to scrape all the text from any webpage. Download Chromedriver, and install the Selenium and Beautifulsoup4 Python libraries to follow along. You can install the Python libraries by using the command below in the terminal.
pip install selenium beautifulsoup4
To create this webscraper we’ll need to:
- Handle imports from Selenium, BeautifulSoup4, and Time
- Set Up Chromedriver
- Create a function for scraping the text from a webpage
- Load a webpage with Selenium
- Scrape the text from the page with BeautifulSoup4
- Test your webdriver
Selenium and BeautifulSoup 4 Modules to Import
We’ll need to import one module and three classes from Selenium, one class from BeautifulSoup4, and one function from
time. We need the
webdriver module from Selenium to create our webdriver. The Service and Options classes from
selenium.webdriver.chrome and their respective modules will create the Chrome service and allow us to add options to it. The BeautifulSoup class from
bs4 will create the “soup” of the page. Finally, we use
sleep to wait for the webpage that we navigate to with Selenium to load.
# imports from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup from time import sleep
Setting up Chromedriver
Now that we’ve handled our imports, we have to set up our Chromedriver before we can use it. First, we need to know where we saved the Chromedriver executable we downloaded. Then we use that to create a Chrome Service object. Finally, we’ll create an Options object. I’m going to run this in “headless” mode, meaning that the actual Chrome window won’t pop up. However, it will still run.
chromedriver_path = "C:\\Users\\ytang\\Documents\\workspace\\content-moderation\\chromedriver.exe" service = Service(chromedriver_path) options = Options() options.add_argument("--headless")
Create Function for Scraping the Text from a WebPage
We have everything setup to scrape the text from a webpage now. Let’s create a function that will do that. Our function will open up a webpage and scrape ALL of the text from it. Why do we scrape all the text, including links, and not just the text from the main “article” of the page? Because, every single bit of text on a page should be subject to moderation.
Load The Page with Selenium
scrape_page_text will take one parameter,
url, which we expect to be in the form of a string. We’ll start out the function by creating a webdriver from the
options we created earlier. Then we’ll call the driver to “get” the
url and sleep for a few second to allow it to fully load.
# function def scrape_page_text(url: str): # create driver driver = webdriver.Chrome(service=service, options=options) # launch driver driver.get(url) sleep(3)
Scrape the Text From the Page with BeautifulSoup 4
Once the driver is done loading, we’ll fire up
BeautifulSoup to create our “soup”. We’ll pass the
BeautifulSoup object the page source from the driver and use an HTML parser. After we’ve created our soup, we’ll quit the driver. Now that we have the soup, we’ll get all of the text from the page and do a little cleaning by deleting the newline objects from the text. Finally, we’ll return the text.
# get soup from driver page soup = BeautifulSoup(driver.page_source, 'html.parser') driver.quit() # scrape all the text from page text = soup.get_text() text = text.replace("\n", " ") return(text)
Full Text Scraping Function Code
Here’s the full code for the function to scrape the text on a page.
# function def scrape_page_text(url: str): # create driver driver = webdriver.Chrome(service=service, options=options) # launch driver driver.get(url) sleep(3) # get soup from driver page soup = BeautifulSoup(driver.page_source, 'html.parser') driver.quit() # scrape all the text from page text = soup.get_text() text = text.replace("\n", "") return(text)
Test Your WebScraper
To text our webscraper, we’ll simply pass it a URL and print out the result of the page scrape.
url = "https://pythonalgos.com/2021/11/20/web-scraping-the-easy-way-python-selenium-beautiful-soup/" print(scrape_page_text(url))
We should see an output like the one below.
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.