As content on the web increases, content moderation becomes more and more important to protect sensitive groups such as children and people who have suffered from trauma. We’re going to learn how to create your own AI content moderator using Python, Selenium, Beautiful Soup 4, and The Text API.
Our AI content moderator will be built in three parts, a webscraper to scrape all the text from a page, a module for the content moderation with AI using The Text API, and an orchestrator to put it all together.
Video Guide Here:
In this post, we’re going to take a look at how to build a webscraper to scrape all the text from any webpage. Download Chromedriver, and install the Selenium and Beautifulsoup4 Python libraries to follow along. You can install the Python libraries by using the command below in the terminal.
pip install selenium beautifulsoup4
To create this webscraper we’ll need to:
- Handle imports from Selenium, BeautifulSoup4, and Time
- Set Up Chromedriver
- Create a function for scraping the text from a webpage
- Load a webpage with Selenium
- Scrape the text from the page with BeautifulSoup4
- Test your webdriver
Selenium and BeautifulSoup 4 Modules to Import
We’ll need to import one module and three classes from Selenium, one class from BeautifulSoup4, and one function from time
. We need the webdriver
module from Selenium to create our webdriver. The Service and Options classes from selenium.webdriver.chrome
and their respective modules will create the Chrome service and allow us to add options to it. The BeautifulSoup class from bs4
will create the “soup” of the page. Finally, we use sleep
to wait for the webpage that we navigate to with Selenium to load.
# imports
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from time import sleep
Setting up Chromedriver
Now that we’ve handled our imports, we have to set up our Chromedriver before we can use it. First, we need to know where we saved the Chromedriver executable we downloaded. Then we use that to create a Chrome Service object. Finally, we’ll create an Options object. I’m going to run this in “headless” mode, meaning that the actual Chrome window won’t pop up. However, it will still run.
chromedriver_path = "C:\\Users\\ytang\\Documents\\workspace\\content-moderation\\chromedriver.exe"
service = Service(chromedriver_path)
options = Options()
options.add_argument("--headless")
Create Function for Scraping the Text from a WebPage
We have everything setup to scrape the text from a webpage now. Let’s create a function that will do that. Our function will open up a webpage and scrape ALL of the text from it. Why do we scrape all the text, including links, and not just the text from the main “article” of the page? Because, every single bit of text on a page should be subject to moderation.
Load The Page with Selenium
Our function, scrape_page_text
will take one parameter, url
, which we expect to be in the form of a string. We’ll start out the function by creating a webdriver from the service
and options
we created earlier. Then we’ll call the driver to “get” the url
and sleep for a few second to allow it to fully load.
# function
def scrape_page_text(url: str):
# create driver
driver = webdriver.Chrome(service=service, options=options)
# launch driver
driver.get(url)
sleep(3)
Scrape the Text From the Page with BeautifulSoup 4
Once the driver is done loading, we’ll fire up BeautifulSoup
to create our “soup”. We’ll pass the BeautifulSoup
object the page source from the driver and use an HTML parser. After we’ve created our soup, we’ll quit the driver. Now that we have the soup, we’ll get all of the text from the page and do a little cleaning by deleting the newline objects from the text. Finally, we’ll return the text.
# get soup from driver page
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
# scrape all the text from page
text = soup.get_text()
text = text.replace("\n", " ")
return(text)
Full Text Scraping Function Code
Here’s the full code for the function to scrape the text on a page.
# function
def scrape_page_text(url: str):
# create driver
driver = webdriver.Chrome(service=service, options=options)
# launch driver
driver.get(url)
sleep(3)
# get soup from driver page
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
# scrape all the text from page
text = soup.get_text()
text = text.replace("\n", "")
return(text)
Test Your WebScraper
To text our webscraper, we’ll simply pass it a URL and print out the result of the page scrape.
url = "https://pythonalgos.com/2021/11/20/web-scraping-the-easy-way-python-selenium-beautiful-soup/"
print(scrape_page_text(url))
We should see an output like the one below.
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One thought on “Create Your Own AI Content Moderator – Part 1”