Create Your Own AI Content Moderator – Part 1

As content on the web increases, content moderation becomes more and more important to protect sensitive groups such as children and people who have suffered from trauma. We’re going to learn how to create your own AI content moderator using Python, Selenium, Beautiful Soup 4, and The Text API.

Our AI content moderator will be built in three parts, a webscraper to scrape all the text from a page, a module for the content moderation with AI using The Text API, and an orchestrator to put it all together.

Video Guide Here:

In this post, we’re going to take a look at how to build a webscraper to scrape all the text from any webpage. Download Chromedriver, and install the Selenium and Beautifulsoup4 Python libraries to follow along. You can install the Python libraries by using the command below in the terminal.

pip install selenium beautifulsoup4

To create this webscraper we’ll need to:

  • Handle imports from Selenium, BeautifulSoup4, and Time
  • Set Up Chromedriver
  • Create a function for scraping the text from a webpage
    • Load a webpage with Selenium
    • Scrape the text from the page with BeautifulSoup4
  • Test your webdriver

Selenium and BeautifulSoup 4 Modules to Import

We’ll need to import one module and three classes from Selenium, one class from BeautifulSoup4, and one function from time. We need the webdriver module from Selenium to create our webdriver. The Service and Options classes from selenium.webdriver.chrome and their respective modules will create the Chrome service and allow us to add options to it. The BeautifulSoup class from bs4 will create the “soup” of the page. Finally, we use sleep to wait for the webpage that we navigate to with Selenium to load.

# imports
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from time import sleep

Setting up Chromedriver

Now that we’ve handled our imports, we have to set up our Chromedriver before we can use it. First, we need to know where we saved the Chromedriver executable we downloaded. Then we use that to create a Chrome Service object. Finally, we’ll create an Options object. I’m going to run this in “headless” mode, meaning that the actual Chrome window won’t pop up. However, it will still run.

chromedriver_path = "C:\\Users\\ytang\\Documents\\workspace\\content-moderation\\chromedriver.exe"
service = Service(chromedriver_path)
options = Options()
options.add_argument("--headless")

Create Function for Scraping the Text from a WebPage

We have everything setup to scrape the text from a webpage now. Let’s create a function that will do that. Our function will open up a webpage and scrape ALL of the text from it. Why do we scrape all the text, including links, and not just the text from the main “article” of the page? Because, every single bit of text on a page should be subject to moderation. 

Load The Page with Selenium

Our function, scrape_page_text will take one parameter, url, which we expect to be in the form of a string. We’ll start out the function by creating a webdriver from the service and options we created earlier. Then we’ll call the driver to “get” the url and sleep for a few second to allow it to fully load.

# function
def scrape_page_text(url: str):
    # create driver
    driver = webdriver.Chrome(service=service, options=options)
 
    # launch driver
    driver.get(url)
    sleep(3)

Scrape the Text From the Page with BeautifulSoup 4

Once the driver is done loading, we’ll fire up BeautifulSoup to create our “soup”. We’ll pass the BeautifulSoup object the page source from the driver and use an HTML parser. After we’ve created our soup, we’ll quit the driver. Now that we have the soup, we’ll get all of the text from the page and do a little cleaning by deleting the newline objects from the text. Finally, we’ll return the text.

    # get soup from driver page
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
   
    # scrape all the text from page
    text = soup.get_text()
    text = text.replace("\n", " ")
   
    return(text)

Full Text Scraping Function Code

Here’s the full code for the function to scrape the text on a page.

# function
def scrape_page_text(url: str):
    # create driver
    driver = webdriver.Chrome(service=service, options=options)
 
    # launch driver
    driver.get(url)
    sleep(3)
   
    # get soup from driver page
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
   
    # scrape all the text from page
    text = soup.get_text()
    text = text.replace("\n", "")
   
    return(text)

Test Your WebScraper

To text our webscraper, we’ll simply pass it a URL and print out the result of the page scrape. 

url = "https://pythonalgos.com/2021/11/20/web-scraping-the-easy-way-python-selenium-beautiful-soup/"
 
print(scrape_page_text(url))

We should see an output like the one below.

Web Scraper Output – All the Text on a Page

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

One thought on “Create Your Own AI Content Moderator – Part 1

Leave a Reply

%d bloggers like this: