What AI Keyword Extraction Is and How to Do It

Keyword extraction is an example of applied Natural Language Processing (NLP). NLP is the subfield of AI concerned with analyzing, understanding, and generating language. Keyword extraction is one of the basic techniques in NLP. The first step to keyword extraction is tokenization. After tokenizing a text, it’s a simple step to look through for a keyword.

Even though keyword extraction is a relatively simple process, it plays a big role in NLP. Keyword extraction can be applied to multiple contexts from finding headlines, as we’ll see in the examples, to AI Content Moderation to finding relevant sentences in legal documents.

In this post we’ll go over:

  • What is Keyword Extraction?
  • How Can AI Keyword Extraction be Applied?
  • Implementing Keyword Extraction
    • Keyword Extraction for One Keyword with spaCy
    • Keyword Extraction for Multiple Keywords with The Text API
  • Applied Examples of AI Keyword Extraction
    • COVID Headlines
    • Obama Headlines
  • Summary of Keyword Extraction with AI

What is Keyword Extraction?

Let’s start by answering the obvious question before we dive into the details – what is keyword extraction? Keyword extraction is the process of finding each occurrence of one or many keywords in a text. Keyword extraction can be used to extract sentences, paragraphs, or sections containing a keyword. At a more basic level, it may also be used to simply find occurrences of a keyword in the text without extracting surrounding information.

How Can AI Keyword Extraction be Applied?

As we mentioned above, keyword extraction can be applied to many contexts. In this post we’ll go over two examples of keyword extraction by AI applied to headlines. Another important application of AI Keyword extraction is to the legal field. Legal papers such as court documents, laws, bills, or other similar legal documentation often need to be searched. Usually these documents are 10s or 100s of pages long. Imagine going through that. That’s a lot of looking through documents.

Although the legal field is generally all paper, they’ve begun digitizing. Digital documents can be searched much more efficiently using keyword extraction. Other than using AI keyword extraction to search legal documents more efficiently, it can also be used for reviews. For example, if I run a restaurant and I want to know the public’s opinion about my new dish, the “pepperoni pizza”, I can gather all my reviews and use an AI keyword extractor to get all the sentences about pepperoni pizza. From there, I can either read the sentences, or even just run them through a sentiment analyzer and get their polarity value if I want to know how the public feels about it.

Implementing Keyword Extraction with AI

As we said above there’s multiple things you can do with keyword extraction from extraction sentences to paragraphs to sections. In this post, we’ll implement AI keyword extraction for keywords. First, we’ll go over extracting sentences for one keyword using an NLP library, spaCy. Then, we’ll go over extracting sentences for multiple keywords using a web API, The Text API.

To follow along with these implementations, you’ll have to install some libraries. For the first AI sentence extraction implementation with spaCy, you’ll need to install spaCy and download a model. You can do that with the lines below in the terminal:

pip install spacy
python -m spacy download en_core_web_sm

For the second example using a web API to do keyword extraction on multiple keywords we’ll need an API key from The Text API. We’ll also need to download the requests library. You can do that with the line in the terminal below.

pip install requests

Sentence Extraction for One Keyword (spaCy)

For our first example, we’re going to use spaCy to do keyword extraction for one keyword. As always, we’ll start with importing the libraries we need. We’ll need the spacy library, we’ll also import the Matcher object from spacy.matcher. Not totally necessary, but makes things look nicer. After the imports, we’ll load the NLP model we downloaded earlier.

Next, we’ll create the function that will get all the sentences we’re looking for. This function will take three parameters, the language model, the text to search, and the keyword to search for. The first thing we’ll do here is create a document from the NLP language model and the passed in text. Then we’ll create a pattern to pass to the matcher. After creating the pattern, we’ll create a Matcher object using the text and then add the pattern to the matcher. Note that we’ll have to add it as a list object around it because the matcher expects a string and a list of lists as it’s two positional parameters.

Now, we’ll create an empty list to hold the returned strings and loop through the sentences in the document. Now we’ll loop through each entry in our matcher, and check if the pattern is in the sentence. If the pattern is in the sentence, then we’ll add it to our return values, otherwise we’ll move on. Finally, we’ll return the list of return values as a set. We have to wrap the return value in a set because we may get repeated sentences if the keyword appears twice in a sentence.

import spacy
from spacy.matcher import Matcher
 
nlp = spacy.load("en_core_web_sm")
 
def get_sentences(nlp:spacy.lang, text: str, keyword: str):
    doc = nlp(text)
    pattern = [{"TEXT": word}]
    matcher = Matcher(nlp.vocab)
    matcher.add(keyword, [pattern])
    retval = []
    for sent in doc.sents:
        for match_id, start, end in matcher(nlp(sent.text)):
            if nlp.vocab.strings[match_id] in [keyword]:
                snippet = sent.text
                retval.append(snippet)
    return list(set(retval))

Extracted Sentences from spaCy Keyword Extraction

Let’s take a piece of text from this primer on climate change and green energy and do a keyword extraction on this. The keyword we’ll use is “energy”. We’ll call the spaCy function above to get all the sentences that contain the word “energy” in the text below.

text = """Green energy will be the backbone of decarbonizing our energy systems, and by extension, human society as a whole. Using the breakdown of GHG emissions by sector in the US below, replacing our direct electricity usage emissions with electricity from green energy sources (we can call this green electricity) would already reduce emissions by 25%. Furthermore, reducing emissions from transport and industry (another 52% of emissions) would require replacing burning hydrocarbons with using green electricity in a process called electrification. For transportation, replacing internal combustion engine vehicles with electric vehicles would enable the transportation sector to use green electricity instead of gasoline. For industry, electrifying manufacturing equipment or combining heat and power processes can enable the sector to use green electricity instead of burning coal. For commercial and residential, we could electrify heating and cooling for homes. Right now, there's a lot of propane and natural gas systems, and converting these to electricity would reduce the carbon footprint of the average American home. These pathways to decarbonization suggest that we need to install a lot of green electricity capacity and ensure our energy systems (like the electric grid) are capable of meeting people's new and existing demands without relying on hydrocarbons."""
kw = "energy"
print(get_sentences(nlp, text, kw))

We should see an output like the one below.

Keyword Extraction via spaCy

Sentence Extraction for Multiple Keywords (The Text API)

For this example, we’re going to use AI to extract multiple keywords. As always we’ll start with the libraries we need to import. First we’ll import the requests and json libraries. I also imported the API key from my config file. This is The Text API API key. Next, we’ll create a headers file which tells the server that we’re sending JSON content and sends the API key. We’ll also declare the keyword URL API endpoint. 

Just for consistency, we’ll continue by using the same text in this example as we did last time. After declaring the text, we’ll establish the keywords. In this example, we’ll get two keywords, “energy” and “process”. Then we’ll create the body that we’ll send to the server, the body contains the text and the keywords.

Now we’ll send a request to the server and parse that response into a JSON. After parsing it into a JSON we’ll print out the values for the keys “energy” and “process”. That’s all there is to using AI to extract sentences for multiple keywords with a web API.

import requests
import json
from config import apikey
 
headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
kw_url = "https://app.thetextapi.com/text/sentences_with_keywords"
text = """Green energy will be the backbone of decarbonizing our energy systems, and by extension, human society as a whole. Using the breakdown of GHG emissions by sector in the US below, replacing our direct electricity usage emissions with electricity from green energy sources (we can call this green electricity) would already reduce emissions by 25%. Furthermore, reducing emissions from transport and industry (another 52% of emissions) would require replacing burning hydrocarbons with using green electricity in a process called electrification. For transportation, replacing internal combustion engine vehicles with electric vehicles would enable the transportation sector to use green electricity instead of gasoline. For industry, electrifying manufacturing equipment or combining heat and power processes can enable the sector to use green electricity instead of burning coal. For commercial and residential, we could electrify heating and cooling for homes. Right now, there's a lot of propane and natural gas systems, and converting these to electricity would reduce the carbon footprint of the average American home. These pathways to decarbonization suggest that we need to install a lot of green electricity capacity and ensure our energy systems (like the electric grid) are capable of meeting people's new and existing demands without relying on hydrocarbons."""
kws = ["energy", "process"]
body = {
    "text": text,
    "keywords": kws
}
 
response = requests.post(kw_url, headers=headers, json=body)
_dict = json.loads(response.text)
print(_dict["energy"])
print(_dict["process"])

You should get a response like the image below.

Keyword Extraction with The Text API

Applied Examples of AI Keyword Extraction

Now that we’ve seen some examples of keyword extraction with AI, let’s see some real life applied examples. We’re going to look at how we can use keyword extraction to do data analysis. The two applied examples we’ll look at both revolve around extracting headlines. In the first example, we’ll extract COVID headlines, in the second, we’ll extract Obama headlines.

COVID Headlines

One example of what we can do with keyword extraction is extract headlines from archives. We can use AI to extract all the headlines from the NY Times that contain the word COVID. In this section, I’ll display some of the headlines we extracted as well as go over a bit of what we learned. For the full example check out Using AI to Analyze COVID Headlines Over Time.

We extracted headlines containing the word “covid” from the NY Times archive from 2020 to 2021. We found that there were no headlines about COVID for the first 3 months of 2020! Then in April, we got 6, here’s what they were:

  1. life, covid-free, after 22 days in the hospital.
  2. covid or no covid, it’s important to plan.
  3. pregnant and scared of ‘covid hospitals,’ they’re giving birth at home, women scared of hospitals are increasingly turning to midwives.
  4. 32 days on a ventilator: one covid patient’s fight to breathe again, gasping for breaths the size of a tablespoon.
  5. ‘possible covid’: why the lulls never last for weary e.m.s. crews, a call pierces the lulls for exhausted paramedics: ‘possible covid’.
  6. arthritis drug did not help seriously ill covid patients, early data shows, drug shows slim promise for critical covid cases.

This was the graph of the number of COVID headlines per month from 2020 to 2021. This plot was created using AI keyword extraction with The Text API and matplotlib.

Number of COVID headlines over time

Obama Headlines

We can also get the headlines for all of the news about Obama back during his presidency. I chose Obama because he’s one of the internet’s favorite presidents, and the most followed person on Twitter. For a full tutorial on how we got the headlines, read Using NLP to get the Obama Presidency in Headlines. There are a TON of headlines about Obama. 

Read the headlines we extracted about Obama each year in these files: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017. We got these through using The Text API and the NY Times archive by extracting the word “Obama”. From these we were able to find the media’s portrayal of Obama through finding the most common phrases, and the word cloud summaries.

Summary of AI Keyword Extraction

In this article we learned about AI Keyword extraction. We learned that we can use keyword extraction on text documents to get the sentences, paragraphs, or sections around keywords. Then went over some examples of possible uses of AI keyword extraction including analyzing text data, looking through legal documents, and analyzing reviews. Next, we saw how we could implement keyword extraction for sentences through spaCy and The Text API. First implementing sentence extraction for one keyword using spaCy, then for multiple keywords with The Text API. Finally, we took a look at two examples of AI keyword extraction, an analysis of COVID headlines and an analysis of the media’s portrayal of Obama.

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang
Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

One thought on “What AI Keyword Extraction Is and How to Do It

Leave a Reply

%d bloggers like this: