Lemmatization is an important part of Natural Language Processing. Other NLP topics we’ve covered include Text Polarity, Named Entity Recognition, and Summarization. Lemmatization is the process of turning a word into its lemma. A lemma is the “canonical form” of a word. A lemma is usually the dictionary version of a word, it’s picked by convention. Let’s look at some examples to make more sense of this.

The words “playing”, “played”, and “plays” all have the same lemma of the word “play”. The words “win”, “winning”, “won”, and “wins” all have the same lemma of the word “win”. Let’s take a look at one more example before we move on to how you can do lemmatization in Python. The words “programming”, “programs”, “programmed”, and “programmatic” all have the same lemma of the word “program”. Another way to think about it is to think of the lemma as the “root” of the word.

In this post we’ll cover:

How Can I Do Lemmatization with Python
- Lemmatization with spaCy
- Lemmatization with NLTK

How Can I Do Lemmatization with Python?

Python has many well known Natural Language Processing libraries, and we’re going to make use of two of them to do lemmatization. The first one we’ll look at is spaCy and the second one we’ll use is Natural Language Toolkit (NLTK).

Lemmatization with spaCy

This is pretty cool, we’re going to lemmatize our text in under 10 lines of code. To get started with spaCy we’ll install the spacy library and download a model. We can do this in the terminal with the following commands:

pip install spacy
python -m spacy download en_core_web_sm

To start off our program, we’ll import spacy and load the language model.

import spacy
 
nlp = spacy.load("en_core_web_sm")

Once we have the model, we’ll simply make up a text, turn it into a spaCy Doc, and that’s basically it. To get the lemma of each word, we’ll just print out the lemma_ attribute. Note that printing out the lemma attribute will get you a number corresponding to the lemma’s representation.

text = "This is an example program for showing how lemmatization works in spaCy. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date."
 
doc = nlp(text)
 
for token in doc:
    print(token.lemma_)

Our output should look like the following:

Sounds like a pirate!

Lemmatization with NLTK

Cool, lemmatization with spaCy wasn’t that hard, let’s check it out with NLTK. For NLTK, we’ll need to install the library and install the wordnet submodule before we can write the program. We can do that in the terminal with the below commands.

pip install NLTK
python 
>>> import nltk
>>> nltk.download(‘wordnet’)
>>> exit()

Why are we running a Python script in shell and not just downloading wordnet at the start of our program? We only need to download it once to be able to use it, so we don’t want to put it in a program we’ll be running multiple times. As always, we’ll start out our program by importing the libraries we need. In this case, we’re just going to be importing nltk and the WordNetLemmatizer object from nltk.stem.

import nltk
from nltk.stem import WordNetLemmatizer

First we’ll use word_tokenize from nltk to tokenize our text. Then we’ll loop through the tokenized text and use the lemmatizer to lemmatize each token and print it out.

lemmatizer = WordNetLemmatizer()
tokenized = nltk.word_tokenize("This is an example program for showing how lemmatization works in NLTK. Let's play ball!. The NFL has many football teams and players. Yujian Tang is the best software content creator. The Text API is the most comprehensive sentiment analysis API made to date.")
for t in tokenized:
    print(lemmatizer.lemmatize(t))

We’ll end up with something like the image below.

As you can see, using NLTK returns a different lemmatization than using spaCy. It doesn’t seem to do lemmatization as well. NLTK and spaCy are made for different purposes, so I am usually impartial. However, spaCy definitely wins for built in lemmatization. NLTK can be customized because it’s highly used for research purposes, but that’s out of the scope for this article. Be on the lookout for an in depth dive though!

What is Part of Speech (POS) Tagging?

Traditionally, there are nine parts of speech taught in English literature – nouns, adjectives, determiners, adverbs, pronouns, prepositions, conjunctions, and interjections. We’ll see below, that for NLP reasons, we’ll actually be using way more than nine tags. The spaCy library tags 19 different parts of speech, and over 50 “tags” (depending how you count different punctuation marks).

In spaCy tags are more granularized parts of speech. NLTK’s part of speech tagging tags 34 parts of speech. It is more like spaCy’s tagging concept than spaCy’s parts of speech. We’ll take a look at the parts of speech labels from both, and then spaCy’s fine grained tagging. You can find the Github Repo that contains code for POS tagging here.

In this post, we’ll go over:

List of spaCy automatic parts of speech (POS)
List of NLTK parts of speech (POS)
Fine-grained Part of Speech (POS) tags in spaCy
spaCy POS Tagging Example
NLTK POS Tagging Example

List of spaCy parts of speech (automatic):

POS	Description	POS	Description
ADJ	Adjective – big, purple, creamy	ADP	Adposition – in, to, during
ADV	Adverb – very, really, there	AUX	Auxiliary – is, has, will
CONJ	Conjunction – and, or, but	CCONJ	Coordinating conjunction – either…or, neither…nor, not only
DET	Determiner – a, an, the	INTJ	Interjection – psst, oops, oof
NOUN	Noun – cat, dog, frog	NUM	Numeral – 1, one, 20
PART	Particle – ‘s, ‘nt, ‘d	PRON	Pronoun – he, she, me
PROPN	Proper noun – Yujian Tang, Michael Jordan, Andrew Ng	PUNCT	Punctuation – commas, periods, semicolons
SCONJ	Subordinating conjunction – if, while, but	SYM	Symbol – $, %, ^
VERB	Verb – sleep, eat, run	X	Other – asdf, xyz, abc
SPACE	Space – space lol

List of NLTK parts of speech:

POS	Description	POS	Description
CC	Coordinating Conjunction – either…or, neither…nor, not only	CD	Cardinal Digit – 1, 2, twelve
DT	Determiner – a, an, the	EX	Existential There – “there” used for introducing a topic
FW	Foreign Word – bonjour, ciao, 你好	IN	Preposition/Subordinating Conjunction – in, at, on
JJ	Adjective – big	JJR	Comparative Adjective – bigger
JJS	Superlative Adjective – biggest	LS	List Marker – first, A., 1), etc
MD	Modal – can, cannot, may	NN	Singular Noun – student, learner, enthusiast
NNS	Plural Noun – students, programmers, geniuses	NNP	Singular Proper Noun – Yujian Tang, Tom Brady, Fei Fei Li
NNPS	Plural Proper Noun – Americans, Democrats, Presidents	PDT	Predeterminer – all, both, many
POS	Possessive Ending – ‘s	PRP	Personal Pronoun – her, him, yourself
PRP$	Possessive Pronoun – her, his, mine	RB	Adverb – occasionally, technologically, magically
RBR	Comparative Adjective – further, higher, better	RBS	Superlative Adjective – best, biggest, highest
RP	Particle – aboard, into, upon	TO	Infinitive Marker – “to” when it is used as an infinitive marker or preposition
UH	Interjection – uh, wow, jinkies!	VB	Verb – ask, assemble, brush
VBG	Verb Gerund – stirring, showing, displaying	VBD	Verb Past Tense – dipped, diced, wrote
VBN	Verb Past Participle – condensed, refactored, unsettled	VBP	Verb Present Tense not 3rd person singular – predominate, wrap, resort
VBZ	Verb Present Tense, 3rd person singular – bases, reconstructs, emerges	WDT	Wh-determiner – that, what, which
WP	Wh-pronoun – that, what, whatever	WRB	Wh-adverb – how, however, wherever

We can see that NLTK and spaCy have different parts of speech tagging, this is because there are many ways to tag parts of speech and the different ways that NLTK has split it up is advantageous for academic process. Above, I’ve only shown spaCy’s automatic POS tagging, but spaCy actually has a fine grained part of speech tagging as well, they call it “tag” instead of “part of speech”. I’ll break down how parts of speech map to tagging in spaCy below.

List of spaCy Part of Speech Tags (Fine grained)

POS	Mapped Tags	POS	Mapped Tags
ADJ	AFX – affix: “pre-” JJ – adjective: good JJR – comparative adjective: better JJS – superlative adjective: best PDT – predeterminer: half PRP$ – possessive pronoun: his, her WDT – wh-determiner: which WP$ – possessive wh-pronoun: whose	ADP	IN – subordinating conjunction or preposition: “in”
ADV	EX – existential there: there RB – adverb: quickly RBR – comparative adverb: quicker RBS – superlative adverb: quickest WRB – wh-adverb: when	CONJ	CC – coordinating conjunction: and
DET	DT – determiner: this, a, an	INTJ	UH – interjection: uh, uhm, ruh-roh!
NOUN	NN – noun: sentence NNS – plural noun: sentences WP – wh-pronoun: who	NUM	CD – cardinal number: three, 5, twelve
PART	POS – possessive ending: ‘s RP – particle adverb: back (put it “back”) TO – infinitive to: “to”	PRON	PRP – personal pronoun: I, you
PROPN	NNP – proper singular noun: Yujian Tang NNPS – proper plural nouns: Pythonistas	PUNCT	-LRB- left round bracket: “(“ -RRB- right round bracket: “)” (actual punctuation marks): , : ; . “ ‘ (etc) HYPH – hyphen LS – list item marker: a., A), iii. NFP – superfluous punctuation
SYM	(like punctuation, these are pretty self explanatory)# $ SYM – symbol	VERB	BES – auxiliary “be” HVS – “have”: ‘ve MD – auxiliary modal: could VB – base form verb: go VBD – past tense verb: was VBG – gerund: going VBN – past participle verb: lost VBP – non 3rd person singular present verb: want VBZ – 3rd person singular present verb: wants
X	ADD – email FW – foreign word GW – additional word XX – unknown

How do I Implement POS Tagging?

Part of Speech Tagging is at the cornerstone of Natural Language Processing. It is one of the most basic parts of NLP, and as a result it comes standard as part of any respectable NLP library. Below, I’m going to cover how you can do POS tagging in just a few lines of code with spaCy and NLTK.

Spacy POS Tagging

We’ll start by implementing part of speech tagging in spaCy. The first thing we’ll need to do is install spaCy and download a model.

pip install spacy
python -m spacy download en_core_web_sm

Once we have our required libraries downloaded we can start. Like I said above, POS tagging is one of the cornerstones of natural language processing. It’s so important that the spaCy pipeline automatically does it upon tokenization. For this example, I’m using a large piece of text, this text about solar energy comes from How Many Solar Farms Does it Take to Power America?

First we import spaCy, then we load our NLP model, then we feed the NLP model our text to create our NLP document. After creating the document, we can simply loop through it and print out the different parts of the tokens. For this example, we’ll print out the token text, the token part of speech, and the token tag.

import spacy
 
nlp = spacy.load("en_core_web_sm")
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
doc = nlp(text)
 
for token in doc:
    print(token.text, token.pos_, token.tag_)

Once you run this you should see an output like the one pictured below.

NLTK POS Tagging

Now let’s take a look at how to do POS tagging with the Natural Language Toolkit. We’ll get started with this the same way we got started with spaCy, by downloading the library and the model we’ll need. We’re going to need to install NLTK and download the NLTK “punkt” tokenizer model.

pip install nltk
python
>>> import nltk
>>> nltk.download(‘punkt’)

Once we have our libraries downloaded, we can fire up our favorite Python editor and get started. Like with spaCy, there’s only a few steps we need to do to start tagging parts of speech with the NLTK library. First, we need to tokenize our text. Then, we simply call the NLTK part of speech tagger on the tokenized text and voila! We’re done. I’ve used the exact same text from above.

import nltk
from nltk.tokenize import word_tokenize
 
text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""
 
tokenized = word_tokenize(text)
tagged = nltk.pos_tag(tokenized)
for tag in tagged:
    print(tag)

Once we’re done, we simply run this in a terminal and we should see an output like the following.

You can compare and see that NLTK and spaCy have pretty much the same tagging at the tag level.