Part of Speech (POS) Tagging is an integral part of Natural Language Processing (NLP). The first step in most state of the art NLP pipelines is tokenization. Tokenization is the separating of text into “tokens”. Tokens are generally regarded as individual pieces of languages – words, whitespace, and punctuation.
Once we tokenize our text we can tag it with the part of speech, note that this article only covers the details of part of speech tagging for English. Part of speech tagging is done on all tokens except for whitespace. We’ll take a look at how to do POS with the two most popular and easy to use NLP Python libraries – spaCy and NLTK – coincidentally also my favorite two NLP libraries to play with.
What is Part of Speech (POS) Tagging?
Traditionally, there are nine parts of speech taught in English literature – nouns, adjectives, determiners, adverbs, pronouns, prepositions, conjunctions, and interjections. We’ll see below, that for NLP reasons, we’ll actually be using way more than nine tags. The spaCy library tags 19 different parts of speech, and over 50 “tags” (depending how you count different punctuation marks).
In spaCy tags are more granularized parts of speech. NLTK’s part of speech tagging tags 34 parts of speech. It is more like spaCy’s tagging concept than spaCy’s parts of speech. We’ll take a look at the parts of speech labels from both, and then spaCy’s fine grained tagging. You can find the Github Repo that contains code for POS tagging here.
In this post, we’ll go over:
- List of spaCy automatic parts of speech (POS)
- List of NLTK parts of speech (POS)
- Fine-grained Part of Speech (POS) tags in spaCy
- spaCy POS Tagging Example
- NLTK POS Tagging Example
List of spaCy parts of speech (automatic):
|ADJ||Adjective – big, purple, creamy||ADP||Adposition – in, to, during|
|ADV||Adverb – very, really, there||AUX||Auxiliary – is, has, will|
|CONJ||Conjunction – and, or, but||CCONJ||Coordinating conjunction – either…or, neither…nor, not only|
|DET||Determiner – a, an, the||INTJ||Interjection – psst, oops, oof|
|NOUN||Noun – cat, dog, frog||NUM||Numeral – 1, one, 20|
|PART||Particle – ‘s, ‘nt, ‘d||PRON||Pronoun – he, she, me|
|PROPN||Proper noun – Yujian Tang, Michael Jordan, Andrew Ng||PUNCT||Punctuation – commas, periods, semicolons|
|SCONJ||Subordinating conjunction – if, while, but||SYM||Symbol – $, %, ^|
|VERB||Verb – sleep, eat, run||X||Other – asdf, xyz, abc|
|SPACE||Space – space lol|
List of NLTK parts of speech:
|CC||Coordinating Conjunction – either…or, neither…nor, not only||CD||Cardinal Digit – 1, 2, twelve|
|DT||Determiner – a, an, the||EX||Existential There – “there” used for introducing a topic|
|FW||Foreign Word – bonjour, ciao, 你好||IN||Preposition/Subordinating Conjunction – in, at, on|
|JJ||Adjective – big||JJR||Comparative Adjective – bigger|
|JJS||Superlative Adjective – biggest||LS||List Marker – first, A., 1), etc|
|MD||Modal – can, cannot, may||NN||Singular Noun – student, learner, enthusiast|
|NNS||Plural Noun – students, programmers, geniuses||NNP||Singular Proper Noun – Yujian Tang, Tom Brady, Fei Fei Li|
|NNPS||Plural Proper Noun – Americans, Democrats, Presidents||PDT||Predeterminer – all, both, many|
|POS||Possessive Ending – ‘s||PRP||Personal Pronoun – her, him, yourself|
|PRP$||Possessive Pronoun – her, his, mine||RB||Adverb – occasionally, technologically, magically|
|RBR||Comparative Adjective – further, higher, better||RBS||Superlative Adjective – best, biggest, highest|
|RP||Particle – aboard, into, upon||TO||Infinitive Marker – “to” when it is used as an infinitive marker or preposition|
|UH||Interjection – uh, wow, jinkies!||VB||Verb – ask, assemble, brush|
|VBG||Verb Gerund – stirring, showing, displaying||VBD||Verb Past Tense – dipped, diced, wrote|
|VBN||Verb Past Participle – condensed, refactored, unsettled||VBP||Verb Present Tense not 3rd person singular – predominate, wrap, resort|
|VBZ||Verb Present Tense, 3rd person singular – bases, reconstructs, emerges||WDT||Wh-determiner – that, what, which|
|WP||Wh-pronoun – that, what, whatever||WRB||Wh-adverb – how, however, wherever|
We can see that NLTK and spaCy have different parts of speech tagging, this is because there are many ways to tag parts of speech and the different ways that NLTK has split it up is advantageous for academic process. Above, I’ve only shown spaCy’s automatic POS tagging, but spaCy actually has a fine grained part of speech tagging as well, they call it “tag” instead of “part of speech”. I’ll break down how parts of speech map to tagging in spaCy below.
List of spaCy Part of Speech Tags (Fine grained)
|POS||Mapped Tags||POS||Mapped Tags|
|ADJ||AFX – affix: “pre-”|
JJ – adjective: good
JJR – comparative adjective: better
JJS – superlative adjective: best
PDT – predeterminer: half
PRP$ – possessive pronoun: his, her
WDT – wh-determiner: which
WP$ – possessive wh-pronoun: whose
|ADP||IN – subordinating conjunction or preposition: “in”|
|ADV||EX – existential there: there|
RB – adverb: quickly
RBR – comparative adverb: quicker
RBS – superlative adverb: quickest
WRB – wh-adverb: when
|CONJ||CC – coordinating conjunction: and|
|DET||DT – determiner: this, a, an||INTJ||UH – interjection: uh, uhm, ruh-roh!|
|NOUN||NN – noun: sentence|
NNS – plural noun: sentences
WP – wh-pronoun: who
|NUM||CD – cardinal number: three, 5, twelve|
|PART||POS – possessive ending: ‘s|
RP – particle adverb: back (put it “back”)
TO – infinitive to: “to”
|PRON||PRP – personal pronoun: I, you|
|PROPN||NNP – proper singular noun: Yujian Tang|
NNPS – proper plural nouns: Pythonistas
|PUNCT||-LRB- left round bracket: “(“|
-RRB- right round bracket: “)”
(actual punctuation marks): , : ; . “ ‘ (etc)
HYPH – hyphen
LS – list item marker: a., A), iii.
NFP – superfluous punctuation
|SYM||(like punctuation, these are pretty self explanatory)#|
SYM – symbol
|VERB||BES – auxiliary “be”|
HVS – “have”: ‘ve
MD – auxiliary modal: could
VB – base form verb: go
VBD – past tense verb: was
VBG – gerund: going
VBN – past participle verb: lost
VBP – non 3rd person singular present verb: want
VBZ – 3rd person singular present verb: wants
|X||ADD – email|
FW – foreign word
GW – additional word
XX – unknown
How do I Implement POS Tagging?
Part of Speech Tagging is at the cornerstone of Natural Language Processing. It is one of the most basic parts of NLP, and as a result it comes standard as part of any respectable NLP library. Below, I’m going to cover how you can do POS tagging in just a few lines of code with spaCy and NLTK.
Spacy POS Tagging
We’ll start by implementing part of speech tagging in spaCy. The first thing we’ll need to do is install spaCy and download a model.
pip install spacy python -m spacy download en_core_web_sm
Once we have our required libraries downloaded we can start. Like I said above, POS tagging is one of the cornerstones of natural language processing. It’s so important that the spaCy pipeline automatically does it upon tokenization. For this example, I’m using a large piece of text, this text about solar energy comes from How Many Solar Farms Does it Take to Power America?
First we import spaCy, then we load our NLP model, then we feed the NLP model our text to create our NLP document. After creating the document, we can simply loop through it and print out the different parts of the tokens. For this example, we’ll print out the token text, the token part of speech, and the token tag.
import spacy nlp = spacy.load("en_core_web_sm") text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only. Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US.""" doc = nlp(text) for token in doc: print(token.text, token.pos_, token.tag_)
Once you run this you should see an output like the one pictured below.
NLTK POS Tagging
Now let’s take a look at how to do POS tagging with the Natural Language Toolkit. We’ll get started with this the same way we got started with spaCy, by downloading the library and the model we’ll need. We’re going to need to install NLTK and download the NLTK “punkt” tokenizer model.
pip install nltk python >>> import nltk >>> nltk.download(‘punkt’)
Once we have our libraries downloaded, we can fire up our favorite Python editor and get started. Like with spaCy, there’s only a few steps we need to do to start tagging parts of speech with the NLTK library. First, we need to tokenize our text. Then, we simply call the NLTK part of speech tagger on the tokenized text and voila! We’re done. I’ve used the exact same text from above.
import nltk from nltk.tokenize import word_tokenize text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only. Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US.""" tokenized = word_tokenize(text) tagged = nltk.pos_tag(tokenized) for tag in tagged: print(tag)
Once we’re done, we simply run this in a terminal and we should see an output like the following.
You can compare and see that NLTK and spaCy have pretty much the same tagging at the tag level.
- Kruskal’s Algorithm in Python
- Python Nested Lists (Lists in Lists)
- Python War Card Game
- How to Send and Email with an Attachment in Python
- Floyd Warshall Algorithm in Python
To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.DonateDonate monthlyDonate yearly