Categories
NLP NLTK spaCy

NLP: Stop Words, When and Why to Use Them

There are 326 “Stop Words” by default in spaCy. What are stopwords (or stop words)? They’re common words that we don’t want to include in some of our analysis when we perform Natural Language Processing. These are words that generally don’t contribute anything to the meaning of the text. However, we can’t always remove stopwords. In this article we’re going to go over why we remove stopwords, which NLP techniques and applications should keep or remove stopwords, and lists of default stop words for spaCy and NLTK.

Why Do We Remove Stopwords?

Stopwords are words that don’t add to the overall meaning of our text. When performing NLP tasks that revolve around understanding, we don’t need these words. Since machine learning is computationally expensive, it benefits us to process as little data as possible while still being able to produce a usable result. Of course, we can’t remove stop words for every task, so let’s take a look at which tasks we should remove stopwords for and which tasks we should keep them for.

Which NLP Techniques or Applications Should Remove Stop Words?

As we talked about above, not all Natural Language Processing tasks require removing stop words. The NLP techniques or applications that should use stopword removal in the pipeline are ones that revolve around meaning. These are usually the Natural Language Understanding tasks. These include applications like sentiment analysis, semantic parsing, or spam filtering. The tasks that don’t require stop words are ones which don’t necessarily need these common words to construct their responses.

Which NLP Techniques of Applications Should Keep Stop Words?

So, if we want to remove stopwords for NLP techniques and applications that don’t require them in their responses, which ones should keep stop words? When we’re doing NLP tasks that require the whole text in its processing, we should keep stopwords. Examples of these kinds of NLP tasks include text summarization, language translation, and when doing question-answer tasks. You can see that these tasks depend on some common words such as “for”, “on”, or “in” to model the connection between words. 

List of Default English Stop Words from Different Libraries

In our introduction to the top 3 NLP libraries in Python, we went over spaCy, NLTK, and CoreNLP. Interestingly, there’s no universal list of stopwords. The spaCy library has 326 default stopwords in English, the NLTK library has 179, and CoreNLP doesn’t have its own list of default stopwords. Let’s take a look at the default stopwords from spaCy and NLTK and how to get them.

List of all 326 Default Stopwords in spaCy

spacy stopwords word cloud

There are 326 default stopwords in spaCy. To get these, we install the `spacy` library and download the `en_core_web_sm` model. The default stop words come with the model. We can see the stopwords by loading the model and printing it’s `Defaults.stop_words`.

pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load(“en_core_web_sm”)
print(nlp.Defaults.stop_words)
'you', 'something', 'anyhow', 'would', 'not', 'first', 'now', 'without', 'which', 'may', 'regarding', '’d', 'back', 'nevertheless', 'how', 'should', 'bottom', 'by', 'twelve', 'least', 'but', '‘d', 'thence', 'i', 'hers', 'are', 'therein', 'same', 'indeed', 'others', 'whither', 'your', '’ll', 'either', 'last', 'therefore', 'do', 'whence', 'we', 'top', 'beforehand', 'though', 'across', 'everyone', 'only', 'full', 'fifteen', 'hereby', 'since', 'while', 're', 'beside', 'quite', 'her', 'is', 'their', 'meanwhile', 'neither', 'various', 'everywhere', "'d", 'made', 'nowhere', 'name', 'of', 'done', 'ever', 'onto', 'off', 'its', 'most', 'twenty', 'next', 'after', 'does', 'whether', 'say', 'please', 'at', 'sometimes', "n't", 'hereafter', 'here', 'until', 'itself', 'latterly', 'well', 'became', 'under', 'behind', 'the', 'me', 'must', 'give', 'former', 'using', 'or', 'otherwise', 'noone', '‘s', 'yours', 'everything', 'wherein', 'even', 'take', 'put', 'ourselves', 'themselves', 'him', 'beyond', 'whose', 'another', 'with', 'every', 'whom', 'somewhere', 'forty', 'via', '’ve', 'get', "'s", '‘re', 'any', 'due', 'really', '’re', 'towards', 'it', 'whereupon', 'none', 'anyway', 'very', 'among', 'before', 'sixty', 'eleven', 'seeming', 'why', 'whereby', 'whenever', 'per', 'ours', 'namely', 'they', "'m", 'along', 'somehow', 'yourself', 'many', 'empty', 'who', 'becoming', 'hence', 'them', 'n’t', 'between', 'a', 'be', 'further', 'against', 'else', 'when', 'has', 'will', 'anyone', 'was', 'several', 'there', 'three', 'formerly', 'one', 'my', 'were', 'side', 'cannot', 'becomes', "'ll", 'make', 'such', 'never', 'amount', 'enough', 'just', 'our', 'those', 'besides', '’s', 'being', 'part', 'except', 'someone', 'often', 'seems', '‘ve', 'latter', "'ve", 'afterwards', 'both', 'during', 'unless', 'together', 'n‘t', 'show', 'keep', 'too', 'each', 'into', 'been', 'an', 'us', 'whereafter', 'to', 'in', 'nor', '‘ll', 'so', "'re", 'down', 'six', 'toward', 'five', 'doing', 'out', 'herein', 'thereupon', 'whole', 'anything', 'can', 'because', 'over', 'however', 'seem', 'serious', 'go', 'am', 'then', 'myself', 'within', 'four', 'his', 'nobody', 'sometime', 'yet', 'front', 'become', 'himself', 'wherever', 'upon', 'nothing', 'few', 'hundred', 'move', '‘m', 'what', 'as', 'below', 'elsewhere', 'mostly', 'anywhere', 'up', 'that', 'amongst', 'this', 'around', 'she', 'always', 'thereafter', 'nine', 'ca', 'already', 'herself', 'some', 'much', 'if', 'two', 'these', 'had', 'ten', 'whatever', 'also', 'through', 'thus', 'yourselves', 'see', 'he', 'throughout', 'for', 'moreover', '’m', 'seemed', 'again', 'might', 'all', 'on', 'almost', 'have', 'less', 'fifty', 'eight', 'could', 'used', 'thereby', 'perhaps', 'above', 'whereas', 'and', 'about', 'although', 'still', 'mine', 'from', 'than', 'rather', 'once', 'third', 'call', 'alone', 'did', 'more', 'thru', 'whoever', 'where', 'hereupon', 'other', 'own', 'no'

List of all 179 Default Stopwords in NLTK

nlt stopwords word cloud

There are 179 stop words in NLTK. To get all the default stopwords from NLTK, we install the library and download the `stopwords` submodule. Once we do that, we can see all the stopwords with a simple command.

pip install nltk
python
>>> nltk.download(“stopwords”)
from nltk.corpus import stopwords
print(stopwords.words('english'))  
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"

Stopwords Recap

In this post, we learned that stopwords are the most common words in a language that usually don’t provide much semantic value. Then we looked at why we remove stopwords. Some NLP tasks such as sentiment analysis should remove stop words. Some NLP tasks such as AI Summarization, shouldn’t remove stop words. Finally, we went over the default stopwords in spaCy and NLTK and how to get them.

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.