Categories
General Python level 1 python

Create a Word Cloud in 10 Lines of Python

I needed to create a word cloud for a recent text analysis project I was working on. It wasn’t easy to find resources to actually just create a text cloud without a bunch of other random shit on it, so I decided to make one. Here’s a no-nonsense tutorial on how to create a word cloud in Python. I’ve simplified pages and pages of reading into just 10 lines of Python.

To follow along you’ll need to get an image, here’s the cloud image I used. You’ll also need to install the matplotlib, numpy, and wordcloud libraries. You can install these with the line below in your terminal:

pip install matplotlib numpy wordcloud

Handling Imports

As we always do, the first thing we need to do is handle our imports. From the wordcloud library we’ll import the WordCloud function and the STOPWORDS list. We need the WordCloud function to create our word cloud and STOPWORDS are words that don’t make sense to include in the word cloud. STOPWORDS words include its, an, the, etc … 

We’ve used matplotlib.pyplot in some of our tutorials already, such as the one on plotting a random dataset, or checking to see if more polarizing YouTube titles get more views. We use this library for plotting. The other two libraries we’ll use here are numpy for fast math operations and access to the array type, and PIL to load the image. We didn’t explicitly install PIL above because it comes with matplotlib.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

Creating the Word Cloud

Now let’s create our word cloud function. This function will take one parameter, the text that we’ll make the word cloud from. The first thing we’ll do in our function is make a set out of the STOPWORDS we imported. Next, let’s make a mask out of the image. This frame mask will be what makes the shape of our word cloud. To make the mask, we’ll open up our image and turn it into an np.array type object.

Once we have these set up, we can create the word cloud. All we’ll do is call the WordCloud function and pass it some parameters. In this example, we’ve passed in the maximum number of words we want in the cloud, the mask for the shape, the stop words for the words to ignore, and the background color. After creating the word cloud, we’ll use the imshow function from matplotlib.pyplot to show the word cloud and not show the axis. The interpolation option is for how we want to show the image, to learn more, read about interpolation in matplotlib.

# wordcloud function
def word_cloud(text):
    stopwords = set(STOPWORDS)
    frame_mask=np.array(Image.open("cloud_shape.png"))
    wordcloud = WordCloud(max_words=50, mask=frame_mask, stopwords=stopwords, background_color="white").generate(text)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

For examples of word clouds, check out these word clouds about Obama’s Presidency in Headlines

Word Cloud Options/Help Results

Here are the parameters and options for the WordCloud function.

Parameters
----------
font_path : string
    Font path to the font that will be used (OTF or TTF).
    Defaults to DroidSansMono path on a Linux machine. If you are on
    another OS or don't have this font; you need to adjust this path.

width : int (default=400)
    Width of the canvas.

height : int (default=200)
    Height of the canvas.

prefer_horizontal : float (default=0.90)
    The ratio of times to try horizontal fitting as opposed to vertical.
    If prefer_horizontal < 1, the algorithm will try rotating the word
    if it doesn't fit. (There is currently no built-in way to get only
    vertical words.)

mask : nd-array or None (default=None)
    If not None, gives a binary mask on where to draw words. If mask is not
    None, width and height will be ignored, and the shape of mask will be
    used instead. All white (#FF or #FFFFFF) entries will be considered
    "masked out" while other entries will be free to draw on. [This
    changed in the most recent version!]

contour_width: float (default=0)
    If mask is not None and contour_width > 0, draw the mask contour.

contour_color: color value (default="black")
    Mask contour color.

scale : float (default=1)
    Scaling between computation and drawing. For large word-cloud images,
    using scale instead of larger canvas size is significantly faster, but
    might lead to a coarser fit for the words.

min_font_size : int (default=4)
    Smallest font size to use. Will stop when there is no more room in this
    size.

font_step : int (default=1)
    Step size for the font. font_step > 1 might speed up computation but
    give a worse fit.

max_words : number (default=200)
    The maximum number of words.

stopwords : set of strings or None
    The words that will be eliminated. If None, the build-in STOPWORDS
    list will be used.

background_color : color value (default="black")
    Background color for the word cloud image.

max_font_size : int or None (default=None)
    Maximum font size for the largest word. If None, the height of the image is
    used.

mode : string (default="RGB")
    Transparent background will be generated when mode is "RGBA" and
    background_color is None.

relative_scaling : float (default=.5)
    Importance of relative word frequencies for font-size.  With
    relative_scaling=0, only word-ranks are considered.  With
    relative_scaling=1, a word that is twice as frequent will have twice
    the size.  If you want to consider the word frequencies and not only
    their rank, relative_scaling around .5 often looks good.

    .. versionchanged: 2.0
        Default is now 0.5.

color_func : callable, default=None
    Callable with parameters word, font_size, position, orientation,
    font_path, random_state that returns a PIL color for each word.
    Overwrites "colormap".
    See colormap for specifying a matplotlib colormap instead.

regexp : string or None (optional)
    Regular expression to split the input text into tokens in process_text.
    If None is specified, ``r"\w[\w']+"`` is used.

collocations : bool, default=True
    Whether to include collocations (bigrams) of two words.

    .. versionadded: 2.0

colormap : string or matplotlib colormap, default="viridis"
    Matplotlib colormap to randomly draw colors from for each word.
    Ignored if "color_func" is specified.

    .. versionadded: 2.0

normalize_plurals : bool, default=True
    Whether to remove trailing 's' from words. If True and a word
    appears with and without a trailing 's', the one with trailing 's'
    is removed and its counts are added to the version without
    trailing 's' -- unless the word ends with 'ss'.

Attributes
----------
``words_`` : dict of string to float
    Word tokens with associated frequency.

    .. versionchanged: 2.0
        ``words_`` is now a dictionary

``layout_`` : list of tuples (string, int, (int, int), int, color))
    Encodes the fitted word cloud. Encodes for each word the string, font
    size, position, orientation, and color.

Notes
-----
Larger canvases will make the code significantly slower. If you need a
large word cloud, try a lower canvas size, and set the scale parameter.

The algorithm might give more weight to the ranking of the words
then their actual frequencies, depending on the ``max_font_size`` and the
scaling heuristic.