How To Download Archived News Headlines

Have you ever wanted to see news headlines from the past? The New York Times provides an API to access their archive that’s perfect for doing this. I recently wanted to pull a bunch of older headlines to analyze them. I wrote about how to do that in this article asking Has COVID Made NY Times Articles More Negative? There’s only two steps to getting archived news headlines from the NY Times.

  1. Sign Up For the NY Times API and Configure It
  2. Pull Archived Headlines From the NY Times API

Sign Up For the NY Times API and Configure It

Let’s get into it. Getting signed up for the New York Times API is pretty easy, just go to their Get Started Page and follow their provided instructions. For this tutorial specifically, when you set up your “App”, make sure you enable the Archive API like I’ve shown below.

Make sure you copy and save your API key somewhere safe. I’ve marked my API key out in blue in the image below.

That’s all there is to Step 1, signing up for the New York Times API, let’s get into the code.

Pull Archived Headlines From the NY Times API

The first thing we’re going to do is pull the archived headlines from the NY Times with their API. To do this, we’ll need to install the requests library in order to make HTTP requests.

pip install requests

The first file we’ll make is a file to handle calling the archive. We’ll title this file archive.py and start off with our imports. We’ll import the requests library for HTTP requests, the json library for JSON parsing, os for making directories if we need to, and remember the API key we copied from the NY Times API page? We’ll import that as well.

import requests
import json
import os
from config import public_key

Now let’s set up a dictionary to map the month number to the month name. Why are we doing this? This is due to the format of the URL to retrieve archived articles, which we’ll see in a moment. This will also help us retrieve entire years of archived titles in one function call.

month_dict = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December"
}

Creating the Function to Call the Archive Endpoint

Now let’s define our central function for this document. Let’s call it get_month and give it the parameters year and month because we’re going to use it to get the archive information for a given year and month. In this function the first thing we’re going to do is construct the URL from our input and send a GET request to the URL. Load our response into text format, and then parse it with JSON. We’ll wrap this in a try-except block so that if it fails, we simply tell the user that the requested month was not in the archive and return.

Next we’ll create a folder for the year if it doesn’t exist yet and declare our file name. You are free to set the filename to whatever you want. I chose the format of “year/month.json” for ease of access and clarity of title. You can actually skip this next part if you want to retain all the information returned from the archive call, but I put this in to get rid of superfluous information. Let’s create a list so that we can save the information from News articles only. While we loop through the returned documents, we’ll also get rid of the multimedia information, the _id, and the uri. Finally, we save our list to a JSON file.

def get_month(year, month):
    try:
        url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={public_key}"
        res = requests.get(url)
        json_dict = json.loads(res.text)
	  # print(docs[0].keys())
        docs = json_dict["response"]["docs"]
    except:
        print("Requested month not in archive")
        return
    try:
        os.mkdir(year)
    except:
        pass
    filename = f"{year}/{month_dict[month]}.json"
    # print(filename)
    # get types_of_material = "News" only
    # get document_type = "article" only
    new_docs = []
    for doc in docs:
        if doc["type_of_material"] == "News" and doc["document_type"] == "article":
            doc.pop("multimedia")
            doc.pop("_id")
            doc.pop("uri")
            new_docs.append(doc)
 
 
    with open(filename, "w") as f:
        json.dump(new_docs, f)

For reference, here are the keys to each document in the response:

The Extracted Information as JSON

The other two functions we’ll define are a function to get a whole year, and a function to get multiple years. Our get_year function will take in a year and call the get_month function we created above for each month in that year. Notice that range starts at 0 and is a closed range function, which means we won’t actually hit the number 12 in the call to range(12). Our get_years function will take a list of years and call the get_year function for each of those years.

def get_year(year):
    # range(12) goes from 0 to 11
    for i in range(12):
        get_month(year, i+1)
 
def get_years(years):
    for year in years:
        get_year(year)

Now let’s make a separate file to actually orchestrate these downloads. I called mine downloader.py. This file is pretty simple, all we do is import our get_years function from the archive and then create a function to call it and download the archive. Our function will prompt the user to put in a comma separated list of years, which it reads as a string (so don’t put spaces between commas and years!). Then it will split that string and call get_years on each year in that list. After we define our function, to actually use it, we’ll just call it at the end of the file.

Note: the downloader.py file doesn’t actually contain the code to download the files, we did that in archive.py, but this is the file we’ll be using to call our functions – this is to increase our code modularity

from archive import get_years
 
def download():
    years = input("Which years do you want? Enter a Comma Separated List.\n")
    yr_list = years.split(",")
    print(yr_list)
    get_years(yr_list)
 
download()

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang
Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
%d bloggers like this: