Download a YouTube Transcript in 3 Lines of Python

Do you ever want to read the transcript of a YouTube video rather than watch or listen to it? Maybe when you want to learn without using data or when you’re in a public place. There are multiple ways to get YouTube transcripts including downloading the video and transcribing the audio. A much easier and quicker method to get the transcript of a YouTube video in Python would be to use the YouTube-Transcript-API. 

Overview of Getting a YouTube Transcript with Python

In this post we will cover:

  • What is YouTube-Transcript-API?
  • How Can I Download YouTube Video Transcript?
    • Code Example of YouTube-Transcript-API Downloading a YouTube Transcript
  • How Can I Download Transcripts from a YouTube Playlist?
    • Use YouTube-Transcript-API for Downloading Multiple Transcripts, Example 1
    • Use YouTube-Transcript-API for Downloading Multiple Transcripts, Example 2
  • Summary of Downloading a YouTube Transcript with Python

What is YouTube-Transcript-API?

The Youtube-Transcript-API is a Python API that gets the captions attached to a YouTube video. Instead of going through the trouble to download a video and transcribe it, you can simply fetch the pre-existing captions. The downside to this is that it is unable to get transcripts for you when there’s no captions attached to the video. Sometimes people attach manual captions, sometimes they’re YouTube generated.

How Can I Download a YouTube Video Transcript?

The YouTube-Transcript-API provides a simple interface to download video transcripts from YouTube. You can download one YouTube video transcript at a time or multiple. In this section, we’ll cover how to download one YouTube video transcript using the YouTube-Transcript-API. We’ll cover how to do it both using a Python script and with the command line interface (CLI) tool. For this example, we’ll download this video on watching a software engineer create a webscraper from scratch.

To follow along, you’ll need to install the youtube-transcript-api library with the following line in your command prompt:

pip install youtube-transcript-api

I’ve circled the ID of a YouTube video in the image below. It is the 11 character code after the “v?=” and before the “&” if it’s also in a playlist.

We can copy and paste the ID or use Python to parse it out. To parse the ID out with Python we’ll start by saving the link as a string. Then we’ll use the split() function to split the string into a list on the = sign, and take the second element in that list. Then we split that element using split() again on the & sign and take the first element from the resulting list. We should then have the video ID.

example_url = "https://www.youtube.com/watch?v=pN3jRihVpGk&list=PLKiU8vyKB6ti1_rUlpZJFdPaxT04sUIoV&index=1"
_id = example_url.split("=")[1].split("&")[0]
print(_id)

Code Example of YouTube-Transcript-API Downloading a YouTube Transcript

We can easily download a YouTube transcript to a JSON file with the YouTube-Transcript-API. This API provides a simple class, YouTubeTranscriptApi that can be used to download YouTube transcripts. The API has two functions for downloading transcripts, get_transcript, and get_transcripts. We’ll be using get_transcript in this section to download the transcript of one YouTube video. Here’s where we download a YouTube transcript in 3 lines of Python.

It’s actually 5 lines of Python if you count the import statements, or only 1 line of Python if you’re just counting getting the transcript. Semantics aside, let’s start by importing the libraries we need. We’ll need to import the YouTubeTranscriptApi class from the youtube_transcript_api library and the json library.

Next, we’ll set an _id variable equal to the ID of the YouTube video we want to download. Then, we’ll use the YouTubeTranscriptApi class and call get_transcript on the _id provided to get the transcription JSON format. Finally, we’ll use open to open a JSON file and dump the transcript into the JSON file.

from youtube_transcript_api import YouTubeTranscriptApi
import json
 
_id = "sQuFl0PSoXo"
transcript = YouTubeTranscriptApi.get_transcript(_id)
with open(f'{filename}.json', 'w', encoding='utf-8') as json_file:
            json.dump(transcript, json_file)

How Can I Download Transcripts from a YouTube Playlist?

Let’s say you don’t want to just download one video, you want to download a playlist like this one: ai content moderator – YouTube. How can you download the whole playlist? You’ll need all the playlist IDs. You can get all the playlist IDs from the playlist, manually or programmatically using the example above to parse one link to parse multiple.

There’s actually two ways to get multiple transcripts using the YouTubeTranscriptApi class. Earlier we mentioned that there is a get_transcripts method as well as a get_transcript method. Here we’ll go over two examples, one using get_transcripts and one using get_transcript.

Use YouTube-Transcript-API for Downloading Multiple Transcripts, Example 1

The get_transcripts method works on a list of transcript IDs. It returns a tuple object where the first object is the successful transcripts and the second one is a list of unsuccessful transcripts. To use this function, we’ll start by importing the imports we need as usual, YouTubeTranscriptApi and json. Then, we’ll pass in our list of transcripts to the get_transcripts function. Finally, we’ll save each of the transcripts in the first entry of the tuple to a JSON file. Note that get_transcripts will not necessarily return everything in order.

from  youtube_transcript_api import YouTubeTranscriptApi
import json
 
transcripts = YouTubeTranscriptApi.get_transcripts(["pN3jRihVpGk", "F_7xepUPH7E", "d5ib3qjQkwk", "EfY_GG4cqHM"])
for transcript in transcripts[0]:
    with open(f'{filename}.json', 'w', encoding='utf-8') as json_file:
        json.dump(transcripts[transcript], json_file)

Use YouTube-Transcript-API for Downloading Multiple Transcripts, Example 2

The second way we can download multiple transcripts is more specific to playlists and more verbose. Use this method when you want to know exactly which indexes in the playlist weren’t downloaded. As usual we’ll start with our imports of the YouTubeTranscriptApi class and the json library. Then we’ll declare a list of the desired video IDs. Next, we’ll loop through that enumerated list and try to download and save the video transcript to a JSON file. If we can’t download it, we’ll print out a statement telling us which index in the playlist failed to download.

from youtube_transcript_api import YouTubeTranscriptApi
import json
 
_ids = ["pN3jRihVpGk", "F_7xepUPH7E", "d5ib3qjQkwk", "EfY_GG4cqHM"]
 
for index, _id in enumerate(_ids):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(_id)
        with open(f'{filename}.json', 'w', encoding='utf-8') as json_file:
            json.dump(transcript, json_file)
    except:
        print(f"playlist {index} not valid")

Summary of Downloading a YouTube Transcript with Python

In this post we learned about how to download a transcript from a YouTube video with the YouTube-Transcript-API library for Python. We learned how to parse a YouTube link to get the ID of a video and how to download a single video. We then learned how we can download multiple videos using both the get_transcripts and get_transcript functions. 

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Yujian Tang

6 thoughts on “Download a YouTube Transcript in 3 Lines of Python

  1. Thank you. I used below code to place the text in a nice format

    maximum_chars=80
    text=””
    for kt in range(len(transcript)):
    text=text+’ ‘+ transcript[kt][‘text’]

    words=text.split()
    current_line_length=0
    current_line=””
    for kw in range(len(words)):
    nc=len(words[kw])+current_line_length

    if nc < maximum_chars:
    current_line=current_line+' '+words[kw]
    current_line_length=nc
    else:
    print(current_line)
    print(' ')
    current_line=words[kw]
    current_line_length=0

Leave a Reply

%d bloggers like this: