This is a follow up to the Ask NLP projects on What Does US News Say About Top Colleges. Top Colleges Part 1 was a naive analysis without data cleaning where we pulled the most positive and most negative sentences about each college and found tons of repeat data. In Top Colleges Part 2, we cleaned the data and repeated what we did in Part 1. We found that cleaning the data by removing repeated sentences showed us much more interpretable results. For this episode, part 3, we’ll assume that you’ve already cleaned your data as we did in Part 2. This episode will be focused around finding the most common phrases and producing objective summaries of each school. To get the data that we’ve been analyzing, check out Web Scraping the Easy Way with Python, Selenium, and Beautiful Soup 4. SKIP TO THE RESULTS HERE
To follow this tutorial, you’ll need to get your free API key from The Text API. Simply scroll down to the page and click “Get My Free API Key”.
Using Python to Summarize a Text and Get the Most Common Phrases
Alright so from this point, we’ll assume that you’ve already gotten the data, and used the code in Part 2 to clean it. What we’re about to program is very similar to what we programmed in Part 1. We’ll start our program with our imports as usual. We need requests
to send our HTTP requests and json
to parse the JSON response. We also need to import our Text API key. You can get your free API key from The Text API.
import requests
import json
from text_api_config import apikey
Setting Up the API Requests
Let’s start by setting up our API requests. We’ll need to create headers, define the URLs, and create a JSON body. The JSON body will be different for each text that we analyze, but the headers and URLs can be reused. We’ll declare the headers and URL here. The headers will tell the server that the content type we’re sending is JSON and also pass in the API key. The URLs we’ll be using are the most_common_phrases
and summarize
endpoints.
headers = {
"Content-Type": "application/json",
"apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
mcps_url = text_url + "most_common_phrases"
summarize_url = text_url + "summarize"
Now let’s declare our filenames that we’ll be saving our results to and the filenames of the files we’ll be reading our text from. We’ll also put all the filenames for the colleges in a list. Later we’ll loop through this list to read each of the text files in.
mcps_filename = "mcps_universities.json"
summaries_filename = "university_summaries.json"
caltech = "california-institute-of-technology-1131.txt"
columbia = "columbia-university-2707.txt"
duke = "duke-university-2920.txt"
harvard = "harvard-university-2155.txt"
mit = "massachusetts-institute-of-technology-2178.txt"
princeton = "princeton-university-2627.txt"
stanford = "stanford-university-1305.txt"
uchicago = "university-of-chicago-1774.txt"
penn = "university-of-pennsylvania-3378.txt"
yale = "yale-university-1426.txt"
university_files = [caltech, columbia, duke, harvard, mit, princeton, stanford, uchicago, penn, yale]
Calling the Summarize and Most Common Phrases Endpoints
We’ll start by creating two dictionaries for the summaries and the most common endpoints.
mcps = {}
summaries = {}
Then we’ll loop through that list of filenames for each of the colleges we scraped earlier. For each file, we’ll open it up and read it in as the text
. Then we’ll create a body and send the text to the summarize
endpoint. Before we send a request to the most_common_phrases
endpoint, we’ll have to clean our text a little more. How do I know that we’ve got to clean these phrases out of the text? Because I ran it without removing them and saw these phrases in six or seven of the ten results and that’s a clear indicator that these are repetitive phrases that shouldn’t be considered.
for university in university_files:
with open(university, "r") as f:
text = f.read()
body = {
"text": text
}
response = requests.post(url=summarize_url, headers=headers, json=body)
_dict = json.loads(response.text)
summaries[university] = _dict["summary"]
text = text.replace("\n", "")
text = text.replace("#", "")
text = text.replace("Students", "")
text = text.replace("student satisfaction", "")
body = {
"text": text
}
response = requests.post(url=mcps_url, headers=headers, json=body)
_dict = json.loads(response.text)
mcps[university] = _dict["most common phrases"]
Saving Our Results in a JSON
After we’ve looped through all the files, we’ll have all our data saved in the dictionaries we created earlier. All we have to do is open them up and use json.dump
to save each dictionary into a file.
with open(mcps_filename, "w") as f:
json.dump(mcps, f)
with open(summaries_filename, "w") as f:
json.dump(summaries, f)
The Most Common Phrases for Each Top 10 College
Now that we’ve finished coding up our Python program to get the most common phrases and summaries of the top 10 colleges, let’s take a look at our results. We’ll only check out the most common phrases here. The summaries are already up in another article on the AI Summaries of the Top 10 Schools in America.
Let’s take a look at what we can learn from this analysis.
- Caltech has student waiters – apparently it’s a tradition to have student waiters serve dinners for student dining.
- Looks like Barack Obama went to Columbia
- Duke has a lot of schools
- Harvard has a lot of schools
- MIT really is an engineering school
- US News has nothing of importance to say about Princeton
- Stanford has an emphasis around student organizations
- UChicago has a variety of programs mentioned from Computer Science to Public Policy to Study Abroad
- Penn is focused on the sciences
- (and finally) Yale has a focus on student organizations and student service
Caltech:
- “Student houses”
- “Caltech alumni”
- “student waiters”
Columbia:
- “former President Barack Obama”
- “Business School”
- “Columbia University admissions”
Duke:
- “Sanford School”
- “Nicholas School”
- “Pratt School”
Harvard:
- “John F. Kennedy School
- “Graduate Education School”
- “Business School”
MIT:
- “Biomedical Engineering”
- “Engineering”
- “Mechanical Engineering”
Princeton:
- “students”
- “undergraduate students”
- “Princeton University admissions”
Stanford:
- “Stanford University admissions”
- “Graduate School”
- “student organizations”
UChicago: (No problems this time around!)
- “study abroad experiences”
- “Public Policy Analysis”
- “Computer Science”
Penn:
- “Physical Sciences”
- “Social Sciences”
- “Sciences”
Yale:
- “student organizations”
- “student service”
- “Yale University admissions”
To learn more feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Python skills!
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly