This is a follow up post to Ask NLP: What Does US News Say About Top Colleges? Part 1. In the last article we discovered that each page about the top 10 colleges ranked by US News had a LOT of repeat text. In order to deal with this, we’ll clean the data first. How will we clean the data? Once again the The Text API comes in handy with an endpoint to factor out the repeated sentences. Click here for the results!
For this article, we’ll assume that you have already done the web scraping. A tutorial for building a simple web scraper can be found at Web Scraping the Easy Way: Python, Selenium, BeautifulSoup. As always, we’ll have to `pip` install the libraries that we need for the project before we can do anything. For this project, we’ll just need to install `requests` and sign up for a free API key at The Text API.
pip install requests
Cleaning the Texts
Let’s get rid of all the annoying repeated text in our documents. The Text API has an endpoint called `similarity_by_sentence` to do this. It takes a list of two texts and returns a list of the sentences they have in common as well as the cleaned version of both texts. For our use case, we can clean any two of the text files we scraped and then use the repeated sentences to clean the rest of the text files.
We should create a file just for text cleaning. It’s best practice to separate out your files into modular functions. Our imports for this file will be the same as the ones we did for part 1. We just need the `requests` library to send HTTP requests and the `json` library to parse JSON. Then we’ll also import our API key from our config. You can optionally hard code your API key, but it’s better to have a config file for it.
import requests
import json
from text_api_config import apikey
Then we set up our headers to tell the server that we’re sending a JSON and pass it our API key. Let’s also set up our endpoint here so we know where to send our POST request.
headers = {
"Content-Type": "application/json",
"apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
sim_sentences_url = text_url + "similarity_by_sentences"
Just like before, we’ll also need to set up a list of all the files. Unlike before we won’t include the first two universities in our list because we’re going to handle those separately.
caltech = "california-institute-of-technology-1131.txt"
columbia = "columbia-university-2707.txt"
duke = "duke-university-2920.txt"
harvard = "harvard-university-2155.txt"
mit = "massachusetts-institute-of-technology-2178.txt"
princeton = "princeton-university-2627.txt"
stanford = "stanford-university-1305.txt"
uchicago = "university-of-chicago-1774.txt"
penn = "university-of-pennsylvania-3378.txt"
yale = "yale-university-1426.txt"
university_files = [duke, harvard, mit, princeton, stanford, uchicago, penn, yale]
Using the First Two Texts to Get Repeat Sentences
Now let’s set up our request. We’ll open up our two files that we excluded from the list and read them in as strings. Then we’ll set up the body to send the list of the two strings to the `similarity_by_sentence` endpoint.
# remove the most common sentences and see what they are
with open(caltech, "r") as f:
text_caltech = f.read()
with open(columbia, "r") as f:
text_columbia = f.read()
body = {
"texts": [text_caltech, text_columbia]
}
After sending our request, we’ll get a response back. We’ll use the `json` module to parse the string form of the response into a dictionary. From the dictionary version of the response we’ll extract out the list of repeated sentences and cleaned texts. You’ll notice that I removed something from the list of repeated sentences. That is because I noticed that it included just a regular space as a sentence, meaning that somewhere in these text files there are just spaces being used as sentences. Thanks, US News. Let’s also save our newly cleaned text files using the same filename as before.
response = requests.post(sim_sentences_url, headers=headers, json=body)
_dict = json.loads(response.text)
repeats = _dict["repeat sentences"]
repeats.remove(' ')
new_caltech = _dict["doc1 cleaned"]
new_columbia = _dict["doc2 cleaned"]
with open(caltech, "w") as f:
f.write(new_caltech)
with open(columbia, "w") as f:
f.write(new_columbia)
Cleaning the Rest of the Texts
After we’ve cleaned our original two files and extracted out the repeat sentences, let’s also clean the other files. All we do is read them in as strings and then replace each repeated sentence with an empty string. We’ll also split on “Campus safety data were” and only take the first element. The way that the text is structured, the University name plays a role in that next sentence and it isn’t included in repeat sentences, but it’s the same in all of the files. Then we just save them back into the same file names just like before. Notice that we didn’t do that for our original two files and we’ll have to go in and do that manually.
# remove all repeat sentences from other docs
# save each one as a cleaned text file
for university in university_files:
with open(university, "r") as f:
text = f.read()
for sent in repeats:
text = text.replace(sent, "")
texts = text.split("Campus safety data were")
with open(university, "w") as f:
f.write(texts[0])
Re-Analysis of What US News Says About the Top 10 Colleges
In the last article, we covered what US News says about the top 10 colleges in the form of the most positive and the most negative sentences. For this, we’re going to reuse the exact same code. You can find the code to get the following results by reading Ask NLP: What Does US News Say About Top Colleges? Part 1
Cleaned Results – Most Positive Sentences
These are the sentences with the highest text polarity scores as determined by a state of the art NLP API. We’ll see from our results this time that cleaning the text has big payoffs. Each school now actually has some text that is about that specific school. We can learn cool things like:
- Frank Capra graduated from Caltech
- 90% of Columbia Students live on Campus
- Duke has a scholarship for both UNC and Duke students
And so on. To note – UChicago only returned two sentences and both were the same, weird. We’ll look more into the UChicago case in the next article (Part 3!)
Caltech:
- “ California Institute of Technology admissions is most selective with an acceptance rate of 7%.”
- “Famous film director Frank Capra also graduated from Caltech.”
- “California Institute of Technology’s ranking in the 2022 edition of Best Colleges is National Universities, #9.”
Columbia:
- “Columbia University admissions is most selective with an acceptance rate of 6%.”
- “Columbia University’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
- “More than 90% of students live on campus.”
Duke:
- “Duke’s most esteemed undergraduate scholarship, the Robertson Scholars Leadership Program, is offered to students at both Duke University and the University of North Carolina — Chapel Hill.”
- “Duke University’s ranking in the 2022 edition of Best Colleges is National Universities, #9.”
- “Duke University is divided into 10 schools and colleges, many of which serve both undergraduate and graduate students.”
Harvard:
- “The first commencement ceremony at Harvard, held in 1642, had nine graduates.”
- “Harvard University admissions is most selective with an acceptance rate of 5%\n and an early acceptance rate of 13.9%.”
- “Harvard University’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
MIT:
- “\n At Massachusetts Institute of Technology, 60% of full-time undergraduates\n receive some kind of need-based financial aid, and the\n average need-based scholarship or grant award is $49,315.\n “
- “Massachusetts Institute of Technology’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
- “The most popular majors at Massachusetts Institute of Technology include: Computer Science; Mechanical Engineering; Mathematics, General; Physics, General; Aerospace, Aeronautical, and Astronautical/Space Engineering, General; Bioengineering and Biomedical Engineering; Econometrics and Quantitative Economics; Electrical and Electronics Engineering; Biology/Biological Sciences, General; and Chemical Engineering.”
Princeton:
- “Students live in one of six residential colleges that provide a residential community as well as dining services but have the option to join one of more than 10 eating clubs for their junior and senior years.”
- “Princeton University’s ranking in the 2022 edition of Best Colleges is National Universities, #1.”
- “The Princeton Tigers, members of the Ivy League, are well known for their consistently strong men’s and women’s lacrosse teams.”
Stanford:
- “Stanford also has successful programs in tennis and golf.”
- “Stanford University’s ranking in the 2022 edition of Best Colleges is National Universities, #6.”
- “The most popular majors at Stanford University include: Computer and Information Sciences and Support Services; Multi/Interdisciplinary Studies; and Engineering.”
UChicago: (it’s the same lol)
- “University of Chicago’s ranking in the 2020 edition of Best Colleges is National Universities, #6.”
- “University of Chicago’s ranking in the 2022 edition of Best Colleges is National Universities, #6.”
Penn:
- “Notable Penn alumni include singer John Legend, poet William Carlos Williams and President Donald Trump.”
- “The Penn Quakers have more than 25 NCAA Division I sports that compete in the Ivy League, and are noted for successful basketball and lacrosse teams.”
- “University of Pennsylvania’s ranking in the 2022 edition of Best Colleges is National Universities, #8.”
Yale:
- “Yale University’s ranking in the 2022 edition of Best Colleges is National Universities, #5.”
- “ Yale University admissions is most selective with an acceptance rate of 7%.”
- “Yale University, located in New Haven, Connecticut, is known for its excellent drama and music programs, which reach outside the classroom with student organizations such as the Yale Whiffenpoofs, a famous a cappella group, and the Yale Dramatic Association.”
Wow, this cleaned group tells us so much more about each school! Each school actually has some unique qualities in their most positive sentences now. We can tell the rank, acceptance rate, and some interesting fact about all the schools. Except UChicago, which continues to cause problems for us.
Cleaned Results – Most Negative Sentences
These are the sentences with the lowest text polarity scores as determined by a state of the art NLP API. We shouldn’t expect much negativity. After all, US News gets paid to say good things about schools and especially the top 10 schools. Here we see some subjectively negative things like:
- Caltech is 11 miles from LA
- 30% of Duke is in Greek life
- Yale has an average freshman retention rate of 91%
And more. You’ll also notice something weird going on with UChicago in this one as well. Also Penn has a “ “ as a sentence – I’m 100% sure that sentence scores a 0 on polarity score and if that’s the most negative sentence about it, all the other sentences are subjectively positive!
Caltech:
- “Caltech, which focuses on science and engineering, is located in Pasadena, California, approximately 11 miles northeast of Los Angeles.”
- “Integral to student life is the Honor Code, which dictates that \”No member of the Caltech community shall take unfair advantage of any other member of the Caltech community.\”
- “Half the applicants admitted to California Institute of Technology have an\n SAT score between 1530 and 1580 or an ACT score of 35 and 36.”
Columbia:
- “The average freshman retention rate, an indicator of student satisfaction, is 98%.”
- “The Columbia University Medical Center — home to the medical, nursing, dental and public health faculties — is located in northern Manhattan in the Washington Heights neighborhood.”
- “Half the applicants admitted to Columbia University have an\n SAT score between 1470 and 1570 or an ACT score of 33 and 35.”
Duke:
- “Approximately 30 percent of the student body is affiliated with Greek life, which encompasses almost 40 fraternities and sororities.”
- “The average freshman retention rate, an indicator of student satisfaction, is 97%.”
- “Half the applicants admitted to Duke University have an\n SAT score between 1470 and 1570 or an ACT score of 34 and 35.”
Harvard:
- “Half the applicants admitted to Harvard University have an\n SAT score between 1460 and 1580 or an ACT score of 33 and 35.”
- “The average freshman retention rate, an indicator of student satisfaction, is 92%.”
- “Harvard is named after a Puritan minister — John Harvard — who, in 1638, left his 400-book library and half of his estate to the young school.”
MIT:
- “Architect Steven Holl designed one dorm, commonly called \”The Sponge.\”
- “The average freshman retention rate, an indicator of student satisfaction, is 99%.”
- “Half the applicants admitted to Massachusetts Institute of Technology have an\n SAT score between 1510 and 1580 or an ACT score of 34 and 36.”
Princeton:
- “ “
- “The average freshman retention rate, an indicator of student satisfaction, is 94%.”
- “Half the applicants admitted to Princeton University have an\n SAT score between 1450 and 1570 or an ACT score of 32 and 35.”
Stanford:
- “Half the applicants admitted to Stanford University have an\n SAT score between 1420 and 1570 or an ACT score of 31 and 35.”
- “The Stanford Cardinal are well known for the traditional \”Big Game\” against Cal, an annual football competition that awards the Stanford Axe — a sought-after trophy — to the victor.”
- “Greek life at Stanford represents approximately 25 percent of the student body.”
UChicago:
- “It has a total undergraduate enrollment of approximately 6800 students, its setting is urban, and the campus size is 217 acres.”
Penn:
- “ “
- “Half the applicants admitted to University of Pennsylvania have an\n SAT score between 1460 and 1570 or an ACT score of 33 and 35.”
- “The average freshman retention rate, an indicator of student satisfaction, is 97%.”
Yale:
- “Half the applicants admitted to Yale University have an\n SAT score between 1460 and 1580 or an ACT score of 33 and 35.”
- “Yale is well known for its secret societies, the most famous of which are the Skull and Bones Society, which boasts members such as George W. Bush and John Kerry, and the Scroll and Key Society.”
- “The average freshman retention rate, an indicator of student satisfaction, is 91%.”
UChicago. Again, with the not having three sentences. This time we also see some interesting behavior from the Princeton and Penn sentences. This is due to the structure of the web scraped data. Could we get rid of these blank sentences? Yes. Why didn’t I? Well, I did actually, but that caused other problems, such as fusing some sentences and words together.
Be on the lookout for part 3 where we’ll dive into some of the idiosyncrasies of particular results and also take a look at the summaries and most common phrases mentioned for each school! To learn more feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Python skills!
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One thought on “Ask NLP: What Does US News Say About Top Colleges? Part 2”