Ask NLP: What Does US News Say About Top Colleges? Part 1

I don’t know if any of y’all checked out the US News Top Colleges rankings when you were in high school looking to apply for colleges, but I did. Back then the site wasn’t trashed by ads like it is now, so it was actually readable. Now it’s not. Thankfully, we have technology and we can simply scrape the site and get all the useful text for each page. For the full tutorial on how to do that check out Web Scraping the Easy Way with Python, Selenium, and Beautiful Soup. Click here to skip to the results. Disclaimer: what we learn from this post will actually show us two things – the amount of repeated text on US News’ college rankings and the importance of cleaning data!

For this tutorial we’ll start with the text documents that we generated when we scraped the US News website. To follow this tutorial we’ll only need to install the requests library and get a free API key from The Text API. We’ll use pip to install requests in the command line and then get started.

pip install requests

Using The Text API to Extract Information from Our Documents

Alright let’s dive in. We’ll need to import the requests and json libraries to send HTTP requests and parse the results into a JSON file. I’ve stored my API key in another file and I’ve imported it here. You can store yours directly in your file or in a config file and import it.

import requests
import json
 
from text_api_config import apikey

For this tutorial we’ll go and get the most positive and most negative sentences in each of these posts. The first thing we’ll have to do is get the text files that we saved the web scraped texts to. We got these in the last post on how to create a web scraper. You may pick whichever text documents you want. Once we have the names of all the text documents we want to analyze, we’ll put them into a list to easily iterate through.

caltech = "california-institute-of-technology-1131.txt"
columbia = "columbia-university-2707.txt"
duke = "duke-university-2920.txt"
harvard = "harvard-university-2155.txt"
mit = "massachusetts-institute-of-technology-2178.txt"
princeton = "princeton-university-2627.txt"
stanford = "stanford-university-1305.txt"
uchicago = "university-of-chicago-1774.txt"
penn = "university-of-pennsylvania-3378.txt"
yale = "yale-university-1426.txt"
university_files = [caltech, columbia, duke, harvard, mit, princeton, stanford, uchicago, penn, yale]

Calling The Text API Endpoints

Now let’s set up the requests. We’ll create a header that tells the server that we’re sending JSON content and passes in the API key. We’ll also need to set up our API endpoints. As mentioned earlier, we’ll be hitting The Text API endpoints for getting the most positive sentences and the most negative sentences. These sentences are determined based on the text polarity of the sentences. 

headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
mps_url = text_url + "most_positive_sentences"
mns_url = text_url + "most_negative_sentences"

Once our requests are set up, let’s declare some dictionaries to keep track of the most positive and most negative sentences for each school we’re looking at. Now we loop through each of our files, read it in, create a JSON body to send The Text API endpoint, and start sending off and parsing the requests. After getting a request back we’ll parse it into a dictionary and then save that to whichever dictionary for the most positive or most negative sentences.

mps = {}
mns = {}
 
for university in university_files:
    with open(university, "r") as f:
        text = f.read()
    body = {
        "text": text
    }
    response = requests.post(url=mps_url, headers=headers, json=body)
    _dict = json.loads(response.text)
    mps[university] = _dict["most positive sentences"]
   
    response = requests.post(url=mns_url, headers=headers, json=body)
    _dict = json.loads(response.text)
    mns[university] = _dict["most negative sentences"]

At the end of our loop, we’ll take our dictionaries and save them into JSON files.

with open(mps_filename, "w") as f:
    json.dump(mps, f)
 
with open(mns_filename, "w") as f:
    json.dump(mns, f)

Before we get into the results I want to put this disclaimer here again. Disclaimer: what we learn from this post will actually show us two things – the amount of repeated text on US News’ college rankings and the importance of cleaning data!

What are the Most Positive Sentences for Each School?

Let’s take a look. There’s a lot of statements about tax advantages accounts, the actual college US News rankings and the advice of experts to do your own research on the area. This is clearly not a GREAT example of the most positive sentences. But this tells us something about the text – there’s a lot of repeated information in this text. What does that mean? That means we need to do more data cleaning with our text before we can extract more meaningful information. This will come in a future revisit of this post!

CalTech:

  • “Famous film director Frank Capra also graduated from Caltech.”
  • “\n to choose the best tax-advantaged college investment account for you.”
  • “California Institute of Technology’s ranking in the 2022 edition of Best Colleges is National Universities, #9.”

Columbia:

  • “Columbia University’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
  • “Columbia University admissions is most selective with an acceptance rate of 6%.”
  • “\n to choose the best tax-advantaged college investment account for you.”

Duke:

  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.”
  • “Duke University’s ranking in the 2022 edition of Best Colleges is National Universities, #9.”
  • “\n to choose the best tax-advantaged college investment account for you.”

Harvard:

  • “Harvard University’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
  • “Harvard University admissions is most selective with an acceptance rate of 5%\n and an early acceptance rate of 13.9%.”
  • “\n to choose the best tax-advantaged college investment account for you.”

MIT:

  • “Massachusetts Institute of Technology’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n “
  • “\n \n to get advice on raising cash and reducing costs, or use the\n U.S. News 529 Finder\n to choose the best tax-advantaged college investment account for you.\n \n \n Campus safety data were\n \n reported by the institution\n \n to the U.S. Department of Education and have not been\n independently verified.\n”

Princeton:

  • “Princeton University’s ranking in the 2022 edition of Best Colleges is National Universities, #1.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n”
  • “\n to choose the best tax-advantaged college investment account for you.”

Stanford:

  • “\n to choose the best tax-advantaged college investment account for you.”
  • “Stanford also has successful programs in tennis and golf.”
  • “Stanford University’s ranking in the 2022 edition of Best Colleges is National Universities, #6.”

UChicago

(I have no clue why I was only able to get two sentences back, UChicago caused errors on all 3 rounds!)

  • “University of Chicago’s ranking in the 2020 edition of Best Colleges is National Universities, #6.”
  • “University of Chicago’s ranking in the 2022 edition of Best Colleges is National Universities, #6.”

Penn:

  • “University of Pennsylvania’s ranking in the 2022 edition of Best Colleges is National Universities, #8.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n”
  • “\n to choose the best tax-advantaged college investment account for you.”

Yale:

  • “\n \n to get advice on raising cash and reducing costs, or use the\n U.S. News 529 Finder\n to choose the best tax-advantaged college investment account for you.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n”
  • “Yale University’s ranking in the 2022 edition of Best Colleges is National Universities, #5.”

CLEARLY there’s a lot of comments around money and the cost of college. That’s because all of these schools cost exorbitant amounts of money. Many of these colleges also have comments about examining the safety of the surrounding area. Having been near Duke many times, I can say this is definitely true for them.

There’s also a lot of repeated sentences. This tells us two things: that US News uses a template that they insert into each of their articles and that we need to clean our data.

What are the Most Negative Sentences for Each School? 

Once again we run into the same problem of having many repeated sentences. This time the sentences are about whether or not you can register late and about how paying for college doesn’t have to be difficult or devastating. As an aside – student loans are WILD nowadays but if you’re a high school student looking to major in computer science, here’s a list of scholarships for Computer Science. As with the Most Positive Sentences endpoint, we’ll be revisiting this in the future.

CalTech:

  • “Paying for college doesn’t have to be difficult or devastating.”
  • “Integral to student life is the Honor Code, which dictates that \”No member of the Caltech community shall take unfair advantage of any other member of the Caltech community.\””
  • “Caltech, which focuses on science and engineering, is located in Pasadena, California, approximately 11 miles northeast of Los Angeles.”

Columbia:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Duke:

  • “You can register late, but there are some factors you should consider.”
  • “Approximately 30 percent of the student body is affiliated with Greek life, which encompasses almost 40 fraternities and sororities.”
  • “Campus safety data were\n \n reported by the institution\n \n to the U.S. Department of Education and have not been\n independently verified.\n The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Harvard:

  • “You can register late, but there are some factors you should consider.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”
  • “Harvard University also offers campus safety and security\n services like 24-hour foot and vehicle patrols, late night transport/escort service, 24-hour emergency telephones, lighted pathways/sidewalks, student patrols, controlled dormitory access (key, security card, etc.). “

MIT:

  • “You can register late, but there are some factors you should consider.”
  • “\n Paying for college doesn’t have to be difficult or devastating.”
  • “Architect Steven Holl designed one dorm, commonly called \”The Sponge.\”

Princeton:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Stanford:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “Greek life at Stanford represents approximately 25 percent of the student body.”

UChicago (once again causing problems):

  • “It has a total undergraduate enrollment of approximately 6800 students, its setting is urban, and the campus size is 217 acres.”

Penn:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Yale:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Pretty much all the negative sentences are around safety, registering late, and Greek Life. MIT has a random comment about one of their dorms, The Sponge. This tells us, once again, that US News uses a template, and that safety can be an issue at college.

Conclusion

The data that you have and are processing is just as important as what you do with it. This post was meant to be an example post to demonstrate this fact and will be followed up with an example of how you can programmatically clean your data. Personally, I don’t want to go in and manually clean a bunch of text, so I will find a way to programmatically do it and showcase that next time!

Learn More

To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang
Yujian Tang

I started my professional software career interning for IBM in high school after winning ACSL two years in a row. I got into AI/ML in college where I published a first author paper to IEEE Big Data. After college I worked on the AutoML infrastructure at Amazon before leaving to work in startups. I believe I create the highest quality software content so that’s what I’m doing now. Drop a comment to let me know!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

3 thoughts on “Ask NLP: What Does US News Say About Top Colleges? Part 1

Leave a Reply

%d bloggers like this: