Categories
NLP The Text API

Ask NLP: What Does US News Say About Top Colleges? Part 1

I don’t know if any of y’all checked out the US News Top Colleges rankings when you were in high school looking to apply for colleges, but I did. Back then the site wasn’t trashed by ads like it is now, so it was actually readable. Now it’s not. Thankfully, we have technology and we can simply scrape the site and get all the useful text for each page. For the full tutorial on how to do that check out Web Scraping the Easy Way with Python, Selenium, and Beautiful Soup. Click here to skip to the results. Disclaimer: what we learn from this post will actually show us two things – the amount of repeated text on US News’ college rankings and the importance of cleaning data!

For this tutorial we’ll start with the text documents that we generated when we scraped the US News website. To follow this tutorial we’ll only need to install the requests library and get a free API key from The Text API. We’ll use pip to install requests in the command line and then get started.

pip install requests

Using The Text API to Extract Information from Our Documents

Alright let’s dive in. We’ll need to import the requests and json libraries to send HTTP requests and parse the results into a JSON file. I’ve stored my API key in another file and I’ve imported it here. You can store yours directly in your file or in a config file and import it.

import requests
import json
 
from text_api_config import apikey

For this tutorial we’ll go and get the most positive and most negative sentences in each of these posts. The first thing we’ll have to do is get the text files that we saved the web scraped texts to. We got these in the last post on how to create a web scraper. You may pick whichever text documents you want. Once we have the names of all the text documents we want to analyze, we’ll put them into a list to easily iterate through.

caltech = "california-institute-of-technology-1131.txt"
columbia = "columbia-university-2707.txt"
duke = "duke-university-2920.txt"
harvard = "harvard-university-2155.txt"
mit = "massachusetts-institute-of-technology-2178.txt"
princeton = "princeton-university-2627.txt"
stanford = "stanford-university-1305.txt"
uchicago = "university-of-chicago-1774.txt"
penn = "university-of-pennsylvania-3378.txt"
yale = "yale-university-1426.txt"
university_files = [caltech, columbia, duke, harvard, mit, princeton, stanford, uchicago, penn, yale]

Calling The Text API Endpoints

Now let’s set up the requests. We’ll create a header that tells the server that we’re sending JSON content and passes in the API key. We’ll also need to set up our API endpoints. As mentioned earlier, we’ll be hitting The Text API endpoints for getting the most positive sentences and the most negative sentences. These sentences are determined based on the text polarity of the sentences. 

headers = {
    "Content-Type": "application/json",
    "apikey": apikey
}
text_url = "https://app.thetextapi.com/text/"
mps_url = text_url + "most_positive_sentences"
mns_url = text_url + "most_negative_sentences"

Once our requests are set up, let’s declare some dictionaries to keep track of the most positive and most negative sentences for each school we’re looking at. Now we loop through each of our files, read it in, create a JSON body to send The Text API endpoint, and start sending off and parsing the requests. After getting a request back we’ll parse it into a dictionary and then save that to whichever dictionary for the most positive or most negative sentences.

mps = {}
mns = {}
 
for university in university_files:
    with open(university, "r") as f:
        text = f.read()
    body = {
        "text": text
    }
    response = requests.post(url=mps_url, headers=headers, json=body)
    _dict = json.loads(response.text)
    mps[university] = _dict["most positive sentences"]
   
    response = requests.post(url=mns_url, headers=headers, json=body)
    _dict = json.loads(response.text)
    mns[university] = _dict["most negative sentences"]

At the end of our loop, we’ll take our dictionaries and save them into JSON files.

with open(mps_filename, "w") as f:
    json.dump(mps, f)
 
with open(mns_filename, "w") as f:
    json.dump(mns, f)

Before we get into the results I want to put this disclaimer here again. Disclaimer: what we learn from this post will actually show us two things – the amount of repeated text on US News’ college rankings and the importance of cleaning data!

What are the Most Positive Sentences for Each School?

Let’s take a look. There’s a lot of statements about tax advantages accounts, the actual college US News rankings and the advice of experts to do your own research on the area. This is clearly not a GREAT example of the most positive sentences. But this tells us something about the text – there’s a lot of repeated information in this text. What does that mean? That means we need to do more data cleaning with our text before we can extract more meaningful information. This will come in a future revisit of this post!

CalTech:

  • “Famous film director Frank Capra also graduated from Caltech.”
  • “\n to choose the best tax-advantaged college investment account for you.”
  • “California Institute of Technology’s ranking in the 2022 edition of Best Colleges is National Universities, #9.”

Columbia:

  • “Columbia University’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
  • “Columbia University admissions is most selective with an acceptance rate of 6%.”
  • “\n to choose the best tax-advantaged college investment account for you.”

Duke:

  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.”
  • “Duke University’s ranking in the 2022 edition of Best Colleges is National Universities, #9.”
  • “\n to choose the best tax-advantaged college investment account for you.”

Harvard:

  • “Harvard University’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
  • “Harvard University admissions is most selective with an acceptance rate of 5%\n and an early acceptance rate of 13.9%.”
  • “\n to choose the best tax-advantaged college investment account for you.”

MIT:

  • “Massachusetts Institute of Technology’s ranking in the 2022 edition of Best Colleges is National Universities, #2.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n “
  • “\n \n to get advice on raising cash and reducing costs, or use the\n U.S. News 529 Finder\n to choose the best tax-advantaged college investment account for you.\n \n \n Campus safety data were\n \n reported by the institution\n \n to the U.S. Department of Education and have not been\n independently verified.\n”

Princeton:

  • “Princeton University’s ranking in the 2022 edition of Best Colleges is National Universities, #1.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n”
  • “\n to choose the best tax-advantaged college investment account for you.”

Stanford:

  • “\n to choose the best tax-advantaged college investment account for you.”
  • “Stanford also has successful programs in tennis and golf.”
  • “Stanford University’s ranking in the 2022 edition of Best Colleges is National Universities, #6.”

UChicago

(I have no clue why I was only able to get two sentences back, UChicago caused errors on all 3 rounds!)

  • “University of Chicago’s ranking in the 2020 edition of Best Colleges is National Universities, #6.”
  • “University of Chicago’s ranking in the 2022 edition of Best Colleges is National Universities, #6.”

Penn:

  • “University of Pennsylvania’s ranking in the 2022 edition of Best Colleges is National Universities, #8.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n”
  • “\n to choose the best tax-advantaged college investment account for you.”

Yale:

  • “\n \n to get advice on raising cash and reducing costs, or use the\n U.S. News 529 Finder\n to choose the best tax-advantaged college investment account for you.”
  • “Experts advise prospective\n students and their\n families to\n \n do their own research\n \n to evaluate the safety of a campus as well as the surrounding area.\n”
  • “Yale University’s ranking in the 2022 edition of Best Colleges is National Universities, #5.”

CLEARLY there’s a lot of comments around money and the cost of college. That’s because all of these schools cost exorbitant amounts of money. Many of these colleges also have comments about examining the safety of the surrounding area. Having been near Duke many times, I can say this is definitely true for them.

There’s also a lot of repeated sentences. This tells us two things: that US News uses a template that they insert into each of their articles and that we need to clean our data.

What are the Most Negative Sentences for Each School? 

Once again we run into the same problem of having many repeated sentences. This time the sentences are about whether or not you can register late and about how paying for college doesn’t have to be difficult or devastating. As an aside – student loans are WILD nowadays but if you’re a high school student looking to major in computer science, here’s a list of scholarships for Computer Science. As with the Most Positive Sentences endpoint, we’ll be revisiting this in the future.

CalTech:

  • “Paying for college doesn’t have to be difficult or devastating.”
  • “Integral to student life is the Honor Code, which dictates that \”No member of the Caltech community shall take unfair advantage of any other member of the Caltech community.\””
  • “Caltech, which focuses on science and engineering, is located in Pasadena, California, approximately 11 miles northeast of Los Angeles.”

Columbia:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Duke:

  • “You can register late, but there are some factors you should consider.”
  • “Approximately 30 percent of the student body is affiliated with Greek life, which encompasses almost 40 fraternities and sororities.”
  • “Campus safety data were\n \n reported by the institution\n \n to the U.S. Department of Education and have not been\n independently verified.\n The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Harvard:

  • “You can register late, but there are some factors you should consider.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”
  • “Harvard University also offers campus safety and security\n services like 24-hour foot and vehicle patrols, late night transport/escort service, 24-hour emergency telephones, lighted pathways/sidewalks, student patrols, controlled dormitory access (key, security card, etc.). “

MIT:

  • “You can register late, but there are some factors you should consider.”
  • “\n Paying for college doesn’t have to be difficult or devastating.”
  • “Architect Steven Holl designed one dorm, commonly called \”The Sponge.\”

Princeton:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Stanford:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “Greek life at Stanford represents approximately 25 percent of the student body.”

UChicago (once again causing problems):

  • “It has a total undergraduate enrollment of approximately 6800 students, its setting is urban, and the campus size is 217 acres.”

Penn:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Yale:

  • “You can register late, but there are some factors you should consider.”
  • “Paying for college doesn’t have to be difficult or devastating.”
  • “The numbers for criminal offenses reflect reports of alleged\n offenses to\n campus security and/or law enforcement authorities, not necessarily\n prosecutions or convictions.”

Pretty much all the negative sentences are around safety, registering late, and Greek Life. MIT has a random comment about one of their dorms, The Sponge. This tells us, once again, that US News uses a template, and that safety can be an issue at college.

Conclusion

The data that you have and are processing is just as important as what you do with it. This post was meant to be an example post to demonstrate this fact and will be followed up with an example of how you can programmatically clean your data. Personally, I don’t want to go in and manually clean a bunch of text, so I will find a way to programmatically do it and showcase that next time!

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Categories
APIs The Text API

How to Send a Web API Request in Python

In this post we’ll cover how to send an API request to a web API in Python. Web APIs are all the rage nowadays. The advantages of using a web API over a library include not having to install and configure a library, having much of the code abstracted out for you, and, in the case of machine learning, not having to handle the models.

Here’s an extension on How to Send Web API Requests Asynchronously.

Setting up an API Request

The web API that we’ll use for our example is the Text API. To get an API key, simply go to the site, create an account, and you should see your API key just like the image below. I’ve blurred out my API key for obvious reasons.

API Key location in The Text API

As always, we’ll start with the libraries that we’ll need. The simplest way to make an API request in Python is to use the “requests” library. This library is native to Python so we won’t need to install anything. I will also import a config file. This is best practice for using API keys, but you may choose to keep your API key in the file itself if you wish.

import requests
from text_api_config import apikey

After we import our libraries and API key, we’ll need to define the URL, create the headers, and create the body. We’ll be sending our request in JSON format. For this use case, I’ve ripped one of the paragraphs from the post about sending emails with attachments using Python. In the headers you’ll see that I’ve included a “Content-Type” key which tells the server to expect content in the form of a JSON. I’ve also included an “api-key” key which will be used to authorize our request.

Create an API Request

url = "https://www.thetextapi.com/text/summarize"
 
headers = {
    "Content-Type": "application/json",
    "api-key": apikey
}
 
text = """Here’s where we’ll switch things up before we send out the email. We’ll use os.path.basename to find the name of the file. This command will extract the basename of the file from the passed in filename (that is the name without the directory extensions). We’ll open the file to read in as bytes, and use our MIMEApplication object to read in our file. Make sure to specify your file type using the “_subtype” parameter. Ours will be a “txt” file. Then close our file to prevent memory leaks. Once we’ve created our attachment object, we need to add a header to it to let the MIMEMultipart object know what we’re attaching. Finally, we simply call the MIMEMultipart object, “message”, and attach our attachment object before we call the server to send the mail."""
 
body = {
    "text": text
}

Call the API

Note that the “body” of our request is specific to this API format and you may have to change the body of your request if you are sending your request to another API endpoint. Once we’ve declared all of our variables that we need to send a request, we’ll send a post request using the requests library as shown below. Note that not all responses will have a “.text” attribute, but this one does. You may have to do some fiddling around and debugging your response format before you can confidently pick an attribute to display.

response = requests.post(url, headers=headers, json=body)
print(response.text)

We run this program simply by calling

python <program name>

The output should look like the following:

Response from API

Further Reading

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly