Web Scraping – Kandace Loualhati

Web Scraping the Pomona College Home Page

library(tidyverse)
library(rvest)
library(purrr)

# create a function that inputs a specific title for a Pomona College website and returns the facts listed on that specific URL
web_func <- function(n) {
  title <- gsub(" ", "-", n)
  title <- gsub("and", "-", title)
  title <- gsub("@", "at", title)
  title <- gsub("&", "-", title)
  title <- gsub("-+", "-", title)
  html <- paste0("https://www.pomona.edu/", title)
  page <- read_html(html)
quantity <- page |>
  html_elements(".text-7xl") |>
  html_text()
facts <- page |>
  html_elements(".fact .text-xl") |>
  html_text()
# account for the sites that have no facts
if (length(facts) == 0)
  facts <- "N/A"
if (length(quantity) == 0)
  quantity <- "N/A"
# return a tibble
data <- tibble(
  topic = n,
  quantity = quantity,
  facts = facts
)
return(data)
}

# map the different options available
dif_titles <- c("Admissions & Aid", "Academics", "Life @ Pomona", "News & Events", "About", "Alumni & Families")
map(dif_titles, web_func) |>
  list_rbind()

# A tibble: 22 × 3
   topic            quantity facts                                              
   <chr>            <chr>    <chr>                                              
 1 Admissions & Aid 20%      "of our students are the first in their family to …
 2 Admissions & Aid 8:1      "Student to faculty ratio on average"              
 3 Admissions & Aid 94%      "live on campus all four years, creating a tight c…
 4 Admissions & Aid 250+     "student-run clubs and organizations to choose fro…
 5 Academics        8:1      "Student-to-faculty ratio"                         
 6 Academics        600+     "Classes offered at Pomona each year"              
 7 Academics        52%      "Students who conduct research with faculty"       
 8 Academics        48       "Majors offered at Pomona College"                 
 9 Life @ Pomona    94%      "of students live on campus all four years"        
10 Life @ Pomona    6,000+   "students you can meet from the Claremont Colleges…
# ℹ 12 more rows

Checking Permission

# check permission
library(robotstxt)
paths_allowed("https://www.pomona.edu")

[1] TRUE

Analysis of Pomona Scraping

This function would specifically be useful for high school or undergraduate seniors in their college search. By minimizing the information provided on a certain page, students would quickly be able to just pull the information they’re most interested in. For example, if this hypothetical person needed to know the student:faculty ratio or the average class size, all they would need to do is type in “Admissions & Aid” into the function to pull out this fact.

Web Scraping the Government Agencies Page

# create a function that takes a letter as input and returns all the U.S. government agencies that start with that letter
gov_web_func <- function(letter) {
  if (letter == "A")
    gov_html <- paste0("https://www.usa.gov/agency-index#", letter)
    else
    gov_html <- paste0("https://www.usa.gov/agency-index/", letter, "#", letter)
  gov_page <- read_html(gov_html)
  agencies <- gov_page |>
    html_elements("#block-views-block-federal-agencies-block-1 .usa-accordion__button") |>
    html_text()
  agen_list <- tibble(
    agencies = agencies |>
      str_remove("\n")
  )
  return(agen_list)
}
all_letters <- setdiff(toupper(letters[1:23]), "Q")
map(all_letters, gov_web_func) |>
  list_rbind()

# A tibble: 602 × 1
   agencies                                                            
   <chr>                                                               
 1 "AbilityOne Commission  "                                           
 2 "Access Board  "                                                    
 3 "Administration for Children and Families        (ACF)      "       
 4 "Administration for Community Living         (ACL)      "           
 5 "Administration for Native Americans        (ANA)      "            
 6 "Administrative Conference of the United States        (ACUS)      "
 7 "Administrative Office of the U.S. Courts  "                        
 8 "Advisory Council on Historic Preservation        (ACHP)      "     
 9 "Africa Command  "                                                  
10 "African Development Foundation        (USADF)      "               
# ℹ 592 more rows

Checking Permission

# check permission
paths_allowed("https://www.usa.gov")

[1] TRUE

Analysis of Agency Scraping

This function would be useful for anybody interested in government agencies. If somebody needed to quickly find the name or contact information for “U.S. federal government agencies, departments, corporations, instrumentalities, [or] government-sponsored enterprises,” and only knew the letter in which it started, this function would quickly be able to retrieve this information.