DS02 Presentation

Web Scraping!

Goals for this project:

  • Create a function that scrapes www.pomona.edu according to a specific topic the user inputs, returning all facts on that specific page

  • Create a function that scrapes www.usa.gov according to a certain letter the user inputs, returning all government agencies that start with that letter

Checking permissions

# check permission
library(robotstxt)
paths_allowed("https://www.pomona.edu")
[1] TRUE
paths_allowed("https://www.usa.gov")
[1] TRUE

First Function

# load in packages
library(tidyverse)
library(rvest)
library(purrr)
# create the function
web_func <- function(n) {
  title <- gsub(" ", "-", n)
  title <- gsub("and", "-", title)
  title <- gsub("@", "at", title)
  title <- gsub("&", "-", title)
  title <- gsub("-+", "-", title)
  html <- paste0("https://www.pomona.edu/", title)
  page <- read_html(html)
quantity <- page |>
  html_elements(".text-7xl") |>
  html_text()
facts <- page |>
  html_elements(".fact .text-xl") |>
  html_text()
# account for the sites that have no facts
if (length(facts) == 0)
  facts <- "N/A"
if (length(quantity) == 0)
  quantity <- "N/A"
# return a tibble
data <- tibble(
  topic = n,
  quantity = quantity,
  facts = facts
)
return(data)
}

Some Issues/Explanation

  • The gsub() function serves as a replacement operation; in order for “n” to properly be translated into the html code, I had to remove/replace whitespaces, symbols, and the word “and”

  • In order to account for the sites that don’t have any facts, I created an if-clause for those whose lengths equal 0 to return “N/A” (i.e., News & Events)

Mapping the Func

# map the different options available
dif_titles <- c("Admissions & Aid", "Academics", "Life @ Pomona", "News & Events", "About", "Alumni & Families")
map(dif_titles, web_func) |>
  list_rbind()
# A tibble: 22 × 3
   topic            quantity facts                                              
   <chr>            <chr>    <chr>                                              
 1 Admissions & Aid 20%      "of our students are the first in their family to …
 2 Admissions & Aid 8:1      "Student to faculty ratio on average"              
 3 Admissions & Aid 94%      "live on campus all four years, creating a tight c…
 4 Admissions & Aid 250+     "student-run clubs and organizations to choose fro…
 5 Academics        8:1      "Student-to-faculty ratio"                         
 6 Academics        600+     "Classes offered at Pomona each year"              
 7 Academics        52%      "Students who conduct research with faculty"       
 8 Academics        48       "Majors offered at Pomona College"                 
 9 Life @ Pomona    94%      "of students live on campus all four years"        
10 Life @ Pomona    6,000+   "students you can meet from the Claremont Colleges…
# ℹ 12 more rows

Second Function

# create a function that takes a letter as input and returns all the U.S. government agencies that start with that letter
gov_web_func <- function(letter) {
  if (letter == "A")
    gov_html <- paste0("https://www.usa.gov/agency-index#", letter)
    else
    gov_html <- paste0("https://www.usa.gov/agency-index/", letter, "#", letter)
  gov_page <- read_html(gov_html)
  agencies <- gov_page |>
    html_elements("#block-views-block-federal-agencies-block-1 .usa-accordion__button") |>
    html_text()
  agen_list <- tibble(
    agencies = agencies |>
      str_remove("\n")
  )
  return(agen_list)
}

Some Issues/Explanation

  • The letter “A” is the only letter that has a different HTML, so I created an if-else clause to account for that difference

  • Most of the agencies ended with “\n” so I removed that from the string

  • Only the letters A-W (except for Q) have government agencies

  • All the letters in the URL are uppercase

Mapping the Func

all_letters <- setdiff(toupper(letters[1:23]), "Q")
map(all_letters, gov_web_func) |>
  list_rbind()
# A tibble: 602 × 1
   agencies                                                            
   <chr>                                                               
 1 "AbilityOne Commission  "                                           
 2 "Access Board  "                                                    
 3 "Administration for Children and Families        (ACF)      "       
 4 "Administration for Community Living         (ACL)      "           
 5 "Administration for Native Americans        (ANA)      "            
 6 "Administrative Conference of the United States        (ACUS)      "
 7 "Administrative Office of the U.S. Courts  "                        
 8 "Advisory Council on Historic Preservation        (ACHP)      "     
 9 "Africa Command  "                                                  
10 "African Development Foundation        (USADF)      "               
# ℹ 592 more rows