[1] TRUE
[1] TRUE
Goals for this project:
Create a function that scrapes www.pomona.edu according to a specific topic the user inputs, returning all facts on that specific page
Create a function that scrapes www.usa.gov according to a certain letter the user inputs, returning all government agencies that start with that letter
# load in packages
library(tidyverse)
library(rvest)
library(purrr)
# create the function
web_func <- function(n) {
title <- gsub(" ", "-", n)
title <- gsub("and", "-", title)
title <- gsub("@", "at", title)
title <- gsub("&", "-", title)
title <- gsub("-+", "-", title)
html <- paste0("https://www.pomona.edu/", title)
page <- read_html(html)
quantity <- page |>
html_elements(".text-7xl") |>
html_text()
facts <- page |>
html_elements(".fact .text-xl") |>
html_text()
# account for the sites that have no facts
if (length(facts) == 0)
facts <- "N/A"
if (length(quantity) == 0)
quantity <- "N/A"
# return a tibble
data <- tibble(
topic = n,
quantity = quantity,
facts = facts
)
return(data)
}The gsub() function serves as a replacement operation; in order for “n” to properly be translated into the html code, I had to remove/replace whitespaces, symbols, and the word “and”
In order to account for the sites that don’t have any facts, I created an if-clause for those whose lengths equal 0 to return “N/A” (i.e., News & Events)
# map the different options available
dif_titles <- c("Admissions & Aid", "Academics", "Life @ Pomona", "News & Events", "About", "Alumni & Families")
map(dif_titles, web_func) |>
list_rbind()# A tibble: 22 × 3
topic quantity facts
<chr> <chr> <chr>
1 Admissions & Aid 20% "of our students are the first in their family to …
2 Admissions & Aid 8:1 "Student to faculty ratio on average"
3 Admissions & Aid 94% "live on campus all four years, creating a tight c…
4 Admissions & Aid 250+ "student-run clubs and organizations to choose fro…
5 Academics 8:1 "Student-to-faculty ratio"
6 Academics 600+ "Classes offered at Pomona each year"
7 Academics 52% "Students who conduct research with faculty"
8 Academics 48 "Majors offered at Pomona College"
9 Life @ Pomona 94% "of students live on campus all four years"
10 Life @ Pomona 6,000+ "students you can meet from the Claremont Colleges…
# ℹ 12 more rows
# create a function that takes a letter as input and returns all the U.S. government agencies that start with that letter
gov_web_func <- function(letter) {
if (letter == "A")
gov_html <- paste0("https://www.usa.gov/agency-index#", letter)
else
gov_html <- paste0("https://www.usa.gov/agency-index/", letter, "#", letter)
gov_page <- read_html(gov_html)
agencies <- gov_page |>
html_elements("#block-views-block-federal-agencies-block-1 .usa-accordion__button") |>
html_text()
agen_list <- tibble(
agencies = agencies |>
str_remove("\n")
)
return(agen_list)
}The letter “A” is the only letter that has a different HTML, so I created an if-else clause to account for that difference
Most of the agencies ended with “\n” so I removed that from the string
Only the letters A-W (except for Q) have government agencies
All the letters in the URL are uppercase
# A tibble: 602 × 1
agencies
<chr>
1 "AbilityOne Commission "
2 "Access Board "
3 "Administration for Children and Families (ACF) "
4 "Administration for Community Living (ACL) "
5 "Administration for Native Americans (ANA) "
6 "Administrative Conference of the United States (ACUS) "
7 "Administrative Office of the U.S. Courts "
8 "Advisory Council on Historic Preservation (ACHP) "
9 "Africa Command "
10 "African Development Foundation (USADF) "
# ℹ 592 more rows