The Modal Institution of Higher Education
I have enjoyed NPR’s Planet Money podcast for many years. They always have an interesting perspective on matters foreign and domestic; macro and micro; trivial and critical. It’s also a space that doesn’t shy away from wonky, data-filled policy debates.
A recent episode, The Modal American, talked listeners through a full analytic pipeline including the research question, an explanation of the methodology and the results.
Planet Money’s specific research question was whether they could aggregate IPUMS data (Go Gophers!) to find the “most typical” American using the mode as their measure of central tendancy.
To do this, they conscripted the estimiable NYT reporter Ben Casselman, who generously shared his code repo so anyone could replicate or repurpose the analysis.
Did you enjoy @planetmoney's "Modal American" episode but think, "if only it were nerdier"? You're in luck! Full results and code are now available on my GitHub page:https://t.co/b2u0zjYMpl
— Ben Casselman (@bencasselman) September 3, 2019
Finding the modal anything is interesting, but as someone who likes to think about education policy, I was happy to find that Isabella Velásquez used the methodology to look for the Modal School District.
Buiding on her work, I thought it would be interesting to aggregate IPEDS data to find the “Modal Institution of Higher Education.”
Who Cares?
Averages provide a shortcut for understanding what a “typical” thing (i.e. human height, widget, economy, giraffe age, test score, price of tacos, etc.) might be. As we learned from Planet Money, our perceptions of what we think of as the “average American” might be inaccurate.
If journalists or policy makers focus on an inaccurate version of “typical” then perceptions of how to improve systems might begin from an inaccurate baseline
American Institutions of Higher Education
Let me illustrate why this is a problem.
Imagine a “typical” college or university. In your mind you probably have a four-year, highly-selective, medium-sized, private university on the East Coast. Ivy grows on Gothic buildings. Professors wear tweed while lecturing in front of equation-filled chalkboards while earnest students take notes. Frisbees are absolutely everywhere.
This perception is, unfortunately, what drives the conversation about American Higher Education. Think about the recent Operation Varsity Blues incident.
To be clear, there was legitimate corruption perpetrated by bad actors looking to exploit a system that is largely built on trust. But was it, as Fareed Zakaria presented, a “crisis”?
If an “average” college of university is the one described above, then maybe so. But if typical describes something else, then perhaps we’re missing much more interesting or important stories about higher education in America.
So what is a “modal” college or university in the United States?
Let’s find out.
Analysis
Data Pull
Finding the “modal institution” will depend entirely on what features are selected. Rather than using features that might be influenced by the institution, I opted for institutional characteristics which are largely determined by funding mechanisms, student populations, degrees awarded, and location. Furthermore, I opted to utilize standardized, industry accepted classifications and categories to avoid subjective binning problems.
Using the Integrated Postsecondary Education Data Systems (IPEDS) web interface, I selected the universe of cases (n = 6,857) along with the following features:
var_name | var_desc |
---|---|
unit_id | IPEDS UNITID |
institution_name | Institution Name |
opeflag_hd2018 | Title IV Flag |
stabbr_hd2018 | State |
sector_hd2018 | Sector |
iclevel_hd2018 | Degree Levels Offered |
control_hd2018 | Institution Funding Control |
deggrant_hd2018 | Degree Granting Flag |
instcat_hd2018 | Degree Granting Category |
instsize_hd2018 | Enrollment Size Category |
c18basic_hd2018 | Carnegie Classification |
obereg_hd2018 | Region Category |
####Data Import and Cleaning
# load libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## Warning: package 'forcats' was built under R version 4.0.2
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.2
library(janitor)
## Warning: package 'janitor' was built under R version 4.0.2
library(dataMeta)
library(here)
library(kableExtra)
# read in raw data files from IPEDS.
#file 1 is the raw data normalized
d <- read.csv(here("content/post/the-modal-institution-of-higher-education/data/","data_table.csv"),
stringsAsFactors = F)
#file 2 is the data dictionary from NCES/IPEDS
e <- read.csv(here("content/post/the-modal-institution-of-higher-education/data/","ValueLabels_10-11-2019---735.csv"),
stringsAsFactors = F)
I always like to take a quick look at the data, just to see what I am working with. As you can see below, the variable names have been normalized with digits or codes rather than useful, human-readable labels.
The data dictionary is in a long format. This means that each case in the data set represents a factor level for one of the features. In order to append on to the primary data frame, the variable names require conversion into a wide format.
#clean column names
d <- clean_names(d,case = "snake")
glimpse(d)
## Rows: 6,857
## Columns: 12
## $ unit_id <int> 240985, 177834, 180203, 491464, 493105, 459523, 48550…
## $ institution_name <chr> "\tEducational Technical College-Recinto de Bayamon",…
## $ opeflag_hd2018 <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ stabbr_hd2018 <chr> "PR", "MO", "MT", "CA", "CA", "TX", "CA", "MI", "OR",…
## $ sector_hd2018 <int> 6, 2, 4, 7, 99, 9, 9, 9, 9, 2, 1, 3, 9, 3, 9, 6, 7, 2…
## $ iclevel_hd2018 <int> 2, 1, 2, 3, -3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 3, 2, 3, 1…
## $ control_hd2018 <int> 3, 2, 1, 1, -3, 3, 3, 3, 3, 2, 1, 3, 3, 3, 3, 3, 1, 2…
## $ deggrant_hd2018 <int> 1, 1, 1, 2, -3, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 1…
## $ instcat_hd2018 <int> 4, 1, 4, 6, -2, 6, 6, 6, 6, 2, 3, 2, 6, 3, 6, 6, 6, 1…
## $ instsize_hd2018 <int> 1, 2, 1, 1, -2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1…
## $ c18basic_hd2018 <int> -2, 25, 33, -2, -2, -2, -2, -2, -2, 18, 23, 31, -2, 2…
## $ obereg_hd2018 <int> 9, 4, 7, 8, 8, 6, 8, 3, 8, 6, 5, 8, 9, 4, 1, 7, 5, 5,…
e <- clean_names(e,case = "snake")
glimpse(e)
## Rows: 146
## Columns: 3
## $ variable_name <chr> "STABBR (HD2018)", "STABBR (HD2018)", "STABBR (HD2018)",…
## $ value <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "F…
## $ value_label <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California"…
Although I could force the joins to match on different variable names, it is easier to lean on the automatic matching functionality in the Tidyverse. In order to leverage this, the variable names in both files required some standardizing. Below is that process writen in my odd dialect of TidyBase and in direct contravention of DRY regulations.
#remove the year information after the underscore in the variable names
names(d) <- gsub("_[^_]+$","",names(d))
#match those names in the description file
names(e)
## [1] "variable_name" "value" "value_label"
table(e$variable_name)
##
## C18BASIC (HD2018) CONTROL (HD2018) DEGGRANT (HD2018) ICLEVEL (HD2018)
## 34 4 3 4
## INSTCAT (HD2018) INSTSIZE (HD2018) OBEREG (HD2018) OPEFLAG (HD2018)
## 8 7 10 6
## SECTOR (HD2018) STABBR (HD2018)
## 11 59
e$variable_name %<>% tolower(.)
e$variable_name <- gsub("hd2018",replacement = "", e$variable_name)
e$variable_name <- gsub("\\(",replacement = "", e$variable_name)
e$variable_name <- gsub("\\)",replacement = "", e$variable_name)
e$variable_name <- gsub(" ",replacement = "", e$variable_name)
table(e$variable_name)
##
## c18basic control deggrant iclevel instcat instsize obereg opeflag
## 34 4 3 4 8 7 10 6
## sector stabbr
## 11 59
Next we’re going to spread out the data dictionary from long to wide. I should probably admit that these transformations have always been conceptually confusing for me. Melt + cast, gather + spread, or pivot_longer/pivot_wider have always been exercises in trial + error
# spread data dictionary from long to wide
e <- spread(e,key = variable_name,value = value_label)
glimpse(e)
## Rows: 97
## Columns: 11
## $ value <chr> "-1", "-2", "-3", "0", "1", "10", "11", "12", "13", "14", "15…
## $ c18basic <chr> NA, "Not applicable, not in Carnegie universe (not accredited…
## $ control <chr> NA, NA, "{Not available}", NA, "Public", NA, NA, NA, NA, NA, …
## $ deggrant <chr> NA, NA, "{Not available}", NA, "Degree-granting", NA, NA, NA,…
## $ iclevel <chr> NA, NA, "{Not available}", NA, "Four or more years", NA, NA, …
## $ instcat <chr> "Not reported", "Not applicable", NA, NA, "Degree-granting, g…
## $ instsize <chr> "Not reported", "Not applicable", NA, NA, "Under 1,000", NA, …
## $ obereg <chr> NA, NA, NA, "US Service schools", "New England CT ME MA NH RI…
## $ opeflag <chr> NA, NA, NA, NA, "Participates in Title IV federal financial a…
## $ sector <chr> NA, NA, NA, "Administrative Unit", "Public, 4-year or above",…
## $ stabbr <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# convert all numeric/integer variables to characters to facilitate joins
e <- e %>%
mutate_if(is.integer,as.character)
d <- d %>%
mutate_if(is.integer,as.character)
The last data cleaning step is merging the data file with the data dictionary labels. There are many different ways to achieve this. I went with a for a loop that pulls each value, renames it, and appends.
#select variable names into vector
varnames <- names(e)
varnamestest <- varnames[2:length(varnames)]
#begin for loop
#For each of the variables in the data dictionary:
# - select the variable and the value
# - remove NAs
# - rename the value to avoid overwriting
# - paste _val on to the variable name
# - append to date frame
# - repeat
for(i in varnamestest){
temp_dict <- e %>%
select(i,value) %>%
na.omit() %>%
rename(.,newval = value)
names(temp_dict) <- c(paste0(i,"_val"),i)
join_df <- left_join(d,temp_dict)
d <- join_df
}
glimpse(d)
## Rows: 6,857
## Columns: 22
## $ unit <chr> "240985", "177834", "180203", "491464", "493105", "459523…
## $ institution <chr> "\tEducational Technical College-Recinto de Bayamon", "A …
## $ opeflag <chr> "1", "1", "1", "1", "1", "1", "1", "1", "5", "1", "1", "1…
## $ stabbr <chr> "PR", "MO", "MT", "CA", "CA", "TX", "CA", "MI", "OR", "TX…
## $ sector <chr> "6", "2", "4", "7", "99", "9", "9", "9", "9", "2", "1", "…
## $ iclevel <chr> "2", "1", "2", "3", "-3", "3", "3", "3", "3", "1", "1", "…
## $ control <chr> "3", "2", "1", "1", "-3", "3", "3", "3", "3", "2", "1", "…
## $ deggrant <chr> "1", "1", "1", "2", "-3", "2", "2", "2", "2", "1", "1", "…
## $ instcat <chr> "4", "1", "4", "6", "-2", "6", "6", "6", "6", "2", "3", "…
## $ instsize <chr> "1", "2", "1", "1", "-2", "1", "1", "1", "1", "3", "2", "…
## $ c18basic <chr> "-2", "25", "33", "-2", "-2", "-2", "-2", "-2", "-2", "18…
## $ obereg <chr> "9", "4", "7", "8", "8", "6", "8", "3", "8", "6", "5", "8…
## $ c18basic_val <chr> "Not applicable, not in Carnegie universe (not accredited…
## $ control_val <chr> "Private for-profit", "Private not-for-profit", "Public",…
## $ deggrant_val <chr> "Degree-granting", "Degree-granting", "Degree-granting", …
## $ iclevel_val <chr> "At least 2 but less than 4 years", "Four or more years",…
## $ instcat_val <chr> "Degree-granting, associate's and certificates \n", "Degr…
## $ instsize_val <chr> "Under 1,000", "1,000 - 4,999", "Under 1,000", "Under 1,0…
## $ obereg_val <chr> "Outlying areas AS FM GU MH MP PR PW VI", "Plains IA KS M…
## $ opeflag_val <chr> "Participates in Title IV federal financial aid programs"…
## $ sector_val <chr> "Private for-profit, 2-year", "Private not-for-profit, 4-…
## $ stabbr_val <chr> "Puerto Rico", "Missouri", "Montana", "California", "Cali…
Aside from a few weird character returns, everything is looking good. Now on to the aggregations and counts.
Aggregations
The final step is counting the number of institutions within each combination of these bins. Initially, I was planning to copy Ben Cassleman’s code but opted to use the updated version on Isabella Velásquez’s “Modal School District” instead.
#use Casselman/Velásquez code to find modal IHE
d.agg <- d %>%
count(c18basic_val,
control_val,
deggrant_val,
iclevel_val,
instcat_val,
instsize_val,
obereg_val,
opeflag_val,
sector_val,
stabbr_val,
sort = TRUE) %>%
slice(1:3)
glimpse(d.agg)
## Rows: 3
## Columns: 11
## $ c18basic_val <chr> "Not applicable, not in Carnegie universe (not accredited…
## $ control_val <chr> "Private for-profit", "Private for-profit", "Private for-…
## $ deggrant_val <chr> "Nondegree-granting, primarily postsecondary", "Nondegree…
## $ iclevel_val <chr> "Less than 2 years (below associate)", "Less than 2 years…
## $ instcat_val <chr> "Nondegree-granting, sub-baccalaureate", "Nondegree-grant…
## $ instsize_val <chr> "Under 1,000", "Under 1,000", "Under 1,000"
## $ obereg_val <chr> "Far West AK CA HI NV OR WA", "Southwest AZ NM OK TX", "S…
## $ opeflag_val <chr> "Participates in Title IV federal financial aid programs"…
## $ sector_val <chr> "Private for-profit, less-than 2-year", "Private for-prof…
## $ stabbr_val <chr> "California", "Texas", "Florida"
## $ n <int> 163, 142, 97
Results
kable(d.agg) %>%
kable_styling(font_size = 9)
c18basic_val | control_val | deggrant_val | iclevel_val | instcat_val | instsize_val | obereg_val | opeflag_val | sector_val | stabbr_val | n |
---|---|---|---|---|---|---|---|---|---|---|
Not applicable, not in Carnegie universe (not accredited or nondegree-granting) | Private for-profit | Nondegree-granting, primarily postsecondary | Less than 2 years (below associate) | Nondegree-granting, sub-baccalaureate | Under 1,000 | Far West AK CA HI NV OR WA | Participates in Title IV federal financial aid programs | Private for-profit, less-than 2-year | California | 163 |
Not applicable, not in Carnegie universe (not accredited or nondegree-granting) | Private for-profit | Nondegree-granting, primarily postsecondary | Less than 2 years (below associate) | Nondegree-granting, sub-baccalaureate | Under 1,000 | Southwest AZ NM OK TX | Participates in Title IV federal financial aid programs | Private for-profit, less-than 2-year | Texas | 142 |
Not applicable, not in Carnegie universe (not accredited or nondegree-granting) | Private for-profit | Nondegree-granting, primarily postsecondary | Less than 2 years (below associate) | Nondegree-granting, sub-baccalaureate | Under 1,000 | Southeast AL AR FL GA KY LA MS NC SC TN VA WV | Participates in Title IV federal financial aid programs | Private for-profit, less-than 2-year | Florida | 97 |
A quick look at the table reveals that several of the top spots differ only by geography. It’s mildly interesting that so many of these institutions are in Californa, Texas, or Florida, but given the population sizes there, this seems overly granular. Removing both the state and regional aggregations produces this:
kable(d.agg) %>%
kable_styling(font_size = 9)
c18basic_val | control_val | deggrant_val | iclevel_val | instcat_val | instsize_val | opeflag_val | sector_val | n |
---|---|---|---|---|---|---|---|---|
Not applicable, not in Carnegie universe (not accredited or nondegree-granting) | Private for-profit | Nondegree-granting, primarily postsecondary | Less than 2 years (below associate) | Nondegree-granting, sub-baccalaureate | Under 1,000 | Participates in Title IV federal financial aid programs | Private for-profit, less-than 2-year | 1452 |
Not applicable, not in Carnegie universe (not accredited or nondegree-granting) | Private for-profit | Nondegree-granting, primarily postsecondary | At least 2 but less than 4 years | Nondegree-granting, sub-baccalaureate | Under 1,000 | Participates in Title IV federal financial aid programs | Private for-profit, 2-year | 226 |
Not applicable, not in Carnegie universe (not accredited or nondegree-granting) | Public | Nondegree-granting, primarily postsecondary | Less than 2 years (below associate) | Nondegree-granting, sub-baccalaureate | Under 1,000 | Participates in Title IV federal financial aid programs | Public, less-than 2-year | 220 |
With this higher-level aggregation, it appears as though the modal institution of Higher Education is/are:
- Private for-profit
- Nondegree-granting
- Associate’s level
- Under 1,000 enrollment
- Receiving Title IV funding
There are 1,452 of these institutions comprising a sizable 21.3% of the total universe.
What, might you ask is an example of one of the modal institution? To find out, I will append the counts back to the original data set, and randomly select one for a closer zoom.
set.seed(8675309)
modal <- left_join(d,d.agg) %>%
filter(n == max(n,na.rm = T)) %>%
select(unit,institution,13:23) %>%
sample_n(size = 1)
kable(modal) %>%
kable_styling(font_size = 9)
unit | institution | c18basic_val | control_val | deggrant_val | iclevel_val | instcat_val | instsize_val | obereg_val | opeflag_val | sector_val | stabbr_val | n |
---|---|---|---|---|---|---|---|---|---|---|---|---|
438805 | Branford Hall Career Institute-Springfield Campus | Not applicable, not in Carnegie universe (not accredited or nondegree-granting) | Private for-profit | Nondegree-granting, primarily postsecondary | Less than 2 years (below associate) | Nondegree-granting, sub-baccalaureate | Under 1,000 | New England CT ME MA NH RI VT | Participates in Title IV federal financial aid programs | Private for-profit, less-than 2-year | Massachusetts | 1452 |
The Modal Institution Is…
The California Barber and Beauty College.
This school, and 1,451 schools like it are private, for-profit institutions that qualify for federal Title IV funding which allows schools to receive federal aid dollars from students who qualify.
There is bound to be variation in the types of schools that fit into this modal bin, but it’s worth a quick look at CB&B.
California requires licensure for cosmetologists and barbers. Assuming Californians enjoy haircuts and makeup, this seems like a reasonable explaination for this institution’s “modal-ness” on the demand side.
On the supply side, schools like the CB&B keep costs low by running operations from small storefronts or malls. Without tentured faculty, physical plant, hospitals, or football teams, the barriers for entry are comparatively low.
Furthermore, Certificates can be awarded in 44 weeks by anyone who 16 or older and has a diploma or GED. This is presumably a broader pool of prospective students than “Hollywood Typical” institutions that require two to four years to earn a degree.
The modal-ness might also be explained by the nature of the training itself. Barbering and cosmotology require hands-on training and frequent practice, both which lend themselves to small, localized schools with individualized attention. For this reason the modal student, is probably very different from the modal institution.
Acknowledgements
I am far from the first to point out that the colleges in the movies are not the typical American institution. Since this is something like universal knowledge among higher ed wonks, I won’t be able to cite all of the places this argument is found. But a few of note are:
Special thanks to Isabella Velásquez and Ben Casselman for their code contributions and shoulders to stand on.
Of course, thanks to Planet Money for continuing to create such awesome content nearly 1,000 episodes in.
Errors and interpretations are the author’s alone. Please submit edits or errors to Github.