The Modal Institution of Higher Education

Photo Credit to Tim Mossholder from https://unsplash.com/photos/q49oU8NeOHQ

I have enjoyed NPR’s Planet Money podcast for many years. They always have an interesting perspective on matters foreign and domestic; macro and micro; trivial and critical. It’s also a space that doesn’t shy away from wonky, data-filled policy debates.

A recent episode, The Modal American, talked listeners through a full analytic pipeline including the research question, an explanation of the methodology and the results.

Planet Money’s specific research question was whether they could aggregate IPUMS data (Go Gophers!) to find the “most typical” American using the mode as their measure of central tendancy.

To do this, they conscripted the estimiable NYT reporter Ben Casselman, who generously shared his code repo so anyone could replicate or repurpose the analysis.

Finding the modal anything is interesting, but as someone who likes to think about education policy, I was happy to find that Isabella Velásquez used the methodology to look for the Modal School District.

Buiding on her work, I thought it would be interesting to aggregate IPEDS data to find the “Modal Institution of Higher Education.”

Who Cares?

Averages provide a shortcut for understanding what a “typical” thing (i.e. human height, widget, economy, giraffe age, test score, price of tacos, etc.) might be. As we learned from Planet Money, our perceptions of what we think of as the “average American” might be inaccurate.

If journalists or policy makers focus on an inaccurate version of “typical” then perceptions of how to improve systems might begin from an inaccurate baseline

American Institutions of Higher Education

Let me illustrate why this is a problem.

Imagine a “typical” college or university. In your mind you probably have a four-year, highly-selective, medium-sized, private university on the East Coast. Ivy grows on Gothic buildings. Professors wear tweed while lecturing in front of equation-filled chalkboards while earnest students take notes. Frisbees are absolutely everywhere.

This perception is, unfortunately, what drives the conversation about American Higher Education. Think about the recent Operation Varsity Blues incident.

To be clear, there was legitimate corruption perpetrated by bad actors looking to exploit a system that is largely built on trust. But was it, as Fareed Zakaria presented, a “crisis”?

If an “average” college of university is the one described above, then maybe so. But if typical describes something else, then perhaps we’re missing much more interesting or important stories about higher education in America.

So what is a “modal” college or university in the United States?

Let’s find out.

Analysis

Data Pull

Finding the “modal institution” will depend entirely on what features are selected. Rather than using features that might be influenced by the institution, I opted for institutional characteristics which are largely determined by funding mechanisms, student populations, degrees awarded, and location. Furthermore, I opted to utilize standardized, industry accepted classifications and categories to avoid subjective binning problems.

Using the Integrated Postsecondary Education Data Systems (IPEDS) web interface, I selected the universe of cases (n = 6,857) along with the following features:

var_namevar_desc
unit_idIPEDS UNITID
institution_nameInstitution Name
opeflag_hd2018Title IV Flag
stabbr_hd2018State
sector_hd2018Sector
iclevel_hd2018Degree Levels Offered
control_hd2018Institution Funding Control
deggrant_hd2018Degree Granting Flag
instcat_hd2018Degree Granting Category
instsize_hd2018Enrollment Size Category
c18basic_hd2018Carnegie Classification
obereg_hd2018Region Category

####Data Import and Cleaning

# load libraries

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.2
library(janitor)
library(dataMeta)
library(here)
library(kableExtra)
# read in raw data files from IPEDS. 

#file 1 is the raw data normalized
d <- read.csv(here("content/post/the-modal-institution-of-higher-education/data/","data_table.csv"),
              stringsAsFactors = F)

#file 2 is the data dictionary from NCES/IPEDS
e <- read.csv(here("content/post/the-modal-institution-of-higher-education/data/","ValueLabels_10-11-2019---735.csv"),
              stringsAsFactors = F)

I always like to take a quick look at the data, just to see what I am working with. As you can see below, the variable names have been normalized with digits or codes rather than useful, human-readable labels.

The data dictionary is in a long format. This means that each case in the data set represents a factor level for one of the features. In order to append on to the primary data frame, the variable names require conversion into a wide format.

#clean column names

d <- clean_names(d,case = "snake")
glimpse(d)
## Rows: 6,857
## Columns: 12
## $ unit_id          <int> 240985, 177834, 180203, 491464, 493105, 459523, 48550…
## $ institution_name <chr> "\tEducational Technical College-Recinto de Bayamon",…
## $ opeflag_hd2018   <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ stabbr_hd2018    <chr> "PR", "MO", "MT", "CA", "CA", "TX", "CA", "MI", "OR",…
## $ sector_hd2018    <int> 6, 2, 4, 7, 99, 9, 9, 9, 9, 2, 1, 3, 9, 3, 9, 6, 7, 2…
## $ iclevel_hd2018   <int> 2, 1, 2, 3, -3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 3, 2, 3, 1…
## $ control_hd2018   <int> 3, 2, 1, 1, -3, 3, 3, 3, 3, 2, 1, 3, 3, 3, 3, 3, 1, 2…
## $ deggrant_hd2018  <int> 1, 1, 1, 2, -3, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 1…
## $ instcat_hd2018   <int> 4, 1, 4, 6, -2, 6, 6, 6, 6, 2, 3, 2, 6, 3, 6, 6, 6, 1…
## $ instsize_hd2018  <int> 1, 2, 1, 1, -2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1…
## $ c18basic_hd2018  <int> -2, 25, 33, -2, -2, -2, -2, -2, -2, 18, 23, 31, -2, 2…
## $ obereg_hd2018    <int> 9, 4, 7, 8, 8, 6, 8, 3, 8, 6, 5, 8, 9, 4, 1, 7, 5, 5,…
e <- clean_names(e,case = "snake")
glimpse(e)
## Rows: 146
## Columns: 3
## $ variable_name <chr> "STABBR (HD2018)", "STABBR (HD2018)", "STABBR (HD2018)",…
## $ value         <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "F…
## $ value_label   <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California"…

Although I could force the joins to match on different variable names, it is easier to lean on the automatic matching functionality in the Tidyverse. In order to leverage this, the variable names in both files required some standardizing. Below is that process writen in my odd dialect of TidyBase and in direct contravention of DRY regulations.

#remove the year information after the underscore in the variable names
names(d) <- gsub("_[^_]+$","",names(d))

#match those names in the description file

names(e)
## [1] "variable_name" "value"         "value_label"
table(e$variable_name)
## 
## C18BASIC (HD2018)  CONTROL (HD2018) DEGGRANT (HD2018)  ICLEVEL (HD2018) 
##                34                 4                 3                 4 
##  INSTCAT (HD2018) INSTSIZE (HD2018)   OBEREG (HD2018)  OPEFLAG (HD2018) 
##                 8                 7                10                 6 
##   SECTOR (HD2018)   STABBR (HD2018) 
##                11                59
e$variable_name %<>% tolower(.)
e$variable_name <- gsub("hd2018",replacement = "", e$variable_name)
e$variable_name <- gsub("\\(",replacement = "", e$variable_name)
e$variable_name <- gsub("\\)",replacement = "", e$variable_name)
e$variable_name <- gsub(" ",replacement = "", e$variable_name)

table(e$variable_name)
## 
## c18basic  control deggrant  iclevel  instcat instsize   obereg  opeflag 
##       34        4        3        4        8        7       10        6 
##   sector   stabbr 
##       11       59

Next we’re going to spread out the data dictionary from long to wide. I should probably admit that these transformations have always been conceptually confusing for me. Melt + cast, gather + spread, or pivot_longer/pivot_wider have always been exercises in trial + error

# spread data dictionary from long to wide

e <- spread(e,key = variable_name,value = value_label)
glimpse(e)
## Rows: 97
## Columns: 11
## $ value    <chr> "-1", "-2", "-3", "0", "1", "10", "11", "12", "13", "14", "15…
## $ c18basic <chr> NA, "Not applicable, not in Carnegie universe (not accredited…
## $ control  <chr> NA, NA, "{Not available}", NA, "Public", NA, NA, NA, NA, NA, …
## $ deggrant <chr> NA, NA, "{Not available}", NA, "Degree-granting", NA, NA, NA,…
## $ iclevel  <chr> NA, NA, "{Not available}", NA, "Four or more years", NA, NA, …
## $ instcat  <chr> "Not reported", "Not applicable", NA, NA, "Degree-granting, g…
## $ instsize <chr> "Not reported", "Not applicable", NA, NA, "Under 1,000", NA, …
## $ obereg   <chr> NA, NA, NA, "US Service schools", "New England CT ME MA NH RI…
## $ opeflag  <chr> NA, NA, NA, NA, "Participates in Title IV federal financial a…
## $ sector   <chr> NA, NA, NA, "Administrative Unit", "Public, 4-year or above",…
## $ stabbr   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# convert all numeric/integer variables to characters to facilitate joins

e <- e %>%
        mutate_if(is.integer,as.character) 

d <- d %>%
        mutate_if(is.integer,as.character)

The last data cleaning step is merging the data file with the data dictionary labels. There are many different ways to achieve this. I went with a for a loop that pulls each value, renames it, and appends.

#select variable names into vector
varnames <- names(e)
varnamestest <- varnames[2:length(varnames)]

#begin for loop

#For each of the variables in the data dictionary: 
# - select the variable and the value
# - remove NAs
# - rename the value to avoid overwriting
# - paste _val on to the variable name
# - append to date frame
# - repeat

for(i in varnamestest){
        
        temp_dict <- e %>%
                select(i,value) %>%
                na.omit() %>%
                rename(.,newval = value)
        
        names(temp_dict) <- c(paste0(i,"_val"),i)
        
        join_df <- left_join(d,temp_dict)
        
        d <- join_df
        
       
}

glimpse(d)
## Rows: 6,857
## Columns: 22
## $ unit         <chr> "240985", "177834", "180203", "491464", "493105", "459523…
## $ institution  <chr> "\tEducational Technical College-Recinto de Bayamon", "A …
## $ opeflag      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "5", "1", "1", "1…
## $ stabbr       <chr> "PR", "MO", "MT", "CA", "CA", "TX", "CA", "MI", "OR", "TX…
## $ sector       <chr> "6", "2", "4", "7", "99", "9", "9", "9", "9", "2", "1", "…
## $ iclevel      <chr> "2", "1", "2", "3", "-3", "3", "3", "3", "3", "1", "1", "…
## $ control      <chr> "3", "2", "1", "1", "-3", "3", "3", "3", "3", "2", "1", "…
## $ deggrant     <chr> "1", "1", "1", "2", "-3", "2", "2", "2", "2", "1", "1", "…
## $ instcat      <chr> "4", "1", "4", "6", "-2", "6", "6", "6", "6", "2", "3", "…
## $ instsize     <chr> "1", "2", "1", "1", "-2", "1", "1", "1", "1", "3", "2", "…
## $ c18basic     <chr> "-2", "25", "33", "-2", "-2", "-2", "-2", "-2", "-2", "18…
## $ obereg       <chr> "9", "4", "7", "8", "8", "6", "8", "3", "8", "6", "5", "8…
## $ c18basic_val <chr> "Not applicable, not in Carnegie universe (not accredited…
## $ control_val  <chr> "Private for-profit", "Private not-for-profit", "Public",…
## $ deggrant_val <chr> "Degree-granting", "Degree-granting", "Degree-granting", …
## $ iclevel_val  <chr> "At least 2 but less than 4 years", "Four or more years",…
## $ instcat_val  <chr> "Degree-granting, associate's and certificates \n", "Degr…
## $ instsize_val <chr> "Under 1,000", "1,000 - 4,999", "Under 1,000", "Under 1,0…
## $ obereg_val   <chr> "Outlying areas AS FM GU MH MP PR PW VI", "Plains IA KS M…
## $ opeflag_val  <chr> "Participates in Title IV federal financial aid programs"…
## $ sector_val   <chr> "Private for-profit, 2-year", "Private not-for-profit, 4-…
## $ stabbr_val   <chr> "Puerto Rico", "Missouri", "Montana", "California", "Cali…

Aside from a few weird character returns, everything is looking good. Now on to the aggregations and counts.

Aggregations

The final step is counting the number of institutions within each combination of these bins. Initially, I was planning to copy Ben Cassleman’s code but opted to use the updated version on Isabella Velásquez’s “Modal School District” instead.

#use Casselman/Velásquez code to find modal IHE

d.agg <- d %>%
        count(c18basic_val,
              control_val,
              deggrant_val,
              iclevel_val,
              instcat_val,
              instsize_val,
              obereg_val,
              opeflag_val,
              sector_val,
              stabbr_val,
              sort = TRUE) %>%
        slice(1:3)
        
glimpse(d.agg)
## Rows: 3
## Columns: 11
## $ c18basic_val <chr> "Not applicable, not in Carnegie universe (not accredited…
## $ control_val  <chr> "Private for-profit", "Private for-profit", "Private for-…
## $ deggrant_val <chr> "Nondegree-granting, primarily postsecondary", "Nondegree…
## $ iclevel_val  <chr> "Less than 2 years (below associate)", "Less than 2 years…
## $ instcat_val  <chr> "Nondegree-granting, sub-baccalaureate", "Nondegree-grant…
## $ instsize_val <chr> "Under 1,000", "Under 1,000", "Under 1,000"
## $ obereg_val   <chr> "Far West AK CA HI NV OR WA", "Southwest AZ NM OK TX", "S…
## $ opeflag_val  <chr> "Participates in Title IV federal financial aid programs"…
## $ sector_val   <chr> "Private for-profit, less-than 2-year", "Private for-prof…
## $ stabbr_val   <chr> "California", "Texas", "Florida"
## $ n            <int> 163, 142, 97

Results

kable(d.agg) %>%
        kable_styling(font_size = 9)
c18basic_valcontrol_valdeggrant_valiclevel_valinstcat_valinstsize_valobereg_valopeflag_valsector_valstabbr_valn
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)Private for-profitNondegree-granting, primarily postsecondaryLess than 2 years (below associate)Nondegree-granting, sub-baccalaureateUnder 1,000Far West AK CA HI NV OR WAParticipates in Title IV federal financial aid programsPrivate for-profit, less-than 2-yearCalifornia163
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)Private for-profitNondegree-granting, primarily postsecondaryLess than 2 years (below associate)Nondegree-granting, sub-baccalaureateUnder 1,000Southwest AZ NM OK TXParticipates in Title IV federal financial aid programsPrivate for-profit, less-than 2-yearTexas142
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)Private for-profitNondegree-granting, primarily postsecondaryLess than 2 years (below associate)Nondegree-granting, sub-baccalaureateUnder 1,000Southeast AL AR FL GA KY LA MS NC SC TN VA WVParticipates in Title IV federal financial aid programsPrivate for-profit, less-than 2-yearFlorida97

A quick look at the table reveals that several of the top spots differ only by geography. It’s mildly interesting that so many of these institutions are in Californa, Texas, or Florida, but given the population sizes there, this seems overly granular. Removing both the state and regional aggregations produces this:

kable(d.agg) %>%
        kable_styling(font_size = 9)
c18basic_valcontrol_valdeggrant_valiclevel_valinstcat_valinstsize_valopeflag_valsector_valn
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)Private for-profitNondegree-granting, primarily postsecondaryLess than 2 years (below associate)Nondegree-granting, sub-baccalaureateUnder 1,000Participates in Title IV federal financial aid programsPrivate for-profit, less-than 2-year1452
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)Private for-profitNondegree-granting, primarily postsecondaryAt least 2 but less than 4 yearsNondegree-granting, sub-baccalaureateUnder 1,000Participates in Title IV federal financial aid programsPrivate for-profit, 2-year226
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)PublicNondegree-granting, primarily postsecondaryLess than 2 years (below associate)Nondegree-granting, sub-baccalaureateUnder 1,000Participates in Title IV federal financial aid programsPublic, less-than 2-year220

With this higher-level aggregation, it appears as though the modal institution of Higher Education is/are:

  • Private for-profit
  • Nondegree-granting
  • Associate’s level
  • Under 1,000 enrollment
  • Receiving Title IV funding

There are 1,452 of these institutions comprising a sizable 21.3% of the total universe.

What, might you ask is an example of one of the modal institution? To find out, I will append the counts back to the original data set, and randomly select one for a closer zoom.

set.seed(8675309)

modal <- left_join(d,d.agg) %>%
        filter(n == max(n,na.rm = T)) %>%
        select(unit,institution,13:23) %>%
        sample_n(size = 1)
kable(modal) %>%
        kable_styling(font_size = 9)
unitinstitutionc18basic_valcontrol_valdeggrant_valiclevel_valinstcat_valinstsize_valobereg_valopeflag_valsector_valstabbr_valn
438805Branford Hall Career Institute-Springfield CampusNot applicable, not in Carnegie universe (not accredited or nondegree-granting)Private for-profitNondegree-granting, primarily postsecondaryLess than 2 years (below associate)Nondegree-granting, sub-baccalaureateUnder 1,000New England CT ME MA NH RI VTParticipates in Title IV federal financial aid programsPrivate for-profit, less-than 2-yearMassachusetts1452

The Modal Institution Is…

The California Barber and Beauty College.

This school, and 1,451 schools like it are private, for-profit institutions that qualify for federal Title IV funding which allows schools to receive federal aid dollars from students who qualify.

There is bound to be variation in the types of schools that fit into this modal bin, but it’s worth a quick look at CB&B.

California requires licensure for cosmetologists and barbers. Assuming Californians enjoy haircuts and makeup, this seems like a reasonable explaination for this institution’s “modal-ness” on the demand side.

On the supply side, schools like the CB&B keep costs low by running operations from small storefronts or malls. Without tentured faculty, physical plant, hospitals, or football teams, the barriers for entry are comparatively low.

Furthermore, Certificates can be awarded in 44 weeks by anyone who 16 or older and has a diploma or GED. This is presumably a broader pool of prospective students than “Hollywood Typical” institutions that require two to four years to earn a degree.

The modal-ness might also be explained by the nature of the training itself. Barbering and cosmotology require hands-on training and frequent practice, both which lend themselves to small, localized schools with individualized attention. For this reason the modal student, is probably very different from the modal institution.

Acknowledgements

I am far from the first to point out that the colleges in the movies are not the typical American institution. Since this is something like universal knowledge among higher ed wonks, I won’t be able to cite all of the places this argument is found. But a few of note are:

  1. Shut Up About Harvard

  2. Higher Ed Data Stories

Special thanks to Isabella Velásquez and Ben Casselman for their code contributions and shoulders to stand on.

Of course, thanks to Planet Money for continuing to create such awesome content nearly 1,000 episodes in.

Errors and interpretations are the author’s alone. Please submit edits or errors to Github.

Avatar
Brad Weiner

My research interests include higher education policy, data science, enrollment management, and institutional advancement.

Related