The Modal Institution of Higher Education

Last updated on Oct 28, 2019 14 min read highered, R

Photo Credit to Tim Mossholder from https://unsplash.com/photos/q49oU8NeOHQ

I have enjoyed NPR’s Planet Money podcast for many years. They always have an interesting perspective on matters foreign and domestic; macro and micro; trivial and critical. It’s also a space that doesn’t shy away from wonky, data-filled policy debates.

A recent episode, The Modal American, talked listeners through a full analytic pipeline including the research question, an explanation of the methodology and the results.

Planet Money’s specific research question was whether they could aggregate IPUMS data (Go Gophers!) to find the “most typical” American using the mode as their measure of central tendancy.

To do this, they conscripted the estimiable NYT reporter Ben Casselman, who generously shared his code repo so anyone could replicate or repurpose the analysis.

Did you enjoy @planetmoney's "Modal American" episode but think, "if only it were nerdier"? You're in luck! Full results and code are now available on my GitHub page:https://t.co/b2u0zjYMpl
— Ben Casselman (@bencasselman) September 3, 2019

Finding the modal anything is interesting, but as someone who likes to think about education policy, I was happy to find that Isabella Velásquez used the methodology to look for the Modal School District.

Buiding on her work, I thought it would be interesting to aggregate IPEDS data to find the “Modal Institution of Higher Education.”

Who Cares?

Averages provide a shortcut for understanding what a “typical” thing (i.e. human height, widget, economy, giraffe age, test score, price of tacos, etc.) might be. As we learned from Planet Money, our perceptions of what we think of as the “average American” might be inaccurate.

If journalists or policy makers focus on an inaccurate version of “typical” then perceptions of how to improve systems might begin from an inaccurate baseline

American Institutions of Higher Education

Let me illustrate why this is a problem.

Imagine a “typical” college or university. In your mind you probably have a four-year, highly-selective, medium-sized, private university on the East Coast. Ivy grows on Gothic buildings. Professors wear tweed while lecturing in front of equation-filled chalkboards while earnest students take notes. Frisbees are absolutely everywhere.

This perception is, unfortunately, what drives the conversation about American Higher Education. Think about the recent Operation Varsity Blues incident.

To be clear, there was legitimate corruption perpetrated by bad actors looking to exploit a system that is largely built on trust. But was it, as Fareed Zakaria presented, a “crisis”?

If an “average” college of university is the one described above, then maybe so. But if typical describes something else, then perhaps we’re missing much more interesting or important stories about higher education in America.

So what is a “modal” college or university in the United States?

Let’s find out.

Analysis

Data Pull

Finding the “modal institution” will depend entirely on what features are selected. Rather than using features that might be influenced by the institution, I opted for institutional characteristics which are largely determined by funding mechanisms, student populations, degrees awarded, and location. Furthermore, I opted to utilize standardized, industry accepted classifications and categories to avoid subjective binning problems.

Using the Integrated Postsecondary Education Data Systems (IPEDS) web interface, I selected the universe of cases (n = 6,857) along with the following features:

var_name	var_desc
unit_id	IPEDS UNITID
institution_name	Institution Name
opeflag_hd2018	Title IV Flag
stabbr_hd2018	State
sector_hd2018	Sector
iclevel_hd2018	Degree Levels Offered
control_hd2018	Institution Funding Control
deggrant_hd2018	Degree Granting Flag
instcat_hd2018	Degree Granting Category
instsize_hd2018	Enrollment Size Category
c18basic_hd2018	Carnegie Classification
obereg_hd2018	Region Category

####Data Import and Cleaning

# load libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.2

## Warning: package 'ggplot2' was built under R version 4.0.2

## Warning: package 'tibble' was built under R version 4.0.2

## Warning: package 'tidyr' was built under R version 4.0.2

## Warning: package 'dplyr' was built under R version 4.0.2

## Warning: package 'forcats' was built under R version 4.0.2

library(magrittr)

## Warning: package 'magrittr' was built under R version 4.0.2

library(janitor)

## Warning: package 'janitor' was built under R version 4.0.2

library(dataMeta)
library(here)
library(kableExtra)

# read in raw data files from IPEDS. 

#file 1 is the raw data normalized
d <- read.csv(here("content/post/the-modal-institution-of-higher-education/data/","data_table.csv"),
              stringsAsFactors = F)

#file 2 is the data dictionary from NCES/IPEDS
e <- read.csv(here("content/post/the-modal-institution-of-higher-education/data/","ValueLabels_10-11-2019---735.csv"),
              stringsAsFactors = F)

I always like to take a quick look at the data, just to see what I am working with. As you can see below, the variable names have been normalized with digits or codes rather than useful, human-readable labels.

The data dictionary is in a long format. This means that each case in the data set represents a factor level for one of the features. In order to append on to the primary data frame, the variable names require conversion into a wide format.

#clean column names

d <- clean_names(d,case = "snake")
glimpse(d)

## Rows: 6,857
## Columns: 12
## $ unit_id          <int> 240985, 177834, 180203, 491464, 493105, 459523, 48550…
## $ institution_name <chr> "\tEducational Technical College-Recinto de Bayamon",…
## $ opeflag_hd2018   <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ stabbr_hd2018    <chr> "PR", "MO", "MT", "CA", "CA", "TX", "CA", "MI", "OR",…
## $ sector_hd2018    <int> 6, 2, 4, 7, 99, 9, 9, 9, 9, 2, 1, 3, 9, 3, 9, 6, 7, 2…
## $ iclevel_hd2018   <int> 2, 1, 2, 3, -3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 3, 2, 3, 1…
## $ control_hd2018   <int> 3, 2, 1, 1, -3, 3, 3, 3, 3, 2, 1, 3, 3, 3, 3, 3, 1, 2…
## $ deggrant_hd2018  <int> 1, 1, 1, 2, -3, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 1…
## $ instcat_hd2018   <int> 4, 1, 4, 6, -2, 6, 6, 6, 6, 2, 3, 2, 6, 3, 6, 6, 6, 1…
## $ instsize_hd2018  <int> 1, 2, 1, 1, -2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1…
## $ c18basic_hd2018  <int> -2, 25, 33, -2, -2, -2, -2, -2, -2, 18, 23, 31, -2, 2…
## $ obereg_hd2018    <int> 9, 4, 7, 8, 8, 6, 8, 3, 8, 6, 5, 8, 9, 4, 1, 7, 5, 5,…

e <- clean_names(e,case = "snake")
glimpse(e)

## Rows: 146
## Columns: 3
## $ variable_name <chr> "STABBR (HD2018)", "STABBR (HD2018)", "STABBR (HD2018)",…
## $ value         <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "F…
## $ value_label   <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California"…

Although I could force the joins to match on different variable names, it is easier to lean on the automatic matching functionality in the Tidyverse. In order to leverage this, the variable names in both files required some standardizing. Below is that process writen in my odd dialect of TidyBase and in direct contravention of DRY regulations.

#remove the year information after the underscore in the variable names
names(d) <- gsub("_[^_]+$","",names(d))

#match those names in the description file

names(e)

## [1] "variable_name" "value"         "value_label"

table(e$variable_name)

## 
## C18BASIC (HD2018)  CONTROL (HD2018) DEGGRANT (HD2018)  ICLEVEL (HD2018) 
##                34                 4                 3                 4 
##  INSTCAT (HD2018) INSTSIZE (HD2018)   OBEREG (HD2018)  OPEFLAG (HD2018) 
##                 8                 7                10                 6 
##   SECTOR (HD2018)   STABBR (HD2018) 
##                11                59

e$variable_name %<>% tolower(.)
e$variable_name <- gsub("hd2018",replacement = "", e$variable_name)
e$variable_name <- gsub("\\(",replacement = "", e$variable_name)
e$variable_name <- gsub("\\)",replacement = "", e$variable_name)
e$variable_name <- gsub(" ",replacement = "", e$variable_name)

table(e$variable_name)

## 
## c18basic  control deggrant  iclevel  instcat instsize   obereg  opeflag 
##       34        4        3        4        8        7       10        6 
##   sector   stabbr 
##       11       59

Next we’re going to spread out the data dictionary from long to wide. I should probably admit that these transformations have always been conceptually confusing for me. Melt + cast, gather + spread, or pivot_longer/pivot_wider have always been exercises in trial + error

# spread data dictionary from long to wide

e <- spread(e,key = variable_name,value = value_label)
glimpse(e)

## Rows: 97
## Columns: 11
## $ value    <chr> "-1", "-2", "-3", "0", "1", "10", "11", "12", "13", "14", "15…
## $ c18basic <chr> NA, "Not applicable, not in Carnegie universe (not accredited…
## $ control  <chr> NA, NA, "{Not available}", NA, "Public", NA, NA, NA, NA, NA, …
## $ deggrant <chr> NA, NA, "{Not available}", NA, "Degree-granting", NA, NA, NA,…
## $ iclevel  <chr> NA, NA, "{Not available}", NA, "Four or more years", NA, NA, …
## $ instcat  <chr> "Not reported", "Not applicable", NA, NA, "Degree-granting, g…
## $ instsize <chr> "Not reported", "Not applicable", NA, NA, "Under 1,000", NA, …
## $ obereg   <chr> NA, NA, NA, "US Service schools", "New England CT ME MA NH RI…
## $ opeflag  <chr> NA, NA, NA, NA, "Participates in Title IV federal financial a…
## $ sector   <chr> NA, NA, NA, "Administrative Unit", "Public, 4-year or above",…
## $ stabbr   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

# convert all numeric/integer variables to characters to facilitate joins

e <- e %>%
        mutate_if(is.integer,as.character) 

d <- d %>%
        mutate_if(is.integer,as.character)

The last data cleaning step is merging the data file with the data dictionary labels. There are many different ways to achieve this. I went with a for a loop that pulls each value, renames it, and appends.

#select variable names into vector
varnames <- names(e)
varnamestest <- varnames[2:length(varnames)]

#begin for loop

#For each of the variables in the data dictionary: 
# - select the variable and the value
# - remove NAs
# - rename the value to avoid overwriting
# - paste _val on to the variable name
# - append to date frame
# - repeat

for(i in varnamestest){
        
        temp_dict <- e %>%
                select(i,value) %>%
                na.omit() %>%
                rename(.,newval = value)
        
        names(temp_dict) <- c(paste0(i,"_val"),i)
        
        join_df <- left_join(d,temp_dict)
        
        d <- join_df
        
       
}

glimpse(d)

## Rows: 6,857
## Columns: 22
## $ unit         <chr> "240985", "177834", "180203", "491464", "493105", "459523…
## $ institution  <chr> "\tEducational Technical College-Recinto de Bayamon", "A …
## $ opeflag      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "5", "1", "1", "1…
## $ stabbr       <chr> "PR", "MO", "MT", "CA", "CA", "TX", "CA", "MI", "OR", "TX…
## $ sector       <chr> "6", "2", "4", "7", "99", "9", "9", "9", "9", "2", "1", "…
## $ iclevel      <chr> "2", "1", "2", "3", "-3", "3", "3", "3", "3", "1", "1", "…
## $ control      <chr> "3", "2", "1", "1", "-3", "3", "3", "3", "3", "2", "1", "…
## $ deggrant     <chr> "1", "1", "1", "2", "-3", "2", "2", "2", "2", "1", "1", "…
## $ instcat      <chr> "4", "1", "4", "6", "-2", "6", "6", "6", "6", "2", "3", "…
## $ instsize     <chr> "1", "2", "1", "1", "-2", "1", "1", "1", "1", "3", "2", "…
## $ c18basic     <chr> "-2", "25", "33", "-2", "-2", "-2", "-2", "-2", "-2", "18…
## $ obereg       <chr> "9", "4", "7", "8", "8", "6", "8", "3", "8", "6", "5", "8…
## $ c18basic_val <chr> "Not applicable, not in Carnegie universe (not accredited…
## $ control_val  <chr> "Private for-profit", "Private not-for-profit", "Public",…
## $ deggrant_val <chr> "Degree-granting", "Degree-granting", "Degree-granting", …
## $ iclevel_val  <chr> "At least 2 but less than 4 years", "Four or more years",…
## $ instcat_val  <chr> "Degree-granting, associate's and certificates \n", "Degr…
## $ instsize_val <chr> "Under 1,000", "1,000 - 4,999", "Under 1,000", "Under 1,0…
## $ obereg_val   <chr> "Outlying areas AS FM GU MH MP PR PW VI", "Plains IA KS M…
## $ opeflag_val  <chr> "Participates in Title IV federal financial aid programs"…
## $ sector_val   <chr> "Private for-profit, 2-year", "Private not-for-profit, 4-…
## $ stabbr_val   <chr> "Puerto Rico", "Missouri", "Montana", "California", "Cali…

Aside from a few weird character returns, everything is looking good. Now on to the aggregations and counts.

Aggregations

The final step is counting the number of institutions within each combination of these bins. Initially, I was planning to copy Ben Cassleman’s code but opted to use the updated version on Isabella Velásquez’s “Modal School District” instead.

#use Casselman/Velásquez code to find modal IHE

d.agg <- d %>%
        count(c18basic_val,
              control_val,
              deggrant_val,
              iclevel_val,
              instcat_val,
              instsize_val,
              obereg_val,
              opeflag_val,
              sector_val,
              stabbr_val,
              sort = TRUE) %>%
        slice(1:3)
        
glimpse(d.agg)

## Rows: 3
## Columns: 11
## $ c18basic_val <chr> "Not applicable, not in Carnegie universe (not accredited…
## $ control_val  <chr> "Private for-profit", "Private for-profit", "Private for-…
## $ deggrant_val <chr> "Nondegree-granting, primarily postsecondary", "Nondegree…
## $ iclevel_val  <chr> "Less than 2 years (below associate)", "Less than 2 years…
## $ instcat_val  <chr> "Nondegree-granting, sub-baccalaureate", "Nondegree-grant…
## $ instsize_val <chr> "Under 1,000", "Under 1,000", "Under 1,000"
## $ obereg_val   <chr> "Far West AK CA HI NV OR WA", "Southwest AZ NM OK TX", "S…
## $ opeflag_val  <chr> "Participates in Title IV federal financial aid programs"…
## $ sector_val   <chr> "Private for-profit, less-than 2-year", "Private for-prof…
## $ stabbr_val   <chr> "California", "Texas", "Florida"
## $ n            <int> 163, 142, 97

Results

kable(d.agg) %>%
        kable_styling(font_size = 9)

c18basic_val	control_val	deggrant_val	iclevel_val	instcat_val	instsize_val	obereg_val	opeflag_val	sector_val	stabbr_val	n
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)	Private for-profit	Nondegree-granting, primarily postsecondary	Less than 2 years (below associate)	Nondegree-granting, sub-baccalaureate	Under 1,000	Far West AK CA HI NV OR WA	Participates in Title IV federal financial aid programs	Private for-profit, less-than 2-year	California	163
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)	Private for-profit	Nondegree-granting, primarily postsecondary	Less than 2 years (below associate)	Nondegree-granting, sub-baccalaureate	Under 1,000	Southwest AZ NM OK TX	Participates in Title IV federal financial aid programs	Private for-profit, less-than 2-year	Texas	142
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)	Private for-profit	Nondegree-granting, primarily postsecondary	Less than 2 years (below associate)	Nondegree-granting, sub-baccalaureate	Under 1,000	Southeast AL AR FL GA KY LA MS NC SC TN VA WV	Participates in Title IV federal financial aid programs	Private for-profit, less-than 2-year	Florida	97

A quick look at the table reveals that several of the top spots differ only by geography. It’s mildly interesting that so many of these institutions are in Californa, Texas, or Florida, but given the population sizes there, this seems overly granular. Removing both the state and regional aggregations produces this:

kable(d.agg) %>%
        kable_styling(font_size = 9)

c18basic_val	control_val	deggrant_val	iclevel_val	instcat_val	instsize_val	opeflag_val	sector_val	n
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)	Private for-profit	Nondegree-granting, primarily postsecondary	Less than 2 years (below associate)	Nondegree-granting, sub-baccalaureate	Under 1,000	Participates in Title IV federal financial aid programs	Private for-profit, less-than 2-year	1452
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)	Private for-profit	Nondegree-granting, primarily postsecondary	At least 2 but less than 4 years	Nondegree-granting, sub-baccalaureate	Under 1,000	Participates in Title IV federal financial aid programs	Private for-profit, 2-year	226
Not applicable, not in Carnegie universe (not accredited or nondegree-granting)	Public	Nondegree-granting, primarily postsecondary	Less than 2 years (below associate)	Nondegree-granting, sub-baccalaureate	Under 1,000	Participates in Title IV federal financial aid programs	Public, less-than 2-year	220

With this higher-level aggregation, it appears as though the modal institution of Higher Education is/are:

Private for-profit
Nondegree-granting
Associate’s level
Under 1,000 enrollment
Receiving Title IV funding

There are 1,452 of these institutions comprising a sizable 21.3% of the total universe.

What, might you ask is an example of one of the modal institution? To find out, I will append the counts back to the original data set, and randomly select one for a closer zoom.

set.seed(8675309)

modal <- left_join(d,d.agg) %>%
        filter(n == max(n,na.rm = T)) %>%
        select(unit,institution,13:23) %>%
        sample_n(size = 1)

kable(modal) %>%
        kable_styling(font_size = 9)

unit	institution	c18basic_val	control_val	deggrant_val	iclevel_val	instcat_val	instsize_val	obereg_val	opeflag_val	sector_val	stabbr_val	n
438805	Branford Hall Career Institute-Springfield Campus	Not applicable, not in Carnegie universe (not accredited or nondegree-granting)	Private for-profit	Nondegree-granting, primarily postsecondary	Less than 2 years (below associate)	Nondegree-granting, sub-baccalaureate	Under 1,000	New England CT ME MA NH RI VT	Participates in Title IV federal financial aid programs	Private for-profit, less-than 2-year	Massachusetts	1452

Acknowledgements

I am far from the first to point out that the colleges in the movies are not the typical American institution. Since this is something like universal knowledge among higher ed wonks, I won’t be able to cite all of the places this argument is found. But a few of note are:

Special thanks to Isabella Velásquez and Ben Casselman for their code contributions and shoulders to stand on.

Of course, thanks to Planet Money for continuing to create such awesome content nearly 1,000 episodes in.

Errors and interpretations are the author’s alone. Please submit edits or errors to Github.

datascience highered IPEDS

The Modal Institution of Higher Education

Who Cares?

American Institutions of Higher Education

Analysis

Data Pull

Aggregations

Results

Acknowledgements

Brad Weiner

Related

The Modal Institution of Higher Education

Who Cares?

American Institutions of Higher Education

Analysis

Data Pull

Aggregations

Results

The Modal Institution Is…

Acknowledgements

Brad Weiner

Related