Loading [MathJax]/extensions/Safe.js

Predicting Education Level for Southern Region of USA (using Census Data)

Abstract

To implement effective education reform across the United States, it is vital to know areas where education programs would be most effective. These areas could be identified as areas where higher levels of education are rarely attained. This paper uses five years of detailed census data from Ipums.org to predict an individual's education level. Through utilizing big data analysis techniques such as Principal Component Analysis (PCA), Cross Validation, and Grid Search, a Random Forest model was trained with high accuracy (80.2%) and recall (76.4%) that effectively predicts education level while minimizing False Negatives (individuals predicted to be uneducated, but actually are educated). Several models were trained and evaluated, such as Logistic Regression, Support Vector Machine, and Gradient Boosting, however the Random Forest yielded the highest accuracy and recall, which would be best to maximize identifying areas that are truly uneducated so they can receive education assistance.

The data comes from the American Community Survey 2015-2019, 5-Year Sample from usa.ipums.org. and contains all households and persons from 1% ACS sample for 2015-2019 identifiable by year. The data includes information such as age, income, health insurance, and other demographic variables. In total, there are 137 variables, and the file is 2.3GB. Our focus of the study is on the South Region. The smallest identifiable geographic unit is the PUMA. PUMAs contain at least 100,000 people.

Code Book for variables in the data: https://github.com/narashil/DS-5110-Project/blob/main/usa_00001.cbk

https://usa.ipums.org/usa/acs_multyr.shtml

Read In Data

Preprocess Data

EDA

Full Data EDA

Education EDA

Gender EDA

Balance the data for similar number of EDUC FLAG

EDA On Sampled Data

Model Construction

Baseline Logistic Regression Model With Only 36 Features

Evaluate Baseline Model

Vector Assemble all features

Split Data Into Train Test Split

Scale Data to Prepare for PCA

PCA

Logistic Regression Pipeline and Evaluation

Models

Param Grids

Support Vector Machine Pipeline, Model, Cross Validation and Evaluation

Gradient Boosting Pipeline, Model, Cross Validation and Evaluation

Random Forest Pipeline, Model, Cross Validation and Evaluation

Interpretation of PCA