Urban Walkability and Health Outcomes - Shakthi Manjanath and Quinn Stevens, Link to our GitHub webpage¶
Our Motivation and Introduction¶
Project Goals¶
The city that you live in determines many aspects of your life. The design of your environment impacts how you interact with healthcare, businesses, transportation, and other people. An important aspect of urban design is walkability. This is the term to describe how easy it is to walk around an area as means of transportation every day. In highly walkable neighborhoods, people are able to engage in physical activity when they commute, shop, or meet up with other people. Contrastingly, car-dependent neighborhoods require less physical activity to get from point A to point B.
This topic is important in the context of American health as millions of Americans suffer from chronic diseases like hypertension, obesity, arthritis, and depression. For this project, we will be exploring how multiple large data sets can be used together to explore how urban design affects health. We will use environmental, economic, and health data sets to see if walkability in urban environments serves as a predictor of health outcomes.
Dataset Questions¶
- How much does an American city's average Walkability Index score serve as a significant predictor for the prevalence of hypertension (High Blood Pressure) medication usage?
- Does the positive effect of walkability stay consistent with different chronic diseases (Hypertension, Arthritis, Obesity) versus social health effects like Loneliness?
- Does the relationship between urban walkability and hypertension still exist when controlling for median household income?
Project Datasets¶
To answer our questions, we are using several datasets. One is the National Walkability Index (2025) dataset that characterizes every Census 2019 block group in the U.S. based on its relative walkability. The second data set is 500 Cities: Local Data for Better Health (2025), which provides small-area estimates for 27 chronic disease measures. We also use US Census Bureau ACS 5-Year Estimates and Zillow Home Value Index (ZHVI) for data related to a city's wealth and the housing market.
The National Walkability Index is a 2025 update of the EPA’s Smart Location Database, which uses 2019 Census block group boundaries to score the built environment’s influence on pedestrian travel. These scores are derived from a weighted equation involving four variables: land-use mix, intersection density, transit proximity, and employment-to-housing ratios. The 500 Cities dataset provides model-based small-area estimates for the 500 largest U.S. cities, covering approximately 28,000 census tracts. Developed by the CDC and the Robert Wood Johnson Foundation, this dataset uses 2014-2015 BRFSS and ACS data to report on 27 chronic disease measures, including unhealthy behaviors and prevention services. By merging these sets, we can investigate how neighborhood design correlates with health outcomes. The US Census Bureau ACS 5-Year Estimates show granular data about Median Household Income at the city and census-tract level. Wealth is an important factor that affects health, and so this data set allows us to consider this when we attribute health benefits to walkability. The Zillow Home Value Index is a measure of how the home value market changes across parts of the US. This allows us to consider different housing markets as a socioeconomic factor in health.
Collaboration Plan¶
The team plans to continue to meet weekly over Zoom to collaborate on the project. We also communicate through text messages and email to send each other updates and resources. Additionally, we have started a Word document that has an ongoing to-do list. We have set up a Google Colab to work on our code together and have a shared GitHub repository together.
Data Collection¶
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
%%shell
jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/smanjanath_qstevens.ipynb"
[NbConvertApp] Converting notebook /content/drive/My Drive/Colab Notebooks/smanjanath_qstevens.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 9 image(s). [NbConvertApp] Writing 1992562 bytes to /content/drive/My Drive/Colab Notebooks/smanjanath_qstevens.html
To load and inspect the datasets we import several Python libraries that we will use for data visualization, manipulation, and modeling. We use pandas for tabular data processing, NumPy to conduct numerical operations, and matplotlib for visualizations.
Then, we load the two base datasets:
- The EPA Smart Location Database with walkability metrics
- The CDC 500 Cities dataset that has health outcomes for cities
We use .info() to understand the structure and data types of our datasets.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df_health = pd.read_csv("/content/drive/My Drive/Colab Notebooks/500_Cities__Local_Data_for_Better_Health__2017_release.csv")
df_walking = pd.read_csv("/content/drive/My Drive/Colab Notebooks/EPA_SmartLocationDatabase_V3_Jan_2021_Final.csv")
# this shows initial data types for df_health
print("df_health info:")
df_health.info()
# this will display initial data types for df_walking
print("\ndf_walking info:")
df_walking.info()
df_health info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 810103 entries, 0 to 810102 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 810103 non-null int64 1 StateAbbr 810103 non-null object 2 StateDesc 810103 non-null object 3 CityName 810047 non-null object 4 GeographicLevel 810103 non-null object 5 DataSource 810103 non-null object 6 Category 810103 non-null object 7 UniqueID 810103 non-null object 8 Measure 810103 non-null object 9 Data_Value_Unit 810103 non-null object 10 DataValueTypeID 810103 non-null object 11 Data_Value_Type 810103 non-null object 12 Data_Value 789600 non-null float64 13 Data_Value_Footnote_Symbol 20503 non-null object 14 Data_Value_Footnote 20503 non-null object 15 Low_Confidence_Limit 789600 non-null float64 16 High_Confidence_Limit 789600 non-null float64 17 Population2010 810103 non-null int64 18 GeoLocation 810047 non-null object 19 CategoryID 810103 non-null object 20 MeasureId 810103 non-null object 21 CityFIPS 810047 non-null float64 22 TractFIPS 782047 non-null float64 23 Short_Question_Text 810103 non-null object dtypes: float64(5), int64(2), object(17) memory usage: 148.3+ MB df_walking info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 220740 entries, 0 to 220739 Columns: 117 entries, OBJECTID to Shape_Area dtypes: float64(73), int64(42), object(2) memory usage: 197.0+ MB
Data Cleaning¶
Both of the EPA and CDC datasets were made separately and don't have the same city name conventions. So, we created a standardized city_clean variable. To do this we:
- Converted the city names to lowercase
- Removed excess whitespace using .strip()
- Parsed naming conventions using .split()
Parse df_walking to make clean city names, without the other unnecessary info
After parsing, make a 'city_clean' column with the city names in it, for both data sets
df_walking['city_clean'] = df_walking['CBSA_Name'].str.split(',').str[0]
df_walking['city_clean'] = df_walking['city_clean'].str.split('-').str[0]
df_walking['city_clean'] = df_walking['city_clean'].str.strip().str.lower()
df_health['city_clean'] = df_health['CityName'].str.strip().str.lower()
# this shows data types after cleaning city names for df_walking
print("df_walking info after city clean:")
df_walking.info()
# this shows data types after cleaning city names for df_health
print("\ndf_health info after city clean:")
df_health.info()
df_walking info after city clean: <class 'pandas.core.frame.DataFrame'> RangeIndex: 220740 entries, 0 to 220739 Columns: 118 entries, OBJECTID to city_clean dtypes: float64(73), int64(42), object(3) memory usage: 198.7+ MB df_health info after city clean: <class 'pandas.core.frame.DataFrame'> RangeIndex: 810103 entries, 0 to 810102 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 810103 non-null int64 1 StateAbbr 810103 non-null object 2 StateDesc 810103 non-null object 3 CityName 810047 non-null object 4 GeographicLevel 810103 non-null object 5 DataSource 810103 non-null object 6 Category 810103 non-null object 7 UniqueID 810103 non-null object 8 Measure 810103 non-null object 9 Data_Value_Unit 810103 non-null object 10 DataValueTypeID 810103 non-null object 11 Data_Value_Type 810103 non-null object 12 Data_Value 789600 non-null float64 13 Data_Value_Footnote_Symbol 20503 non-null object 14 Data_Value_Footnote 20503 non-null object 15 Low_Confidence_Limit 789600 non-null float64 16 High_Confidence_Limit 789600 non-null float64 17 Population2010 810103 non-null int64 18 GeoLocation 810047 non-null object 19 CategoryID 810103 non-null object 20 MeasureId 810103 non-null object 21 CityFIPS 810047 non-null float64 22 TractFIPS 782047 non-null float64 23 Short_Question_Text 810103 non-null object 24 city_clean 810047 non-null object dtypes: float64(5), int64(2), object(18) memory usage: 154.5+ MB
Group the data by the clean_city column we made earlier, then merge the sets.
df_walkability_city = df_walking.groupby('city_clean').agg({'NatWalkInd': 'mean', 'Pct_AO0': 'mean', 'D3B': 'mean', 'D4A': 'mean',}).reset_index()
df = df_health.merge(df_walkability_city, on='city_clean', how='inner')
Aggregation Logic for Walkability Data¶
What?: In this part we aggregated the df_walking data by using city_clean and calculating the mean between NatWalkInd, Pct_AO0, D3B, and D4A. This df_walking dataset contains data at the Census Block Group level
Why?: The health outcome dataset has data available at the city level. By aggregating the block group-level walkability data to a city-level average using groupby(), we could explore our research questions at a city level.
Insights: Averaging the block group walkability scores into a city-level NatWalkInd helps us to compare the general walkability of different urban areas with their corresponding health metrics. Now we can:
- Create a constant geographic unit for merging purposes
- Compare urban walkability between cities
- Reduce noise of localized block-group variations
df_zillow = pd.read_csv("/content/drive/My Drive/Colab Notebooks/City_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
date_cols = [c for c in df_zillow.columns if c.startswith('20')]
latest_col = date_cols[-1]
print(f"Using column: {latest_col}")
# Filter for RegionType 'city' to avoid duplicates for the same city name
df_zillow_filtered = df_zillow[df_zillow['RegionType'] == 'city'].copy()
df_zillow_filtered['city_clean'] = df_zillow_filtered['RegionName'].str.strip().str.lower()
df_zillow_city = df_zillow_filtered[['city_clean', latest_col]].copy()
df_zillow_city.rename(columns={latest_col: 'home_value'}, inplace=True)
df_plot_zillow = df_walkability_city.merge(df_zillow_city, on='city_clean', how='inner')
print(f"Matched cities: {len(df_plot_zillow)}")
print(df_plot_zillow[['city_clean', 'NatWalkInd', 'home_value']].head(10))
Using column: 2026-03-31 Matched cities: 3036 city_clean NatWalkInd home_value 0 aberdeen 9.185567 240644.332155 1 aberdeen 9.185567 343947.667771 2 aberdeen 9.185567 273201.242877 3 aberdeen 9.185567 352796.288282 4 aberdeen 9.185567 142494.040733 5 aberdeen 9.185567 301127.778633 6 aberdeen 9.185567 198707.400195 7 abilene 7.322917 210016.002457 8 abilene 7.322917 184104.647848 9 ada 7.005208 180731.575584
# data types for the merged df
print("df info after merge:")
df.info()
df info after merge: <class 'pandas.core.frame.DataFrame'> RangeIndex: 573809 entries, 0 to 573808 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 573809 non-null int64 1 StateAbbr 573809 non-null object 2 StateDesc 573809 non-null object 3 CityName 573809 non-null object 4 GeographicLevel 573809 non-null object 5 DataSource 573809 non-null object 6 Category 573809 non-null object 7 UniqueID 573809 non-null object 8 Measure 573809 non-null object 9 Data_Value_Unit 573809 non-null object 10 DataValueTypeID 573809 non-null object 11 Data_Value_Type 573809 non-null object 12 Data_Value 559863 non-null float64 13 Data_Value_Footnote_Symbol 13946 non-null object 14 Data_Value_Footnote 13946 non-null object 15 Low_Confidence_Limit 559863 non-null float64 16 High_Confidence_Limit 559863 non-null float64 17 Population2010 573809 non-null int64 18 GeoLocation 573809 non-null object 19 CategoryID 573809 non-null object 20 MeasureId 573809 non-null object 21 CityFIPS 573809 non-null float64 22 TractFIPS 559865 non-null float64 23 Short_Question_Text 573809 non-null object 24 city_clean 573809 non-null object 25 NatWalkInd 573809 non-null float64 26 Pct_AO0 573809 non-null float64 27 D3B 573809 non-null float64 28 D4A 573809 non-null float64 dtypes: float64(9), int64(2), object(18) memory usage: 127.0+ MB
Filter to create a blood pressure data set using df_health, then make a new data frame for the mean of each city
The primary challenge in merging the EPA’s Walkability Index and the CDC’s 500 Cities data was a mismatch in geographic granularity. The walkability data is recorded at the Census Block Group level, and the health outcomes are reported by Census Tract or City Name. To resolve this, we aggregated the block group data using a groupby operation to create a consistent city-level average.
Furthermore, significant string normalization was required to execute a successful "inner join." We standardized the city columns by applying .lower() and .strip() functions, removing whitespace and case inconsistencies that would have prevented the records from matching. This ensured a tidy and unified dataset.
df_bp = df_health[df_health['Short_Question_Text'] == 'Taking BP Medication']
df_bp_city = df_bp.groupby('city_clean')['Data_Value'].mean().reset_index()
df_bp_city.rename(columns={'Data_Value': 'bp_med_rate'}, inplace=True)
# displays data types for df_bp and df_bp_city
print("df_bp info:")
df_bp.info()
print("\ndf_bp_city info:")
df_bp_city.info()
df_bp info: <class 'pandas.core.frame.DataFrame'> Index: 29006 entries, 8 to 810065 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 29006 non-null int64 1 StateAbbr 29006 non-null object 2 StateDesc 29006 non-null object 3 CityName 29004 non-null object 4 GeographicLevel 29006 non-null object 5 DataSource 29006 non-null object 6 Category 29006 non-null object 7 UniqueID 29006 non-null object 8 Measure 29006 non-null object 9 Data_Value_Unit 29006 non-null object 10 DataValueTypeID 29006 non-null object 11 Data_Value_Type 29006 non-null object 12 Data_Value 28212 non-null float64 13 Data_Value_Footnote_Symbol 794 non-null object 14 Data_Value_Footnote 794 non-null object 15 Low_Confidence_Limit 28212 non-null float64 16 High_Confidence_Limit 28212 non-null float64 17 Population2010 29006 non-null int64 18 GeoLocation 29004 non-null object 19 CategoryID 29006 non-null object 20 MeasureId 29006 non-null object 21 CityFIPS 29004 non-null float64 22 TractFIPS 28004 non-null float64 23 Short_Question_Text 29006 non-null object 24 city_clean 29004 non-null object dtypes: float64(5), int64(2), object(18) memory usage: 5.8+ MB df_bp_city info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 474 entries, 0 to 473 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_clean 474 non-null object 1 bp_med_rate 474 non-null float64 dtypes: float64(1), object(1) memory usage: 7.5+ KB
Exploratory Data Analysis¶
Here we use pd.cut() to separate continuous walkability scores into discrete ranges, bins. This allows us to easily visualize trends between bar charts to discover patterns.
df_plot_bp = df_walkability_city.merge(df_bp_city, on='city_clean', how='inner')
# Create bins for walkability
df_plot_bp['walk_bin'] = pd.cut(df_plot_bp['NatWalkInd'], bins=6)
df_binned = df_plot_bp.groupby('walk_bin', observed=False)['bp_med_rate'].mean().reset_index()
# Calculate midpoints for plotting
df_binned['walk_midpoint'] = df_binned['walk_bin'].apply(lambda x: round(x.mid, 1))
# Display data types for df_plot_bp and df_binned
print("df_plot_bp info:")
df_plot_bp.info()
print("\ndf_binned info:")
df_binned.info()
df_plot_bp info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 228 entries, 0 to 227 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_clean 228 non-null object 1 NatWalkInd 228 non-null float64 2 Pct_AO0 228 non-null float64 3 D3B 228 non-null float64 4 D4A 228 non-null float64 5 bp_med_rate 228 non-null float64 6 walk_bin 228 non-null category dtypes: category(1), float64(5), object(1) memory usage: 11.3+ KB df_binned info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 walk_bin 6 non-null category 1 bp_med_rate 6 non-null float64 2 walk_midpoint 6 non-null category dtypes: category(2), float64(1) memory usage: 680.0 bytes
Finally, create and show the bar chart
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_ylim(68, 75)
ax.bar(df_binned['walk_midpoint'].astype(str), df_binned['bp_med_rate'], color='steelblue', edgecolor='white')
ax.set_xlabel('Walkability Score', fontsize=12)
ax.set_ylabel('Avg % of Adults Taking BP Medication', fontsize=12)
ax.set_title('Average BP Medication Rate by Walkability Group', fontsize=14)
plt.show()
# The graph shows a clear decrease in blood pressure medication rate as we go up in walkability score.
# This makes intuitive sense as we would assume that people who like walking more are probably taking less blood pressure medication.
Graph Explanation: Average BP Medication Rate by Walkability Group¶
What?: We made this bar chart to show the average percentage of adults taking blood pressure medication across different bins of the National Walkability Index score. Each bar represents a number range of walkability scores, and each bar's height indicates the mean blood pressure medication rate for the cities falling into that range.
Why?: This visualization serves to addresses the first project question by showing a preliminary relationship between walkability and hypertension medication usage.
Insights: The graph indicates a clear inverse relationship, as the walkability score increases, the average percentage of adults taking blood pressure medication tends to decrease.
# 3-5 Interesting Stats
print(f"Mean Walkability Score: {df['NatWalkInd'].mean():.2f}")
print(f"City with Highest Walkability: {df.loc[df['NatWalkInd'].idxmax(), 'city_clean']}")
print(f"Correlation between Walkability and BP Meds: {df_plot_bp['NatWalkInd'].corr(df_plot_bp['bp_med_rate']):.4f}")
Mean Walkability Score: 10.30 City with Highest Walkability: los angeles Correlation between Walkability and BP Meds: -0.2682
The average walkability score of 10.30 provides a baseline for the dataset, showing that the sampled urban areas sit near the midpoint of the EPA's index. This benchmark will help to identify outliers and serve as the "national average" against which specific health outcomes can be measured. Finding that Los Angeles has the highest walkability is particularly significant because it challenges the common stereotype of the city as a purely car-dependent area; this highlights the importance of analyzing data at the granular Census Block Group level to reveal high-density pedestrian pockets that are overlooked in city-wide generalizations.
The correlation coefficient of -0.2682 is the most vital statistic for the project’s goal, as the negative value directly supports the hypothesis that increased walkability is associated with a decrease in blood pressure medication usage.
Here we display the distribution of walkability scores across cities.
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_walkability_city['NatWalkInd'], bins=50)
ax.set_xlabel('score')
ax.set_ylabel('# of cities')
ax.set_title('Distribution of walkability scores across US cities')
plt.show()
Graph Explanation: Distribution of Walkability Scores Across U.S. Cities¶
What?: We created a histogram to display the frequency distribution of National Walkability Index scores across the cities included in our df_walkability_city DataFrame. The x axis is walkability score ranges and the y axis is the number of cities in each range.
Why?: Understanding the distribution of walkability scores helps to identify the common range of walkability in the sampled cities and whether the data is possibly skewed towards highly walkable or car-dependent places.
Insights: The distribution reveals that many cities fall within the mid-range of walkability scores, with less cities at the extreme ends of very low/high walkability.
Next we want to see how median household income of the 20 largest cities compares to the walkability index
df_zillow = pd.read_csv("/content/drive/My Drive/Colab Notebooks/City_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
date_cols = [c for c in df_zillow.columns if c.startswith('20')]
# to get last
latest_col = date_cols[-1]
print(f"Using column: {latest_col}")
# making a city_clean column to be able to merge data accurately
df_zillow['city_clean'] = df_zillow['RegionName'].str.strip().str.lower()
df_zillow_city = df_zillow[['city_clean', latest_col]].copy()
df_zillow_city.rename(columns={latest_col: 'home_value'}, inplace=True)
df_plot_zillow = df_walkability_city.merge(df_zillow_city, on='city_clean', how='inner')
print(f"cities: {len(df_plot_zillow)}")
print(df_plot_zillow[['city_clean', 'NatWalkInd', 'home_value']].head())
Using column: 2026-03-31 cities: 3036 city_clean NatWalkInd home_value 0 aberdeen 9.185567 240644.332155 1 aberdeen 9.185567 343947.667771 2 aberdeen 9.185567 273201.242877 3 aberdeen 9.185567 352796.288282 4 aberdeen 9.185567 142494.040733
# Display data types for Zillow DataFrames
print("df_zillow info:")
df_zillow.info()
print("\ndf_zillow_city info:")
df_zillow_city.info()
print("\ndf_plot_zillow info:")
df_plot_zillow.info()
df_zillow info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 21425 entries, 0 to 21424 Columns: 324 entries, RegionID to city_clean dtypes: float64(315), int64(2), object(7) memory usage: 53.0+ MB df_zillow_city info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 21425 entries, 0 to 21424 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_clean 21425 non-null object 1 home_value 21424 non-null float64 dtypes: float64(1), object(1) memory usage: 334.9+ KB df_plot_zillow info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 3036 entries, 0 to 3035 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_clean 3036 non-null object 1 NatWalkInd 3036 non-null float64 2 Pct_AO0 3036 non-null float64 3 D3B 3036 non-null float64 4 D4A 3036 non-null float64 5 home_value 3036 non-null float64 dtypes: float64(5), object(1) memory usage: 142.4+ KB
import requests
CENSUS_API_KEY = "517f479f3e2894a5ef3dc4a58fb05b19ca64b310"
url = (
f"https://api.census.gov/data/2022/acs/acs5"
f"?get=NAME,B19013_001E"
f"&for=place:*"
f"&key={CENSUS_API_KEY}"
)
response = requests.get(url)
data = response.json()
df_income = pd.DataFrame(data[1:], columns=data[0])
df_income.rename(columns={'B19013_001E': 'median_income', 'NAME': 'place_name'}, inplace=True)
df_income['median_income'] = pd.to_numeric(df_income['median_income'], errors='coerce')
df_income = df_income[df_income['median_income'] > 0]
df_income['city_clean'] = df_income['place_name'].str.split(' city,').str[0]
df_income['city_clean'] = df_income['city_clean'].str.split(' town,').str[0]
df_income['city_clean'] = df_income['city_clean'].str.strip().str.lower()
print(f"Cities loaded: {len(df_income)}")
print(df_income[['city_clean', 'median_income']].head(10))
Cities loaded: 28530
city_clean median_income
0 abanda cdp, alabama 29263
1 abbeville 35147
2 adamsville 58631
3 addison 47188
4 akron 53929
5 alabaster 89423
6 albertville 55933
7 alexander city 42141
8 alexandria cdp, alabama 95039
9 aliceville 30152
# Display data types for df_income
print("df_income info:")
df_income.info()
df_income info: <class 'pandas.core.frame.DataFrame'> Index: 28530 entries, 0 to 32185 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 place_name 28530 non-null object 1 median_income 28530 non-null int64 2 state 28530 non-null object 3 place 28530 non-null object 4 city_clean 28530 non-null object dtypes: int64(1), object(4) memory usage: 1.3+ MB
valid_cities = set(df_health['city_clean'].unique())
df_plot_income = df_walkability_city.merge(df_income[['city_clean', 'median_income']], on='city_clean', how='inner')
df_plot_income_filtered = df_plot_income[df_plot_income['city_clean'].isin(valid_cities)]
df_income_agg = df_plot_income_filtered.groupby('city_clean').agg(NatWalkInd=('NatWalkInd', 'mean'),median_income=('median_income', 'mean')).reset_index()
top20_cities = ['new york', 'los angeles', 'chicago', 'houston', 'phoenix','philadelphia', 'san antonio', 'san diego', 'dallas', 'jacksonville','austin', 'san jose', 'fort worth', 'columbus', 'charlotte','indianapolis', 'san francisco', 'seattle', 'denver', 'nashville']
top20 = df_income_agg[df_income_agg['city_clean'].isin(top20_cities)]
fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(top20['NatWalkInd'], top20['median_income'] / 1000,color='steelblue', s=100, alpha=0.8, zorder=3)
for _, row in top20.iterrows(): ax.annotate(row['city_clean'].title(),xy=(row['NatWalkInd'], row['median_income'] / 1000),xytext=(5, 3), textcoords='offset points')
m, b = np.polyfit(top20['NatWalkInd'], top20['median_income'] / 1000, 1)
x_line = np.linspace(top20['NatWalkInd'].min(), top20['NatWalkInd'].max(), 100)
ax.plot(x_line, m * x_line + b, color='firebrick', linewidth=2, linestyle='--')
corr = top20['NatWalkInd'].corr(top20['median_income'])
ax.set_xlabel('avg walkability score')
ax.set_ylabel('median household income')
ax.set_title(f'walkability v income on US 20 largest cities')
ax.grid(True, alpha=0.3)
plt.show()
Graph Explanation: Walkability vs Income: 20 Largest U.S. Cities¶
What?: This scatter plot explores the relationship between the average walkability index and the median household income for the 20 largest U.S. cities. Each point represents a city and its position is determined by both its walkability score and median income. A regression line and correlation coefficient r are also displayed.
Why?: This graph addresses an important factor of socioeconomic status. It explores if walkability is a proxy for income or if there is an actual independent relationship.
Insights: The positive correlation r = 0.5898 suggests that in the 20 largest U.S. cities, the higher walkability tends to be associated with higher median household incomes.
# Display data types for income aggregation DataFrames
print("df_plot_income info:")
df_plot_income.info()
print("\ndf_plot_income_filtered info:")
df_plot_income_filtered.info()
print("\ndf_income_agg info:")
df_income_agg.info()
df_plot_income info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 2344 entries, 0 to 2343 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_clean 2344 non-null object 1 NatWalkInd 2344 non-null float64 2 Pct_AO0 2344 non-null float64 3 D3B 2344 non-null float64 4 D4A 2344 non-null float64 5 median_income 2344 non-null int64 dtypes: float64(4), int64(1), object(1) memory usage: 110.0+ KB df_plot_income_filtered info: <class 'pandas.core.frame.DataFrame'> Index: 899 entries, 6 to 2341 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_clean 899 non-null object 1 NatWalkInd 899 non-null float64 2 Pct_AO0 899 non-null float64 3 D3B 899 non-null float64 4 D4A 899 non-null float64 5 median_income 899 non-null int64 dtypes: float64(4), int64(1), object(1) memory usage: 49.2+ KB df_income_agg info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 227 entries, 0 to 226 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city_clean 227 non-null object 1 NatWalkInd 227 non-null float64 2 median_income 227 non-null float64 dtypes: float64(2), object(1) memory usage: 5.4+ KB
Next, we look at the street density of each city v the walkability index
df_sample = df_walking[['NatWalkInd', 'D3B']].dropna()
df_sample = df_sample[(df_sample['D3B'] > 0) & (df_sample['D3B'] < 400)].sample(5000, random_state=42)
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(df_sample['D3B'], df_sample['NatWalkInd'],alpha=0.25, s=10)
m, b = np.polyfit(df_sample['D3B'], df_sample['NatWalkInd'], 1)
x_line = np.linspace(df_sample['D3B'].min(), df_sample['D3B'].max(), 100)
ax.plot(x_line, m * x_line + b, color='firebrick', linewidth=2)
corr = df_sample['D3B'].corr(df_sample['NatWalkInd'])
ax.set_xlabel('street intersection density')
ax.set_ylabel('Walkability Index Score')
ax.set_title(f'street intersection density v walkability')
# set the x and y lim to filter out outliers
ax.set_xlim(0, 400)
ax.set_ylim(0, 20)
ax.grid(True, alpha=0.3)
plt.show()
fig, ax = plt.subplots(figsize=(9, 6))
hb = ax.hexbin(df_sample['D3B'], df_sample['NatWalkInd'], gridsize=40, cmap='Blues', mincnt=1, xscale='log')
plt.colorbar(hb, ax=ax, label='# of block groups')
ax.set_xlabel('street intersection density, (log)', fontsize=12)
ax.set_ylabel('walkability score', fontsize=12)
ax.set_title(f'street intersection density v walkability (log)', fontsize=14)
ax.set_ylim(0, 20)
#plt.tight_layout()
plt.show()
Now, mental health model
url = "https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=50000&measureid=DEPRESSION&$select=locationname,locationid,data_value,measureid"
response = requests.get(url)
df_places_tract = pd.DataFrame(response.json())
# to clean places
df_places_tract['tract_id'] = df_places_tract['locationid'].astype(str).str.strip()
df_places_tract['depression_rate'] = pd.to_numeric(df_places_tract['data_value'], errors='coerce')
df_places_tract = df_places_tract.dropna(subset=['depression_rate'])
df_walking['tract_id'] = (df_walking['GEOID10'].dropna().astype(float).astype(int).astype(str).str.zfill(12).str[:11])
df_walk_tract = df_walking.dropna(subset=['tract_id']).groupby('tract_id')['NatWalkInd'].mean().reset_index()
# merge
df_tract_merged = df_walk_tract.merge(df_places_tract[['tract_id', 'depression_rate']],on='tract_id', how='inner')
print(f"matching tracts: {len(df_tract_merged)}")
print(df_tract_merged[['NatWalkInd', 'depression_rate']].corr())
matching tracts: 10788
NatWalkInd depression_rate
NatWalkInd 1.000000 -0.248726
depression_rate -0.248726 1.000000
# Display data types for mental health model DataFrames
print("df_places_tract info:")
df_places_tract.info()
print("\ndf_walk_tract info:")
df_walk_tract.info()
print("\ndf_tract_merged info:")
df_tract_merged.info()
df_places_tract info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 locationname 50000 non-null object 1 locationid 50000 non-null object 2 data_value 50000 non-null object 3 measureid 50000 non-null object 4 tract_id 50000 non-null object 5 depression_rate 50000 non-null float64 dtypes: float64(1), object(5) memory usage: 2.3+ MB df_walk_tract info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 17137 entries, 0 to 17136 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tract_id 17137 non-null object 1 NatWalkInd 17137 non-null float64 dtypes: float64(1), object(1) memory usage: 267.9+ KB df_tract_merged info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 10788 entries, 0 to 10787 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tract_id 10788 non-null object 1 NatWalkInd 10788 non-null float64 2 depression_rate 10788 non-null float64 dtypes: float64(2), object(1) memory usage: 253.0+ KB
# Get all unique measure IDs available in PLACES tract data
url = "https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=1000&$select=measureid,measure&$group=measureid,measure"
response = requests.get(url)
df_measures = pd.DataFrame(response.json())
print(df_measures.to_string(index=False))
measureid measure
ACCESS2 Current lack of health insurance among adults aged 18-64 years
ARTHRITIS Arthritis among adults
BINGE Binge drinking among adults
BPHIGH High blood pressure among adults
BPMED Taking medicine to control high blood pressure among adults with high blood pressure
CANCER Cancer (non-skin) or melanoma among adults
CASTHMA Current asthma among adults
CHD Coronary heart disease among adults
CHECKUP Visits to doctor for routine checkup within the past year among adults
CHOLSCREEN Cholesterol screening among adults
COGNITION Cognitive disability among adults
COLON_SCREEN Colorectal cancer screening among adults aged 45–75 years
COPD Chronic obstructive pulmonary disease among adults
CSMOKING Current cigarette smoking among adults
DENTAL Visited dentist or dental clinic in the past year among adults
DEPRESSION Depression among adults
DIABETES Diagnosed diabetes among adults
DISABILITY Any disability among adults
EMOTIONSPT Lack of social and emotional support among adults
FOODINSECU Food insecurity in the past 12 months among adults
FOODSTAMP Received food stamps in the past 12 months among adults
GHLTH Fair or poor self-rated health status among adults
HEARING Hearing disability among adults
HIGHCHOL High cholesterol among adults who have ever been screened
HOUSINSECU Housing insecurity in the past 12 months among adults
INDEPLIVE Independent living disability among adults
LACKTRPT Lack of reliable transportation in the past 12 months among adults
LONELINESS Loneliness among adults
LPA No leisure-time physical activity among adults
MAMMOUSE Mammography use among women aged 50-74 years
MHLTH Frequent mental distress among adults
MOBILITY Mobility disability among adults
OBESITY Obesity among adults
PHLTH Frequent physical distress among adults
SELFCARE Self-care disability among adults
SHUTUTILITY Utility services shut-off threat in the past 12 months among adults
SLEEP Short sleep duration among adults
STROKE Stroke among adults
TEETHLOST All teeth lost among adults aged >=65 years
VISION Vision disability among adults
Next we can sort through the list and find the measures that correlate the most with the walkability scores
results = []
for mid in df_measures['measureid'].tolist():
url = f"https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=50000&measureid={mid}&$select=locationid,data_value"
response = requests.get(url)
df_m = pd.DataFrame(response.json())
df_m['tract_id'] = df_m['locationid'].astype(str).str.strip()
df_m['value'] = pd.to_numeric(df_m['data_value'], errors='coerce')
df_m = df_m.dropna(subset=['value'])
df_merged = df_walk_tract.merge(df_m[['tract_id', 'value']], on='tract_id', how='inner')
if len(df_merged) > 500:
corr = df_merged['NatWalkInd'].corr(df_merged['value'])
results.append({'measure': mid, 'correlation': round(corr, 4), 'n': len(df_merged)})
df_res = pd.DataFrame(results).sort_values('correlation', key=abs, ascending=False)
print(df_res.to_string(index=False))
measure correlation n
ARTHRITIS -0.5505 10788
HEARING -0.4755 10788
BPHIGH -0.4750 10788
COPD -0.4410 10788
CHD -0.4368 10788
CHECKUP -0.4171 10788
CANCER -0.4055 10788
HIGHCHOL -0.3739 10788
LONELINESS 0.3636 9732
BPMED -0.3566 10788
STROKE -0.3332 10788
OBESITY -0.3329 10788
COLON_SCREEN -0.3051 10788
CASTHMA -0.2899 10788
EMOTIONSPT 0.2629 9732
CSMOKING -0.2601 10788
DEPRESSION -0.2487 10788
MOBILITY -0.2284 10788
DISABILITY -0.2056 10788
DIABETES -0.1974 10788
BINGE 0.1766 10788
PHLTH -0.1748 10788
TEETHLOST -0.1658 10783
CHOLSCREEN -0.1606 10788
HOUSINSECU 0.1413 9732
FOODSTAMP 0.1255 9732
ACCESS2 0.1067 10788
MAMMOUSE 0.1048 10785
INDEPLIVE -0.0988 10788
SELFCARE -0.0889 10788
FOODINSECU 0.0864 9732
LPA -0.0721 10788
LACKTRPT 0.0689 9732
SLEEP -0.0488 10788
SHUTUTILITY -0.0456 9732
DENTAL 0.0345 10788
COGNITION -0.0237 10788
VISION 0.0095 10788
GHLTH -0.0072 10788
MHLTH 0.0020 10788
Hypothesis Testing and Modeling¶
Our hypothesis is that higher walkability in urban environments reduces the prevalence of hypertension. Null Hypothesis: Walkability has no statistically significant relationship with health outcomes, like hypertension rates. Alternative Hypothesis: Higher walkability is associated with lower prevalence of hypertension. Now, to test this hypothesis we will conduct a correlation analysis, simple linear regression, and multivariate linear regression.
import scipy.stats as stats
# Load all four measures
measures = {'BPHIGH': 'High Blood Pressure Rate (%)','ARTHRITIS': 'Arthritis Rate (%)','LONELINESS': 'Loneliness Rate (%)','OBESITY': 'Obesity Rate (%)'}
df_combined = df_walk_tract.copy()
for mid in measures:
url = f"https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=50000&measureid={mid}&$select=locationid,data_value"
response = requests.get(url)
df_m = pd.DataFrame(response.json())
df_m['tract_id'] = df_m['locationid'].astype(str).str.strip()
df_m[mid.lower()] = pd.to_numeric(df_m['data_value'], errors='coerce')
df_combined = df_combined.merge(df_m[['tract_id', mid.lower()]], on='tract_id', how='inner')
print(f"Tracts matched: {len(df_combined)}")
# 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
colors = ['steelblue', 'seagreen', 'darkorange', 'mediumpurple']
for i, (mid, label) in enumerate(measures.items()):
col = mid.lower()
x = df_combined['NatWalkInd']
y = df_combined[col]
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
df_sample = df_combined.sample(3000, random_state=42)
ax = axes[i]
ax.scatter(df_sample['NatWalkInd'], df_sample[col],alpha=0.15, color=colors[i], s=10)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = slope * x_line + intercept
n = len(x)
se_line = std_err * np.sqrt(1/n + (x_line - x.mean())**2 / ((x - x.mean())**2).sum())
ax.plot(x_line, y_line, color='firebrick', linewidth=2,label=f'r = {r_value:.3f}')
ax.fill_between(x_line, y_line - 1.96*se_line, y_line + 1.96*se_line,alpha=0.15, color='firebrick')
ax.set_xlabel('Walkability Index Score')
ax.set_ylabel(label, fontsize=11)
ax.set_title(f'Walkability vs {label.split(" Rate")[0]}\n'
f'r = {r_value:.3f}, r² = {r_value**2:.3f}, p = {p_value:.2e}')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.suptitle('walkability v health measure (census tracts)')
plt.show()
Tracts matched: 9732
Graph Explanation: Walkability vs Health Outcomes at Census Tract Level¶
What?: This 2x2 grid of scatter plots visualizes the relationship between the National Walkability Index and 4 different health outcomes: High Blood Pressure Rate, Arthritis Rate, Loneliness Rate, and Obesity Rate. Each outcome is at the census tract level. Each of the four plots includes a regression line and its corresponding correlation coefficient r.
Why?: This visualization explores various chronic health outcomes at the same time, which provided a broader understanding of walkability's potential impact beyond just blood pressure.
Insights: High Blood Pressure, Arthritis, and Obesity show generally negative correlations with walkability (higher walkability, lower rates). This can support the idea that active environments promote physical health. However, Loneliness shows a positive correlation, which suggests that while walkability improves physical health, the social aspects of highly walkable areas might not help lessen feelings of loneliness.
# Display data types for the combined health measures DataFrame
print("df_combined info:")
df_combined.info()
df_combined info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 9732 entries, 0 to 9731 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tract_id 9732 non-null object 1 NatWalkInd 9732 non-null float64 2 bphigh 9732 non-null float64 3 arthritis 9732 non-null float64 4 loneliness 9732 non-null float64 5 obesity 9732 non-null float64 dtypes: float64(5), object(1) memory usage: 456.3+ KB
To see if our regression model generalizes to unseen data we will divide the dataset into a separate training and a separate testing subset with train_test_split(). This will prevent overfitting and realistically test how well our model can predict.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
df_model_bp = df_combined[['NatWalkInd', 'bphigh']].dropna().drop_duplicates().reset_index(drop=True)
df_model_bp['log_walk'] = np.log(df_model_bp['NatWalkInd'])
X = df_model_bp[['NatWalkInd']]
y = df_model_bp['bphigh']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model_bp = LinearRegression()
model_bp.fit(X_train, y_train)
y_pred = model_bp.predict(X_test)
print(f"training #: {len(X_train):,}")
print(f"test #: {len(X_test):,}")
print(f"slope: {model_bp.coef_[0]:.4f}")
print(f"intercept: {model_bp.intercept_:.4f}")
print(f"r squared (train): {model_bp.score(X_train, y_train):.4f}")
print(f"r squared (test): {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
ax = axes[0]
ax.scatter(X_test, y_test, alpha=0.2, color='steelblue', s=10, label='actual')
x_line = np.linspace(X['NatWalkInd'].min(), X['NatWalkInd'].max(), 100)
ax.plot(x_line, model_bp.predict(pd.DataFrame({'NatWalkInd': x_line})),
color='firebrick', linewidth=2, label='predicted')
ax.set_xlabel('walkability score')
ax.set_ylabel('high bp rate')
ax.set_title('walkability - high bp rate\n(test data)')
ax.legend()
ax.grid(True, alpha=0.3)
ax2 = axes[1]
ax2.scatter(y_test, y_pred, alpha=0.2, color='seagreen', s=10)
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
color='firebrick', linewidth=2, linestyle='--', label='perfect pred')
ax2.set_xlabel('actual bp rate')
ax2.set_ylabel('predicted bp rate')
ax2.set_title(f'predicted v actual\n(r squared = {r2_score(y_test, y_pred):.4f})')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax3 = axes[2]
residuals = y_test.values - y_pred
ax3.scatter(y_pred, residuals, alpha=0.2, color='darkorange', s=10)
ax3.axhline(y=0, color='firebrick', linewidth=2, linestyle='--')
ax3.set_xlabel('predicted bp rate')
ax3.set_ylabel('residuals')
ax3.set_title('residual plot')
ax3.grid(True, alpha=0.3)
plt.suptitle('linear regression model: walkability predicting high bp rate')
plt.show()
training #: 7,493 test #: 1,874 slope: -0.9749 intercept: 44.0224 r squared (train): 0.2529 r squared (test): 0.2487 RMSE: 6.7097
Graph Explanation: Linear Regression: Walkability as Predictor of High Blood Pressure Rate¶
What?: This set of three plots presents the results of a linear regression model where the Walkability Index is predicting High Blood Pressure Rate. The first subplot compares the actual vs. predicted values for the test set, the second plots predicted vs. actual values to assess model fit, and the third shows the residuals to check for linearity and homoscedasticity.
Why?: This section moves beyond simple correlation to a formal predictive model, directly addressing the project's goal of using walkability as a predictor for health outcomes. The metrics (R², RMSE, coefficient) evaluate the strength of this predictive relationship.
Insights: The model shows a negative coefficient (-0.9749), indicating that for every unit increase in the Walkability Index, the High Blood Pressure Rate decreases by approximately 0.97 percentage points.
# display data types for the linear regression model DataFrame
print("df_model_bp info:")
df_model_bp.info()
df_model_bp info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 9367 entries, 0 to 9366 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NatWalkInd 9367 non-null float64 1 bphigh 9367 non-null float64 2 log_walk 9367 non-null float64 dtypes: float64(3) memory usage: 219.7 KB
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# train model only on train data split
df_model_bp = df_combined[['NatWalkInd', 'bphigh']].dropna().drop_duplicates().reset_index(drop=True)
X = df_model_bp[['NatWalkInd']]
y = df_model_bp['bphigh']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model_bp = LinearRegression()
model_bp.fit(X_train, y_train)
# predict only on unseen test data
y_pred = model_bp.predict(X_test)
print(f"training data: {len(X_train):,}")
print(f"test data: {len(X_test):,}")
print(f"r squared: {r2_score(y_test, y_pred):.4f}")
# plot only test data
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(X_test, y_test, alpha=0.2, color='steelblue', s=15, label=f'actual, n={len(X_test):,})', zorder=2)
ax.scatter(X_test, y_pred,alpha=0.8, color='firebrick', s=40, marker='x',label='predicted', zorder=5, linewidths=1)
ax.set_xlabel('walkability score')
ax.set_ylabel('high bp rate')
ax.set_title(f'walkability v high bp rate\n'
f'r squared = {r2_score(y_test, y_pred):.4f}, n = {len(X_test):,}')
ax.legend()
ax.grid(True, alpha=0.3)
ax.spines[['top', 'right']].set_visible(False)
plt.show()
training data: 7,493 test data: 1,874 r squared: 0.2487
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_walk_tract['NatWalkInd'], bins=50, color='steelblue')
ax.set_xlabel('National Walkability Index Score')
ax.set_ylabel('Number of Census Tracts')
ax.set_title('Distribution of Walkability Scores Across U.S. Census Tracts')
plt.show()
#This will create the multivariate dataframe for a multivariate linear regression
#This aims to find the unique tract_id to city_clean from df_walking
tract_to_city = df_walking[['tract_id', 'city_clean']].drop_duplicates().dropna(subset=['tract_id', 'city_clean'])
#Now we merge df_combined at tract level, with mapping which adds the city_clean column
df_combined_f_merge = df_combined.merge(tract_to_city, on='tract_id', how='inner')
#So the multivar_dataframe is created by merging df_combined_f_merge with df_income_agg
multivar_dataframe = df_combined_f_merge.merge(df_income_agg[['city_clean', 'median_income']], on='city_clean', how='inner')
#Our independent variable, x, and dependent variable, y
x = multivar_dataframe[['NatWalkInd', 'median_income']]
y = multivar_dataframe['bphigh']
#This splits our data into a separate training set and separate test set
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)
#Now we can run the linear regression
multilinreg = LinearRegression()
multilinreg.fit(xtrain, ytrain)
predicty = multilinreg.predict(xtest)
print(f"Multivariate Linear Regression Results:")
print(f"r squared (test):{r2_score(ytest, predicty):.4f}")
print(f"Intercept:{multilinreg.intercept_:.4f}")
print(f"RMSE:{np.sqrt(mean_squared_error(ytest, predicty)):.4f}")
for col, coef in zip(X.columns, multilinreg.coef_):
print(f"Coefficients: {col}: {coef:.6f}")
Multivariate Linear Regression Results: r squared (test):0.2106 Intercept:42.6368 RMSE:6.1712 Coefficients: NatWalkInd: -0.629949
Interpreting These Results¶
The R² is the Coefficient of Determination and it is 0.2106. This means that 21% of the variance in cities' blood pressure rates is able to be explained by the dual factors of walkability and median income. The remaining 79% of the variance is possibly due to other factors such as age or access to medical resources.
The intercept of 42.64 is our baseline for what the model would predict a blood pressure rate to be given a city with a walkability score and median income of zero.
The RMSE is the Root Mean Square Error, and ours of 6.17 means the model's predictions are, on average, off by 6.17.
Lastly, the coefficient (NatWalkInd) of -.63 means that for each 1-point increase in a city's Walkability Index score, our model will predict the High Blood Pressure Rate drops by 0.63 percentage points. This leads to a conclusion regarding our hypothesis. The negative sign means that high walkability is associated with better health outcomes, specifically lower blood pressure rates, even after having accounted for a city's wealth level.
Insights and Conclusion¶
This project reveals how urban walkability is significantly associated with several relevant public health outcomes in U.S. cities and census tracts. By using exploratory data analysis, correlation testing, and regression modeling, we found that there exists a consistent negative relationship between walkability and rates of conditions like hypertension, obesity, arthritis, and depression. We found that higher walkability in a city is associated with lower usage of blood pressure medication. We found a moderately negative correlation of r=-.27. This means that cities with higher walkability scores generally had lower usage of blood pressure medication. Our linear regression revealed that walkability could be used as a predictive value, with a ~1% decrease in high blood pressure rates for each one-point increase in walkability score.
Using our multivariate regression model, we found that walkability is still negatively associated with blood pressure rates after we consider median household income as a factor. This shows how the built urban design could independently contribute to health outcomes instead of being a proxy for higher-income areas.
This project shows how valuable it is to combine datasets from different disciplines. With economic, environmental, and public health datasets, we found that design infrastructure can play a big part in community health.
In the future, this project could be expanded using more demographics as factors like age, race, climate, and access to healthcare. Ultimately, we found that healthier cities could also be more walkable cities.
References and Data Sources¶
- EPA Smart Location Database (Walkability Index): https://www.google.com/search?q=https://www.epa.gov/smartgrowth/smart-location-mapping%23walkability
- CDC PLACES / 500 Cities Data: https://www.google.com/search?q=https://data.cdc.gov/browse%3Fcategory%3D500%2BCities%2B%2526%2BPlaces
- US Census Bureau ACS 5-Year Estimates: https://www.census.gov/data/developers/data-sets/acs-5year.html
- Zillow Home Value Index (ZHVI): https://www.zillow.com/research/data/