Urban Walkability and Health Outcomes - Shakthi Manjanath and Quinn Stevens, Link to our GitHub webpage¶

Our Motivation and Introduction¶

Project Goals¶

The city that you live in determines many aspects of your life. The design of your environment impacts how you interact with healthcare, businesses, transportation, and other people. An important aspect of urban design is walkability. This is the term to describe how easy it is to walk around an area as means of transportation every day. In highly walkable neighborhoods, people are able to engage in physical activity when they commute, shop, or meet up with other people. Contrastingly, car-dependent neighborhoods require less physical activity to get from point A to point B.

This topic is important in the context of American health as millions of Americans suffer from chronic diseases like hypertension, obesity, arthritis, and depression. For this project, we will be exploring how multiple large data sets can be used together to explore how urban design affects health. We will use environmental, economic, and health data sets to see if walkability in urban environments serves as a predictor of health outcomes.

Dataset Questions¶

  1. How much does an American city's average Walkability Index score serve as a significant predictor for the prevalence of hypertension (High Blood Pressure) medication usage?
  2. Does the positive effect of walkability stay consistent with different chronic diseases (Hypertension, Arthritis, Obesity) versus social health effects like Loneliness?
  3. Does the relationship between urban walkability and hypertension still exist when controlling for median household income?

Project Datasets¶

To answer our questions, we are using several datasets. One is the National Walkability Index (2025) dataset that characterizes every Census 2019 block group in the U.S. based on its relative walkability. The second data set is 500 Cities: Local Data for Better Health (2025), which provides small-area estimates for 27 chronic disease measures. We also use US Census Bureau ACS 5-Year Estimates and Zillow Home Value Index (ZHVI) for data related to a city's wealth and the housing market.

The National Walkability Index is a 2025 update of the EPA’s Smart Location Database, which uses 2019 Census block group boundaries to score the built environment’s influence on pedestrian travel. These scores are derived from a weighted equation involving four variables: land-use mix, intersection density, transit proximity, and employment-to-housing ratios. The 500 Cities dataset provides model-based small-area estimates for the 500 largest U.S. cities, covering approximately 28,000 census tracts. Developed by the CDC and the Robert Wood Johnson Foundation, this dataset uses 2014-2015 BRFSS and ACS data to report on 27 chronic disease measures, including unhealthy behaviors and prevention services. By merging these sets, we can investigate how neighborhood design correlates with health outcomes. The US Census Bureau ACS 5-Year Estimates show granular data about Median Household Income at the city and census-tract level. Wealth is an important factor that affects health, and so this data set allows us to consider this when we attribute health benefits to walkability. The Zillow Home Value Index is a measure of how the home value market changes across parts of the US. This allows us to consider different housing markets as a socioeconomic factor in health.

Collaboration Plan¶

The team plans to continue to meet weekly over Zoom to collaborate on the project. We also communicate through text messages and email to send each other updates and resources. Additionally, we have started a Word document that has an ongoing to-do list. We have set up a Google Colab to work on our code together and have a shared GitHub repository together.

Data Collection¶

In [232]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [267]:
%%shell
jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/smanjanath_qstevens.ipynb"
[NbConvertApp] Converting notebook /content/drive/My Drive/Colab Notebooks/smanjanath_qstevens.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 9 image(s).
[NbConvertApp] Writing 1992562 bytes to /content/drive/My Drive/Colab Notebooks/smanjanath_qstevens.html
Out[267]:

To load and inspect the datasets we import several Python libraries that we will use for data visualization, manipulation, and modeling. We use pandas for tabular data processing, NumPy to conduct numerical operations, and matplotlib for visualizations.

Then, we load the two base datasets:

  1. The EPA Smart Location Database with walkability metrics
  2. The CDC 500 Cities dataset that has health outcomes for cities

We use .info() to understand the structure and data types of our datasets.

In [234]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df_health = pd.read_csv("/content/drive/My Drive/Colab Notebooks/500_Cities__Local_Data_for_Better_Health__2017_release.csv")
df_walking = pd.read_csv("/content/drive/My Drive/Colab Notebooks/EPA_SmartLocationDatabase_V3_Jan_2021_Final.csv")
In [235]:
# this shows initial data types for df_health
print("df_health info:")
df_health.info()

# this will display initial data types for df_walking
print("\ndf_walking info:")
df_walking.info()
df_health info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810103 entries, 0 to 810102
Data columns (total 24 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Year                        810103 non-null  int64  
 1   StateAbbr                   810103 non-null  object 
 2   StateDesc                   810103 non-null  object 
 3   CityName                    810047 non-null  object 
 4   GeographicLevel             810103 non-null  object 
 5   DataSource                  810103 non-null  object 
 6   Category                    810103 non-null  object 
 7   UniqueID                    810103 non-null  object 
 8   Measure                     810103 non-null  object 
 9   Data_Value_Unit             810103 non-null  object 
 10  DataValueTypeID             810103 non-null  object 
 11  Data_Value_Type             810103 non-null  object 
 12  Data_Value                  789600 non-null  float64
 13  Data_Value_Footnote_Symbol  20503 non-null   object 
 14  Data_Value_Footnote         20503 non-null   object 
 15  Low_Confidence_Limit        789600 non-null  float64
 16  High_Confidence_Limit       789600 non-null  float64
 17  Population2010              810103 non-null  int64  
 18  GeoLocation                 810047 non-null  object 
 19  CategoryID                  810103 non-null  object 
 20  MeasureId                   810103 non-null  object 
 21  CityFIPS                    810047 non-null  float64
 22  TractFIPS                   782047 non-null  float64
 23  Short_Question_Text         810103 non-null  object 
dtypes: float64(5), int64(2), object(17)
memory usage: 148.3+ MB

df_walking info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220740 entries, 0 to 220739
Columns: 117 entries, OBJECTID to Shape_Area
dtypes: float64(73), int64(42), object(2)
memory usage: 197.0+ MB

Data Cleaning¶

Both of the EPA and CDC datasets were made separately and don't have the same city name conventions. So, we created a standardized city_clean variable. To do this we:

  1. Converted the city names to lowercase
  2. Removed excess whitespace using .strip()
  3. Parsed naming conventions using .split()

Parse df_walking to make clean city names, without the other unnecessary info

After parsing, make a 'city_clean' column with the city names in it, for both data sets

In [236]:
df_walking['city_clean'] = df_walking['CBSA_Name'].str.split(',').str[0]
df_walking['city_clean'] = df_walking['city_clean'].str.split('-').str[0]
df_walking['city_clean'] = df_walking['city_clean'].str.strip().str.lower()

df_health['city_clean'] = df_health['CityName'].str.strip().str.lower()
In [237]:
# this shows data types after cleaning city names for df_walking
print("df_walking info after city clean:")
df_walking.info()

# this shows data types after cleaning city names for df_health
print("\ndf_health info after city clean:")
df_health.info()
df_walking info after city clean:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220740 entries, 0 to 220739
Columns: 118 entries, OBJECTID to city_clean
dtypes: float64(73), int64(42), object(3)
memory usage: 198.7+ MB

df_health info after city clean:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810103 entries, 0 to 810102
Data columns (total 25 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Year                        810103 non-null  int64  
 1   StateAbbr                   810103 non-null  object 
 2   StateDesc                   810103 non-null  object 
 3   CityName                    810047 non-null  object 
 4   GeographicLevel             810103 non-null  object 
 5   DataSource                  810103 non-null  object 
 6   Category                    810103 non-null  object 
 7   UniqueID                    810103 non-null  object 
 8   Measure                     810103 non-null  object 
 9   Data_Value_Unit             810103 non-null  object 
 10  DataValueTypeID             810103 non-null  object 
 11  Data_Value_Type             810103 non-null  object 
 12  Data_Value                  789600 non-null  float64
 13  Data_Value_Footnote_Symbol  20503 non-null   object 
 14  Data_Value_Footnote         20503 non-null   object 
 15  Low_Confidence_Limit        789600 non-null  float64
 16  High_Confidence_Limit       789600 non-null  float64
 17  Population2010              810103 non-null  int64  
 18  GeoLocation                 810047 non-null  object 
 19  CategoryID                  810103 non-null  object 
 20  MeasureId                   810103 non-null  object 
 21  CityFIPS                    810047 non-null  float64
 22  TractFIPS                   782047 non-null  float64
 23  Short_Question_Text         810103 non-null  object 
 24  city_clean                  810047 non-null  object 
dtypes: float64(5), int64(2), object(18)
memory usage: 154.5+ MB

Group the data by the clean_city column we made earlier, then merge the sets.

In [238]:
df_walkability_city = df_walking.groupby('city_clean').agg({'NatWalkInd': 'mean', 'Pct_AO0': 'mean', 'D3B': 'mean', 'D4A': 'mean',}).reset_index()

df = df_health.merge(df_walkability_city, on='city_clean', how='inner')

Aggregation Logic for Walkability Data¶

What?: In this part we aggregated the df_walking data by using city_clean and calculating the mean between NatWalkInd, Pct_AO0, D3B, and D4A. This df_walking dataset contains data at the Census Block Group level

Why?: The health outcome dataset has data available at the city level. By aggregating the block group-level walkability data to a city-level average using groupby(), we could explore our research questions at a city level.

Insights: Averaging the block group walkability scores into a city-level NatWalkInd helps us to compare the general walkability of different urban areas with their corresponding health metrics. Now we can:

  1. Create a constant geographic unit for merging purposes
  2. Compare urban walkability between cities
  3. Reduce noise of localized block-group variations
In [239]:
df_zillow = pd.read_csv("/content/drive/My Drive/Colab Notebooks/City_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")

date_cols = [c for c in df_zillow.columns if c.startswith('20')]
latest_col = date_cols[-1]
print(f"Using column: {latest_col}")

# Filter for RegionType 'city' to avoid duplicates for the same city name
df_zillow_filtered = df_zillow[df_zillow['RegionType'] == 'city'].copy()

df_zillow_filtered['city_clean'] = df_zillow_filtered['RegionName'].str.strip().str.lower()

df_zillow_city = df_zillow_filtered[['city_clean', latest_col]].copy()
df_zillow_city.rename(columns={latest_col: 'home_value'}, inplace=True)

df_plot_zillow = df_walkability_city.merge(df_zillow_city, on='city_clean', how='inner')

print(f"Matched cities: {len(df_plot_zillow)}")
print(df_plot_zillow[['city_clean', 'NatWalkInd', 'home_value']].head(10))
Using column: 2026-03-31
Matched cities: 3036
  city_clean  NatWalkInd     home_value
0   aberdeen    9.185567  240644.332155
1   aberdeen    9.185567  343947.667771
2   aberdeen    9.185567  273201.242877
3   aberdeen    9.185567  352796.288282
4   aberdeen    9.185567  142494.040733
5   aberdeen    9.185567  301127.778633
6   aberdeen    9.185567  198707.400195
7    abilene    7.322917  210016.002457
8    abilene    7.322917  184104.647848
9        ada    7.005208  180731.575584
In [240]:
# data types for the merged df
print("df info after merge:")
df.info()
df info after merge:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573809 entries, 0 to 573808
Data columns (total 29 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Year                        573809 non-null  int64  
 1   StateAbbr                   573809 non-null  object 
 2   StateDesc                   573809 non-null  object 
 3   CityName                    573809 non-null  object 
 4   GeographicLevel             573809 non-null  object 
 5   DataSource                  573809 non-null  object 
 6   Category                    573809 non-null  object 
 7   UniqueID                    573809 non-null  object 
 8   Measure                     573809 non-null  object 
 9   Data_Value_Unit             573809 non-null  object 
 10  DataValueTypeID             573809 non-null  object 
 11  Data_Value_Type             573809 non-null  object 
 12  Data_Value                  559863 non-null  float64
 13  Data_Value_Footnote_Symbol  13946 non-null   object 
 14  Data_Value_Footnote         13946 non-null   object 
 15  Low_Confidence_Limit        559863 non-null  float64
 16  High_Confidence_Limit       559863 non-null  float64
 17  Population2010              573809 non-null  int64  
 18  GeoLocation                 573809 non-null  object 
 19  CategoryID                  573809 non-null  object 
 20  MeasureId                   573809 non-null  object 
 21  CityFIPS                    573809 non-null  float64
 22  TractFIPS                   559865 non-null  float64
 23  Short_Question_Text         573809 non-null  object 
 24  city_clean                  573809 non-null  object 
 25  NatWalkInd                  573809 non-null  float64
 26  Pct_AO0                     573809 non-null  float64
 27  D3B                         573809 non-null  float64
 28  D4A                         573809 non-null  float64
dtypes: float64(9), int64(2), object(18)
memory usage: 127.0+ MB

Filter to create a blood pressure data set using df_health, then make a new data frame for the mean of each city

The primary challenge in merging the EPA’s Walkability Index and the CDC’s 500 Cities data was a mismatch in geographic granularity. The walkability data is recorded at the Census Block Group level, and the health outcomes are reported by Census Tract or City Name. To resolve this, we aggregated the block group data using a groupby operation to create a consistent city-level average.

Furthermore, significant string normalization was required to execute a successful "inner join." We standardized the city columns by applying .lower() and .strip() functions, removing whitespace and case inconsistencies that would have prevented the records from matching. This ensured a tidy and unified dataset.

In [241]:
df_bp = df_health[df_health['Short_Question_Text'] == 'Taking BP Medication']

df_bp_city = df_bp.groupby('city_clean')['Data_Value'].mean().reset_index()
df_bp_city.rename(columns={'Data_Value': 'bp_med_rate'}, inplace=True)
In [242]:
# displays data types for df_bp and df_bp_city
print("df_bp info:")
df_bp.info()

print("\ndf_bp_city info:")
df_bp_city.info()
df_bp info:
<class 'pandas.core.frame.DataFrame'>
Index: 29006 entries, 8 to 810065
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Year                        29006 non-null  int64  
 1   StateAbbr                   29006 non-null  object 
 2   StateDesc                   29006 non-null  object 
 3   CityName                    29004 non-null  object 
 4   GeographicLevel             29006 non-null  object 
 5   DataSource                  29006 non-null  object 
 6   Category                    29006 non-null  object 
 7   UniqueID                    29006 non-null  object 
 8   Measure                     29006 non-null  object 
 9   Data_Value_Unit             29006 non-null  object 
 10  DataValueTypeID             29006 non-null  object 
 11  Data_Value_Type             29006 non-null  object 
 12  Data_Value                  28212 non-null  float64
 13  Data_Value_Footnote_Symbol  794 non-null    object 
 14  Data_Value_Footnote         794 non-null    object 
 15  Low_Confidence_Limit        28212 non-null  float64
 16  High_Confidence_Limit       28212 non-null  float64
 17  Population2010              29006 non-null  int64  
 18  GeoLocation                 29004 non-null  object 
 19  CategoryID                  29006 non-null  object 
 20  MeasureId                   29006 non-null  object 
 21  CityFIPS                    29004 non-null  float64
 22  TractFIPS                   28004 non-null  float64
 23  Short_Question_Text         29006 non-null  object 
 24  city_clean                  29004 non-null  object 
dtypes: float64(5), int64(2), object(18)
memory usage: 5.8+ MB

df_bp_city info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 474 entries, 0 to 473
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   city_clean   474 non-null    object 
 1   bp_med_rate  474 non-null    float64
dtypes: float64(1), object(1)
memory usage: 7.5+ KB

Exploratory Data Analysis¶

Here we use pd.cut() to separate continuous walkability scores into discrete ranges, bins. This allows us to easily visualize trends between bar charts to discover patterns.

In [243]:
df_plot_bp = df_walkability_city.merge(df_bp_city, on='city_clean', how='inner')

# Create bins for walkability
df_plot_bp['walk_bin'] = pd.cut(df_plot_bp['NatWalkInd'], bins=6)

df_binned = df_plot_bp.groupby('walk_bin', observed=False)['bp_med_rate'].mean().reset_index()

# Calculate midpoints for plotting
df_binned['walk_midpoint'] = df_binned['walk_bin'].apply(lambda x: round(x.mid, 1))
In [244]:
# Display data types for df_plot_bp and df_binned
print("df_plot_bp info:")
df_plot_bp.info()

print("\ndf_binned info:")
df_binned.info()
df_plot_bp info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228 entries, 0 to 227
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   city_clean   228 non-null    object  
 1   NatWalkInd   228 non-null    float64 
 2   Pct_AO0      228 non-null    float64 
 3   D3B          228 non-null    float64 
 4   D4A          228 non-null    float64 
 5   bp_med_rate  228 non-null    float64 
 6   walk_bin     228 non-null    category
dtypes: category(1), float64(5), object(1)
memory usage: 11.3+ KB

df_binned info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   walk_bin       6 non-null      category
 1   bp_med_rate    6 non-null      float64 
 2   walk_midpoint  6 non-null      category
dtypes: category(2), float64(1)
memory usage: 680.0 bytes

Finally, create and show the bar chart

In [245]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_ylim(68, 75)
ax.bar(df_binned['walk_midpoint'].astype(str), df_binned['bp_med_rate'], color='steelblue', edgecolor='white')
ax.set_xlabel('Walkability Score', fontsize=12)
ax.set_ylabel('Avg % of Adults Taking BP Medication', fontsize=12)
ax.set_title('Average BP Medication Rate by Walkability Group', fontsize=14)
plt.show()
# The graph shows a clear decrease in blood pressure medication rate as we go up in walkability score.
# This makes intuitive sense as we would assume that people who like walking more are probably taking less blood pressure medication.
No description has been provided for this image

Graph Explanation: Average BP Medication Rate by Walkability Group¶

What?: We made this bar chart to show the average percentage of adults taking blood pressure medication across different bins of the National Walkability Index score. Each bar represents a number range of walkability scores, and each bar's height indicates the mean blood pressure medication rate for the cities falling into that range.

Why?: This visualization serves to addresses the first project question by showing a preliminary relationship between walkability and hypertension medication usage.

Insights: The graph indicates a clear inverse relationship, as the walkability score increases, the average percentage of adults taking blood pressure medication tends to decrease.

In [246]:
# 3-5 Interesting Stats
print(f"Mean Walkability Score: {df['NatWalkInd'].mean():.2f}")
print(f"City with Highest Walkability: {df.loc[df['NatWalkInd'].idxmax(), 'city_clean']}")
print(f"Correlation between Walkability and BP Meds: {df_plot_bp['NatWalkInd'].corr(df_plot_bp['bp_med_rate']):.4f}")
Mean Walkability Score: 10.30
City with Highest Walkability: los angeles
Correlation between Walkability and BP Meds: -0.2682

The average walkability score of 10.30 provides a baseline for the dataset, showing that the sampled urban areas sit near the midpoint of the EPA's index. This benchmark will help to identify outliers and serve as the "national average" against which specific health outcomes can be measured. Finding that Los Angeles has the highest walkability is particularly significant because it challenges the common stereotype of the city as a purely car-dependent area; this highlights the importance of analyzing data at the granular Census Block Group level to reveal high-density pedestrian pockets that are overlooked in city-wide generalizations.

The correlation coefficient of -0.2682 is the most vital statistic for the project’s goal, as the negative value directly supports the hypothesis that increased walkability is associated with a decrease in blood pressure medication usage.

Here we display the distribution of walkability scores across cities.

In [247]:
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_walkability_city['NatWalkInd'], bins=50)
ax.set_xlabel('score')
ax.set_ylabel('# of cities')
ax.set_title('Distribution of walkability scores across US cities')
plt.show()
No description has been provided for this image

Graph Explanation: Distribution of Walkability Scores Across U.S. Cities¶

What?: We created a histogram to display the frequency distribution of National Walkability Index scores across the cities included in our df_walkability_city DataFrame. The x axis is walkability score ranges and the y axis is the number of cities in each range.

Why?: Understanding the distribution of walkability scores helps to identify the common range of walkability in the sampled cities and whether the data is possibly skewed towards highly walkable or car-dependent places.

Insights: The distribution reveals that many cities fall within the mid-range of walkability scores, with less cities at the extreme ends of very low/high walkability.

Next we want to see how median household income of the 20 largest cities compares to the walkability index

In [248]:
df_zillow = pd.read_csv("/content/drive/My Drive/Colab Notebooks/City_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")

date_cols = [c for c in df_zillow.columns if c.startswith('20')]
# to get last
latest_col = date_cols[-1]
print(f"Using column: {latest_col}")

# making a city_clean column to be able to merge data accurately
df_zillow['city_clean'] = df_zillow['RegionName'].str.strip().str.lower()

df_zillow_city = df_zillow[['city_clean', latest_col]].copy()
df_zillow_city.rename(columns={latest_col: 'home_value'}, inplace=True)

df_plot_zillow = df_walkability_city.merge(df_zillow_city, on='city_clean', how='inner')

print(f"cities: {len(df_plot_zillow)}")
print(df_plot_zillow[['city_clean', 'NatWalkInd', 'home_value']].head())
Using column: 2026-03-31
cities: 3036
  city_clean  NatWalkInd     home_value
0   aberdeen    9.185567  240644.332155
1   aberdeen    9.185567  343947.667771
2   aberdeen    9.185567  273201.242877
3   aberdeen    9.185567  352796.288282
4   aberdeen    9.185567  142494.040733
In [249]:
# Display data types for Zillow DataFrames
print("df_zillow info:")
df_zillow.info()

print("\ndf_zillow_city info:")
df_zillow_city.info()

print("\ndf_plot_zillow info:")
df_plot_zillow.info()
df_zillow info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21425 entries, 0 to 21424
Columns: 324 entries, RegionID to city_clean
dtypes: float64(315), int64(2), object(7)
memory usage: 53.0+ MB

df_zillow_city info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21425 entries, 0 to 21424
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city_clean  21425 non-null  object 
 1   home_value  21424 non-null  float64
dtypes: float64(1), object(1)
memory usage: 334.9+ KB

df_plot_zillow info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3036 entries, 0 to 3035
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city_clean  3036 non-null   object 
 1   NatWalkInd  3036 non-null   float64
 2   Pct_AO0     3036 non-null   float64
 3   D3B         3036 non-null   float64
 4   D4A         3036 non-null   float64
 5   home_value  3036 non-null   float64
dtypes: float64(5), object(1)
memory usage: 142.4+ KB
In [250]:
import requests

CENSUS_API_KEY = "517f479f3e2894a5ef3dc4a58fb05b19ca64b310"

url = (
    f"https://api.census.gov/data/2022/acs/acs5"
    f"?get=NAME,B19013_001E"
    f"&for=place:*"
    f"&key={CENSUS_API_KEY}"
)

response = requests.get(url)
data = response.json()

df_income = pd.DataFrame(data[1:], columns=data[0])
df_income.rename(columns={'B19013_001E': 'median_income', 'NAME': 'place_name'}, inplace=True)

df_income['median_income'] = pd.to_numeric(df_income['median_income'], errors='coerce')
df_income = df_income[df_income['median_income'] > 0]

df_income['city_clean'] = df_income['place_name'].str.split(' city,').str[0]
df_income['city_clean'] = df_income['city_clean'].str.split(' town,').str[0]
df_income['city_clean'] = df_income['city_clean'].str.strip().str.lower()

print(f"Cities loaded: {len(df_income)}")
print(df_income[['city_clean', 'median_income']].head(10))
Cities loaded: 28530
                city_clean  median_income
0      abanda cdp, alabama          29263
1                abbeville          35147
2               adamsville          58631
3                  addison          47188
4                    akron          53929
5                alabaster          89423
6              albertville          55933
7           alexander city          42141
8  alexandria cdp, alabama          95039
9               aliceville          30152
In [251]:
# Display data types for df_income
print("df_income info:")
df_income.info()
df_income info:
<class 'pandas.core.frame.DataFrame'>
Index: 28530 entries, 0 to 32185
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   place_name     28530 non-null  object
 1   median_income  28530 non-null  int64 
 2   state          28530 non-null  object
 3   place          28530 non-null  object
 4   city_clean     28530 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.3+ MB
In [252]:
valid_cities = set(df_health['city_clean'].unique())
df_plot_income = df_walkability_city.merge(df_income[['city_clean', 'median_income']], on='city_clean', how='inner')
df_plot_income_filtered = df_plot_income[df_plot_income['city_clean'].isin(valid_cities)]

df_income_agg = df_plot_income_filtered.groupby('city_clean').agg(NatWalkInd=('NatWalkInd', 'mean'),median_income=('median_income', 'mean')).reset_index()

top20_cities = ['new york', 'los angeles', 'chicago', 'houston', 'phoenix','philadelphia', 'san antonio', 'san diego', 'dallas', 'jacksonville','austin', 'san jose', 'fort worth', 'columbus', 'charlotte','indianapolis', 'san francisco', 'seattle', 'denver', 'nashville']
top20 = df_income_agg[df_income_agg['city_clean'].isin(top20_cities)]

fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(top20['NatWalkInd'], top20['median_income'] / 1000,color='steelblue', s=100, alpha=0.8, zorder=3)

for _, row in top20.iterrows(): ax.annotate(row['city_clean'].title(),xy=(row['NatWalkInd'], row['median_income'] / 1000),xytext=(5, 3), textcoords='offset points')

m, b = np.polyfit(top20['NatWalkInd'], top20['median_income'] / 1000, 1)
x_line = np.linspace(top20['NatWalkInd'].min(), top20['NatWalkInd'].max(), 100)
ax.plot(x_line, m * x_line + b, color='firebrick', linewidth=2, linestyle='--')

corr = top20['NatWalkInd'].corr(top20['median_income'])
ax.set_xlabel('avg walkability score')
ax.set_ylabel('median household income')
ax.set_title(f'walkability v income on US 20 largest cities')
ax.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

Graph Explanation: Walkability vs Income: 20 Largest U.S. Cities¶

What?: This scatter plot explores the relationship between the average walkability index and the median household income for the 20 largest U.S. cities. Each point represents a city and its position is determined by both its walkability score and median income. A regression line and correlation coefficient r are also displayed.

Why?: This graph addresses an important factor of socioeconomic status. It explores if walkability is a proxy for income or if there is an actual independent relationship.

Insights: The positive correlation r = 0.5898 suggests that in the 20 largest U.S. cities, the higher walkability tends to be associated with higher median household incomes.

In [253]:
# Display data types for income aggregation DataFrames
print("df_plot_income info:")
df_plot_income.info()

print("\ndf_plot_income_filtered info:")
df_plot_income_filtered.info()

print("\ndf_income_agg info:")
df_income_agg.info()
df_plot_income info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2344 entries, 0 to 2343
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   city_clean     2344 non-null   object 
 1   NatWalkInd     2344 non-null   float64
 2   Pct_AO0        2344 non-null   float64
 3   D3B            2344 non-null   float64
 4   D4A            2344 non-null   float64
 5   median_income  2344 non-null   int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 110.0+ KB

df_plot_income_filtered info:
<class 'pandas.core.frame.DataFrame'>
Index: 899 entries, 6 to 2341
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   city_clean     899 non-null    object 
 1   NatWalkInd     899 non-null    float64
 2   Pct_AO0        899 non-null    float64
 3   D3B            899 non-null    float64
 4   D4A            899 non-null    float64
 5   median_income  899 non-null    int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 49.2+ KB

df_income_agg info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   city_clean     227 non-null    object 
 1   NatWalkInd     227 non-null    float64
 2   median_income  227 non-null    float64
dtypes: float64(2), object(1)
memory usage: 5.4+ KB

Next, we look at the street density of each city v the walkability index

In [254]:
df_sample = df_walking[['NatWalkInd', 'D3B']].dropna()
df_sample = df_sample[(df_sample['D3B'] > 0) & (df_sample['D3B'] < 400)].sample(5000, random_state=42)

fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(df_sample['D3B'], df_sample['NatWalkInd'],alpha=0.25, s=10)

m, b = np.polyfit(df_sample['D3B'], df_sample['NatWalkInd'], 1)
x_line = np.linspace(df_sample['D3B'].min(), df_sample['D3B'].max(), 100)
ax.plot(x_line, m * x_line + b, color='firebrick', linewidth=2)

corr = df_sample['D3B'].corr(df_sample['NatWalkInd'])
ax.set_xlabel('street intersection density')
ax.set_ylabel('Walkability Index Score')
ax.set_title(f'street intersection density v walkability')
# set the x and y lim to filter out outliers
ax.set_xlim(0, 400)
ax.set_ylim(0, 20)
ax.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image
In [255]:
fig, ax = plt.subplots(figsize=(9, 6))
hb = ax.hexbin(df_sample['D3B'], df_sample['NatWalkInd'], gridsize=40, cmap='Blues', mincnt=1, xscale='log')
plt.colorbar(hb, ax=ax, label='# of block groups')

ax.set_xlabel('street intersection density, (log)', fontsize=12)
ax.set_ylabel('walkability score', fontsize=12)
ax.set_title(f'street intersection density v walkability (log)', fontsize=14)
ax.set_ylim(0, 20)
#plt.tight_layout()
plt.show()
No description has been provided for this image

Now, mental health model

In [256]:
url = "https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=50000&measureid=DEPRESSION&$select=locationname,locationid,data_value,measureid"
response = requests.get(url)
df_places_tract = pd.DataFrame(response.json())

# to clean places
df_places_tract['tract_id'] = df_places_tract['locationid'].astype(str).str.strip()
df_places_tract['depression_rate'] = pd.to_numeric(df_places_tract['data_value'], errors='coerce')
df_places_tract = df_places_tract.dropna(subset=['depression_rate'])

df_walking['tract_id'] = (df_walking['GEOID10'].dropna().astype(float).astype(int).astype(str).str.zfill(12).str[:11])

df_walk_tract = df_walking.dropna(subset=['tract_id']).groupby('tract_id')['NatWalkInd'].mean().reset_index()

# merge
df_tract_merged = df_walk_tract.merge(df_places_tract[['tract_id', 'depression_rate']],on='tract_id', how='inner')

print(f"matching tracts: {len(df_tract_merged)}")
print(df_tract_merged[['NatWalkInd', 'depression_rate']].corr())
matching tracts: 10788
                 NatWalkInd  depression_rate
NatWalkInd         1.000000        -0.248726
depression_rate   -0.248726         1.000000
In [257]:
# Display data types for mental health model DataFrames
print("df_places_tract info:")
df_places_tract.info()

print("\ndf_walk_tract info:")
df_walk_tract.info()

print("\ndf_tract_merged info:")
df_tract_merged.info()
df_places_tract info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   locationname     50000 non-null  object 
 1   locationid       50000 non-null  object 
 2   data_value       50000 non-null  object 
 3   measureid        50000 non-null  object 
 4   tract_id         50000 non-null  object 
 5   depression_rate  50000 non-null  float64
dtypes: float64(1), object(5)
memory usage: 2.3+ MB

df_walk_tract info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17137 entries, 0 to 17136
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   tract_id    17137 non-null  object 
 1   NatWalkInd  17137 non-null  float64
dtypes: float64(1), object(1)
memory usage: 267.9+ KB

df_tract_merged info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10788 entries, 0 to 10787
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tract_id         10788 non-null  object 
 1   NatWalkInd       10788 non-null  float64
 2   depression_rate  10788 non-null  float64
dtypes: float64(2), object(1)
memory usage: 253.0+ KB
In [258]:
# Get all unique measure IDs available in PLACES tract data
url = "https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=1000&$select=measureid,measure&$group=measureid,measure"
response = requests.get(url)
df_measures = pd.DataFrame(response.json())
print(df_measures.to_string(index=False))
   measureid                                                                              measure
     ACCESS2                       Current lack of health insurance among adults aged 18-64 years
   ARTHRITIS                                                               Arthritis among adults
       BINGE                                                          Binge drinking among adults
      BPHIGH                                                     High blood pressure among adults
       BPMED Taking medicine to control high blood pressure among adults with high blood pressure
      CANCER                                           Cancer (non-skin) or melanoma among adults
     CASTHMA                                                          Current asthma among adults
         CHD                                                  Coronary heart disease among adults
     CHECKUP               Visits to doctor for routine checkup within the past year among adults
  CHOLSCREEN                                                   Cholesterol screening among adults
   COGNITION                                                    Cognitive disability among adults
COLON_SCREEN                            Colorectal cancer screening among adults aged 45–75 years
        COPD                                   Chronic obstructive pulmonary disease among adults
    CSMOKING                                               Current cigarette smoking among adults
      DENTAL                       Visited dentist or dental clinic in the past year among adults
  DEPRESSION                                                              Depression among adults
    DIABETES                                                      Diagnosed diabetes among adults
  DISABILITY                                                          Any disability among adults
  EMOTIONSPT                                    Lack of social and emotional support among adults
  FOODINSECU                                   Food insecurity in the past 12 months among adults
   FOODSTAMP                              Received food stamps in the past 12 months among adults
       GHLTH                                   Fair or poor self-rated health status among adults
     HEARING                                                      Hearing disability among adults
    HIGHCHOL                            High cholesterol among adults who have ever been screened
  HOUSINSECU                                Housing insecurity in the past 12 months among adults
   INDEPLIVE                                           Independent living disability among adults
    LACKTRPT                   Lack of reliable transportation in the past 12 months among adults
  LONELINESS                                                              Loneliness among adults
         LPA                                       No leisure-time physical activity among adults
    MAMMOUSE                                         Mammography use among women aged 50-74 years
       MHLTH                                                Frequent mental distress among adults
    MOBILITY                                                     Mobility disability among adults
     OBESITY                                                                 Obesity among adults
       PHLTH                                              Frequent physical distress among adults
    SELFCARE                                                    Self-care disability among adults
 SHUTUTILITY                  Utility services shut-off threat in the past 12 months among adults
       SLEEP                                                    Short sleep duration among adults
      STROKE                                                                  Stroke among adults
   TEETHLOST                                          All teeth lost among adults aged >=65 years
      VISION                                                       Vision disability among adults

Next we can sort through the list and find the measures that correlate the most with the walkability scores

In [259]:
results = []
for mid in df_measures['measureid'].tolist():
    url = f"https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=50000&measureid={mid}&$select=locationid,data_value"
    response = requests.get(url)
    df_m = pd.DataFrame(response.json())
    df_m['tract_id'] = df_m['locationid'].astype(str).str.strip()
    df_m['value'] = pd.to_numeric(df_m['data_value'], errors='coerce')
    df_m = df_m.dropna(subset=['value'])
    df_merged = df_walk_tract.merge(df_m[['tract_id', 'value']], on='tract_id', how='inner')
    if len(df_merged) > 500:
        corr = df_merged['NatWalkInd'].corr(df_merged['value'])
        results.append({'measure': mid, 'correlation': round(corr, 4), 'n': len(df_merged)})

df_res = pd.DataFrame(results).sort_values('correlation', key=abs, ascending=False)
print(df_res.to_string(index=False))
     measure  correlation     n
   ARTHRITIS      -0.5505 10788
     HEARING      -0.4755 10788
      BPHIGH      -0.4750 10788
        COPD      -0.4410 10788
         CHD      -0.4368 10788
     CHECKUP      -0.4171 10788
      CANCER      -0.4055 10788
    HIGHCHOL      -0.3739 10788
  LONELINESS       0.3636  9732
       BPMED      -0.3566 10788
      STROKE      -0.3332 10788
     OBESITY      -0.3329 10788
COLON_SCREEN      -0.3051 10788
     CASTHMA      -0.2899 10788
  EMOTIONSPT       0.2629  9732
    CSMOKING      -0.2601 10788
  DEPRESSION      -0.2487 10788
    MOBILITY      -0.2284 10788
  DISABILITY      -0.2056 10788
    DIABETES      -0.1974 10788
       BINGE       0.1766 10788
       PHLTH      -0.1748 10788
   TEETHLOST      -0.1658 10783
  CHOLSCREEN      -0.1606 10788
  HOUSINSECU       0.1413  9732
   FOODSTAMP       0.1255  9732
     ACCESS2       0.1067 10788
    MAMMOUSE       0.1048 10785
   INDEPLIVE      -0.0988 10788
    SELFCARE      -0.0889 10788
  FOODINSECU       0.0864  9732
         LPA      -0.0721 10788
    LACKTRPT       0.0689  9732
       SLEEP      -0.0488 10788
 SHUTUTILITY      -0.0456  9732
      DENTAL       0.0345 10788
   COGNITION      -0.0237 10788
      VISION       0.0095 10788
       GHLTH      -0.0072 10788
       MHLTH       0.0020 10788

Hypothesis Testing and Modeling¶

Our hypothesis is that higher walkability in urban environments reduces the prevalence of hypertension. Null Hypothesis: Walkability has no statistically significant relationship with health outcomes, like hypertension rates. Alternative Hypothesis: Higher walkability is associated with lower prevalence of hypertension. Now, to test this hypothesis we will conduct a correlation analysis, simple linear regression, and multivariate linear regression.

In [260]:
import scipy.stats as stats

# Load all four measures
measures = {'BPHIGH': 'High Blood Pressure Rate (%)','ARTHRITIS': 'Arthritis Rate (%)','LONELINESS': 'Loneliness Rate (%)','OBESITY': 'Obesity Rate (%)'}

df_combined = df_walk_tract.copy()

for mid in measures:
    url = f"https://data.cdc.gov/resource/cwsq-ngmh.json?$limit=50000&measureid={mid}&$select=locationid,data_value"
    response = requests.get(url)
    df_m = pd.DataFrame(response.json())
    df_m['tract_id'] = df_m['locationid'].astype(str).str.strip()
    df_m[mid.lower()] = pd.to_numeric(df_m['data_value'], errors='coerce')
    df_combined = df_combined.merge(df_m[['tract_id', mid.lower()]], on='tract_id', how='inner')

print(f"Tracts matched: {len(df_combined)}")

# 2x2  grid
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
colors = ['steelblue', 'seagreen', 'darkorange', 'mediumpurple']

for i, (mid, label) in enumerate(measures.items()):
    col = mid.lower()
    x = df_combined['NatWalkInd']
    y = df_combined[col]
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

    df_sample = df_combined.sample(3000, random_state=42)
    ax = axes[i]
    ax.scatter(df_sample['NatWalkInd'], df_sample[col],alpha=0.15, color=colors[i], s=10)

    x_line = np.linspace(x.min(), x.max(), 100)
    y_line = slope * x_line + intercept
    n = len(x)
    se_line = std_err * np.sqrt(1/n + (x_line - x.mean())**2 / ((x - x.mean())**2).sum())

    ax.plot(x_line, y_line, color='firebrick', linewidth=2,label=f'r = {r_value:.3f}')
    ax.fill_between(x_line, y_line - 1.96*se_line, y_line + 1.96*se_line,alpha=0.15, color='firebrick')

    ax.set_xlabel('Walkability Index Score')
    ax.set_ylabel(label, fontsize=11)
    ax.set_title(f'Walkability vs {label.split(" Rate")[0]}\n'
                 f'r = {r_value:.3f}, r² = {r_value**2:.3f}, p = {p_value:.2e}')
    ax.legend(fontsize=10)
    ax.grid(True, alpha=0.3)

plt.suptitle('walkability v health measure (census tracts)')
plt.show()
Tracts matched: 9732
No description has been provided for this image

Graph Explanation: Walkability vs Health Outcomes at Census Tract Level¶

What?: This 2x2 grid of scatter plots visualizes the relationship between the National Walkability Index and 4 different health outcomes: High Blood Pressure Rate, Arthritis Rate, Loneliness Rate, and Obesity Rate. Each outcome is at the census tract level. Each of the four plots includes a regression line and its corresponding correlation coefficient r.

Why?: This visualization explores various chronic health outcomes at the same time, which provided a broader understanding of walkability's potential impact beyond just blood pressure.

Insights: High Blood Pressure, Arthritis, and Obesity show generally negative correlations with walkability (higher walkability, lower rates). This can support the idea that active environments promote physical health. However, Loneliness shows a positive correlation, which suggests that while walkability improves physical health, the social aspects of highly walkable areas might not help lessen feelings of loneliness.

In [261]:
# Display data types for the combined health measures DataFrame
print("df_combined info:")
df_combined.info()
df_combined info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9732 entries, 0 to 9731
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   tract_id    9732 non-null   object 
 1   NatWalkInd  9732 non-null   float64
 2   bphigh      9732 non-null   float64
 3   arthritis   9732 non-null   float64
 4   loneliness  9732 non-null   float64
 5   obesity     9732 non-null   float64
dtypes: float64(5), object(1)
memory usage: 456.3+ KB

To see if our regression model generalizes to unseen data we will divide the dataset into a separate training and a separate testing subset with train_test_split(). This will prevent overfitting and realistically test how well our model can predict.

In [262]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

df_model_bp = df_combined[['NatWalkInd', 'bphigh']].dropna().drop_duplicates().reset_index(drop=True)

df_model_bp['log_walk'] = np.log(df_model_bp['NatWalkInd'])

X = df_model_bp[['NatWalkInd']]
y = df_model_bp['bphigh']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_bp = LinearRegression()
model_bp.fit(X_train, y_train)
y_pred = model_bp.predict(X_test)

print(f"training #:     {len(X_train):,}")
print(f"test #:         {len(X_test):,}")
print(f"slope:  {model_bp.coef_[0]:.4f}")
print(f"intercept:            {model_bp.intercept_:.4f}")
print(f"r squared (train):           {model_bp.score(X_train, y_train):.4f}")
print(f"r squared (test):            {r2_score(y_test, y_pred):.4f}")
print(f"RMSE:                 {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

ax = axes[0]
ax.scatter(X_test, y_test, alpha=0.2, color='steelblue', s=10, label='actual')
x_line = np.linspace(X['NatWalkInd'].min(), X['NatWalkInd'].max(), 100)
ax.plot(x_line, model_bp.predict(pd.DataFrame({'NatWalkInd': x_line})),
        color='firebrick', linewidth=2, label='predicted')
ax.set_xlabel('walkability score')
ax.set_ylabel('high bp rate')
ax.set_title('walkability - high bp rate\n(test data)')
ax.legend()
ax.grid(True, alpha=0.3)

ax2 = axes[1]
ax2.scatter(y_test, y_pred, alpha=0.2, color='seagreen', s=10)
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color='firebrick', linewidth=2, linestyle='--', label='perfect pred')
ax2.set_xlabel('actual bp rate')
ax2.set_ylabel('predicted bp rate')
ax2.set_title(f'predicted v actual\n(r squared = {r2_score(y_test, y_pred):.4f})')
ax2.legend()
ax2.grid(True, alpha=0.3)

ax3 = axes[2]
residuals = y_test.values - y_pred
ax3.scatter(y_pred, residuals, alpha=0.2, color='darkorange', s=10)
ax3.axhline(y=0, color='firebrick', linewidth=2, linestyle='--')
ax3.set_xlabel('predicted bp rate')
ax3.set_ylabel('residuals')
ax3.set_title('residual plot')
ax3.grid(True, alpha=0.3)

plt.suptitle('linear regression model: walkability predicting high bp rate')
plt.show()
training #:     7,493
test #:         1,874
slope:  -0.9749
intercept:            44.0224
r squared (train):           0.2529
r squared (test):            0.2487
RMSE:                 6.7097
No description has been provided for this image

Graph Explanation: Linear Regression: Walkability as Predictor of High Blood Pressure Rate¶

What?: This set of three plots presents the results of a linear regression model where the Walkability Index is predicting High Blood Pressure Rate. The first subplot compares the actual vs. predicted values for the test set, the second plots predicted vs. actual values to assess model fit, and the third shows the residuals to check for linearity and homoscedasticity.

Why?: This section moves beyond simple correlation to a formal predictive model, directly addressing the project's goal of using walkability as a predictor for health outcomes. The metrics (R², RMSE, coefficient) evaluate the strength of this predictive relationship.

Insights: The model shows a negative coefficient (-0.9749), indicating that for every unit increase in the Walkability Index, the High Blood Pressure Rate decreases by approximately 0.97 percentage points.

In [263]:
# display data types for the linear regression model DataFrame
print("df_model_bp info:")
df_model_bp.info()
df_model_bp info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9367 entries, 0 to 9366
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   NatWalkInd  9367 non-null   float64
 1   bphigh      9367 non-null   float64
 2   log_walk    9367 non-null   float64
dtypes: float64(3)
memory usage: 219.7 KB
In [264]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# train model only on train data split
df_model_bp = df_combined[['NatWalkInd', 'bphigh']].dropna().drop_duplicates().reset_index(drop=True)

X = df_model_bp[['NatWalkInd']]
y = df_model_bp['bphigh']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_bp = LinearRegression()
model_bp.fit(X_train, y_train)

# predict only on unseen test data
y_pred = model_bp.predict(X_test)

print(f"training data:   {len(X_train):,}")
print(f"test data:     {len(X_test):,}")
print(f"r squared:                 {r2_score(y_test, y_pred):.4f}")

# plot only test data
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(X_test, y_test, alpha=0.2, color='steelblue', s=15, label=f'actual, n={len(X_test):,})', zorder=2)

ax.scatter(X_test, y_pred,alpha=0.8, color='firebrick', s=40, marker='x',label='predicted', zorder=5, linewidths=1)

ax.set_xlabel('walkability score')
ax.set_ylabel('high bp rate')
ax.set_title(f'walkability v high bp rate\n'
             f'r squared = {r2_score(y_test, y_pred):.4f}, n = {len(X_test):,}')
ax.legend()
ax.grid(True, alpha=0.3)
ax.spines[['top', 'right']].set_visible(False)
plt.show()
training data:   7,493
test data:     1,874
r squared:                 0.2487
No description has been provided for this image
In [265]:
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_walk_tract['NatWalkInd'], bins=50, color='steelblue')
ax.set_xlabel('National Walkability Index Score')
ax.set_ylabel('Number of Census Tracts')
ax.set_title('Distribution of Walkability Scores Across U.S. Census Tracts')
plt.show()
No description has been provided for this image
In [266]:
#This will create the multivariate dataframe for a multivariate linear regression

#This aims to find the unique tract_id to city_clean from df_walking
tract_to_city = df_walking[['tract_id', 'city_clean']].drop_duplicates().dropna(subset=['tract_id', 'city_clean'])

#Now we merge df_combined at tract level, with mapping which adds the city_clean column
df_combined_f_merge = df_combined.merge(tract_to_city, on='tract_id', how='inner')

#So the multivar_dataframe is created by merging df_combined_f_merge with df_income_agg
multivar_dataframe = df_combined_f_merge.merge(df_income_agg[['city_clean', 'median_income']], on='city_clean', how='inner')

#Our independent variable, x, and dependent variable, y
x = multivar_dataframe[['NatWalkInd', 'median_income']]
y = multivar_dataframe['bphigh']

#This splits our data into a separate training set and separate test set
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)

#Now we can run the linear regression
multilinreg = LinearRegression()
multilinreg.fit(xtrain, ytrain)

predicty = multilinreg.predict(xtest)

print(f"Multivariate Linear Regression Results:")
print(f"r squared (test):{r2_score(ytest, predicty):.4f}")
print(f"Intercept:{multilinreg.intercept_:.4f}")
print(f"RMSE:{np.sqrt(mean_squared_error(ytest, predicty)):.4f}")
for col, coef in zip(X.columns, multilinreg.coef_):
    print(f"Coefficients: {col}: {coef:.6f}")
Multivariate Linear Regression Results:
r squared (test):0.2106
Intercept:42.6368
RMSE:6.1712
Coefficients: NatWalkInd: -0.629949

Interpreting These Results¶

The R² is the Coefficient of Determination and it is 0.2106. This means that 21% of the variance in cities' blood pressure rates is able to be explained by the dual factors of walkability and median income. The remaining 79% of the variance is possibly due to other factors such as age or access to medical resources.

The intercept of 42.64 is our baseline for what the model would predict a blood pressure rate to be given a city with a walkability score and median income of zero.

The RMSE is the Root Mean Square Error, and ours of 6.17 means the model's predictions are, on average, off by 6.17.

Lastly, the coefficient (NatWalkInd) of -.63 means that for each 1-point increase in a city's Walkability Index score, our model will predict the High Blood Pressure Rate drops by 0.63 percentage points. This leads to a conclusion regarding our hypothesis. The negative sign means that high walkability is associated with better health outcomes, specifically lower blood pressure rates, even after having accounted for a city's wealth level.

Insights and Conclusion¶

This project reveals how urban walkability is significantly associated with several relevant public health outcomes in U.S. cities and census tracts. By using exploratory data analysis, correlation testing, and regression modeling, we found that there exists a consistent negative relationship between walkability and rates of conditions like hypertension, obesity, arthritis, and depression. We found that higher walkability in a city is associated with lower usage of blood pressure medication. We found a moderately negative correlation of r=-.27. This means that cities with higher walkability scores generally had lower usage of blood pressure medication. Our linear regression revealed that walkability could be used as a predictive value, with a ~1% decrease in high blood pressure rates for each one-point increase in walkability score.

Using our multivariate regression model, we found that walkability is still negatively associated with blood pressure rates after we consider median household income as a factor. This shows how the built urban design could independently contribute to health outcomes instead of being a proxy for higher-income areas.

This project shows how valuable it is to combine datasets from different disciplines. With economic, environmental, and public health datasets, we found that design infrastructure can play a big part in community health.

In the future, this project could be expanded using more demographics as factors like age, race, climate, and access to healthcare. Ultimately, we found that healthier cities could also be more walkable cities.

References and Data Sources¶

  • EPA Smart Location Database (Walkability Index): https://www.google.com/search?q=https://www.epa.gov/smartgrowth/smart-location-mapping%23walkability
  • CDC PLACES / 500 Cities Data: https://www.google.com/search?q=https://data.cdc.gov/browse%3Fcategory%3D500%2BCities%2B%2526%2BPlaces
  • US Census Bureau ACS 5-Year Estimates: https://www.census.gov/data/developers/data-sets/acs-5year.html
  • Zillow Home Value Index (ZHVI): https://www.zillow.com/research/data/