by Jack Weyer and Madison Neyland
Introduction
In this project we will use the ‘Boston’ data in the ‘MASS’ library containing several aspects influencing Boston housing in the 1970s to analyze a possible relationship between the proportion of African American residence in Boston and value of homes, which could lead to an overall understanding of how the housing market is affected by race and/or racial discrimination.
Surely, this is not a new research topic. A report from The Brookings Institution titled “The devaluation of assets in black neighborhoods” by Andre M. Perry, Jonathan Rothwell, and David Harshbarger investigated this racial issue in the housing market on the large scale of hundreds of U.S metropolitan areas in the 2010s. Through the collection of census data, it was concluded that “ In the average U.S. metropolitan area, homes in neighborhoods where the share of the population is 50 percent black are valued at roughly half the price as homes in neighborhoods with no black residents.” Further, it found that the devaluation of homes in African American neighborhoods is not fully explained by non-race related neighborhood and house characteristics (such as proximity to work, and average number of bedrooms), which suggests systematic racial discrimination within the housing market. The impact of devaluation of houses in predominantly African American neighborhoods reaches further than just home buying: it hinders upward mobility for young generations within those neighborhoods as it decreases homeowners wealth and therefore ability to afford education.
The report suggests an overall negative relationship across American metropolitan areas between the proportion of African American residents in a neighborhood and the value of houses within that neighborhood. In this project, we aim to focus the lens on the city of Boston to see if this trend holds in this specific area, recognizing the time difference between the Boston data from 1970 and The Brookings Institution report from 2012-2016.
In order to better address this issue using the Boston data set, we found that it was originally created for a research paper titled “Hedonic Housing Prices and the Demand for Clean Air'' by scholars David Harrison Jr. and Daniel L. Rubenfeld. The paper discusses the technique of using housing market data to estimate the willingness to pay for clean air, by way of a model named the “hedonic housing price model.” The paper describes a four-step process, the first of which is estimating a hedonic model predicting the median value of owner-occupied homes using several variables including the influence of air pollution. Then, the next step is to isolate the air pollution variable which yields the marginal willingness to pay for an improvement in air quality for each household. The marginal willingness-to-pay data is converted into a function analogous to a demand curve for clean air, which is then used to calculate the per household benefits of implementing air improvement policies.
While this paper does not necessarily pertain to devaluation of houses in African American neighborhoods, it offers a strong outline on how to use the Boston data set to find a relation of a certain variable to the value of homes in Boston. Specifically for this project, instead of focusing on air quality variables, we will be focusing on the proportion of African American residence in each Boston neighborhood.
Methods
In order to understand the data and types of variables, str(Boston)was used. It is apparent that the predictor “chas” is categorical (as it only takes on the values 0 or 1), so as.factor(Boston$chas)was used to create a dummy variable. While “rad” also appears to be a possible factor variable, as it takes on a finite set of integer values, it represents an index of accessibility to radial Boston highways which is not categorical.
Next, the Boston data was split into equal-sized disjoint training and test sets using the sample function with set.seed so the results could be reproducible.
The first model was created using best subset selection on the training set using the command regsubsets(medv ~.,data=Boston.train,nvmax=13).This command gives a subset of the best 13 models, each having a differing number of predictors. To analyze the best model among this subset, the statistics Mallow’s Cp, adjusted R2, and Bayesian information criterion (BIC) for each model were graphed against the number of predictors. Then for each statistic, the best models for each criterion hold the extreme value (highest value for adjusted R2 and lowest value for Cp and BIC). However, these three statistics did not give a consensus for which model is best, so the validation error approach was used. First, a model matrix of the predictors was created using the test set with the command test.matrix=model.matrix(medv ~., data = Boston.test) to create an “X” matrix for more efficient iteration.
` In choosing among models at each possible number of predictors, we observed the model with 12 predictors to have the lowest estimated test MSE at 26.85. We then chose the simplest model within one standard error of the MSE to account for noise added from additional predictors. This turned out to be the 10-variable model with a test MSE of 27.66.
Next, the plot function was used to perform diagnostics on the chosen model with 10 predictors, and the diagnostics graphs indicated to us the model sufficiently passed linearity assumptions. Proceeding with this model, a hypothesis test was performed on the “black” variable which confirmed that the proportion of African American residents in a Boston neighborhood is significant in predicting the median house value. This was done by simply running a summary of a linear model predicting “medv” using only “black” as a predictor.
A second model was obtained through the Lasso method. Lasso regression was run on the training set, by setting alpha=1 in the glmnet function. Then 10-fold cross validation was used to find the optimal tuning parameter using the function cv.glmnet. The optimal tuning parameter is used to shrink the standardized variables, possibly some even to zero. The test MSE of this model with the optimal tuning parameter was found by using the predict function on the test set and comparing the predicted values of “medv”to the actual values of “medv” in the test set. Finally, we performed hypothesis tests on the coefficients included in the optimal model. To do this, we used the lasso.proj function on a model matrix of all of the predictors, and specifically requested information of the p-values as the output.
We created a third model using ridge regression. Ridge regression is very similar to Lasso, except that ridge regression does not shrink any coefficients all the way to zero, meaning it does not perform variable selection. Due to this similarity, the same procedure of finding a model using Lasso can be applied when using ridge regression--with a couple small alterations. First, we set alpha=0 (rather than 1) in the glmnet function. Second, when we perform hypothesis tests on the coefficients, the ridge.proj function is used (rather than the lasso.proj function).
From the models created from the best subset, Lasso, and ridge regression, we chose one model with the best prediction accuracy by comparing the test MSE of each model and considering the number of parameters of each model. The chosen model (our selection procedure is further explained in the “Results” section) was used as the model to predict the median value of houses in Boston neighborhoods, and in analyzing the influence of the proportion of African American residents. To find this influence, we looked at the coefficient of the “black” variable.
Results
The three models chosen from best subset selection, Lasso, and Ridge regression are represented in the tables below. In choosing the model with the best prediction accuracy, we look at the estimated mean squared error because MSE measures the extent to which the predicted response value for a given observation of “medv” in the test set created by the model is close to the true response value for that observation of “medv.” The MSE’s of the models from Best Subset Selection, Lasso, and Ridge Regression are 27.66006, 26.8612, and 27.19152, respectively. So, the Lasso method gives the model with the best prediction accuracy. It should be noted that when performing the best subset selection procedure, the model containing 12 predictors had the lowest MSE, at 26.84946, which is in fact lower than the MSE of the Lasso model. So technically, this model had the best prediction accuracy for our partition. However, as explained previously, it was decided that after accounting for noise from additional predictors, that the best model from the best subset selection method contains 10 predictors. Additionally, the best subset model containing 12 predictors performed only marginally better than the Lasso model, which also contains 12 predictors. For this reason, and the fact that Lasso performs direct variable selection by shrinking predictor coefficients all the way to zero, we choose the Lasso model as the model with the overall best prediction accuracy.
Moving forward with the Lasso model, we can interpret the meaning of each of the generated coefficients, including the “black” variable’s coefficient to answer the proposed problem.
The “crim” coefficient indicates that as instances of crime increases by one unit per capita, the median value of a home in the neighborhood decreases on average by $103.35.
The “zn” coefficient indicates that as the proportion of residential land zoned for lots over 25,000 sq.ft. increases by 1%, median value of a home in the neighborhood increases on average by $43.55.
The “indus” coefficient indicates that as the proportion of non-retail business acres per town increases by 1%, the median value of a home in the neighborhood increases on average by $2.10.
The “chas” coefficient indicates that if a neighborhood borders the Charles River, the value of the homes are increased on average by $2,698.65.
The “nox” coefficient indicates an increase in one part per 10 million of nitrogen oxides concentration decreases the median value of homes in the neighborhood on average by $16,839.49.
The “rm” coefficient indicates an increase of the average number of rooms per house of one increases the median value of homes in the neighborhood on average by $3,835.81.
The “dis” coefficient indicates an increase of the weighted mean of distances to five Boston employment centres by one unit decreases the median value of homes in the neighborhood on average by $1,440.10.
The “rad” coefficient indicates an increase of accessibility to Boston’s radial highways by one unit increases the median value of homes in the neighborhood on average by $276.49.
The “tax” coefficient indicates an increase of the full-value property-tax rate per $10,000 by one dollar decreases the median value of homes in the neighborhood on average by $10.84.
The “ptratio” coefficient indicates an increase of the pupil-teacher ratio by town by one unit decreases the median value of homes in the neighborhood on average by $939.02.
The “lstat” coefficient indicates an increase of the lower status of the population by 1% decreases the median value of homes in the neighborhood on average by $522.81.
Finally, referring to the variable in question for this project, the “black” coefficient indicates that an increase in the statistic 1000(Bk-.63)2by one unit increases the median value of homes in the neighborhood on average by $9.16, where “Bk” is the proportion of African American residents in the town of the Boston neighborhood. The problem with this result is that it is quite ambiguous: the increase of $9.16 is not from a direct increase in the proportion of African American residents because the statistic is transformed by a shift and squaring. This indicates there is a positive parabolic relationship between the proportion of African American residents and median value of homes in Boston neighborhoods.
Here is a demonstrational graph depicting the parabolic nature of the relationship using the corresponding coding in the dataset scaled down by 1000.
The term for the proportion of African American residents is shifted .63 and squared, so it is impossible to know the value of the original measurement. For instance, neighborhoods with 100% and 26% African American representation yield the exact same “black’ statistic, showing how the coding automatically sets moderately low and extreme high proportions to have the same influence on median house values. The fact that this variable is significant and positive in our data indicates that Boston houses in 1970 were valued at a premium for being in a neighborhood with a low proportion of African American residents, but also for being in a neighborhood with a very high proportion of African American residents to a lesser degree.
Discussion
While the transformation of the “black” variable reveals the parabolic relation of proportion of African American residence in 1970 in a Boston neighborhood to the median value of homes, it also proposes complications which question the validity of this result. It is unclear if this parabolic relation naturally occurs in the Boston data, or if it is the consequence of this transformation. It could be the case that the creators of the Boston data set recognized this relationship in the original census data and used this transformation to simplify and communicate it, but this is unknown. An improvement to this statistical analysis project could be found in using data without ambiguous transformations. Data transformation can always be performed by individual researchers when a specific pattern arises and calls for further investigation. If the actual, untransformed proportion of African American residence in a neighborhood was the “black” variable rather than the current transformation, not only would we be able to see the parabolic relationship, but we would also be able to have more confidence in the interpretation of the model.
Additionally, the results and analysis of this model leave an important problem unanswered: why does this parabolic relationship between proportion of African American residents and median value of homes in Boston neighborhoods in 1970 occur? Based on the accessible variables in the Boston data set and the scope of this project, it is not possible to conclude a definitive answer. Due to the improper coding of the variable and requirements of this project, we are unable to isolate the “black” variable to create a marginal house price function with the proportion of African Americans as the independent variable, which was done in the “Hedonic Housing Prices and the Demand for Clean Air” report to isolate the clean air variable. We can however make inferences as to why this relationship occurs by looking at the Brookings Institution study done on hundreds of U.S metropolitan areas. It suggests that ingrained stereotypes and automatic associations of residents to particular groups leads to such segregation in neighborhoods. This study, however, does not report the same parabolic relationship that we found in the Boston data set: rather, the study reports that on average, areas with a 50% or higher population of African Americans experienced lower house values.
To improve this project and attempt to understand why the parabolic relationship exists (and slightly differs from the results of The Brookings Institution study), we could introduce more variables when creating and choosing models. New variables regarding the racial climate of neighborhoods, such as proportion of other minority residents or survey answers on race related topics, would give us parameters to statistically pinpoint why the parabolic relationship exists. We believe that the addition of more variables could lead to an improved model in understanding this issue by looking at the results from the Lasso regression. Lasso regression performs variable selection by shrinking unnecessary coefficients to zero, keeping the number of predictors at an accurate, yet efficient level. In our model selection, only one out of the thirteen variables was removed by the Lasso process, indicating that the model is still at an efficient size and could potentially benefit from more variables.
Computer Codes
(The numbers in the left margin correspond to the numbered tasks on the project assignment page)
0) library(MASS)
attach(Boston)
str(Boston)
View(Boston)
plot(rad,medv)
cor(rad,medv)
Boston$chas <- as.factor(Boston$chas)
1) set.seed(1)
train=sample(1:nrow(Boston),nrow(Boston)/2)
test=(-train)
# split into equal training and testing sets
Boston.train=Boston[train,]
Boston.test=Boston[test,]
2) library(ISLR)
library(leaps)
besties <- regsubsets(medv ~.,data=Boston.train,nvmax=16)
# Apply best subset selection on training data
besties.summary = summary(besties)
besties.summary$rsq
par(mfrow=c(2,2))
plot(besties.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(besties.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
which.max(besties.summary$adjr2)
[1] 11
points(11,besties.summary$adjr2[11], col="red",cex=2,pch=20)
plot(besties.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
which.min(besties.summary$cp)
[1] 11
points(11,besties.summary$cp[11],col="red",cex=2,pch=20)
which.min(besties.summary$bic)
[1] 8
plot(besties.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
points(8,besties.summary$bic[8],col="red",cex=2,pch=20)
plot(besties,scale = "r2")
plot(besties,scale = "adjr2")
plot(besties,scale = "Cp")
plot(besties,scale = "bic")
coef(besties,11)
coef(besties,8)
# choosing the best model
test.matrix=model.matrix(medv ~., data = Boston.test)
val.errors = rep(NA,13)
for (i in 1:13){
coefi=coef(besties,id=i)
pred = test.matrix[,names(coefi)]%*%coefi
val.errors[i] = mean((Boston$medv[test]-pred)^2)
}
val.errors
which.min(val.errors)
[1] 12
coef(besties,10)
model <- lm(medv ~ crim + chas + nox +
rm + dis + rad + tax +
ptratio + black + lstat)
summary(model)
par(mfrow=c(2,2))
plot(model)
test1 <- lm(medv ~ crim)
summary(test1)
test2 <- lm(medv ~ black)
summary(test2)
test3 <- lm(medv ~ tax)
summary(test3)
test4 <- lm(medv ~ indus)
summary(test4)
3) library(glmnet)
x=model.matrix(medv~.,Boston)[,-1]
y=Boston$medv
grid=10^seq(10,-2,length=100)
cowboy = glmnet(x[train,],y[train],alpha=1,lambda=grid)
par(mfrow=c(1,1))
plot(cowboy)
set.seed(1)
seavee = cv.glmnet(x[train,],y[train],alpha=1)
plot(seavee)
bestlamb = seavee$lambda.min
bestlamb
[1] 0.0141035
baaaa = predict(cowboy,s=bestlamb,newx=x[test,])
mean((baaaa-y[test])^2)
[1] 26.8612
oneSElamb = seavee$lambda.1se
oneSElamb
[1] 0.3038508
bigSheep = predict(cowboy,s=oneSElamb,newx=x[test,])
mean((bigSheep-y[test])^2)
[1] 29.98775
laddo = glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(laddo,type = "coefficients",s=bestlamb)[1:14,]
lasso.coef
lasso.coef[lasso.coef!=0]
library(hdi)
library(dplyr)
attach(Boston)
xModel = model.matrix(medv~.,data=Boston)
xModela = as.data.frame(xModel)
xsub = xModela %>% select(crim,zn,indus,chas1,nox,rm,dis,rad,tax,ptratio,black,lstat)
outRes = ridge.proj(xsub, Boston$medv, family = "gaussian", standardize = TRUE,multiplecorr.method = "none")
outRes$pval
4) cliff = glmnet(x,y,alpha=0,lambda=grid)
set.seed(1)
cliff = cv.glmnet(x[train,],y[train],alpha=0)
plot(cliff)
bestfratbro = cliff$lambda.min
bestfratbro
[1] 0.6546267
cliff.jump = predict(cliff,s=bestfratbro,newx=x[test,])
mean((cliff.jump-y[test])^2)
[1] 27.19152
out=glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestfratbro)[1:14,]
attach(Boston)
xModel2 = model.matrix(medv~.,data=Boston)
xModela2 = as.data.frame(xModel2)
xsub2 = xModela2 %>% select(crim,zn,indus,chas1,nox,rm,dis,rad,tax,ptratio,black,lstat,age)
outRes2 = ridge.proj(xsub2, Boston$medv, family = "gaussian", standardize = TRUE, multiplecorr.method = "none")
outRes2$pval
Outputs
The graphs above are the RSS, Mallow’s Cp, adjusted R2, and Bayesian information criterion (BIC) for the 13 best models selected from the best subset selection process. The red dot on each graph indicates the best model (based on number of predictors) according to each criterion. We can see why RSS is not a valid method for choosing the best model from these 12 models, because RSS tends to favor models with the greatest number of predictors--which does not necessarily indicate best predictive power.
[1] 43.34683 35.46589 31.86354 31.18871 28.78234 31.34632
[7] 29.74464 29.10294 28.44145 27.66006 26.85842 26.84946
[13] 26.86123
which.min(val.errors)
[1] 12
This output gives the validation errors and selected variables for each of the 13 models from the best subset method when cross validation was used on the test set, numbered by the number of predictors. Then, we can see that the model with 12 predictors has the smallest test MSE.
coef(besties,10)
(Intercept) crim chas1 nox
35.425950431 -0.086762583 2.454574420 -14.017221594
rm dis rad tax
4.172159418 -1.103281825 0.371105643 -0.017725440
ptratio black lstat
-1.064906474 0.006664985 -0.487953801
As explained in the “Methods” section, we chose the model with 10 predictors despite the slightly higher test MSE. This output shows the coefficients of this model
This output is the diagnostic graphs of the model with 10 predictors from best subset selection. We can assume linearity of the model.
test2 <- lm(medv ~ black)
summary(test2)
Call:
lm(formula = medv ~ black)
Residuals:
Min 1Q Median 3Q Max
-18.884 -4.862 -1.684 2.932 27.763
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.551034 1.557463 6.775 3.49e-11 ***
black 0.033593 0.004231 7.941 1.32e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.679 on 504 degrees of freedom
Multiple R-squared: 0.1112, Adjusted R-squared: 0.1094
F-statistic: 63.05 on 1 and 504 DF, p-value: 1.318e-14
This output is the ANOVA Table which shows the individual p-value for the “black” predictor. With three significant symbols, we can see that the “black” variable is significant in predicting “medv.”
seavee = cv.glmnet(x[train,],y[train],alpha=1)
plot(seavee)
This is a graph of the cross validation for the Lasso model process. Each red dot represents the average of 10 MSE on the test set (for ten-fold validation), and each is for a different value of lamdba.
bestlamb
[1] 0.0141035
mean((baaaa-y[test])^2)
[1] 26.8612
outRes$pval
crim zn indus chas1
3.160027e-03 1.888200e-03 8.612762e-01 1.953734e-03
nox rm dis rad
5.618327e-06 3.984121e-20 1.725925e-13 1.584697e-05
tax ptratio black lstat
1.397397e-03 2.480993e-12 7.404648e-04 9.975336e-26
This is a table of the p-values from a hypothesis test for each coefficient given by the model with the best lambda value. We can see that each coefficient has a small p-value, so each variable is significant in predicting “medv.”
cliff = cv.glmnet(x[train,],y[train],alpha=0)
plot(cliff)
This is a graph of the cross validation for the Ridge Regression model process. Each red dot represents the average of 10 MSE on the test set (for ten-fold validation), and each is for a different value of lambda.
bestfratbro
[1] 0.6546267
mean((cliff.jump-y[test])^2)
[1] 27.19152
outRes2$pval
crim zn indus chas1 nox
1.948892e-03 1.476397e-03 7.894330e-01 2.938677e-03 1.127306e-05
rm
2.948807e-18
dis rad tax ptratio black
2.336108e-12 1.270611e-05 1.989630e-03 5.003021e-12 1.003465e-03
lstat age
7.868405e-23 9.730553e-01
This is a table of the p-values from a hypothesis test for each coefficient given by the model with the best lambda value (we coded as bestfratbro) . We can see that each coefficient has a small p-value, so each variable is significant in predicting “medv.”
Partner Contribution
While the work was divided, each partner was involved in overseeing all work and in putting the final report together. Specifically, Jack was in charge of writing and running the codes in R and compiling the results into the final models seen in the tables. Madison was in charge of writing most of the written aspects of the report and transferring the codes and outputs onto the final report page. Both partners researched the topic and contributed to the ideas in the final results and discussion.
Works Cited
Carlisle, M. (2019, January 13). Racist data destruction? Retrieved November 25, 2020, from https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8
Finfer, L. (2019, January 18). The 'good intentions' program that devastated Boston's neighborhoods - The Boston Globe. Retrieved November 26, 2020, from https://www.bostonglobe.com/opinion/2019/01/18/the-good-intentions-program-that-devastated-boston-neighborhoods/7ZWLqOYfM03SaTBJn4jRiK/story.html
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). Chapter 6: Linear Model Selection and Regularization. In An introduction to statistical learning with applications in R (pp. 244-255). New York, NY: Springer.
Perry, A, et al. The Brookings Institution, 2018, The Devaluation of Assets in Black Neighborhoods, www.brookings.edu/research/devaluation-of-assets-in-black-neighborhoods/.
Commentaires