# 6  Spatial Econometrics

This chapter is based on the following references, which are good follow-up’s on the topic:

• Chapter 11 of the GDS Book, by Rey, Arribas-Bel, and Wolf (forthcoming).
• Session III of Arribas-Bel (2014). Check the “Related readings” section on the session page for more in-depth discussions.
• Anselin (2007), freely available to download [pdf].
• The second part of this tutorial assumes you have reviewed Block E of Arribas-Bel (2019).

## 6.1 Dependencies

We will rely on the following libraries in this section, all of them included in Section 1.4.1:

# Layout
library(tufte)
# For pretty table
library(knitr)
# For string parsing
library(stringr)
# Spatial Data management
library(rgdal)
# Pretty graphics
library(ggplot2)
# Pretty maps
library(ggmap)
# For all your interpolation needs
library(gstat)
# For data manipulation
library(dplyr)
# Spatial regression
library(spdep)

Before we start any analysis, let us set the path to the directory where we are working. We can easily do that with setwd(). Please replace in the following line the path to the folder where you have placed this file and where the house_transactions folder with the data lives.

setwd('.')

## 6.2 Data

To explore ideas in spatial regression, we will the set of Airbnb properties for San Diego (US), borrowed from the “Geographic Data Science with Python” book (see here for more info on the dataset source). This covers the point location of properties advertised on the Airbnb website in the San Diego region.

db <- st_read('data/abb_sd/regression_db.geojson')
Reading layer regression_db' from data source
/Users/franciscorowe/Dropbox/Francisco/uol/teaching/envs453/202223/san/data/abb_sd/regression_db.geojson'
using driver GeoJSON'
Simple feature collection with 6110 features and 19 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -117.2812 ymin: 32.57349 xmax: -116.9553 ymax: 33.08311
Geodetic CRS:  WGS 84

The table contains the followig variables:

names(db)
  "accommodates"       "bathrooms"          "bedrooms"
 "beds"               "neighborhood"       "pool"
 "d2balboa"           "coastal"            "price"
 "log_price"          "id"                 "pg_Apartment"
 "pg_Condominium"     "pg_House"           "pg_Other"
 "pg_Townhouse"       "rt_Entire_home.apt" "rt_Private_room"
 "rt_Shared_room"     "geometry"          

For most of this chapter, we will be exploring determinants and strategies for modelling the price of a property advertised in AirBnb. To get a first taste of what this means, we can create a plot of prices within the area of San Diego:

db %>%
ggplot(aes(color = price)) +
geom_sf() +
scale_color_viridis_c() +
theme_void() ## 6.3 Non-spatial regression, a refresh

Before we discuss how to explicitly include space into the linear regression framework, let us show how basic regression can be carried out in R, and how you can interpret the results. By no means is this a formal and complete introduction to regression so, if that is what you are looking for, the first part of Gelman and Hill (2006), in particular chapters 3 and 4, are excellent places to check out.

The core idea of linear regression is to explain the variation in a given (dependent) variable as a linear function of a series of other (explanatory) variables. For example, in our case, we may want to express/explain the price of a property advertised on AirBnb as a function of some of its characteristics, such as the number of people it accommodates, and how many bathrooms, bedrooms and beds it features. At the individual level, we can express this as:

$\log(P_i) = \alpha + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i$

where $$P_i$$ is the price of house $$i$$, $$Acc_i$$, $$Bath_i$$, $$Bedr_i$$ and $$Beds_i$$ are the count of people it accommodates, bathrooms, bedrooms and beds that house $$i$$ has, respectively. The parameters $$\beta_{1,2, 3, 4}$$ give us information about in which way and to what extent each variable is related to the price, and $$\alpha$$, the constant term, is the average house price when all the other variables are zero. The term $$\epsilon_i$$ is usually referred to as the “error” and captures elements that influence the price of a house but are not accounted for explicitly. We can also express this relation in matrix form, excluding subindices for $$i$$ as:

$\log(P) = \alpha + \beta_1 Acc + \beta_2 Bath + \beta_3 Bedr + \beta_4 Beds + \epsilon$ where each term can be interpreted in terms of vectors instead of scalars (wit the exception of the parameters $$(\alpha, \beta_{1, 2, 3, 4})$$, which are scalars). Note we are using the logarithm of the price, since this allows us to interpret the coefficients as roughly the percentage change induced by a unit increase in the explanatory variable of the estimate.

Remember a regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the $$\beta_k$$ coefficients in the equation above is as the degree of correlation between the explanatory variable $$k$$ and the dependent variable, keeping all the other explanatory variables constant. When you calculate simple bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables –also called confounding factors1. Regression allows you to isolate the distinct effect that a single variable has on the dependent one, once we control for those other variables.

Practically speaking, running linear regressions in R is straightforward. For example, to fit the model specified in the equation above, we only need one line of code:

m1 <- lm('log_price ~ accommodates + bathrooms + bedrooms + beds', db)

We use the command lm, for linear model, and specify the equation we want to fit using a string that relates the dependent variable (the log of the price, log_price) with a set of explanatory ones (accommodates, bathrooms, bedrooms, beds) by using a tilde ~ that is akin to the $$=$$ symbol in the mathematical equation above. Since we are using names of variables that are stored in a table, we need to pass the table object (db) as well.

In order to inspect the results of the model, the quickest way is to call summary:

summary(m1)

Call:
lm(formula = "log_price ~ accommodates + bathrooms + bedrooms + beds",
data = db)

Residuals:
Min      1Q  Median      3Q     Max
-2.8486 -0.3234 -0.0095  0.3023  3.3975

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.018133   0.013947  288.10   <2e-16 ***
accommodates  0.176851   0.005323   33.23   <2e-16 ***
bathrooms     0.150981   0.012526   12.05   <2e-16 ***
bedrooms      0.111700   0.012537    8.91   <2e-16 ***
beds         -0.076974   0.007927   -9.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5366 on 6105 degrees of freedom
Multiple R-squared:  0.5583,    Adjusted R-squared:  0.558
F-statistic:  1929 on 4 and 6105 DF,  p-value: < 2.2e-16

A full detailed explanation of the output is beyond the scope of the chapter, but we will highlight the relevant bits for our main purpose. This is concentrated on the Coefficients section, which gives us the estimates for the $$\beta_k$$ coefficients in our model. These estimates are the raw equivalent of the correlation coefficient between each explanatory variable and the dependent one, once the “polluting” effect of the other variables included in the model has been accounted for2. Results are as expected for the most part: houses tend to be significantly more expensive if they accommodate more people (an extra person increases the price by 17.7%, approximately), have more bathrooms (15.1%), or bedrooms (11.2%). Perhaps counter intuitively, an extra bed available seems to decrease the price by about -7.7%. However, keep in mind that this is the case, everything else equal. Hence, more beds per room and bathroom (ie. a more crowded house) is a bit cheaper.

## 6.4 Spatial regression: a (very) first dip

Spatial regression is about explicitly introducing space or geographical context into the statistical framework of a regression. Conceptually, we want to introduce space into our model whenever we think it plays an important role in the process we are interested in, or when space can act as a reasonable proxy for other factors we cannot but should include in our model. As an example of the former, we can imagine how houses at the seafront are probably more expensive than those in the second row, given their better views. To illustrate the latter, we can think of how the character of a neighborhood is important in determining the price of a house; however, it is very hard to identify and quantify “character” per se, although it might be easier to get at its spatial variation, hence a case of space as a proxy.

Spatial regression is a large field of development in the econometrics and statistics literature. In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence. For more rigorous treatments of the topics introduced here, the reader is referred to Anselin (2003), Anselin and Rey (2014), and Gibbons, Overman, and Patacchini (2014).

## 6.5 Spatial heterogeneity

Spatial heterogeneity (SH) arises when we cannot safely assume the process we are studying operates under the same “rules” throughout the geography of interest. In other words, we can observe SH when there are effects on the outcome variable that are intrinsically linked to specific locations. A good example of this is the case of seafront houses above: we are trying to model the price of a house and, the fact some houses are located under certain conditions (i.e. by the sea), makes their price behave differently. This somewhat abstract concept of SH can be made operational in a model in several ways. We will explore the following two: spatial fixed-effects (FE); and spatial regimes, which is a generalization of FE.

Spatial FE

Let us consider the house price example from the previous section to introduce a more general illustration that relates to the second motivation for spatial effects (“space as a proxy”). Given we are only including two explanatory variables in the model, it is likely we are missing some important factors that play a role at determining the price at which a house is sold. Some of them, however, are likely to vary systematically over space (e.g. different neighborhood characteristics). If that is the case, we can control for those unobserved factors by using traditional dummy variables but basing their creation on a spatial rule. For example, let us include a binary variable for every neighbourhood, as provided by AirBnB, indicating whether a given house is located within such area (1) or not (0). Neighbourhood membership is expressed on the neighborhood column:

db %>%
ggplot(aes(color = neighborhood)) +
geom_sf() +
theme_void() Mathematically, we are now fitting the following equation:

$\log(P_i) = \alpha_r + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i$

where the main difference is that we are now allowing the constant term, $$\alpha$$, to vary by neighbourhood $$r$$, $$\alpha_r$$.

Programmatically, we can fit this model with lm:

# Include -1 to eliminate the constant term and include a dummy for every area
m2 <- lm(
'log_price ~ neighborhood + accommodates + bathrooms + bedrooms + beds - 1',
db
)
summary(m2)

Call:
lm(formula = "log_price ~ neighborhood + accommodates + bathrooms + bedrooms + beds - 1",
data = db)

Residuals:
Min      1Q  Median      3Q     Max
-2.4549 -0.2920 -0.0203  0.2741  3.5323

Coefficients:
Estimate Std. Error t value Pr(>|t|)
neighborhoodBalboa Park              3.994775   0.036539  109.33   <2e-16 ***
neighborhoodBay Ho                   3.780025   0.086081   43.91   <2e-16 ***
neighborhoodBay Park                 3.941847   0.055788   70.66   <2e-16 ***
neighborhoodCarmel Valley            4.034052   0.062811   64.23   <2e-16 ***
neighborhoodCity Heights West        3.698788   0.065502   56.47   <2e-16 ***
neighborhoodClairemont Mesa          3.658339   0.051438   71.12   <2e-16 ***
neighborhoodCollege Area             3.649859   0.064979   56.17   <2e-16 ***
neighborhoodCore                     4.433447   0.058864   75.32   <2e-16 ***
neighborhoodCortez Hill              4.294790   0.057648   74.50   <2e-16 ***
neighborhoodDel Mar Heights          4.300659   0.060912   70.61   <2e-16 ***
neighborhoodEast Village             4.241146   0.032019  132.46   <2e-16 ***
neighborhoodGaslamp Quarter          4.473863   0.052493   85.23   <2e-16 ***
neighborhoodGrant Hill               4.001481   0.058825   68.02   <2e-16 ***
neighborhoodGrantville               3.664989   0.080168   45.72   <2e-16 ***
neighborhoodKensington               4.073520   0.087322   46.65   <2e-16 ***
neighborhoodLa Jolla                 4.400145   0.026772  164.36   <2e-16 ***
neighborhoodLa Jolla Village         4.066151   0.087263   46.60   <2e-16 ***
neighborhoodLinda Vista              3.817940   0.063128   60.48   <2e-16 ***
neighborhoodLittle Italy             4.390651   0.052433   83.74   <2e-16 ***
neighborhoodLoma Portal              4.034473   0.036173  111.53   <2e-16 ***
neighborhoodMarina                   4.046133   0.052178   77.55   <2e-16 ***
neighborhoodMidtown                  4.032038   0.030280  133.16   <2e-16 ***
neighborhoodMidtown District         4.356943   0.071756   60.72   <2e-16 ***
neighborhoodMira Mesa                3.570523   0.061543   58.02   <2e-16 ***
neighborhoodMission Bay              4.251309   0.023318  182.32   <2e-16 ***
neighborhoodMission Valley           4.012410   0.083766   47.90   <2e-16 ***
neighborhoodMoreno Mission           4.028288   0.063342   63.59   <2e-16 ***
neighborhoodNormal Heights           3.791895   0.054730   69.28   <2e-16 ***
neighborhoodNorth Clairemont         3.498107   0.076432   45.77   <2e-16 ***
neighborhoodNorth Hills              3.959403   0.026823  147.61   <2e-16 ***
neighborhoodNorthwest                3.810201   0.078158   48.75   <2e-16 ***
neighborhoodOcean Beach              4.152695   0.032352  128.36   <2e-16 ***
neighborhoodOld Town                 4.127737   0.046523   88.72   <2e-16 ***
neighborhoodOtay Ranch               3.722902   0.091633   40.63   <2e-16 ***
neighborhoodPacific Beach            4.116749   0.022711  181.27   <2e-16 ***
neighborhoodPark West                4.216829   0.050370   83.72   <2e-16 ***
neighborhoodRancho Bernadino         3.873962   0.080780   47.96   <2e-16 ***
neighborhoodRancho Penasquitos       3.772037   0.068808   54.82   <2e-16 ***
neighborhoodRoseville                4.070468   0.065299   62.34   <2e-16 ***
neighborhoodSan Carlos               3.935042   0.093205   42.22   <2e-16 ***
neighborhoodScripps Ranch            3.641239   0.085190   42.74   <2e-16 ***
neighborhoodSerra Mesa               3.912127   0.066630   58.71   <2e-16 ***
neighborhoodSouth Park               3.987019   0.060141   66.30   <2e-16 ***
neighborhoodUniversity City          3.772504   0.039638   95.17   <2e-16 ***
neighborhoodWest University Heights  4.043161   0.048238   83.82   <2e-16 ***
accommodates                         0.150283   0.005086   29.55   <2e-16 ***
bathrooms                            0.132287   0.011886   11.13   <2e-16 ***
bedrooms                             0.147631   0.011960   12.34   <2e-16 ***
beds                                -0.074622   0.007405  -10.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4971 on 6061 degrees of freedom
Multiple R-squared:  0.9904,    Adjusted R-squared:  0.9904
F-statistic: 1.28e+04 on 49 and 6061 DF,  p-value: < 2.2e-16

Econometrically speaking, what the postcode FE we have introduced imply is that, instead of comparing all house prices across San Diego as equal, we only derive variation from within each postcode. In our particular case, estimating spatial FE in our particular example also gives you an indirect measure of area desirability: since they are simple dummies in a regression explaining the price of a house, their estimate tells us about how much people are willing to pay to live in a given area. We can visualise this “geography of desirability” by plotting the estimates of each fixed effect on a map:

# Extract neighborhood names from coefficients
nei.names <- m2$coefficients %>% as.data.frame() %>% row.names() %>% str_replace("neighborhood", "") # Set up as Data Frame nei.fes <- data.frame( coef = m2$coefficients,
nei = nei.names,
row.names = nei.names
) %>%
right_join(
db, by = c("nei" = "neighborhood")
)
# Plot
nei.fes %>%
st_as_sf() %>%
ggplot(aes(color = coef)) +
geom_sf() +
scale_color_viridis_c() +
theme_void() We can see how neighborhoods in the left (west) tend to have higher prices. What we can’t see, but it is represented there if you are familiar with the geography of San Diego, is that the city is bounded by the Pacific ocean on the left, suggesting neighbourhoods by the beach tend to be more expensive.

Remember that the interpretation of a $$\beta_k$$ coefficient is the effect of variable $$k$$, given all the other explanatory variables included remain constant. By including a single variable for each area, we are effectively forcing the model to compare as equal only house prices that share the same value for each variable; in other words, only houses located within the same area. Introducing FE affords you a higher degree of isolation of the effects of the variables you introduce in your model because you can control for unobserved effects that align spatially with the distribution of the FE you introduce (by neighbourhood, in our case).

Spatial regimes

At the core of estimating spatial FEs is the idea that, instead of assuming the dependent variable behaves uniformly over space, there are systematic effects following a geographical pattern that affect its behaviour. In other words, spatial FEs introduce econometrically the notion of spatial heterogeneity. They do this in the simplest possible form: by allowing the constant term to vary geographically. The other elements of the regression are left untouched and hence apply uniformly across space. The idea of spatial regimes (SRs) is to generalize the spatial FE approach to allow not only the constant term to vary but also any other explanatory variable. This implies that the equation we will be estimating is: $\log(P_i) = \alpha_r + \beta_{1r} Acc_i + \beta_{2r} Bath_i + \beta_{3r} Bedr_i + \beta_{4r} Beds_i + \epsilon_i$

where we are not only allowing the constant term to vary by region ($$\alpha_r$$), but also every other parameter ($$\beta_{kr}$$).

Also, given we are going to allow every coefficient to vary by regime, we will need to explicitly set a constant term that we can allow to vary:

## 6.8 Questions

We will be using again the Madrid AirBnb dataset:

mad_abb <- st_read('./data/assignment_1_madrid/madrid_abb.gpkg')
Reading layer madrid_abb' from data source
/Users/franciscorowe/Dropbox/Francisco/uol/teaching/envs453/202223/san/data/assignment_1_madrid/madrid_abb.gpkg'
using driver GPKG'
Simple feature collection with 18399 features and 16 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -3.86391 ymin: 40.33243 xmax: -3.556 ymax: 40.56274
Geodetic CRS:  WGS 84
colnames(mad_abb)
  "price"           "price_usd"       "log1p_price_usd" "accommodates"
 "bathrooms_text"  "bathrooms"       "bedrooms"        "beds"
 "neighbourhood"   "room_type"       "property_type"   "WiFi"
 "Coffee"          "Gym"             "Parking"         "km_to_retiro"
 "geom"           

In addition to those we have already seen, the columns to use here are:

• neighbourhood: a column with the name of the neighbourhood in which the property is located

With this at hand, answer the following questions:

1. Fit a baseline model with only property characteristics explaining the log of price

$\log(P_i) = \alpha + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i$

1. Augment the model with fixed effects at the neighbourhood level

$\log(P_i) = \alpha_r + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i$

1. [Optional] Augment the model with spatial regimes at the neighbourhood level:

$\log(P_i) = \alpha_r + \beta_{r1} Acc_i + \beta_{r2} Bath_i + \beta_{r3} Bedr_i + \beta_{r4} Beds_i + \epsilon_{ri}$

1. Fit a model that augments the baseline in 1. with the spatial lag of a variable you consider interesting. Motivate this choice. Note that to complete this, you will need to also generate a spatial weights matrix.

In each instance, provide a brief interpretation (no more thana few lines for each) that demonstrates your understanding of theunderlying concepts behind your approach.

1. EXAMPLE Assume that new houses tend to be built more often in areas with low deprivation. If that is the case, then $$NEW$$ and $$IMD$$ will be correlated with each other (as well as with the price of a house, as we are hypothesizing in this case). If we calculate a simple correlation between $$P$$ and $$IMD$$, the coefficient will represent the degree of association between both variables, but it will also include some of the association between $$IMD$$ and $$NEW$$. That is, part of the obtained correlation coefficient will be due not to the fact that higher prices tend to be found in areas with low IMD, but to the fact that new houses tend to be more expensive. This is because (in this example) new houses tend to be built in areas with low deprivation and simple bivariate correlation cannot account for that.↩︎

2. Keep in mind that regression is no magic. We are only discounting the effect of other confounding factors that we include in the model, not of all potentially confounding factors.↩︎

3. If you need to refresh your knowledge on spatial weight matrices, check Block E of Arribas-Bel (2019); Chapter 4 of Rey, Arribas-Bel, and Wolf (forthcoming); or the Spatial Weights Section of Rowe (2022).↩︎