4  Lab: LogisticRegression

Author

Zi Ye

Published

November 29, 2024

Last week, we learnt how to use qualitative variables in multiple linear regression model to understand the relationship between independent variables X1 … Xn and the dependent variable Y. Today we will learn to use logistic regression for binary dependent variable. A logistic regression model is a type of regression analysis used when the dependent variable is binary (e.g., “success/failure” or “0/1”). It estimates the probability of one outcome relative to the other using a logistic function. This model is commonly used in situations like predicting disease presence (yes/no) or customer churn (stay/leave). The independent variables can be continuous, categorical, or a mix of both. The model’s output is in the form of odds ratios, showing how predictors affect the likelihood of the outcome.

The lecture’s slides can be found here.

Learning objectives

An understanding of, and ability to:

The application context of a binomial logistic regression model is when the dependent variable under investigation is a binary variable. Usually, a value of 1 for this dependent variable means the occurrence of an event; and, 0 otherwise. For example, the dependent variable for this practical is whether a person is a long-distance commuter i.e. 1, and 0 otherwise.

In this week’s practical, we are going to apply logistic regression analysis in an attempt to answer the following research question:

RESEARCH QUESTION: Who is willing to commute long distances?

The practical is split into two main parts. The first focuses on implementing a binary logistic regression model with R. The second part focuses the interpretation of the resulting estimates.

4.1 Preparing the input variables

Prepare the data for implementing a logistic regression model. The data set used in this practical is the “sar_sample_label.csv” and “sar_sample_code.csv”. The SAR is a snapshot of census microdata, which are individual level data. The data sample has been drawn and anonymised from census and known as the Samples of Anonymised Records (SARs).

You may need to download the two datasets from our github website if you haven’t. The two csv are actually the same dataframe, only one uses the label as the value but the other uses the code. We will first read in both for the data overview the labels are more friendly, and then we focus on using “sar_sample_code.csv” in the regression model as it is easier for coding.

if(!require("tidyverse"))
  install.packages("tidyverse",dependencies = T)
Warning: package 'tidyverse' was built under R version 4.3.2
Warning: package 'ggplot2' was built under R version 4.3.2
Warning: package 'tibble' was built under R version 4.3.2
Warning: package 'tidyr' was built under R version 4.3.2
Warning: package 'readr' was built under R version 4.3.2
Warning: package 'purrr' was built under R version 4.3.2
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
Warning: package 'forcats' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.2
if(!require("broom"))
  install.packages("broom",dependencies = T)
Warning: package 'broom' was built under R version 4.3.2
library(tidyverse)
library(broom)
sar_label <- read_csv("../data/SAR/sar_sample_label.csv")
Rows: 50000 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (39): country, region, county, la_group, age_group, sex, marital_status,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sar_code <- read_csv("../data/SAR/sar_sample_code.csv")
Rows: 50000 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (39): country, region, county, la_group, age_group, sex, marital_status,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(sar_label)
View(sar_code)

glimpse(sar_label)
glimpse(sar_code)
summary(sar_code)
summary(sar_label)

The outcome variable is a person’s commuting distance captured by the variable “work_distance”.

table(sar_label$work_distance)

                                10 to <20 km 
                                        3650 
                                  2 to <5 km 
                                        4414 
                                20 to <40 km 
                                        2014 
                                 40 to <60km 
                                         572 
                                 5 to <10 km 
                                        4190 
                                60km or more 
                                         703 
                       Age<16 or not working 
                                       25975 
                                     At home 
                                        2427 
                              Less than 2 km 
                                        4028 
                              No fixed place 
                                        1943 
Work outside England and Wales but within UK 
                                          29 
                             Work outside UK 
                                          21 
  Works at offshore installation (within UK) 
                                          34 
table(sar_code$work_distance)

   -9     1     2     3     4     5     6     7     8     9    10    11    12 
25975  4028  4414  4190  3650  2014   572   703  2427  1943    29    21    34 

There are a variety of categories in the variable, however, we are only interested in commuting distance and therefore in people reporting their commuting distance. Thus, we will explore the numeric codes of the variable ranging from 1 to 8.

Code for Work_distance Categories
1 Less than 2 km
2 2 to <5 km
3 5 to <10 km
4 10 to <20 km
5 20 to <40 km
6 40 to <60 km
7 60km or more
8 At home
9 No fixed place
10 Work outside England and Wales but within UK
11 Work outside UK
12 Works at offshore installation (within UK)

As we are also interested in exploring whether people with different socio-economic statuses (or occupations) tend to be associated with varying probabilities of commuting over long distances, we further filter or select cases.

table(sar_label$nssec)

                              Child aged 0-15 
                                         9698 
                            Full-time student 
                                         3041 
              Higher professional occupations 
                                         3162 
                     Intermediate occupations 
                                         5288 
          Large employers and higher managers 
                                          909 
Lower managerial and professional occupations 
                                         8345 
  Lower supervisory and technical occupations 
                                         2924 
         Never worked or long-term unemployed 
                                         2261 
                          Routine occupations 
                                         4660 
                     Semi-routine occupations 
                                         5893 
      Small employers and own account workers 
                                         3819 
table(sar_code$nssec)

   1    2    3    4    5    6    7    8    9   10   12 
 909 3162 8345 5288 3819 2924 5893 4660 2261 3041 9698 

Using nssec, we select people who reported an occupation, and delete cases with numeric codes from 9 to 12, who are unemployed, full-time students, children and not classifiable.

Code for nssec Category labels
1 Large employers and higher managers
2 Higher professional occupations
3 Lower managerial and professional occupations
4 Intermediate occupations
5 Small employers and own account workers
6 Lower supervisory and technical occupations
7 Semi-routine occupations
8 Routine occupations
9 Never worked or long-term employed
10 Full-time student
11 Not classifiable
12 Child aged 0-15

Now, similar to next week, we use the filter() to prepare our dataframe today. You may already realise that using sar_code is easier to do the filtering.

sar_df <- sar_code %>% filter(work_distance<=8 & nssec <=8 )

Q1. Summarise the frequencies of the two variables “work_distance” and “nssec” with the new data.

Recode the “work_distance” variable into a binary dependent variable

A simple way to create a binary dependent variable representing long-distance commuting is to use the mutate() function as discussed in last week’s practical session. Before creating the binary variables from the “work_distance” variable, we need to define what counts as a long-distance commuting move. Such definition can vary. Here we define a long-distance commuting move as any commuting move over a distance above 60km (the category of “60km or more”).

sar_df <- sar_df %>% mutate(
  New_work_distance = if_else(work_distance >6, 1,0))

Q2. Check the new sar_df dataframe with new column named New_work_distance by using the codes you have learnt.

Prepare your “nssec” variable before the regression model

The nssec is a categorical variable. Therefore, as we’ve learnt last week, before adding the categorical variables into the regression model, we need first make it a factor and then identify the reference category.

We are interested in whether people with occupations being “Higher professional occupations” are associated with a lower probability of commuting over long distances when comparing to people in other occu

sar_df$nssec <- relevel(as.factor(sar_df$nssec), ref = "2")

4.2 Implementing a logistic regression model

The binary dependent variable is long-distance commuting, variable name New_work_distance.

The independent variables are gender and socio-economic status.

For gender, we use male as the basline.

sar_df$sex <- relevel(as.factor(sar_df$sex),ref="1")

For socio-economic status, we use code 5 (Small employers and Own account workers) as the baseline category to explore whether people work as independent employers show lower probability of commuting longer than 60km compared with other occupations.

#create the model
m.glm = glm(New_work_distance~sex + nssec, 
            data = sar_df, 
            family= "binomial")
# inspect the results
summary(m.glm) 

Call:
glm(formula = New_work_distance ~ sex + nssec, family = "binomial", 
    data = sar_df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.67337    0.05329 -31.401  < 2e-16 ***
sex2        -0.36678    0.04196  -8.742  < 2e-16 ***
nssec1      -0.12881    0.11306  -1.139    0.255    
nssec3      -0.38761    0.06467  -5.994 2.05e-09 ***
nssec4      -1.03079    0.08439 -12.214  < 2e-16 ***
nssec5       1.22639    0.06489  18.898  < 2e-16 ***
nssec6      -1.38992    0.10919 -12.730  < 2e-16 ***
nssec7      -1.43909    0.09002 -15.986  < 2e-16 ***
nssec8      -1.48534    0.09646 -15.398  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20441  on 33025  degrees of freedom
Residual deviance: 17968  on 33017  degrees of freedom
AIC: 17986

Number of Fisher Scoring iterations: 6
# odds ratios
exp(coef(m.glm)) 
(Intercept)        sex2      nssec1      nssec3      nssec4      nssec5 
  0.1876138   0.6929649   0.8791416   0.6786766   0.3567267   3.4088847 
     nssec6      nssec7      nssec8 
  0.2490946   0.2371432   0.2264258 
# confidence intervals
exp(confint(m.glm, level = 0.95)) 
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1688060 0.2080319
sex2        0.6381810 0.7522773
nssec1      0.7017990 1.0935602
nssec3      0.5981911 0.7708192
nssec4      0.3020431 0.4205270
nssec5      3.0037298 3.8739884
nssec6      0.2002766 0.3073830
nssec7      0.1984396 0.2824629
nssec8      0.1869397 0.2729172

Q3. If we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation, how will we prepare the input independent variables and what will be the specified regression model?

Hint: use mutate() to create a new column, set the value of “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” as original, while the rest as “Other occupations” (recall in Lab 3 what we did for assigning the regions not within “London”, “Wales”, “Scotland” and “Northern Ireland” as “Other Regions in England”). Here by using the SAR in code format, we can make this more easier by using:

sar_df <- sar_df %>% mutate(New_nssec = fct_other(
  nssec,
  keep = c("1", "2", "8"),
  other_level = "0"
))

Or by using if_else and %in% in R, we can achieve the same result. %in% is an operator used to test if elements of one vector are present in another. It returns TRUE for elements found and FALSE otherwise.

sar_df <- sar_df %>% mutate(New_nssec = if_else(!nssec %in% c(1,2,8), "0" ,nssec))

Use “Other occupations” (code: 0) as the reference category by relevel(as.factor()) and then create the regression model: glm(New_work_distance~sex + New_nssec, data = sar_df, family= "binomial"). Can you now run the model by yourself? Find the answer at the end of the practical.

4.2.1 Model fit

We include the R library pscl for calculate the measures of fit.

if(!require("pscl"))
  install.packages("pscl")
Loading required package: pscl
Warning: package 'pscl' was built under R version 4.3.3
Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.
library(pscl)

Relating back to this week’s lecture notes, what is the Pseudo R2 of the fitted logistic model (from the Model Summary table below)?

# Pseudo R-squared
pR2(m.glm)
fitting null model for pseudo-r2
          llh       llhNull            G2      McFadden          r2ML 
-8.983928e+03 -1.022037e+04  2.472890e+03  1.209785e-01  7.214246e-02 
         r2CU 
 1.563288e-01 
# or in better format
pR2(m.glm) %>% round(4) %>% tidy()
fitting null model for pseudo-r2
# A tibble: 6 × 2
  names              x
  <chr>          <dbl>
1 llh       -8984.    
2 llhNull  -10220.    
3 G2         2473.    
4 McFadden      0.121 
5 r2ML          0.0721
6 r2CU          0.156 
  • llh: The log-likelihood of the fitted model.

  • llhNull: The log-likelihood of the null model (without predictors).

  • G2: The likelihood ratio statistic, showing the model’s improvement over the null model.

  • McFadden: McFadden’s pseudo R-squared (a common measure of model fit).

  • r2ML: Maximum likelihood pseudo R-squared.

  • r2CU: Cox & Snell pseudo R-squared.

Different from the multiple linear regression, whose R-squared indicates % of the variance in the dependent variables that is explained by the independent variable. In logistic regression model, R-squared is not directly applicable. Instead, we use pseudo R-squared measures, such as McFadden’s pseudo R-squared, or Cox & Snell pseudo R-squared to provide an indication of model fit. For the individual level dataset like SAR, value around 0.3 is considered good for well-fitting.

4.2.2 Statistical significance of regression coefficients or covariate effects

Similar to the statistical inference in a linear regression model context, p-values of regression coefficients are used to assess significances of coefficients; for instance, by comparing p-values to the conventional level of significance of 0.05:

·       If the p-value of a coefficient is smaller than 0.05, the coefficient is statistically significant. In this case, you can say that the relationship between an independent variable and the outcome variable is statistically significant.

·       If the p-value of a coefficient is larger than 0.05, the coefficient is statistically insignificant. In this case, you can say or conclude that there is no statistically significant association or relationship between an independent variable and the outcome variable.

4.2.3 Interpreting estimated regression coefficients

  • The interpretation of coefficients (B) and odds ratios (Exp(B)) for the independent variables differs from that in a linear regression setting.

  • Interpreting the regression coefficients.

o   For the variable sex, a negative sign and the odds ratio estimate indicate that the probability of commuting over long distances for female is 0.693 times less likely than male (the reference group), with the confidence intervals (CI) or likely range between 0.6 to 0.7, holding all other variables constant (the socio-economic classification variable). Put it differently, being females reduces the probability of long-distance commuting by 30.7% (1-0.693).

o   For variable nssec, a positive significant and the odds ratio estimate indicate that the probability of long-distance commuting for those whose socio-economic classification as:

  • small employers and own account workers (nssec=5) are 3.409 times more likely than the higher prof occupations, holding all other variables constant (the Sex variable), with a likely range (CI) of between 3.0 to 3.8.

  • the p-value of Large employers and higher managers (nssec=1) is > 0.05, so thre is no statistically significant relationship between large employers and higher managers and long-distance commuting.

  • Routine occupations (nssec=8) are 0.226 times (or 22.6%) less likely than the higher professional occupations, with the CI between 0.18 to 0.27. when other variable constant. Or, we can see being routine occupations decreases the probability of long-distance commuting by 77.4% (1-0.226).

Q4. Interpret the regression coefficients (i.e. Exp(B)) of variables “nssec=Lower managerial and professional occupations” and “nssec=Semi-routine occupation”.

Q5. Could you identify significant factors of commuting over long distances?

4.2.4 Prediction using fitted regression model

Relating to this week’s lecture, the log odds of the person who is will to long-distance commuting is equal to:

Log odds of long-distance commuting = 0.188 + 0.693 * sexFemale + 0.679 * nssec3 + 0.357*nssec4 + 3.409*nssec5 + 0.249*nssec6 + 0.237*nssec7 + 0.226*nssec8

By using R, you can create the object you would like to predict. Here we created three person, see whether you can interpret their gender and socio-economic classification?

objs <- data.frame(sex=c("1","2","1"),nssec=c("7","3","5"))

Then we can predict by using our model m.glm:

predict(m.glm, objs,type = "response")
         1          2          3 
0.04259618 0.08108050 0.39007797 

So let us look at these three people. The first one, for a male who classified as Semi-routine occupation in NSSEC, the probability of he travel over 60km to work is only 4.26%. For the second one, a female who is in Lower managerial and professional occupation, the probability of long-distance commuting is 8.11%. Now you know the prediction outcomes for our last person.

4.3 Extension activities

The extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:

  • Select a regression strategy and explain why a linear or logistic model is appropriate

  • Perform one or a series of regression models, including different combinations of your chosen independent variables to explain and/or predict your dependent variable

Answer for the model in Q3

In Q3, we we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation. So we create the variable New_nssec with 0 “Other occupations”, but still keep “1”, “2” and “8” still as original categories.

So we can first have a check of our new variable New_nssec:

table(sar_df$New_nssec)

    0     1     2     8 
24615   887  3055  4469 

Then we set the reference categories: sex as 1 (male) and New_nssec as 0, which is “Other occupations”:

sar_df$sex <- relevel(as.factor(sar_df$sex),ref="1")
sar_df$New_nssec <- relevel(as.factor(sar_df$New_nssec),ref="0")

Now, we build the logistic regression model and check out the outcomes:

model_new = glm(New_work_distance~sex + New_nssec, data = sar_df, family= "binomial")

summary(model_new)

Call:
glm(formula = New_work_distance ~ sex + New_nssec, family = "binomial", 
    data = sar_df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.92955    0.02786 -69.253  < 2e-16 ***
sex2        -0.61757    0.03936 -15.688  < 2e-16 ***
New_nssec1   0.19183    0.10336   1.856   0.0634 .  
New_nssec2   0.32582    0.05678   5.738 9.58e-09 ***
New_nssec8  -1.14082    0.08434 -13.526  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20441  on 33025  degrees of freedom
Residual deviance: 19868  on 33021  degrees of freedom
AIC: 19878

Number of Fisher Scoring iterations: 6

For the model interpretation, we need:

# odds ratios
exp(coef(model_new)) 
(Intercept)        sex2  New_nssec1  New_nssec2  New_nssec8 
  0.1452137   0.5392528   1.2114691   1.3851650   0.3195562 
# confidence intervals
exp(confint(model_new, level = 0.95)) 
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1374508 0.1533142
sex2        0.4991249 0.5824112
New_nssec1  0.9846735 1.4770819
New_nssec2  1.2380128 1.5467297
New_nssec8  0.2698515 0.3756702
# model fit
pR2(model_new) %>% round(4) %>% tidy()
fitting null model for pseudo-r2
Warning: 'tidy.numeric' is deprecated.
See help("Deprecated")
# A tibble: 6 × 2
  names              x
  <chr>          <dbl>
1 llh       -9934.    
2 llhNull  -10220.    
3 G2          573.    
4 McFadden      0.028 
5 r2ML          0.0172
6 r2CU          0.0373