Last week, we learnt how to use qualitative variables in multiple linear regression model to understand the relationship between independent variables X1 … Xn and the dependent variable Y. Today we will learn to use logistic regression for binary dependent variable. A logistic regression model is a type of regression analysis used when the dependent variable is binary (e.g., “success/failure” or “0/1”). It estimates the probability of one outcome relative to the other using a logistic function. This model is commonly used in situations like predicting disease presence (yes/no) or customer churn (stay/leave). The independent variables can be continuous, categorical, or a mix of both. The model’s output is in the form of odds ratios, showing how predictors affect the likelihood of the outcome.
Estimate and interpret a logistic regression model
Assess the model fit
The application context of a binomial logistic regression model is when the dependent variable under investigation is a binary variable. Usually, a value of 1 for this dependent variable means the occurrence of an event; and, 0 otherwise. For example, the dependent variable for this practical is whether a person is a long-distance commuter i.e. 1, and 0 otherwise.
In this week’s practical, we are going to apply logistic regression analysis in an attempt to answer the following research question:
RESEARCH QUESTION: Who is willing to commute long distances?
The practical is split into two main parts. The first focuses on implementing a binary logistic regression model with R. The second part focuses the interpretation of the resulting estimates.
4.1 Preparing the input variables
Prepare the data for implementing a logistic regression model. The data set used in this practical is the “sar_sample_label.csv” and “sar_sample_code.csv”. The SAR is a snapshot of census microdata, which are individual level data. The data sample has been drawn and anonymised from census and known as the Samples of Anonymised Records (SARs).
You may need to download the two datasets from our github website if you haven’t. The two csv are actually the same dataframe, only one uses the label as the value but the other uses the code. We will first read in both for the data overview the labels are more friendly, and then we focus on using “sar_sample_code.csv” in the regression model as it is easier for coding.
Rows: 50000 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (39): country, region, county, la_group, age_group, sex, marital_status,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 50000 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (39): country, region, county, la_group, age_group, sex, marital_status,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The outcome variable is a person’s commuting distance captured by the variable “work_distance”.
table(sar_label$work_distance)
10 to <20 km
3650
2 to <5 km
4414
20 to <40 km
2014
40 to <60km
572
5 to <10 km
4190
60km or more
703
Age<16 or not working
25975
At home
2427
Less than 2 km
4028
No fixed place
1943
Work outside England and Wales but within UK
29
Work outside UK
21
Works at offshore installation (within UK)
34
There are a variety of categories in the variable, however, we are only interested in commuting distance and therefore in people reporting their commuting distance. Thus, we will explore the numeric codes of the variable ranging from 1 to 8.
Code for Work_distance
Categories
1
Less than 2 km
2
2 to <5 km
3
5 to <10 km
4
10 to <20 km
5
20 to <40 km
6
40 to <60 km
7
60km or more
8
At home
9
No fixed place
10
Work outside England and Wales but within UK
11
Work outside UK
12
Works at offshore installation (within UK)
As we are also interested in exploring whether people with different socio-economic statuses (or occupations) tend to be associated with varying probabilities of commuting over long distances, we further filter or select cases.
table(sar_label$nssec)
Child aged 0-15
9698
Full-time student
3041
Higher professional occupations
3162
Intermediate occupations
5288
Large employers and higher managers
909
Lower managerial and professional occupations
8345
Lower supervisory and technical occupations
2924
Never worked or long-term unemployed
2261
Routine occupations
4660
Semi-routine occupations
5893
Small employers and own account workers
3819
Using nssec, we select people who reported an occupation, and delete cases with numeric codes from 9 to 12, who are unemployed, full-time students, children and not classifiable.
Code for nssec
Category labels
1
Large employers and higher managers
2
Higher professional occupations
3
Lower managerial and professional occupations
4
Intermediate occupations
5
Small employers and own account workers
6
Lower supervisory and technical occupations
7
Semi-routine occupations
8
Routine occupations
9
Never worked or long-term employed
10
Full-time student
11
Not classifiable
12
Child aged 0-15
Now, similar to next week, we use the filter() to prepare our dataframe today. You may already realise that using sar_code is easier to do the filtering.
Q1. Summarise the frequencies of the two variables “work_distance” and “nssec” with the new data.
Recode the “work_distance” variable into a binary dependent variable
A simple way to create a binary dependent variable representing long-distance commuting is to use the mutate() function as discussed in last week’s practical session. Before creating the binary variables from the “work_distance” variable, we need to define what counts as a long-distance commuting move. Such definition can vary. Here we define a long-distance commuting move as any commuting move over a distance above 60km (the category of “60km or more”).
Q2. Check the new sar_df dataframe with new column named New_work_distance by using the codes you have learnt.
Prepare your “nssec” variable before the regression model
The nssec is a categorical variable. Therefore, as we’ve learnt last week, before adding the categorical variables into the regression model, we need first make it a factor and then identify the reference category.
We are interested in whether people with occupations being “Higher professional occupations” are associated with a lower probability of commuting over long distances when comparing to people in other occu
For socio-economic status, we use code 5 (Small employers and Own account workers) as the baseline category to explore whether people work as independent employers show lower probability of commuting longer than 60km compared with other occupations.
#create the modelm.glm =glm(New_work_distance~sex + nssec, data = sar_df, family="binomial")# inspect the resultssummary(m.glm)
Call:
glm(formula = New_work_distance ~ sex + nssec, family = "binomial",
data = sar_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.67337 0.05329 -31.401 < 2e-16 ***
sex2 -0.36678 0.04196 -8.742 < 2e-16 ***
nssec1 -0.12881 0.11306 -1.139 0.255
nssec3 -0.38761 0.06467 -5.994 2.05e-09 ***
nssec4 -1.03079 0.08439 -12.214 < 2e-16 ***
nssec5 1.22639 0.06489 18.898 < 2e-16 ***
nssec6 -1.38992 0.10919 -12.730 < 2e-16 ***
nssec7 -1.43909 0.09002 -15.986 < 2e-16 ***
nssec8 -1.48534 0.09646 -15.398 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20441 on 33025 degrees of freedom
Residual deviance: 17968 on 33017 degrees of freedom
AIC: 17986
Number of Fisher Scoring iterations: 6
Q3. If we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation, how will we prepare the input independent variables and what will be the specified regression model?
Hint: use mutate() to create a new column, set the value of “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” as original, while the rest as “Other occupations” (recall in Lab 3 what we did for assigning the regions not within “London”, “Wales”, “Scotland” and “Northern Ireland” as “Other Regions in England”). Here by using the SAR in code format, we can make this more easier by using:
Or by using if_else and %in% in R, we can achieve the same result. %in% is an operator used to test if elements of one vector are present in another. It returns TRUE for elements found and FALSE otherwise.
Use “Other occupations” (code: 0) as the reference category by relevel(as.factor()) and then create the regression model: glm(New_work_distance~sex + New_nssec, data = sar_df, family= "binomial"). Can you now run the model by yourself? Find the answer at the end of the practical.
4.2.1Model fit
We include the R library pscl for calculate the measures of fit.
Warning: package 'pscl' was built under R version 4.3.3
Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.
library(pscl)
Relating back to this week’s lecture notes, what is the Pseudo R2 of the fitted logistic model (from the Model Summary table below)?
llhNull: The log-likelihood of the null model (without predictors).
G2: The likelihood ratio statistic, showing the model’s improvement over the null model.
McFadden: McFadden’s pseudo R-squared (a common measure of model fit).
r2ML: Maximum likelihood pseudo R-squared.
r2CU: Cox & Snell pseudo R-squared.
Different from the multiple linear regression, whose R-squared indicates % of the variance in the dependent variables that is explained by the independent variable. In logistic regression model, R-squared is not directly applicable. Instead, we use pseudo R-squared measures, such as McFadden’s pseudo R-squared, or Cox & Snell pseudo R-squared to provide an indication of model fit. For the individual level dataset like SAR, value around 0.3 is considered good for well-fitting.
4.2.2Statistical significance of regression coefficients or covariate effects
Similar to the statistical inference in a linear regression model context, p-values of regression coefficients are used to assess significances of coefficients; for instance, by comparing p-values to the conventional level of significance of 0.05:
· If the p-value of a coefficient is smaller than 0.05, the coefficient is statistically significant. In this case, you can say that the relationship between an independent variable and the outcome variable is statistically significant.
· If the p-value of a coefficient is larger than 0.05, the coefficient is statistically insignificant. In this case, you can say or conclude that there is no statistically significant association or relationship between an independent variable and the outcome variable.
The interpretation of coefficients (B) and odds ratios (Exp(B)) for the independent variables differs from that in a linear regression setting.
Interpreting the regression coefficients.
o For the variable sex, a negative sign and the odds ratio estimate indicate that the probability of commuting over long distances for female is 0.693 times less likely than male (the reference group), with the confidence intervals (CI) or likely range between 0.6 to 0.7, holding all other variables constant (the socio-economic classification variable). Put it differently, being females reduces the probability of long-distance commuting by 30.7% (1-0.693).
o For variable nssec, a positive significant and the odds ratio estimate indicate that the probability of long-distance commuting for those whose socio-economic classification as:
small employers and own account workers (nssec=5) are 3.409 times more likely than the higher prof occupations, holding all other variables constant (the Sex variable), with a likely range (CI) of between 3.0 to 3.8.
the p-value of Large employers and higher managers (nssec=1) is > 0.05, so thre is no statistically significant relationship between large employers and higher managers and long-distance commuting.
Routine occupations (nssec=8) are 0.226 times (or 22.6%) less likely than the higher professional occupations, with the CI between 0.18 to 0.27. when other variable constant. Or, we can see being routine occupations decreases the probability of long-distance commuting by 77.4% (1-0.226).
Q4. Interpret the regression coefficients (i.e. Exp(B)) of variables “nssec=Lower managerial and professional occupations” and “nssec=Semi-routine occupation”.
Q5. Could you identify significant factors of commuting over long distances?
4.2.4 Prediction using fitted regression model
Relating to this week’s lecture, the log odds of the person who is will to long-distance commuting is equal to:
By using R, you can create the object you would like to predict. Here we created three person, see whether you can interpret their gender and socio-economic classification?
So let us look at these three people. The first one, for a male who classified as Semi-routine occupation in NSSEC, the probability of he travel over 60km to work is only 4.26%. For the second one, a female who is in Lower managerial and professional occupation, the probability of long-distance commuting is 8.11%. Now you know the prediction outcomes for our last person.
4.3Extension activities
The extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:
Select a regression strategy and explain why a linear or logistic model is appropriate
Perform one or a series of regression models, including different combinations of your chosen independent variables to explain and/or predict your dependent variable
Answer for the model in Q3
In Q3, we we want to explore whether people with occupation being “Large employers and higher managers”, “Higher professional occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation. So we create the variable New_nssec with 0 “Other occupations”, but still keep “1”, “2” and “8” still as original categories.
So we can first have a check of our new variable New_nssec:
table(sar_df$New_nssec)
0 1 2 8
24615 887 3055 4469
Then we set the reference categories: sex as 1 (male) and New_nssec as 0, which is “Other occupations”: