5  Lab: Logistic Regression

Author

Zi Ye

Published

October 14, 2025

Last week, we learnt how to use qualitative variables in multiple linear regression model to understand the relationship between independent variables X1 … Xn and the dependent variable Y. Today we will learn to use logistic regression for binary dependent variable. A logistic regression model is a type of regression analysis used when the dependent variable is binary (e.g., “success/failure” or “0/1”). It estimates the probability of one outcome relative to the other using a logistic function. This model is commonly used in situations like predicting disease presence (yes/no) or customer churn (stay/leave). The independent variables can be continuous, categorical, or a mix of both. The model’s output is in the form of odds ratios, showing how predictors affect the likelihood of the outcome.

The lecture’s slides can be found here.

Learning objectives

An understanding of, and ability to:

The application context of a binomial logistic regression model is when the dependent variable under investigation is a binary variable. Usually, a value of 1 for this dependent variable means the occurrence of an event; and, 0 otherwise. For example, the dependent variable for this practical is whether a person is a long-distance commuter i.e. 1, and 0 otherwise.

In this week’s practical, we are going to apply logistic regression analysis in an attempt to answer the following research question:

RESEARCH QUESTION: Who is willing to commute long distances?

The practical is split into two main parts. The first focuses on implementing a binary logistic regression model with R. The second part focuses the interpretation of the resulting estimates.

5.1 Knowing the dataset and descriptive analysis

Prepare the data for implementing a logistic regression model. The data set used in this practical is the “sar_sample_label.csv” and “sar_sample_code.csv”. The SAR is a snapshot of census microdata, which are individual level data. The data sample has been drawn and anonymised from census and known as the Samples of Anonymised Records (SARs).

You may need to download the two datasets and also the data dictionary from Canvas if you haven’t. Double click to open both .csv files you may find they have exactly same column names - yes, they are in fact the same spreadsheet but ‘sar_sample_label.csv’ is more readable for the cell contents as they are in text format but cells in the ‘sar_sample_code.csv’ is in coding format. The ‘SAR_dictionary’ is create to help you understand what the columns mean, just like the FRS dictionary for the FRS datasets last week. Yes, the two SAR csv files are actually the same dataframe, only one uses the label as the value but the other uses the code. This is quite normal as when we doing surveys, we use text as the options of multiple choice questions for respondents to choose from, and therefore the collected survey results are in the ‘sar_sample_label’ format. The label format is usually more readable and is good to do descriptive analysis as we will show in today’s content. However, we may find the coding format is more concise for writing the code and thus will be used during regression analysis. Please notice that coding the labels into numbers, doesn’t mean the categorical/qualitative variable has been converted into numeric/continuous numbers.

Let’s first read in both for the data overview.

if(!require("tidyverse"))
  install.packages("tidyverse",dependencies = T, repos = "https://cloud.r-project.org/")
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'tidyr' was built under R version 4.5.1
Warning: package 'readr' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'dplyr' was built under R version 4.5.1
Warning: package 'forcats' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
if(!require("broom"))
  install.packages("broom",dependencies = T, repos = "https://cloud.r-project.org/")
Warning: package 'broom' was built under R version 4.5.1
if(!require("forcats"))
  install.packages("forcats")

library(tidyverse)
library(broom)
library(forcats)
sar_label <- read.csv("../data/SAR/sar_sample_label.csv")
sar_code <- read.csv("../data/SAR/sar_sample_code.csv")

Run the following codes to view both dataframes:

#View the sar_label dataset
View(sar_label)
#View the sar_code dataset
View(sar_code)

To know the dataset better, we can run dim() to know who many individuals (rows) in this sample data and how many variables (columns) have been recorded.

dim(sar_label)
[1] 50000    39
#or
dim(sar_code)
[1] 50000    39

So there are 50,000 respondents in the sample dataset and 39 variables are included.

After browsing both data sheets and also the data dictionary (sometimes we call it meta data), you may notice that all the columns in the SAR dataset is categorical/qualitative type. Therefore, if we run the summary() function for the dataframes, as we did in previous weeks for the Census data or the FRS dataset, the outputs of the summary() may not be very useful. Again, please notice that the figures in the sar_code.csv is not mean actual numbers, but the codes of different categories of the qualitative variables. Therefore, the summary() results at there is not useful. Please reflect what we’ve learnt in last week’s, when the variables if categorical/qualitative type, how we do the descriptive analysis for them?

summary(sar_code)
summary(sar_label)

So, the answer is we focus on summarising the frequency and distribution of categories within the variable and the function in R for doing this is table(). For example, the variable “work_distance” captures a person’s commuting distance. We can run the table() for both label and code csv to understand the distribution, composition and diversity of categories of the ’’work_distance” in the data. Try to repeat what we have learnt in Week 3 to finish the descriptive analysis for the categorical variables in SAR (by ggplot2).

table(sar_label$work_distance)

                                10 to <20 km 
                                        3650 
                                  2 to <5 km 
                                        4414 
                                20 to <40 km 
                                        2014 
                                 40 to <60km 
                                         572 
                                 5 to <10 km 
                                        4190 
                                60km or more 
                                         703 
                       Age<16 or not working 
                                       25975 
                                     At home 
                                        2427 
                              Less than 2 km 
                                        4028 
                              No fixed place 
                                        1943 
Work outside England and Wales but within UK 
                                          29 
                             Work outside UK 
                                          21 
  Works at offshore installation (within UK) 
                                          34 
table(sar_code$work_distance)

   -9     1     2     3     4     5     6     7     8     9    10    11    12 
25975  4028  4414  4190  3650  2014   572   703  2427  1943    29    21    34 

To create a chart as part of the descriptive analysis, exactly as what we did in Week 3, we use the library ggplots and the codes:

library(ggplot2)
ggplot(sar_label, aes(x = work_distance)) + 
  geom_bar(fill = "rosybrown") + 
  labs(title = "Histogram of work distrance in SAR sample", x = "Work Distance", y = "Count") +
  coord_flip()+ #Flip the Axes, add a # in front of this line, to make the code in gray and you will see why we would better flip the axes at here
  theme_bw() 

We have learnt that for the categorical/qualitative variables, we need R to treat them as factor type. Therefore, if we want to create descriptive analysis charts by using sar_code , we need first convert the code numbers from numeric to factor type.

library(ggplot2)
ggplot(sar_code, aes(x = as.factor(work_distance))) + 
  geom_bar(fill = "rosybrown") + 
  labs(title = "Histogram of work distrance in SAR sample", x = "Work Distance", y = "Count") +
  coord_flip()+ #Flip the Axes, add a # in front of this line, to make the code in gray and you will see why we would better flip the axes at here
  theme_bw() 

You may also realise that this chart is not that readable to be used as a descriptive analysis chart to put in the report as the Y-axis are all in codes, therefore the first chart created by labels is a better choice. Based on ‘SAR_dictionary.xlsx’, we learn how the codes map to the categories for the variable work_distance in both csv.

Code for Work_distance Categories
-9 Age<16 or not working
1 Less than 2 km
2 2 to <5 km
3 5 to <10 km
4 10 to <20 km
5 20 to <40 km
6 40 to <60 km
7 60km or more
8 At home
9 No fixed place
10 Work outside England and Wales but within UK
11 Work outside UK
12 Works at offshore installation (within UK)

There are a variety of categories in the variable, however, we are only interested in commuting distance and therefore in people reporting their commuting distance. Thus, we will explore the numeric codes of the variable ranging from 1 to 8.

As we are also interested in exploring whether people with different socio-economic statuses (or occupations) tend to be associated with varying probabilities of commuting over long distances, we further filter or select cases.

table(sar_label$nssec)

                              Child aged 0-15 
                                         9698 
                            Full-time student 
                                         3041 
              Higher professional occupations 
                                         3162 
                     Intermediate occupations 
                                         5288 
          Large employers and higher managers 
                                          909 
Lower managerial and professional occupations 
                                         8345 
  Lower supervisory and technical occupations 
                                         2924 
         Never worked or long-term unemployed 
                                         2261 
                          Routine occupations 
                                         4660 
                     Semi-routine occupations 
                                         5893 
      Small employers and own account workers 
                                         3819 
table(sar_code$nssec)

   1    2    3    4    5    6    7    8    9   10   12 
 909 3162 8345 5288 3819 2924 5893 4660 2261 3041 9698 
ggplot(sar_label, aes(x = fct_rev(fct_infreq(nssec)))) + 
  geom_bar(fill = "tan") + 
  labs(title = "Histogram of NSSEC in SAR sample", x = "NSSEC", y = "Count") +
  coord_flip()+ #Flip the Axes, add a # in front of this line, to make the code in gray and you will see why we would better flip the axes at here
  theme_bw() 

From ‘SAR_dictionary.xlsx’, we learn what each code indicate corresponding category for variable nssec. For the following regression model, we select people who reported an occupation, and delete cases with numeric codes from 9 to 12, who are unemployed, full-time students, children and not classifiable.

Code for nssec Category labels
1 Large employers and higher managers
2 Higher professional occupations
3 Lower managerial and professional occupations
4 Intermediate occupations
5 Small employers and own account workers
6 Lower supervisory and technical occupations
7 Semi-routine occupations
8 Routine occupations
9 Never worked or long-term employed
10 Full-time student
11 Not classifiable
12 Child aged 0-15

Now, similar to next week, we use the filter() to prepare our dataframe today. You may already realise that using sar_code is easier to do the filtering.

sar_df <- sar_code %>% filter(work_distance<=8 & nssec <=8 )

Q1. Create descriptive analysis for the two variables “work_distance”, “nssec” and “sex” with the new data, including (1) summarise the frequencies and (2) create histogram charts for the three variables.

5.2 Preparing the input variables

For a logistic regression model, we first need to recode the “work_distance” variable into a binary dependent variable as our independent variable.

A simple way to create a binary dependent variable representing long-distance commuting is to use the mutate() function as discussed in last week’s practical session. Before creating the binary variables from the “work_distance” variable, we need to define what counts as a long-distance commuting move. Such definition can vary. Here we define a long-distance commuting move as any commuting move over a distance above 60km (the category of “60km or more”).

sar_df <- sar_df %>% mutate(
  New_work_distance = if_else(work_distance >6, 1,0))

Q2. Do descripitve analysis to the new sar_df dataframe with new column named New_work_distance by using the codes you have learnt.

The independent variables are gender and socio-economic status. The gender variable sex is a categorical variable. Therefore, as we’ve learnt last week, before adding the categorical variables into the regression model, we need first make it a factor and then identify the reference category. For gender, we use male as the basline. Prepare your “sex” variable before the regression model:

sar_df$sex <- relevel(as.factor(sar_df$sex),ref="1")

Then, we prepare socio-economic status variable nssec for the regression model. We are interested in whether people with occupations being “Higher professional occupations” are associated with a lower probability of commuting over long distances when comparing to people in other occupations. Therefore by using the “Higher professional occupations” code 2, we run the code as below:

sar_df$nssec <- relevel(as.factor(sar_df$nssec), ref = "2")

5.3 Implementing a logistic regression model

The binary dependent variable is long-distance commuting, variable name New_work_distance.

#create the model
m.glm = glm(New_work_distance~sex + nssec, 
            data = sar_df, 
            family= "binomial")
# inspect the results
summary(m.glm) 

Call:
glm(formula = New_work_distance ~ sex + nssec, family = "binomial", 
    data = sar_df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.67337    0.05329 -31.401  < 2e-16 ***
sex2        -0.36678    0.04196  -8.742  < 2e-16 ***
nssec1      -0.12881    0.11306  -1.139    0.255    
nssec3      -0.38761    0.06467  -5.994 2.05e-09 ***
nssec4      -1.03079    0.08439 -12.214  < 2e-16 ***
nssec5       1.22639    0.06489  18.898  < 2e-16 ***
nssec6      -1.38992    0.10919 -12.730  < 2e-16 ***
nssec7      -1.43909    0.09002 -15.986  < 2e-16 ***
nssec8      -1.48534    0.09646 -15.398  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20441  on 33025  degrees of freedom
Residual deviance: 17968  on 33017  degrees of freedom
AIC: 17986

Number of Fisher Scoring iterations: 6
# odds ratios
exp(coef(m.glm)) 
(Intercept)        sex2      nssec1      nssec3      nssec4      nssec5 
  0.1876138   0.6929649   0.8791416   0.6786766   0.3567267   3.4088847 
     nssec6      nssec7      nssec8 
  0.2490946   0.2371432   0.2264258 
# confidence intervals
exp(confint(m.glm, level = 0.95)) 
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1688060 0.2080319
sex2        0.6381810 0.7522773
nssec1      0.7017990 1.0935602
nssec3      0.5981911 0.7708192
nssec4      0.3020431 0.4205270
nssec5      3.0037298 3.8739884
nssec6      0.2002766 0.3073830
nssec7      0.1984396 0.2824629
nssec8      0.1869397 0.2729172

5.3.1 Interpreting estimated regression coefficients

Compare to the statistical inference in a multiple linear regression model context (you have done in Week 2 and 3), the interpretation of coefficients (B) and odds ratios (Exp(B)) for the logistic regression model have some similarity and differences:

It is the same that we read p-values of regression coefficients to assess significances of coefficients; for instance, by comparing p-values to the conventional level of significance of 0.05:

·       If the p-value of a coefficient is smaller than 0.05, the coefficient is statistically significant. In this case, you can say that the relationship between an independent variable and the outcome variable is statistically significant.

·       If the p-value of a coefficient is larger than 0.05, the coefficient is statistically insignificant. In this case, you can say or conclude that there is no statistically significant association or relationship between an independent variable and the outcome variable.

It is different in the way we interpret the regression coefficients (B) as we need to read the odds ratios (Exp(B)) from exp(coef(m.glm))

o   For the variable sex, a negative sign and the odds ratio estimate indicate that the probability of commuting over long distances for female is 0.693 times less likely than male (the reference group), with the confidence intervals (CI) or likely range between 0.63 to 0.75, holding all other variables constant (the socio-economic classification variable). Put it differently, being females reduces the probability of long-distance commuting by 30.7% (1-0.693).

o   For variable nssec, a positive significant and the odds ratio estimate indicate that the probability of long-distance commuting for those whose socio-economic classification as:

  • the p-value of Large employers and higher managers (nssec=1) is > 0.05, so there is no statistically significant relationship between large employers and higher managers and long-distance commuting;

  • Lower managerial and professional occupations (nssec=3) are 0.679 times likely for long-distance commuting when comparing to our reference category (nssec=2, higher prof occupations), holding all other variables constant (the Sex variable), with a likely range (CI) between 0.60 to 0.77. Thefore we can also say that the workers in lower managerial and professional occupations has 32.1% (1-0.679) less probability than the higher professional workers to travel longer than 60km for work when the gender is the same.

  • Small employers and own account workers (nssec=5) are 3.409 times more likely than the reference category (nssec=2, higher prof occupations), when holding the gender variable the same, with a likely range (CI) of between 3.00 to 3.87.

  • Compare to the higher professional occupations (nssec=2), the worker in Routine occupations (nssec=8) are 0.226 times (or 22.6%) likely to travel more than 60km to work, with the CI between 0.19 to 0.27. when other variable constant. Or, we can see being routine occupations decreases the probability of long-distance commuting by 77.4% (1-0.226).

Q3. Can you write down the findings you learnt from the model outcomes to other occupation groups (nssec = 4, nssec = 6 and nssec =7)?

5.3.2 Model fit

We include the R library pscl for calculate the measures of fit.

if(!require("pscl"))
  install.packages("pscl", repos = "https://cloud.r-project.org/")
Loading required package: pscl
Warning: package 'pscl' was built under R version 4.5.1
Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.
library(pscl)

Relating back to this week’s lecture notes, what is the Pseudo R2 of the fitted logistic model (from the Model Summary table below)?

# Pseudo R-squared
pR2(m.glm)
fitting null model for pseudo-r2
          llh       llhNull            G2      McFadden          r2ML 
-8.983928e+03 -1.022037e+04  2.472890e+03  1.209785e-01  7.214246e-02 
         r2CU 
 1.563288e-01 
# or in better format
pR2(m.glm) %>% round(4) %>% tidy()
fitting null model for pseudo-r2
# A tibble: 6 × 2
  names              x
  <chr>          <dbl>
1 llh       -8984.    
2 llhNull  -10220.    
3 G2         2473.    
4 McFadden      0.121 
5 r2ML          0.0721
6 r2CU          0.156 
  • llh: The log-likelihood of the fitted model.

  • llhNull: The log-likelihood of the null model (without predictors).

  • G2: The likelihood ratio statistic, showing the model’s improvement over the null model.

  • McFadden: McFadden’s pseudo R-squared (a common measure of model fit).

  • r2ML: Maximum likelihood pseudo R-squared.

  • r2CU: Cox & Snell pseudo R-squared.

Different from the multiple linear regression, whose R-squared indicates % of the variance in the dependent variables that is explained by the independent variable. In logistic regression model, R-squared is not directly applicable. Instead, we use pseudo R-squared measures, such as McFadden’s pseudo R-squared, or Cox & Snell pseudo R-squared to provide an indication of model fit. For the individual level dataset like SAR, value around 0.3 is considered good for well-fitting. Therefore, our model is not that robust for prediction but still explain the association between variables and the categories. To improve the model, we may need more variables as the independent variables, you may identify some based on the related literature or debates on this topic.

5.3.3 Recode Socio-economic status variable and explore commuting differences

This time, we want to know whether “Lower supervisory and technical occupations”, “Semi-routine occupations” and “Routine occupations” are associated with higher probability of commuting over long distance when comparing to people in other occupation.

We can use mutate() to create a new column, set the value of “Lower supervisory and technical occupations”, “Semi-routine occupations” and “Routine occupations” as original, while the rest as “Other occupations”. Here by using the SAR in code format, we can make this more easier by using:

sar_df <- sar_df %>% mutate(New_nssec = fct_other(nssec,   keep = c("6", "7", "8"),   other_level = "0" ))

Or by using if_else and %in% in R, we can achieve the same result. %in% is an operator used to test if elements of one vector are present in another. It returns TRUE for elements found and FALSE otherwise.

sar_df <- sar_df %>% mutate(New_nssec = if_else(!nssec %in% c(6,7,8), "0" ,nssec))

Use “Other occupations” (code: 0) as the reference category by relevel(as.factor()) and then create the regression model: glm(New_work_distance~sex + New_nssec, data = sar_df, family= "binomial"). Can you now run the model by yourself? Find the answer at the end of the practical.

Q4. If we want to explore whether people as independent employers show lower probability of commuting longer than 60km compared with other occupations, how will we prepare the input independent variables and what will be the specified regression model?

5.3.4 Prediction using fitted regression model

Relating to this week’s lecture, the log odds of the person who is will to long-distance commuting is equal to:

Log odds of long-distance commuting = 0.188 + 0.693 * sexFemale + 0.679 * nssec3 + 0.357*nssec4 + 3.409*nssec5 + 0.249*nssec6 + 0.237*nssec7 + 0.226*nssec8. We don’t include nssec1 as the previous result shows that it is not statistically significant, so we won’t use the model to predict individuals that as nssec=1 occupation.

By using R, you can create the object you would like to predict. Here we created three person, see whether you can interpret their gender and socio-economic classification?

objs <- data.frame(sex=c("1","2","1"),nssec=c("7","3","5"))

Then we can predict by using our model m.glm:

predict(m.glm, objs,type = "response")
         1          2          3 
0.04259618 0.08108050 0.39007797 

So let us look at these three people. The first one, for a male who classified as Semi-routine occupation in NSSEC, the probability of he travel over 60km to work is only 4.26%. For the second one, a female who is in Lower managerial and professional occupation, the probability of long-distance commuting is 8.11%. Now you know the prediction outcomes for our last person. Remind: the model fitting result shows the model is not very robust, therefore the prediction may not very solid as well.

5.4 Extension activities

The extension activities are designed to get yourself prepared for the Assignment 2 in progress. For this week, try whether you can:

  • Select a regression strategy and explain why a linear or logistic model is appropriate

  • Perform one or a series of regression models, including different combinations of your chosen independent variables to explain and/or predict your dependent variable

5.5 Answers for Qs

Answer for Q1

For Q1, we have already had the codes for work_distance and nssec, so the only missing variable is sex:

To summaries the frequency of each category, we use code:

table(sar_label$sex)

Female   Male 
 25677  24323 

To create bar charts for the distribution and composition of the gender variable in the SAR sample, we can run:

ggplot(sar_label, aes(x = sex)) +
  geom_bar(fill="gold",width=0.5) +
  labs(title = "Histogram of Gender in SAR sample", x = "gender", y = "Count")+#set text info
  theme_classic()#choose theme type, try theme_bw(), theme_minimal() see differences

Answer for Q2

Similarly, the descriptive analysis for the newly created variable New_work_distance in sar_df should include:

table(sar_df$New_work_distance)

    0     1 
29954  3072 

Here we see the categories of New_work_distance are in numeric type 0 or 1 (as we used code sar_df <- sar_df %>% mutate( New_work_distance = if_else(work_distance >6, 1,0)) to do so), therefore in the histogram code, we use as.factor() to convert it as factor type.

ggplot(sar_df, aes(x = as.factor(New_work_distance))) +
  geom_bar(fill="forestgreen",width=0.5) +
  labs(title = "Histogram of New_work_distance in SAR sample", x = "gender", y = "Count")+#set text info
  theme_classic()#choose theme type, try theme_bw(), theme_minimal() see differences

Now we can use the frequency and the bar chart to report that according to our definition of long-distance travelling to work (over 60 km), there are 3072 individual are long-distance commuters in the SAR sample, which makes up 6.14% (3072/50000) of the sample.

Answer for the model in Q4

In Q4, we we want to explore whether people with occupation being independent employers are associated with higher probability of commuting over long distance when comparing to people in other occupation. So we set the nssec= 5 (Small employers and own account workers) as the baseline category.

table(sar_df$New_nssec)

    0     6     7     8 
20063  2790  5704  4469 

Then we set the reference categories: sex as 1 (male) and New_nssec as 5, which is ‘Small employers and own account workers’:

sar_df$sex <- relevel(as.factor(sar_df$sex),ref="1")
sar_df$nssec <- relevel(as.factor(sar_df$nssec),ref="5")

Now, we build the logistic regression model and check out the outcomes:

model_new = glm(New_work_distance~sex + nssec, data = sar_df, family= "binomial")

summary(model_new)

Call:
glm(formula = New_work_distance ~ sex + nssec, family = "binomial", 
    data = sar_df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.44698    0.04138 -10.801   <2e-16 ***
sex2        -0.36678    0.04196  -8.742   <2e-16 ***
nssec2      -1.22639    0.06489 -18.898   <2e-16 ***
nssec1      -1.35519    0.10782 -12.569   <2e-16 ***
nssec3      -1.61400    0.05482 -29.442   <2e-16 ***
nssec4      -2.25717    0.07696 -29.329   <2e-16 ***
nssec6      -2.61631    0.10377 -25.212   <2e-16 ***
nssec7      -2.66548    0.08317 -32.047   <2e-16 ***
nssec8      -2.71172    0.09021 -30.059   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20441  on 33025  degrees of freedom
Residual deviance: 17968  on 33017  degrees of freedom
AIC: 17986

Number of Fisher Scoring iterations: 6

For the model interpretation, we need:

# odds ratios
exp(coef(model_new)) 
(Intercept)        sex2      nssec2      nssec1      nssec3      nssec4 
 0.63955383  0.69296493  0.29335107  0.25789714  0.19909052  0.10464615 
     nssec6      nssec7      nssec8 
 0.07307217  0.06956621  0.06642224 
# confidence intervals
exp(confint(model_new, level = 0.95)) 
Waiting for profiling to be done...
                 2.5 %     97.5 %
(Intercept) 0.58959085 0.69343772
sex2        0.63818099 0.75227729
nssec2      0.25813190 0.33291942
nssec1      0.20787318 0.31733101
nssec3      0.17876868 0.22162844
nssec4      0.08984430 0.12149367
nssec6      0.05933796 0.08915692
nssec7      0.05895858 0.08169756
nssec8      0.05547661 0.07902917

And don’t forget measure the model fit:

# model fit
pR2(model_new) %>% round(4) %>% tidy()
fitting null model for pseudo-r2
Warning in tidy.numeric(.): 'tidy.numeric' is deprecated.
See help("Deprecated")
# A tibble: 6 × 2
  names              x
  <chr>          <dbl>
1 llh       -8984.    
2 llhNull  -10220.    
3 G2         2473.    
4 McFadden      0.121 
5 r2ML          0.0721
6 r2CU          0.156 

Q5. Now what conclusion you can draw when comparing the median NSSEC groups to the higher groups?