3  Lab: Introductory Statistics - Happiness around the world

Author

Zi Ye

Published

February 19, 2026

3.1 Overview

We have used Census data for the past two weeks. Census data aim to collect detailed information about every person and household in the UK, particularly on personal characteristics, household and housing, work and education, health, migration, and so on. The aim of such collection is to support plan public services and infrastructures.

Different from census (which counts the entire population), in many cases, we need to use surveys to obtain timely, cost-effective and probably more specific in-depth information from people. Therefore, this week’s practical session will draw upon two survey datasets:

  • UK Family Resource Survey

  • World Value Survey

We will use these survey datasets to investigate the sample more thoroughly and employ statistical approaches to determine its representativeness of the population. We will use both numerical and categorical variables to calculate the sample mean, Standard Error and Confidence Intervals in different sample size.

You may wish to recap this week’s lecture: Lecture 03

3.2 Prepare your working environment

  • For this Week 3 session, create a sub-folder called Week3 in your ENVS162 folder on your M-Drive. This is exactly the step we did in Week 1 and 2 and we will do this every week to Week 5.

  • Download this week’s practical datasets from Canvas Week 3: family_resource_survey.csv, world_value_suvey.csv and world_map.geojson. Save all these three datasets in the Week 3 folder you just created.

  • Open RStudio

  • Open a new R Script for your Week 3 work, rename it as Week3.R and save it in your newly created Week 3 folder, under M drive -> ENVS162 folder.

  • Check whether there is any previous left dataframes in your RStudio in the upper-right side Environment pane. You can always use the to clear all the dataframes in your environment and make it all clean. For the same aim, you can click the icon , or you can run the below code:

rm(list = ls())

This command and also the brush icon can both clear RStudio’s memory, removing any data objects that you have previously created.

3.3 Load libraries and familiar with survey data

Exactly as what we have done for Week 1 and 2, before we start to do anything in R, we first need to load the essential libraries as all the functions/commands are packed in these libraries. For this week, we will still rely on tidyversetmap and sf.

library(tidyverse)
library(sf)
library(tmap)

Recall that the data can be loaded into RStudio using the read.csv function:

# use read.csv to load a CSV file
df <- read.csv("family_resource_survey.csv",stringsAsFactors = TRUE)

The stringsAsFactors = TRUE parameter tells R to read any character or text variables as classes or categories and not as just text.

Recall what we should do to familiar ourselves with the dataset? We need to now how many rows and columns of the dataset, what are the variables, what types of these variables, and we may what to view the dataset for a quick scan?

Therefore, we need these functions: dim() or ncol() and nrow(), names(), str(), and View().

# know how many rows and columns
dim(df)

# know the names of the variables
names(df)

# know types and examples of these variables
str(df)

# open a view window to scan the dataset
View(df)

Question 1. How many people have been included in this survey?

In Week 2, we also know a simple way to help us get a quick look at the descriptive summary of all the variables:

# summary all variables
summary(df)

Okay, if for now you get yourself familiar with the dataset you are going to work on, let’s move on.

3.4 Sampling of numerical variables

As discussed in the lecture, the Standard Error (SE) of the sample mean indicates how much the estimated mean is expected to vary from sample to sample. Together, the sample mean and SE help us assess how accurately our sample reflects the true population mean. A smaller SE indicates our sample estimate is likely closer to the actual population value.

From the steps of familiaring with the dataset, you should already know that there are 33,847 people included in this dataset. Recall our lecture, these 33,847 people should be selected by some sampling methods - random or stratified or multistaged - but in any case, they are a sample of the entire world population.

Here, in this practical, let’s assume that our survey data is about a certain population. With such assumption, our population is the people what completed the survey. Thus in this section, we are pretending that the survey respondent are the population.

We can test the mean and standard deviation of the hh_income_net of the population using the functions mean() and sd(). In Week 2, we have learnt that the mean tells you the center of your numerical variable, and the standard deviation reveals how spread out the variable are around the mean.

# calculate the mean value of the variable
mean(df$hh_income_net)
[1] 34485.51
# calculate the standard deviation of the variable
sd(df$hh_income_net)
[1] 29379.1

You can examine the income_net distribution by creating a histogram as using the code below (you should be very competent of this from Week 2):

# plot a histogram for the variable income_net in df
hist(df$hh_income_net, breaks  = 100, 
    main = "Histogram of a population", 
    xlab = "Height", 
    col = "wheat",
    border = "grey")

3.4.1 Sample mean and Standard Error

In the example below, we will work with the hh_income_net variable and an initial sample of 10 people (observations).

You can assess a sample of 10 observations (survey respondents) taken from the population at random using the sample function.

# Create a sample of the population
sample_10  <- slice_sample(df, n=10)
# sample mean value
mean(sample_10$hh_income_net)
[1] 52234
# sample standard deviation
sd(sample_10$hh_income_net)
[1] 38457.84

You will get a different values for these to the ones your friends get (and the ones created below), because you will have each extracted a different sample from the population - this is random sampling!

You should calculate the Standard Error (SE) of the sample mean by:

In R we run the code below:

# sample SE
sd(sample_10$hh_income_net)/sqrt(length(sample_10$hh_income_net))
[1] 12161.44
# or
sd(sample_10$hh_income_net)/sqrt(nrow(sample_10))
[1] 12161.44

Question 2. Compare population mean, sample mean and SE mean from your results?

Now,to see how using a larger sample improves the estimate, you will now repeat the previous example, generating a sample, determining the sample means and SEs, but using different sample sizes: 10, 50, 100, 200.

Complete the below table for the comparison:

Random sample size Sample mean Stardard Error (SE) mean
10
50
100
200

The above should have given you the general idea that as sample size increases, the sample estimate of a population mean becomes more reliable. To show this clearly, the code below generates a trend line plot showing the impact of sample size on Standard Errors. As the sample gets larger the SE gets smaller as the sample more closely represents the true population.

# create a vector of sample sizes
X = seq(10, 200, 10)
# check the result
X
 [1]  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190
[20] 200
# create a vector of sample errors from these
# an empty vector to be populated in the for loop below
SE = vector()
# now loop though each value in X
for (i in X){
  # create a sample for the ith value in X
  # set the seed - see the Info box below
  set.seed(12)
  sample.i = slice_sample(df, n=i)
  # calculate the SE
  se.i = sd(sample.i$hh_income_net)/sqrt(length(sample.i$hh_income_net))
  # add to the SE vector
  SE = append(SE, se.i)
}
# check the result
SE
 [1] 11545.970  6139.880  4638.476  3893.327  3353.118  3067.783  2687.440
 [8]  2391.198  2261.770  2103.450  2114.780  1979.338  1915.625  1882.336
[15]  1979.165  1878.847  1810.918  1725.824  1776.093  1705.949

You can then these plot these:

# plot the SEs
plot(x = X, y = SE,
     pch = 19, col = "darkgreen",
     xlab = "Sample Size", ylab = "Standard Error", 
     main = "Impact of sample size on Standard Error")
lines(x = X, y = SE)

So…as sample size increases, the sample estimate of a population mean becomes more reliable. However, you should note that this was based on one-off sample generation and looking at the SE. In the next section you will apply confidence intervals to the sample in order to assess the robustness of the sample.

3.4.2 Confidence Intervals

The idea of a confidence interval, CI, is a natural extension of the standard error. It allows us to define a level of confidence in our population parameter estimate gleaned from a sample:

We can use the qnorm function. qnorm(p) gives the value on the normal distribution such that the cumulative probability up to that value equals p. It is used at here to calculate the errors around the sample mean under an assumption of a normal distribution of the population (hence the norm bit of qnorm):

# you have already created the sample with
# sample_10  <- slice_sample(df, n=10)
m <- mean(sample_10$hh_income_net)
m
std <- sd(sample_10$hh_income_net)
n <- length(sample_10$hh_income_net)

error <- qnorm(0.975)*std/sqrt(n)

lower.bound <- m-error
upper.bound <- m+error
# upper and lower bound
upper.bound
lower.bound

This is the 95% confidence interval, for the rest 2.5% in the lower tail and 2.5% in the upper tail. This is why you may find in the code, we use qnorm(0.975) as that refers to the cutoff points in two directions). Again, you may find your values subtly different due to random sample. But in my case, the mean household net income is 42,110 pound, and we are 95% confident that the mean household net income is between 22,842 pound and 61,376 pound. This can be a really fuzzy estimates, and it is because we only use 10 respondents from the survey and by using such as small sample size, the CI window would be very fuzzy.

Let’s change the above code to a sample size of 1000:

# change sample size from 10 to 1000 this time
sample_1k  <- slice_sample(df, n=1000)
m <- mean(sample_1k$hh_income_net)
m
std <- sd(sample_1k$hh_income_net)
n <- length(sample_1k$hh_income_net)

error <- qnorm(0.975)*std/sqrt(n)

lower.bound <- m-error
upper.bound <- m+error

lower.bound
upper.bound

Now you see that the sample mean has changed to 36,680, which indicates that the mean household net income is 36,680 pound, and we are 95% confident that the mean household net income is between 33,934 pound and 39,427 pound. Therefore, with this sample size as 1000 sample, the CI window is much narrow and the result is more solid.

You may try out if we use all the respondents as the sample, what is the result? In this case, we don’t need to slice_sample(), and the n should be n = nrow(df).

3.5 Sampling of categorical variable

Now, we switch our focus to the categorical variable happiness in the dataframe. Similar to what we have done in last week, we first wish to have a descriptive summary of the categories and frequency of categories for categorical variables. We can use table() here:

# categories and frequencies
table(df$happiness)

  Fairly happy Fairly unhappy     Very happy   Very unhappy 
          8346           1552          15548           1079 

You may also remember how to create a table to display the frequency and proportion of each category:

# use df to create a table, including frequency and percentage of each category
df %>%
  count(happiness) %>%
  mutate(pct = round(n / sum(n) * 100,1))
       happiness     n  pct
1   Fairly happy  8346 31.5
2 Fairly unhappy  1552  5.9
3     Very happy 15548 58.6
4   Very unhappy  1079  4.1

The pct in the table, is the percentage of respondents in our Family Resource Survey sample who reported each level of happiness. In other words, they are the sample proportion for variable happiness in this dataset.

Meanwhile, recall from Week 2 lecture, there are two types of categorical variables: nominal and ordinal. The key difference between them is whether the categories have a nature ordering. In this case, the happiness variable is ordinal, because its categories follow a meaningful order: Very unhappy, Fairly unhappy, Fairly happy, and Very happy.

Therefore, as we wish to display the variable with an order, we use R function factor() to do so:

#set the ordering of categories
df$happiness <- factor(
  df$happiness,
  levels = c("Very unhappy", 
             "Fairly unhappy", 
             "Fairly happy", 
             "Very happy"),
  ordered = TRUE
) 

If we now re-run the code to generate the frequency table and proportions, you will notice that the categories appear in a more logical order, reflecting the ordinal structure of the variable.

This time, we label the proportion column as p̂ (p-hat) to emphasise that these values represent the sample proportions for each happiness category.

df %>%
  count(happiness) %>%
  mutate(p_hat = round(n / sum(n) * 100,1))
       happiness     n p_hat
1   Very unhappy  1079   4.1
2 Fairly unhappy  1552   5.9
3   Fairly happy  8346  31.5
4     Very happy 15548  58.6

We use barplot() from last week to present the happiness variable distribution in the Family Resource Survey:

#use bar plot to visualise the distribution of categorical variable
tab=table(df$happiness)
barplot(tab)

Question 3. What pattern you can identify from the barplot about happiness in this sample?

3.5.1 Sample proportion and Standard Error of the proportion

In this section, we will focus on the Very happy category, we can check the sample mean of Very happy from the eariler work and the p_hat is 58.6%. This can also be calculated by mean():

#calculate sample mean
p_hat = mean(df$happiness == "Very happy")
p_hat
[1] 0.586164

Now let’s calculate the Standard Error. Recap the equation of Standard Error for a proportion:

So, we need the sample size n first and then use n and p_hat to calculate the SE_p. If we select all the respondents in the dataset as the sample, then the n is equal to the nrow().

#calculate Standard Error
n = nrow(df)
n
[1] 26525
SE_p <- sqrt(p_hat * (1 - p_hat) / n)
SE_p
[1] 0.003024099

Therefore, now we get the interpretation that the “Very happy” sample proportion typically varies by about 0.003 or 0.30 percentage.

We can also test n from 10, 100, 1000 to all to plot the SE_p and see how sample size increased will reduce the Standard Error of proportion:

X = c(10, 100, 1000, 2000, 5000, 10000, n)
SE_p = vector()
# now loop though each value in X
for (i in X){
  # create a sample for the ith value in X
  # set the seed - see the Info box below
  set.seed(20)
  sample.i = slice_sample(df, n=i)
  # calculate the sample mean of proportion
  p_hat = mean(sample.i$happiness == "Very happy")
  # calculate the SE of proportion
  se.i = sqrt(p_hat * (1 - p_hat) / i)
  # add to the SE vector
  SE_p = append(SE_p, se.i)
}
# check the result
SE_p
[1] 0.158113883 0.049355851 0.015409607 0.010928626 0.006966181 0.004908920
[7] 0.003024099

We can then these plot these:

# plot the SEs
plot(x = X, y = SE_p,
     pch = 19, col = "coral",
     xlab = "Sample Size", ylab = "Standard Error of proportion", 
     main = "Impact of sample size on Standard Error of the proportion Very happy")
lines(x = X, y = SE_p)

Again, we know that with the increase of sample size n, the Standard Error decreases.

3.5.2 Confidence Intervals for a proportion (95%)

We now calculate a 95% confidence interval for the proportion of respondents who reported being Very happy in our sample.

To do this, we use the full sample size n. Recall that for a proportion, CI for a proportion (95%) is

The critical value is again the qnorm(0.975) as we did for the numerical variable, if you qnorm(0.975) then you may find it is 1.96. The below R code help use to calculate the CI windows of sample mean, when sample size is n.

#CI windows
p_hat = mean(df$happiness == "Very happy")
p_hat
[1] 0.586164
n <- nrow(df)
SE_p <- sqrt(p_hat * (1 - p_hat) / n)
error_p <- qnorm(0.975) * SE_p
lower.bound <- p_hat - error_p
upper.bound <- p_hat + error_p
lower.bound
[1] 0.5802369
upper.bound
[1] 0.5920911

Therefore, the confidence interval can be interpreted as follows: in the current UK Family Resources Survey, the sample proportion of respondents reporting being Very happy is 58.6%. The 95% confidence interval indicates that we are 95% confident that the true population proportion lies between 58.0% and 59.2%.

3.6 Compare to World Value Survey

We know that when the sample changes, the survey results may also differ. Now we turn to a different dataset — the World Values Survey — and repeat the same analytical process to examine how the happiness results compare.

3.6.1 Happiness in different survey sample

As usual, we first load the World Value Survey data:

# use read.csv to load a CSV file
dat <- read.csv("world_value_suvey.csv",stringsAsFactors = TRUE)

Again, we wish to treat the happiness variable with its natural ordering:

#set ordering
dat$happiness <- factor(
  dat$happiness,
  levels = c("Very unhappy", 
             "Fairly unhappy", 
             "Fairly happy", 
             "Very happy"),
  ordered = TRUE
) 

We also want to have a descriptive summary of the count in each category and compute the percentage distribution:

# descriptive summary table
dat %>%
  count(happiness) %>%
  mutate(p_hat2 = round(n / sum(n) * 100,1))
       happiness     n p_hat2
1   Very unhappy  2517    1.6
2 Fairly unhappy 18032   11.8
3   Fairly happy 87342   57.1
4     Very happy 45118   29.5

Not just as the table above, a barplot would better support the descriptive summary visually:

# barplot visualise
tab = table(dat$happiness)
tab

  Very unhappy Fairly unhappy   Fairly happy     Very happy 
          2517          18032          87342          45118 
barplot(tab)

As what we have done earlier, we want to focus on the Very happy category so we first calculate the sample proportion of respondents who are Very happy.

#sample proportion of Very happy
p_hat2 = mean(dat$happiness == "Very happy")
p_hat2
[1] 0.2948715

To estimate the Standard Error, we use the same coding, but this time, the sample size is called k and all the respondents are included:

#Standard Error for the proportion of Very happy
k = nrow(dat)
k
[1] 153009
SE_p2 <- sqrt(p_hat2 * (1 - p_hat2) / k)
SE_p2
[1] 0.001165714

The estimated proportion of respondents who are Very happy typically varies by about 0.0012 (or 0.12 percentage points) from sample to sample due to random sampling variation.

And the Confidence Intervals:

p_hat2
[1] 0.2948715
SE_p2
[1] 0.001165714
error_p2 <- qnorm(0.975) * SE_p2
lower.bound2 <- p_hat2 - error_p2
upper.bound2 <- p_hat2 + error_p2
lower.bound2
[1] 0.2925868
upper.bound2
[1] 0.2971563

The sample proportion of respondents who are Very happy is estimated to be around 29.5%. The 95% confidence interval suggests that we are 95% confident that the true population proportion lies between 29.3% and 29.7%.

Compare to our earlier results by using UK Family Resource Survey, the findings are quite lower. You may have your inference that this is because earlier respondents are all from the UK but now it is from all over the world. Therefore, let’s filter only UK respondents from this World Value Survey:

# filter UK respondents from the whole survey sample
dat_uk = dat %>% filter(english_short_name == "United Kingdom")
#calculate sample mean proportion for UK sample only
dat_uk %>%
  count(happiness) %>%
  mutate(p_hat3 = round(n / sum(n) * 100,1))
       happiness    n p_hat3
1   Very unhappy   34    0.7
2 Fairly unhappy  335    7.1
3   Fairly happy 2572   54.6
4     Very happy 1772   37.6
p_hat3 = mean(dat_uk$happiness == "Very happy")
p_hat3
[1] 0.3759813

Question 4. Compute the SE and CI for the UK sample only dataframe dat_uk. How you will interpret the results?

3.6.2 Between regions and countries

You may have noticed that, even within the World Values Survey, the sample proportion of respondents reporting being Very happy in the UK is higher than the overall sample proportion.

We can now continue to explore how happiness varies across different regions and countries to see whether similar patterns emerge elsewhere.

To fulfill this task, we shall use the cross-tabulation method we did last week, by using table() between region_name and happiness:

#cross-tabulate region vs happiness
table(dat$region_name, dat$happiness)
          
           Very unhappy Fairly unhappy Fairly happy Very happy
  Africa            421           1627         4829       2645
  Americas          287           2868        11541       9053
  Asia              866           5468        26994      14992
  Europe            927           7894        42287      17539
  Oceania            16            175         1691        889

You may think, okay, but it is very hard to draw conclusions based on the counts. You are right. Let’s change the count table into a proportion one by using function prop.table().

To know the function, we can first ask R’s help:

?prop.table

On your RStudio’s right-hand Help pane, we can learn that we can set parameter margin in to 1 to indicate the division should be done by row (each row sums up to as 1), and margin = 2 as the division by columns (each column sums up to 1).

Let’s try both margin=1 and margin=2:

#get the cross-tabulation result and present in proportion format, margin = 1
tab_region_happy = table(dat$region_name, dat$happiness)
prop.table(tab_region_happy, margin = 1) * 100
          
           Very unhappy Fairly unhappy Fairly happy Very happy
  Africa      4.4213401     17.0867465   50.7141357 27.7777778
  Americas    1.2084719     12.0762979   48.5957303 38.1194998
  Asia        1.7922185     11.3162252   55.8650662 31.0264901
  Europe      1.3503868     11.4994100   61.6006526 25.5495506
  Oceania     0.5774089      6.3154096   61.0249008 32.0822808
round(prop.table(tab_region_happy, margin = 1) * 100, 2)  #round(...,2) to keep only 2 digits for all the numbers
          
           Very unhappy Fairly unhappy Fairly happy Very happy
  Africa           4.42          17.09        50.71      27.78
  Americas         1.21          12.08        48.60      38.12
  Asia             1.79          11.32        55.87      31.03
  Europe           1.35          11.50        61.60      25.55
  Oceania          0.58           6.32        61.02      32.08

A vague calculation you may notice that each row (region) sums to 100 (as in the above codes, we multipled the proportion by 100 to express them as percentage). This means the table presents row percentages: within each region, the percentages show how respondents are distributed across the different levels of happiness. In other words, the table allows us to compare how the distribution of happiness varies across regions within the sample.

If we simply change margin = 1 to margin = 2, the results will be different because we are now calculating column percentages instead of row percentages.

round(prop.table(tab_region_happy, margin = 2) * 100, 2)
          
           Very unhappy Fairly unhappy Fairly happy Very happy
  Africa          16.73           9.02         5.53       5.86
  Americas        11.40          15.91        13.21      20.07
  Asia            34.41          30.32        30.91      33.23
  Europe          36.83          43.78        48.42      38.87
  Oceania          0.64           0.97         1.94       1.97

With margin = 2, each column sums to 100. This means we are looking at the distribution of regions within each happiness category.In other words, instead of asking: “Within each region, how is happiness distributed?” we are now asking: “Among respondents at each happiness level, how are they distributed across regions?” - This changes the interpretation entirely.

Now it’s the time to check out the comparison among countries. Here we use variable english_short_name as the country to cross-tabulate happiness:

tab_country_happy = table(dat$english_short_name, dat$happiness)
round(prop.table(tab_country_happy, margin = 1) * 100, 1) #round(...,1) to keep only 2 digits for all the numbers

Question 5. What conclusion you can draw from the findings? Change margin from 1 to 2, what new information you can learn from the results?

Again, in the following section, we want focus on the proportion of very happy in each country only, we can use two functions group_by() %>% summarise() to fulfill this request:

#group dat by the country name, in each group, calculate the mean proportion of happiness reported as Very happy
df_ctry_very_happy <- dat %>% 
  group_by(english_short_name) %>% 
  summarise(p_hat = mean(happiness == "Very happy", na.rm = TRUE)) #if any rows include NA value, remove them
df_ctry_very_happy

This produces one row per country, with the sample proportion of respondents who are Very happy.

We can get the top 1 country by using function which.max(). This function returns the row number corresponding to the maximum value of p_hat. We can then use that row number to extract the full row from the dataframe: to get the row number of that country, and then ask R to return that row for us:

which.max(df_ctry_very_happy$p_hat)
[1] 44
df_ctry_very_happy[which.max(df_ctry_very_happy$p_hat),]
# A tibble: 1 × 2
  english_short_name p_hat
  <fct>              <dbl>
1 Kyrgyz Republic    0.677

To identify the top 10 happiest countries (based on the proportion of respondents who are Very happy), we can use arrange() function to sort the df_ctry_very_happy dataframe by p_hat.

If we place a minus sign - before p_hat (i.e. arrange(-p_hat)), the dataframe will be sorted in decreasing order, from highest to lowest proportion. Without the minus sign, the sorting will be in ascending order, from lowest to highest.

You can also type ?arrange in RStudio to view the official documentation and see additional details about how the function works.

Now, top 10 Very happy:

df_ctry_very_happy %>% arrange(-p_hat) %>% head(10)
# A tibble: 10 × 2
   english_short_name p_hat
   <fct>              <dbl>
 1 Kyrgyz Republic    0.677
 2 Tajikistan         0.626
 3 Ecuador            0.618
 4 Mexico             0.588
 5 Colombia           0.574
 6 Uzbekistan         0.519
 7 Guatemala          0.511
 8 Philippines        0.511
 9 Puerto Rico        0.510
10 Kenya              0.509

bottom 10 Very happy:

df_ctry_very_happy %>% arrange(p_hat) %>% head(10)
# A tibble: 10 × 2
   english_short_name  p_hat
   <fct>               <dbl>
 1 South Korea        0.0410
 2 Egypt              0.0644
 3 Lithuania          0.0918
 4 Hong Kong          0.111 
 5 Morocco            0.126 
 6 Latvia             0.132 
 7 Slovakia           0.133 
 8 Macao              0.145 
 9 Russia             0.149 
10 Greece             0.154 

3.6.3 Map happiness around the world

As usual, let’s finish today’s practical by a map made by ourselves.

First, load the world map boundaries from your working folder. Before producing the final map, we can generate a quick check map using qtm() to ensure the spatial data has loaded correctly. We also set tmap to interactive mode using tmap_mode("view"), so the map can be explored dynamically (e.g., zooming and clicking on countries) before creating the final visualisation.

#load the boundary dataset
world_map <- st_read("world_map.geojson")
Reading layer `world_map' from data source 
  `C:\Users\ziye\Documents\GitHub\quant\labs\world_map.geojson' 
  using driver `GeoJSON'
Simple feature collection with 203 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -180 ymin: -58.49861 xmax: 180 ymax: 83.6236
Geodetic CRS:  WGS 84
#set as interactive map
tmap_mode("view")
#quick mapping
qtm(world_map)

Click on any country in the world map within the Viewer pane on the right-hand side of RStudio. You will notice that the variable english_short contains the country name.

We can use this variable as the key to join the spatial map data with our country-level happiness dataframe, since it corresponds to the english_short_name variable in our survey dataset.

Before join them, let’s first inspect the structure of both dataframe:

dim(world_map)
[1] 203  15
dim(df_ctry_very_happy)
[1] 91  2

We can learn that the world map has 203 countries, but our df_ctry_very_happy only have 91, this means that not all the countries from the world_map will be able to plot based on the p_hat from df_ctry_very_happy. As a result, some countries will have missing values (NA) for the happiness proportion and therefore will not be coloured according to p_hat in the final map.

However, this is perfectly acceptable — those countries will simply appear with the default missing-value styling, while the countries included in our survey data will be shaded according to their estimated proportion of respondents who are Very happy.

Let’s use function left_join() to join the two dataframes:

#join our newly created dataframe df_ctry_very_happy to the world map by the shared columns
world_happiness = left_join(world_map, df_ctry_very_happy, by=c("english_short"="english_short_name") )

A short code sentence would serve your first try on the new dataframe world_happiness. Don’t forget to click the from your RStudio right-bottom Viewer pane for a full screen interactive map in your explorer:

#View the map as an interview map
tm_shape(world_happiness) +
  tm_polygons("p_hat")

To make it looking better, let’s change the color palette and breaks. Click the from your RStudio right-bottom Viewer pane for a full screen interactive map - browse the map and interact with it to find how UK compare to other countries on this map? Any other findings interest you?

#plot the map as a static map, with color palette as Yellow-Orange_Red, with break styles as jenks into 6 classes

map = tm_shape(world_happiness) +
  tm_polygons("p_hat",
              fill.scale = tm_scale(values="YlOrRd",
                                style = "jenks",n=6))+
   tm_layout(main.title = "Mean proportion of Very happy in the World Value Survey",
            main.title.size=1.2,
            frame = FALSE) + 
  tm_compass(position = c("right", "bottom")) +
  tm_scalebar(position = c("right", "bottom")) 
map

At last, we can use tmap_mode("plot") to change the interactive map into a static one. Also we can save the map on your disk:

tmap_mode("plot")
map
tmap_save(map,"Week3 map.png")

Check in your folder and you shall find the map made by yourself.

3.7 Formative Tasks

Task 1 Use the income_gross variable from the UK Family Resource Survey, compare how increase sample size from 5000 to 20000 would affect the sample mean, standard error and CI of the income_gross. How to interpret the results?

n1 = 5000
n2 = 20000

# sample by size
sample_5k = slice_sample(df, n=n1)
sample_20k = slice_sample(df, n=n2)

#mean
m.5k = mean(sample_5k$income_gross)
m.20k = mean(sample_20k$income_gross)

m.5k
[1] 22224.9
m.20k
[1] 22076.63
#standard deviation
std.5k = sd(sample_5k$income_gross)
std.20k = sd(sample_20k$income_gross)

#standard error
se.5k = std.5k/sqrt(n1)
se.20k = std.20k/sqrt(n2)

se.5k
[1] 434.672
se.20k
[1] 174.1159
#CI
error.5k <- qnorm(0.975)*se.5k
error.20k <- qnorm(0.975)*se.20k

lower.bound.5k <- m.5k-error.5k
upper.bound.5k <- m.5k+error.5k

lower.bound.20k <- m.20k-error.20k
upper.bound.20k <- m.20k+error.20k

lower.bound.5k
[1] 21372.96
upper.bound.5k
[1] 23076.85
lower.bound.20k
[1] 21735.37
upper.bound.20k
[1] 22417.89

Task 2 Use the health variable from the UK Family Resource Survey, what is the sample mean proportion of Very Good health in the whole sample? The standard error and CI? How to interpret the results?

n = nrow(df)
p_hat = mean(df$health == "Very Good")
p_hat
[1] 0.2820358
SE_p <- sqrt(p_hat * (1 - p_hat) / n)
error_p <- qnorm(0.975) * SE_p
lower.bound <- p_hat - error_p
upper.bound <- p_hat + error_p
lower.bound
[1] 0.2766205
upper.bound
[1] 0.2874511

Task 3 Use the health variable from the World Value Survey, compare your the sample mean proportion of Very good health with your findings in Task 2.

n = nrow(dat)
p_hat = mean(dat$health == "Very good")
p_hat
[1] 0.2204315
SE_p <- sqrt(p_hat * (1 - p_hat) / n)
error_p <- qnorm(0.975) * SE_p
lower.bound <- p_hat - error_p
upper.bound <- p_hat + error_p
lower.bound
[1] 0.2183544
upper.bound
[1] 0.2225086

Task 4 Use the trust_strangers variable from the World Value Survey, compare between different regions (variable use sub_region_name). Which region more likely to trust strangers completely?

tab = table(dat$sub_region_name, dat$trust_strangers)
round(prop.table(tab, margin = 1) * 100, 1)
                                 
                                  Do not trust at all Do not trust much
  Australia and New Zealand                       8.2              36.5
  Central Asia                                   41.5              41.5
  Eastern Asia                                   19.0              60.6
  Eastern Europe                                 24.0              48.6
  Latin America and the Caribbean                48.1              35.3
  Northern Africa                                31.2              43.1
  Northern America                               11.6              42.2
  Northern Europe                                 8.7              32.7
  South-eastern Asia                             29.3              49.2
  Southern Asia                                  38.3              32.4
  Southern Europe                                27.6              46.1
  Sub-Saharan Africa                             33.9              34.7
  Western Asia                                   33.8              45.4
  Western Europe                                 10.8              38.1
                                 
                                  Trust completely Trust somewhat
  Australia and New Zealand                    1.2           54.1
  Central Asia                                 2.4           14.6
  Eastern Asia                                 1.8           18.7
  Eastern Europe                               2.5           24.9
  Latin America and the Caribbean              1.9           14.7
  Northern Africa                              2.3           23.4
  Northern America                             2.0           44.2
  Northern Europe                              5.2           53.5
  South-eastern Asia                           3.0           18.5
  Southern Asia                                5.5           23.7
  Southern Europe                              2.1           24.2
  Sub-Saharan Africa                           4.8           26.6
  Western Asia                                 2.3           18.4
  Western Europe                               3.1           48.0
result = round(prop.table(tab, margin = 1) * 100, 1) %>% as.data.frame.matrix()
result
                                Do not trust at all Do not trust much
Australia and New Zealand                       8.2              36.5
Central Asia                                   41.5              41.5
Eastern Asia                                   19.0              60.6
Eastern Europe                                 24.0              48.6
Latin America and the Caribbean                48.1              35.3
Northern Africa                                31.2              43.1
Northern America                               11.6              42.2
Northern Europe                                 8.7              32.7
South-eastern Asia                             29.3              49.2
Southern Asia                                  38.3              32.4
Southern Europe                                27.6              46.1
Sub-Saharan Africa                             33.9              34.7
Western Asia                                   33.8              45.4
Western Europe                                 10.8              38.1
                                Trust completely Trust somewhat
Australia and New Zealand                    1.2           54.1
Central Asia                                 2.4           14.6
Eastern Asia                                 1.8           18.7
Eastern Europe                               2.5           24.9
Latin America and the Caribbean              1.9           14.7
Northern Africa                              2.3           23.4
Northern America                             2.0           44.2
Northern Europe                              5.2           53.5
South-eastern Asia                           3.0           18.5
Southern Asia                                5.5           23.7
Southern Europe                              2.1           24.2
Sub-Saharan Africa                           4.8           26.6
Western Asia                                 2.3           18.4
Western Europe                               3.1           48.0
which.max(result$`Trust completely`)
[1] 10
result[which.max(result$`Trust completely`),]
              Do not trust at all Do not trust much Trust completely
Southern Asia                38.3              32.4              5.5
              Trust somewhat
Southern Asia           23.7

Task 5 Use the immigrant_impact variable from the World Value Survey, create a new dataframe to include the sample mean proportion of in each country/place (variable needed: english_short_name) who holds Very bad attitude of immigrant impact. Identify the five countries/places with the lowest mean proportion in the sample.

df_immigrant <- dat %>% 
  group_by(english_short_name) %>% 
  summarise(p_hat = mean(immigrant_impact == "Very bad", na.rm = TRUE)) # NA value removed

df_immigrant
# A tibble: 91 × 2
   english_short_name   p_hat
   <fct>                <dbl>
 1 Albania            0.0320 
 2 Andorra            0.00402
 3 Argentina          0.0567 
 4 Armenia            0.0397 
 5 Australia          0.145  
 6 Austria            0.103  
 7 Azerbaijan         0.0643 
 8 Bangladesh         0.0387 
 9 Belarus            0.0293 
10 Bolivia            0.0724 
# ℹ 81 more rows
df_immigrant %>% arrange(p_hat) %>% head(5)
# A tibble: 5 × 2
  english_short_name   p_hat
  <fct>                <dbl>
1 Andorra            0.00402
2 South Korea        0.00884
3 Macao              0.00888
4 China              0.0108 
5 Peru               0.0110