9 min read

World Happiness Report

For this project we will be using data from the World Happiness Report. This data originates from the Gallup World Poll, however, our datasource is Kaggle and can be found at https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021. For this project we will be using the packages: dplyer, magrittr, here, ggplot2 and rpart.

Our primary objective for this project is to implement cross-validation to compare linear regression and decision trees methods against each other. This will be done by using variables roughly corresponding to GDP, Social Support, Life Expectancy, Freedom, Generosity, and Corruption to predict the Happiness (Ladder.score) in a country.

While the World Happiness Report does have multiple years of data we will focus on 2021 for this post. Below we load up our data.

happiness_2021 <- read.csv(here('csv', 'kaggle_happiness_dataset', 'world-happiness-report-2021.csv'))

To get an idea of our dataset we use the str function. We find below that we have 149 observations with 20 variables.

str(happiness_2021)
## 'data.frame':    149 obs. of  20 variables:
##  $ ï..Country.name                           : chr  "Finland" "Denmark" "Switzerland" "Iceland" ...
##  $ Regional.indicator                        : chr  "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
##  $ Ladder.score                              : num  7.84 7.62 7.57 7.55 7.46 ...
##  $ Standard.error.of.ladder.score            : num  0.032 0.035 0.036 0.059 0.027 0.035 0.036 0.037 0.04 0.036 ...
##  $ upperwhisker                              : num  7.9 7.69 7.64 7.67 7.52 ...
##  $ lowerwhisker                              : num  7.78 7.55 7.5 7.44 7.41 ...
##  $ Logged.GDP.per.capita                     : num  10.8 10.9 11.1 10.9 10.9 ...
##  $ Social.support                            : num  0.954 0.954 0.942 0.983 0.942 0.954 0.934 0.908 0.948 0.934 ...
##  $ Healthy.life.expectancy                   : num  72 72.7 74.4 73 72.4 73.3 72.7 72.6 73.4 73.3 ...
##  $ Freedom.to.make.life.choices              : num  0.949 0.946 0.919 0.955 0.913 0.96 0.945 0.907 0.929 0.908 ...
##  $ Generosity                                : num  -0.098 0.03 0.025 0.16 0.175 0.093 0.086 -0.034 0.134 0.042 ...
##  $ Perceptions.of.corruption                 : num  0.186 0.179 0.292 0.673 0.338 0.27 0.237 0.386 0.242 0.481 ...
##  $ Ladder.score.in.Dystopia                  : num  2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 ...
##  $ Explained.by..Log.GDP.per.capita          : num  1.45 1.5 1.57 1.48 1.5 ...
##  $ Explained.by..Social.support              : num  1.11 1.11 1.08 1.17 1.08 ...
##  $ Explained.by..Healthy.life.expectancy     : num  0.741 0.763 0.816 0.772 0.753 0.782 0.763 0.76 0.785 0.782 ...
##  $ Explained.by..Freedom.to.make.life.choices: num  0.691 0.686 0.653 0.698 0.647 0.703 0.685 0.639 0.665 0.64 ...
##  $ Explained.by..Generosity                  : num  0.124 0.208 0.204 0.293 0.302 0.249 0.244 0.166 0.276 0.215 ...
##  $ Explained.by..Perceptions.of.corruption   : num  0.481 0.485 0.413 0.17 0.384 0.427 0.448 0.353 0.445 0.292 ...
##  $ Dystopia...residual                       : num  3.25 2.87 2.84 2.97 2.8 ...

While we have many variables, we will only be using the 7 mentioned earlier for our analysis. Below we input these variables into R variables for easy use later on. In particular, we store our response variables and predictor variables separately and then combine them together for a third variable containing all 7. We will be using all three of our variables defined below later on.

response_variable <- "Ladder.score"
predictor_variables <- c(
    "Logged.GDP.per.capita",
    "Social.support",
    "Healthy.life.expectancy",
    "Freedom.to.make.life.choices",
    "Generosity",
    "Perceptions.of.corruption"
)
variables = c(response_variable, predictor_variables)

Our next step is to split our dataset into training and testing sets. While we will be using cross-validation, we will be doing so only on our training set. This will allow us to make decisions for a best model on the training set, while still having the testing set to see how our chosen model holds up to data it has never seen.

# sequester test set
set.seed(242)
n <- nrow(happiness_2021)
train_indices <- sample(1:n, size=n*.7)
train_set <- happiness_2021[train_indices, ]
test_set <- happiness_2021[-train_indices, ]

While we briefly looked at the structure of the whole dataset, to examine the size and variables names, we want to avoid looking closer at the testing data. For the training data, however, we can see how our Happiness or Ladder.score varies by region. The boxplots below show that there are distinct clusters based on the region. Note that we will not use the region as one of our predictor variables. This is not because we believe it would be a bad predictor, but more so due to the small size of our dataset and additional concerns that certain regions would not be represented in our models. If we were not using cross validation this could be a very useful variable, however, we will not explore it further.

ggplot(train_set, aes(x=Ladder.score, y=Regional.indicator)) +
    geom_boxplot()

For the next part of our project we will use cross-validation. More precisely, we will use k-fold cross-validation which involves splitting our dataset into k parts. From these parts we take one part as a test set and combine the others into a training set. We then train our model on the training set. Once the model is trained we make a prediction on the test set and determine the error. We repeat this process k times so that we can find the error on every possible combination of training and testing sets. Once we are done we calculate the average error. This is the error we use to compare our models. This is a very useful technique since it reduces the likelihood of a very good or bad error value.

To implement cross-validation we first calculate the number of rows in our dataset and then split the row numbers in k approximately equal sets. We then return k pairs of training and testing sets containing the indices. We work with indices at this level for simplicity and in order to avoid saving multiple dataframe sized copies of our original data.

# function which splits a dataframe into k sets
cross_validation_sets <- function(data, k=3) {
    n <- nrow(data)
    randomized_indices <- sample(1:n)
    list_of_indices <- list()
    for (i in 1:k) {
        l <- floor((i - 1) * n / k + 1)
        u <- floor(i * n / k)
        list_of_indices[[i]] <- randomized_indices[l:u]
    }
    train_list <- list()
    test_list <- list()
    # outer loop keeps track of fold and test set index
    for (i in 1:k){
        first_training_element = TRUE
        # inner loop adds indices to lists
        for (j in 1:k) {
            if (j == i) {
                test_list[[i]] <- list_of_indices[[j]]
            } else if (first_training_element == TRUE){
                train_list[[i]] <- list_of_indices[[j]]
                first_training_element = FALSE
            } else {
                train_list[[i]] <- c(train_list[[i]], list_of_indices[[j]])
            }
                
        }
    }
    return(list(train=train_list, test=test_list))
}

Since our cross-validation was put into a function we call the function on our data below to get our sets. In this case we will use k=5.

x <- cross_validation_sets(data=train_set, k=5)

We will use root mean squared error (RMSE) for our error estimate. Below we create a function which will do the job for us given the formula which decides which variables will be used as the response and predictors. We also include the name of the function, the training data, and the testing data (split between predictors and response).

calculate_rmse <- function(formula, f_name, train_data, test_predictors, test_response) {
    model <- f_name(formula, data=train_data)
    pred <- predict(model, test_predictors)
    error <- test_response - pred
    rmse <- sqrt(mean(error^2))
    return(rmse)
}

Now that we have a function for evaluating the error for both regression with the lm function and decision trees with the rpart function we loop over our k partitions of the original training set given through our cross-validation function.

k <- 5

regression_rmse_list <- list()
tree_rmse_list <- list()
for (i in 1:k) {
    training_set <- train_set[x$train[[i]], variables]
    testing_set_predictors <- train_set[x$test[[i]], predictor_variables]
    testing_set_response <- train_set[x$test[[i]], response_variable]
    
    regression_rmse_list[[i]] <- calculate_rmse(Ladder.score~., lm, training_set, testing_set_predictors, testing_set_response)
    tree_rmse_list[[i]] <- calculate_rmse(Ladder.score~., rpart, training_set, testing_set_predictors, testing_set_response)
}

We now compare our cross validated results for regression and for the decision tree method by calculating the mean for each.

# cross validated regression error
mean(unlist(regression_rmse_list))
## [1] 0.6093241
# cross validated decision tree error
mean(unlist(tree_rmse_list))
## [1] 0.7405093

Our result show’s that regression does a better job of predicting our response variable. For a comprehensive machine learning task we would work with more models and pay more attention to our parameters. Random forests, for example, could prove to be a much better model. In our case, however, we will conclude that regression is the best model for this dataset and will examine how well it performs on the full dataset. In particular, we will use our full training dataset to train the model and then calculate the RMSE for our original test set. Before we do this, let’s take a quick look at our individual RMSE values for each fold of our cross-validation.

# regression errors
unlist(regression_rmse_list)
## [1] 0.3756598 0.5888793 0.6244611 0.6529883 0.8046318
# decision tree errors
unlist(tree_rmse_list)
## [1] 0.6219071 0.7439812 0.5830238 0.8527813 0.9008533

In only one pair of RMSE values does the decision tree model out perform the regression model. If we had run our model only once without cross-validation we could have ended up thinking a decision tree model was the best or that both models performed similarly. Based on our results this decision would appear to have been a poor one.

We now evaluate a regression model on the full dataset.

model <- lm(Ladder.score~., data=train_set[ , variables])
pred <- predict(model, test_set[ , predictor_variables])
error <- test_set[ , response_variable] - pred
rmse <- sqrt(mean(error^2))

For a summary of our model we get

summary(model)
## 
## Call:
## lm(formula = Ladder.score ~ ., data = train_set[, variables])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78133 -0.29654  0.07411  0.39267  1.09759 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  -2.63389    0.82798  -3.181  0.00197 **
## Logged.GDP.per.capita         0.29579    0.12003   2.464  0.01549 * 
## Social.support                2.46046    0.88630   2.776  0.00660 **
## Healthy.life.expectancy       0.03566    0.01949   1.829  0.07042 . 
## Freedom.to.make.life.choices  1.83132    0.62748   2.919  0.00437 **
## Generosity                    0.62410    0.41576   1.501  0.13658   
## Perceptions.of.corruption    -0.51071    0.36029  -1.417  0.15954   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5928 on 97 degrees of freedom
## Multiple R-squared:  0.7495, Adjusted R-squared:  0.734 
## F-statistic: 48.37 on 6 and 97 DF,  p-value: < 2.2e-16

This indicates that of our variables Social.support and Freedom.to.make.life.choices are the most significant predictors of happiness.

To see how our model did in its predictions, we have the RMSE below.

rmse
## [1] 0.4205455

With an RMSE of 0.4205455 we have a significantly lower error than with our cross-validation. This is a sign that we may have under-fitted the data. Thus, we could use this result as a sign that we should fit a more sophisticated model.