Saturday, June 26, 2021

Using a Machine Learning Model to Predict Happiness and determine how each feature is impacting the overall score

The World Happiness Report 2021 includes historical data as well as new data that consists of metrics to determine the happiness of a country's people. The data has been made available here. Using this data, I hypothesize that I can predict the happiness score of a country with at least 80% accuracy. I will also determine what is necessary to either increase or decrease the score to give country leaders actionable metrics to help their people live a higher quality of life.  


To begin, I need to decide how to use my data. Since there are 2 files (one historical before 2021, and one for 2021) I am going to use the historical file as my training set and the newest file as my testing set. This helps prevent any data leakage, as well as partitions my data well into the right balance of training data vs testing data. 

After importing the data, I need to prepare the data by creating new features, cleaning up labels, deciding how to handle missing values, ensuring data types are correct for each column, and otherwise wrangling the data to be trained to my model. Some interesting findings during this process include: 

  • All columns except the country and year are continuous. This means I need to use Regression in building my model. I will also remove these columns to make my model work.  

  • The data set is relatively small (1949 rows and 23 columns after feature engineering). This means I will use k-fold cross validation to prevent overfitting the model.


I create my model by selecting my target vector as the happiness score ('Life Ladder' column), then do a train-test-split with sklearn to create a training and validation set from the training data. This produces a baseline r^2 value of 0.0000. This is to be expected, and means that the model does not explain any of the variance in the mean. The following graph depicts a pairplot of each column against the target variable.  

 

 

 

 

Let’s see how this looks when I create a linear regression model and fit training data. I created a pipeline to fit the linear regression, which results in a r^2 value of 0.75. This is a good sign, as it means that the linear regression model explains 75% of the variance. 

 

 

Now to try different ways of improving my validation score without overfitting the model. 


Random Forest r^2: 0.8452017903377169 


Let's see if we can reduce the number of features by getting rid of the least important features. To do this I will use feature importances from sklearn, to visualize the features, then use PermutationImportance from sklearn to get a cutoff value for evaluating where to make my cut. I am going to drop the lowest 12 features. 

 

 

 

New random forest r^2 after removing columns: 0.8575296225304241 

 

That just removed over 1/2 of my columns and my performance got a modest increase. Now to take a better look at my data so see if anything stands out as something I can do to improve it more with more data wrangling, hyperparameter tuning, and gradient boosting. 

 

 

 

Well look at that, the Healthy life expectancy (and life expectancy per freedom of choice feature as a function of the former) has a lot of outliers and also greatly distorts the rest of the data. Taking the log of them should normalize them with the rest of the data. 

 

 

 

That helped my hyperparameter tuning model to just barely edge out the random forest regression model. 

 

Random forest with hyperparameter tuning r^2: 0.8585278103347038 

Gradient Boosting r^2: 0.8190404900521074 

Gradient boosting with hyperparameter tuning r^2: 0.8525865625080129 

 

As you can see, the r^2 value did not increase with gradient boosting or additional hyperparameter tuning. Now to look at the first row and see how each feature affects the results. 


 


It is nice to see which parameters positively affect the happiness score and which negatively affect the happiness score. I will now apply my test set, which is new, unseen data, using the random forest regression with hyperparameter tuning model.

 

My test data r^2 value is: 0.9015876748232455 


That is my best score yet! Let’s see it on a graph. 

 

 

 

As you can see, I was able to train a model to predict the Happiness score with approximately 90% accuracy. As more data becomes available, the training set size can be increased which will lead to a better accuracy score. Surprisingly, when looking at the Shap visualization one can see that Freedom to make life choices and Social support have a negative impact on the overall happiness score. Another analysis will be needed to determine what makes social support and freedom to make life choices negatively impact the overall happiness score. 

 

A link to my Github for this project can be found here.