The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network that ranks countries by how happy their citizens perceive themselves to be.
The report is based on surveys that ask people to rate their own well-being on a scale of 0 to 10. The report also includes a chapter on the state of happiness in the world and an analysis of the data from various perspectives, including by gender, age, and region.
Additionally, The World Happiness Report ranks countries by factors that contribute to happiness such as income, social support, healthy life expectancy, freedom to make life choices, trust, and generosity. The first World Happiness Report was released in 2012, and it has been published annually since then.
This project analyzes the 2021 World Happiness Report to draw conclusions about the general well being of 148 countries in the world. This project does some basic data wrangling and exploratory data analysis.
Lets import libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns
We will then load and read our dataset.
df = pd.read_csv('/content/world-happiness-report-2021.csv')
Here’s a slice of the table. These countries are sorted by their happiness scores, with values for each of the variables.
df.head()
I wanted to know how many observations are in the dataset. So I used the .shape attribute to give me the amount of rows and columns.
df.shape (149, 20)
We need to understand our data more.
To display information about a DataFrame, including the number of rows and columns, the data types of each column, and the amount of memory used by the DataFrame. It also shows the number of non-null values in each column.
df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 149 entries, 0 to 148 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 149 non-null object 1 Regional indicator 149 non-null object 2 Ladder score 149 non-null float64 3 Standard error of ladder score 149 non-null float64 4 upperwhisker 149 non-null float64 5 lowerwhisker 149 non-null float64 6 Logged GDP per capita 149 non-null float64 7 Social support 149 non-null float64 8 Healthy life expectancy 149 non-null float64 9 Freedom to make life choices 149 non-null float64 10 Generosity 149 non-null float64 11 Perceptions of corruption 149 non-null float64 12 Ladder score in Dystopia 149 non-null float64 13 Explained by: Log GDP per capita 149 non-null float64 14 Explained by: Social support 149 non-null float64 15 Explained by: Healthy life expectancy 149 non-null float64 16 Explained by: Freedom to make life choices 149 non-null float64 17 Explained by: Generosity 149 non-null float64 18 Explained by: Perceptions of corruption 149 non-null float64 19 Dystopia + residual 149 non-null float64 dtypes: float64(18), object(2) memory usage: 23.4+ KB
What are the data types in this dataset? Do they make sense?
df.dtypes Country name object Regional indicator object Ladder score float64 Standard error of ladder score float64 upperwhisker float64 lowerwhisker float64 Logged GDP per capita float64 Social support float64 Healthy life expectancy float64 Freedom to make life choices float64 Generosity float64 Perceptions of corruption float64 Ladder score in Dystopia float64 Explained by: Log GDP per capita float64 Explained by: Social support float64 Explained by: Healthy life expectancy float64 Explained by: Freedom to make life choices float64 Explained by: Generosity float64 Explained by: Perceptions of corruption float64 Dystopia + residual float64 dtype: object
We also need to generate a descriptive statistic of our data’s columns. The describe() method returns a new DataFrame containing various statistics such as the count, mean, standard deviation, minimum, and maximum value of each.
df.describe()
Is our data clean?
We will drop the statistical columns and the “Explained by: ” columns since these have no direct impact on the total score reported for each country, but instead are just a way of explaining for each country the implications/contribution of these variables to the Ladder Score.
new_df = df.drop(columns=['Standard error of ladder score', 'upperwhisker', 'lowerwhisker', 'Ladder score in Dystopia', 'Explained by: Log GDP per capita', 'Explained by: Social support', 'Explained by: Healthy life expectancy', 'Explained by: Freedom to make life choices','Explained by: Generosity', 'Explained by: Perceptions of corruption', 'Dystopia + residual'])
We can check for any null values.
new_df.isnull().sum() Country name 0 Regional indicator 0 Ladder score 0 Logged GDP per capita 0 Social support 0 Healthy life expectancy 0 Freedom to make life choices 0 Generosity 0 Perceptions of corruption 0 dtype: int64
or even for any duplicates.
new_df.duplicated().sum() 0
We can generate the frequency distribution of unique values in our data using the regional indicator columns. A new Series containing the count of unique values in descending order is the result.
new_df['Regional indicator'].value_counts() Sub-Saharan Africa 36 Western Europe 21 Latin America and Caribbean 20 Middle East and North Africa 17 Central and Eastern Europe 17 Commonwealth of Independent States 12 Southeast Asia 9 South Asia 7 East Asia 6 North America and ANZ 4 Name: Regional indicator, dtype: int64
Data Analysis
Are the minimum and maximum happiness Ladder scores reasonable? Are there any outliers?
df['Ladder score'].max() 7.842
df['Ladder score'].min() 2.523
The happiness Ladder scores appears reasonable since all the scores range between 1 and 8.
Now can we get the average happiness score for all countrie?
df['Ladder score'].mean() 5.532838926174497
cont_data = new_df[['Ladder score', 'Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Generosity', 'Perceptions of corruption', 'Freedom to make life choices']].describe() cont_data
Is there any correlation between the features?
I’m going to use happiness Score as the target since the other features explain it.
correlation = new_df.corr('pearson') correlation
abs(correlation['Ladder score'].sort_values(ascending=False)) Ladder score 1.000000 Logged GDP per capita 0.789760 Healthy life expectancy 0.768099 Social support 0.756888 Freedom to make life choices 0.607753 Generosity 0.017799 Perceptions of corruption 0.421140 Name: Ladder score, dtype: float64
Logged GDP, Healthy life expectancy and Social support are highly correlated with Ladder score for happiness. This means that if I wanted perform a predictive analysis, I could play around with some regression models using these four features to predict the target, Happiness Scores.
Univariate analysis of continuous variables.
A histogram in Pandas shows the frequency distribution of a numerical variable.
How is happiness score distributed in this study?
sns.displot(df['Ladder score'], kde = True)
We can notice a lowest possible score of around 2.5 to a high of 7.8
A box plot (also known as a box-and-whisker plot) in Pandas is a standardised way of displaying the distribution of a dataset. It is a graphical representation of the distribution of a dataset, where the box represents the interquartile range (IQR) of the data, which is the range between the first and third quartiles (25th and 75th percentiles). The whiskers extend from the box to the minimum and maximum values, excluding outliers. Outliers are defined as observations that fall outside of 1.5 times the IQR from the lower and upper quartiles.
sns.boxplot(df['Ladder score'] ).set_title('Box plot of Happiness score') plt.show()
By using loc to select columns, here is the top five happiest countries.
top_happy = df.loc[:4, ['Country name', 'Ladder score']] top_happy
df2021_happiest_unhappiest = df[(df.loc[:, "Ladder score"] > 7.5) | (df.loc[:, "Ladder score"] < 4)] sns.barplot(x = "Ladder score", y = "Country name", data = df2021_happiest_unhappiest, palette = "Set1") plt.title("Happiest and Unhappiest Countries in 2021") plt.show()
The top highest happiness Ladder score belongs to Finland with 7.842, followed by Denmark and others.
All western European countries.
df.loc[:, ['Country name', 'Regional indicator', 'Ladder score']].head()
This is how the happiest country looks like.
df.loc[0] Country name Finland Regional indicator Western Europe Ladder score 7.842 Standard error of ladder score 0.032 upperwhisker 7.904 lowerwhisker 7.78 Logged GDP per capita 10.775 Social support 0.954 Healthy life expectancy 72.0 Freedom to make life choices 0.949 Generosity -0.098 Perceptions of corruption 0.186 Ladder score in Dystopia 2.43 Explained by: Log GDP per capita 1.446 Explained by: Social support 1.106 Explained by: Healthy life expectancy 0.741 Explained by: Freedom to make life choices 0.691 Explained by: Generosity 0.124 Explained by: Perceptions of corruption 0.481 Dystopia + residual 3.253 Name: 0, dtype: object
Finland has a pretty high value for Social Support, Healthy Life Expectancy and Freedom to Make Life Choices when compared to the maximum. It has a fairly low score for Perceptions of Corruption when compared to the minimum. While this does not determine Finland’s score, it may explain why they score the highest Happiness score in the world.
Pretty good reason why its at the top.
This is how happiness is distributed by Regional.
plt.figure(figsize = (15,8)) sns.kdeplot(df["Ladder score"], hue = df["Regional indicator"], fill = True, linewidth = 2) plt.axvline(df["Ladder score"].mean(), c = "black") plt.title("Ladder score distribution by regional indicator") plt.show()
plt.figure(figsize = (15,8)) sns.kdeplot(df["Generosity"], hue = df["Regional indicator"], fill = True, linewidth = 2) plt.axvline(df["Generosity"].mean(), c = "black") plt.title("Generosity distribution by Region") plt.show()
plt.figure(figsize=(20,10)) sns.barplot(data=df, x='Regional indicator',y='Ladder score') plt.xticks(rotation=45)
cont_data.corr()
The above shows a correlation of all the continuous data in our dataset.
We notice how highly correlated most of the data is to each other. GDP is highly correlated to social support. We can read that countries with higher GDP can afford social support for their populations.
The freedom to make life choices also seems to highly correlate with social support
To look at the relationships between happiness scores and the other measurements, I created scatterplots.
sns.scatterplot(df['Logged GDP per capita'], df['Ladder score']) plt.ylim(0,)
sns.lmplot('Logged GDP per capita', 'Ladder score', data = df)
Logged GDP has a positive linear association with happiness scores
sns.scatterplot(df['Social support'], df['Ladder score']) plt.ylim(0, )
sns.lmplot('Social support', 'Ladder score', data = df)
Social support has a positive linear association with happiness scores
sns.scatterplot(df['Healthy life expectancy'], df['Ladder score']) plt.ylim(0,)
sns.lmplot('Healthy life expectancy', 'Ladder score', data = df)
Healthy life expectancy has a positive linear association with happiness Ladder score
sns.scatterplot(df['Generosity'], df['Ladder score']) plt.ylim(0,)
sns.lmplot('Generosity', 'Ladder score', data = df)
sns.scatterplot(df['Perceptions of corruption'], df['Ladder score']) plt.ylim(0,)
sns.lmplot('Perceptions of corruption', 'Ladder score', data = df)
Perceptions of corruption has a negative linear association with the happiness Ladder score.
sns.scatterplot(df['Freedom to make life choices'], df['Ladder score']) plt.ylim(0,)
sns.lmplot('Freedom to make life choices', 'Ladder score', data = df)
Freedom to make life choices has a positive linear association with the happiness Ladder score.
top_10 = df.loc[:, ['Country name', 'Ladder score', 'Regional indicator']].head(10) top_10
plt.figure(figsize=(20,6)) sns.barplot(data=df,x=top10['Country name'],y=top10['Ladder score'])
A look at the unhappiest countries.
lowest_10 = df.loc[:, ['Country name', 'Ladder score', 'Regional indicator']].tail(10) lowest_10
plt.figure(figsize = (20,6)) sns.barplot(data = df, x = low10['Country name'], y = low10['Ladder score'])
The unhappiest of the lot according to the study is Afghanistan.
Lets look at it and see the determinants of happiness and their effects in this country.
df.loc[148] Country name Afghanistan Regional indicator South Asia Ladder score 2.523 Standard error of ladder score 0.038 upperwhisker 2.596 lowerwhisker 2.449 Logged GDP per capita 7.695 Social support 0.463 Healthy life expectancy 52.493 Freedom to make life choices 0.382 Generosity -0.102 Perceptions of corruption 0.924 Ladder score in Dystopia 2.43 Explained by: Log GDP per capita 0.37 Explained by: Social support 0.0 Explained by: Healthy life expectancy 0.126 Explained by: Freedom to make life choices 0.0 Explained by: Generosity 0.122 Explained by: Perceptions of corruption 0.01 Dystopia + residual 1.895 Name: 148, dtype: object [ ]
Afghanistan ranks low in all the determinants of happiness.