World Bank Global Poverty and Education Tutorial

Nandhini Krishnan


1. Introduction

The World Bank provides access for public use to various databases in order to understand stages of development across countries. One database that they provide is various data points deemed as indicators for poverty. Some examples of these indicators are:

Indicators such as these allow for one to make concrete inferences about a country's development. From indicators such as the ones above, we can analyze how much progress overtime contries have made in reducing poverty within the nation and improving the livlihood of the people. Poverty has long standing effects into other sectors of society. For example, there are close correlations between poverty and hunger, education, and the economy. The World Bank provide data for all of these factors. For the purpose of this tutorial, we will be focusing on the datasets provided for poverty and education.

Education is important for countries in order to continue to strive towards innovation and improve development. It is also the key to help people out of poverty as education allows for people to obtain jobs that provide a sustainable income to support themselves and their family. Looking at indicators of Education allow for us to get an understanding of educational improvements a country is making. Example of Education indicators are:

Looking at both education and poverty can be a means of understanding country economic development and areas for improvement. Education and poverty are closely linked according to a study by UNESCO. As a result, education can be the key to help developing countries overcome obstacles of poverty and futher develop their economies.

As a result, this tutorial looks to step through how to query and analyze data pertaining to Poverty and Education as provided by the World Bank. Here is also a link to my github repository in which I created this tutortial: Link to repository

2. Get the Data

First, we will begin by gathering the data for the Poverty and Education data frames. This requires dowloading the necessary CSV files from the World Bank website.

The World Bank provides numerous amounts of data and insight into a country's economic and societal development. As a result, we can use the data provided to quantify information such as understanding poverty trends in countries due to education.

First we must extract the necessary information we need in order to draw analytical conclusions. To do this, we first create a dataframe for both the data with regards to poverty and a seperate one for education. Dataframes are useful data structures in data analysis that allow for the user to be able to filter data, plot and visualize, as well as compare different datasets.

We got the data from the World Bank CSV files. Therefore, in order to read the csv files into a dataframe, we can use the read_csv() function. This is because it will allow us to utilize the structure of csv files to seperate each individual piece of data into its own element that can be queried and filtered as needed to visualize and make sense of the data. Thus, once using the read_csv() function we can then filter through the dataframes to extract only the poverty and education indicators discussed above that we wanted to analyze. This begins the process of cleaning data. Cleaning data is a necessary step in order to allow you to focus on the relevant and important information as well as address how to handle any missing or potentially inaccurate data.

We want to clean the data to ensure we was able to make meanigful inferneces. We can notice that many of the earlier years contained lots of missing information. As a result, let's choose to focus on the most recent 5 years worth of data provided: 2015 - 2020. In order to do this, let's create a new fileted dataframe for both the poverty and education data with only the years and indicators we want to analyze. After filtering to the necessary indicators and the years we want to focus on, we can also drop any countries in which all 6 years had missing data. This way we can focus only on relevant data to draw trends.

Once all these steps were completed, we have cleaned and created 2 dataframes: one focusing on poverty indicators and one on education indicators. The way one cleans data is subjective to the purpose of the visualization and analysis. However, it is generally good data science practive to address missing data in the cleaning process

We created a dataframe that contains all the raw data from the csv files obtained from the World Bank. In order to create these dataframes, we employ the panda library which provides a series of functions that allow you to create and query dataframes.

Next, we clean the dataframe to only include the years and indicators that we want to analyze. This allows us to hanlde the missing data from the csv file and focus our analysis.

We can repeat the same process using the pandas library on the education csv file in order to create the education dataframe and clean the data to focus on the indicators and years we want to analyze.

3. Visualizations

Once the dataframes have been created and cleaned, we can now begin the visualization process in which we can then draw analysis and conclusions from the World Bank CSV files. For this tutorial, I chose to look at trends globally as well as within differnt types of countries. Furthermore, I wished to look for correlating factors between education and poverty indicators. As a result, we can focus on each of the indicators seperately, and then analyzed them jointly with a corresponding indicator from the other data frame.

Let's first look at literacy rates on a global scale. This is because literacy can be an accurate indicator to the level of education and development within a country. Higher literacy rates imply more individuals are educated, and thus are able to obtain higher paying jobs and contribute to innovation solutions and strategies to develop a country. Furthermore, it speaks to how educated the country is as a whole as literacy is considered a basic level of education requirement. The World Bank provides an indicator that shows the percentage of a country's popualtion age 15 and above that are literate.

Next, we can look at the Multidimensional poverty headcount ratio on an international scale. This is the percentage of a country's population who are deemed to be multi-dimensionally poor. Once looking at both these factors individually, we can plot them both to compare the factors to one another and draw beneficial conclusions.

We first can use the pandas library to filter the education and the poverty data frames to only include data on the indicators we wish to focus on. This then allows for us to create subplots to look at each individual country's trend for the various education and poverty indicators. When creating subplots for each country that has literacy rates, we will focus on countries that provided at minimum 3 years worth of data. This is so that we could build a plot to visualize a general trend of the literacy rate from 2015 to 2020. We next can calculate the mean literacy rate globally for each of the years and plotted that as well. This was so that we could get an understanding of the internationally as a whole how the world was doing with regards to education and literacy rates. We can then repeat the same steps in order to visualize the mean multidimensional poverty ratio globally as well. This allows for us to provide multiple forms of visualizations to the same set of indicators in order to have a wholistic picture to make meanigful inferneces.

In order to create the subplots looking at literacy rates, we first must create a function that will allow for us to mark which countries have enough data to create a plot. The defined count_plot() function goes through each row of a given data frame and returns True if at least 3 of the years have data and false otherwise. In this, we employ python's library numpy which allows us to work with nan which is a numpy value used to indicate missing data. We thus want to continue to minimize the amount of missing data we work with in order to make meaninful inferences. Once marking each country as to whether or not it can be plotted, we can then filter the dataframe once more to only focus on the countries that can be plotted and create subplots of them.

Above is 2 different implementations to display the individual literacy rates among the different countries provided in the education dataframe. The first method is creating subplots for each individual country. This allows for an individual line graph for each country displaying the literacy rate trend. We are creating an individual plot for each row of our education dataframe. This can be a beneficial form of displaying individual country trends if we want to take a closer look at specific countries.

The second implementation is the interactive graph seen directly above. Interactive graphs allow for the user viewing the visualizations to click and interact with the visualization in order to customize the view for their purposes. In our interactive graph, we are able to display all individual country literacy rates in one plot. This can be beneficial to view patterns such as which country has the highest rates, which have the lowest, or which had the greatest change over time. However, if we want to use the advantages of the subplots and look at a snapshot view of only one country, the interactive plot has the added benefit of showing this as well. We can double click on a country in the key and the graph will update to only display that country's trend and hide the rest. We can reset the view by double clicking that country again once more. In addition, single clicking on a country will hide that one country from the plot. Thus we can play around to display a combination of different countries to deduce meaninful inferences. Furthermore, the interactive graph allows you to zoom, scale, and easily downlod the graph. This can be beneficial when you are in the hypothesis phase of the data science life cycle. This is because you can start to explore different patterns of data to begin drawing a hypothesis for further analysis.

To create the interactive graph, we need to first employ the python plotly library. This allows us to use functions to create the interactive plot. Next we need to reshape the dataframe in order to plot the literacy rates. This is because to use the interactive plot functions, we need a dataframe that has columns for the year, litearcy rate, and country name each seperately. Therefore, we must take our education literacy rate dataframe and reshape it to be a dataframe with these 3 columns. Then only can we call the interactive plot functions. We can repeat the same process for population in poverty headcount and create an interactive graph for all the countries as seen below.

A good way to analyze the subplot trends from above is to create a visualization of the mean literacy rates for on a global scale. This will allow for one to compare countries to the average trend and infer a wholistic understanding to the indicator being analyzed. To do this, we must calculate the mean for each year and create a new dataframe with this newly extracted information. We create the new dataframe with the year and mean of each year. We can then plot the dataframe in order to analyze the overall trend. We employ the python library matplotlib that allows us to utilize different plotting functionalities. For example, the library provides us with functions to use to create subplots as used above to plot each individual country. Furthermore, general plots can be made like the one we will create to look at the average literacy rate over time.

The above figures is a plot of the average literacy rate each year globally as well as a break down for each country's literacy rate overtime. The average literacy rate over time increases form 2015 to a peak at 2016 of 82.65% and then steadily decreases until a sharper decline from 2019 to 2020. This is surprising as one would imagine overtime countries would be improving on education and literacy rates rather than the opposite. In order to draw any conclusion about why this is the case, I next compared the average trend of literacy rates to that of the headcount of multidimensional poverty.

Furthermore, the trend for each of the countries varies and does not have a constant pattern overall. Some countries view an increase overtime and some view a decrease. Furthermore, it is important to note the variabilty of the plots. Due to limiting factors in data collection, not all countries were able to provide data for all 5 years or consecuvtive years being analyzed. As a result, some plots are incomplete and do not have data for all the years while others have proper line graphs generating. We are unable to draw conclusions as to why this is the case and thus recognize that it can be a limiting factor in the analysis.

We now want to repeat the filtering process above so that we can make meaningful inferences on the multidimensional poverty data from the World Bank CSV file. Thus, like earlier, we first want to create a new dataframe with only the Multidimensional Poverty Headcount Ratio. Then, we can once again repeat the process to calculate the average multidimensional poverty headcount ratio for each year, cretea a new dataframe with this information, and plot it to visualize the global trend.

Now that we have been able to successfully visualize two different indicators from both the poverty and education dataframes, we can now look to combine the two and deduce if a correlation exists between them. Python pandas provides us with a merge() function that merges two dataframes together. This function is best used when there is an overlapping column in which the data can be merged on. If one does not exist, look to concat() the dataframes instead. For our purposes, we know that the Country_Name column overlaps in both the literacy and multidimensional poverty dataframes and thus the merge() function makes sense to use. Furthermore, we want to do a inner join in our merge which means that we only want to keep the data of countries that exist in both dataframes and not if they only exist in one.

The above plots are visualizations of average multidimensional poverty overtime and a comparison to that of the average litearcy rates overtime. The general trend for the average percentage of population in multidimensional poverty is a decrease until 2019, where a sharp increase is observed. 2019 is when the multidimensional poverty ratio peaks at an all time low before rising in 2020 to 25.42%.

By placing the Average percentage of the population in multidimensional poverty side by side with the average literacy rate overtime, we can draw comparisons and conclusions based on the two indicators. For starters, we can observe that generally, while multidimensional poverty percentage is declinig, the literacy rate is increasing. Furthermore, when the sharp increase in multidimensional poverty is observed from 2019 to 2020, the sharp decrease in literacy rate is observed from 2019 to 2020. As a result, we can infer that there might exist some negative correlation between literacy and multidimensional poverty. This thus implies that a proper education can aid in combating poverty within countries.

In order to further visualize the comparison between multidimensional poverty and literacy, we can then took a closer look at a developing and developed country's specific trend in multidimensional poverty and literacy. In order to decide which countries to compare, we can filter through our merged dataframe as it is the intersection between the literacy rate and multidimensional dataframes. For our purposes, we will be analyzing El Salvador and Spain for the comparisons. This is because this will allow for us to anlayze a developed and developing country that has a sufficient amount of data to visualize.

In order to compare the two, we want to once again use the pandas library to create dataframes with the relevant information for each of the countries. In this case, we want to extract only El Salvador's and Spain's multidimensional poverty headcount and literacy rates. Once creating the dataframes for El Salvador's and Spain's indicators, we can then use matlibplot again to plot the country specific indicators in a comparison format similar to above. This will again allow us to determine if any meaningful inferences can be made.

Like above in the comparison of average literacy rates and multidimensional poverty, we can compare the specific literacy rate and multidimensional poverty of El Salvador. Above depicts two plots of El Salvador. one plot shows the percentage of population in poverty overtime while the other shows the percentage of the adult population that is literate. The trend observed in El Salvador corroborate the internal average trend. This is because from 2016 to 2019 El Salvador observes a steady decline in the percentage of individuals in multidimensional poverty. During this same time span, El Salvador also expereinces a steady increase in the percenteage of adults that are literate. As a result this implies there does exist a negative correlation between the two ideas and further supports the idea that to reduce poverty one must increase education.

We can again employ similar strategieis to compare Spain's litearcy rate and multidimensional poverty. The above plot shows the general trends for Spain with regards to multidimensional poverty and literacy rates from 2015 to 2020. It is important to note that Spain does not have a complete set of data for literacy rate over the 5 year span being analyzed. It is missing data in 2017 and 2019. As a result, the line graph does not continue over the 6 years and abruptly ends with data points at 2018 and 2020. However, we can notice from the data we do have that there is a general trend for an increase in literacy rates over time. Furthermore, the change is minimal overtime as it goes from 98.2 in 2015 to 98.6 in 2020. Spain observes a decrease in the percentage of the population in multidimensional poverty over the 6 year span. This once again corroborates the trend seen in El Salvador and the average literacy rate and multidimensinal poverty rates. El Salvador is also a developing nation while Spain is considered more developed, yet both observe similar trends. Furthermore, the difference in status also explains why Spain observed a more minor change over time than El Salvador as it has less room for improvement than El Salvador.

We can see these ideas further by once again employing the plotly functions and creating interactive plots of Spain and El Salvador Literacy Rates and Multidimension Poverty Headcount. By ploting both countrys' literacy rates and poverty headcounts on one plot we can compare the trend and progress between the countries. As mentioned previously, Spain is more developed than El Salvador. This is corroborated by the fact that on the plots we can see that Spain has higher literacy rates than El Salvador and lower population in multidimensional poverty rates than El Salvador. Visualizing both countries in one plot allows for us to easily deduce these conclusions. Furthermore, the interactive graph allows the user to hover over data points of different years at see the exact literacy rate or population head count for that country at a given year. This flexibility can be beneficial when attempting to deduce meaninful inferences.

4. Conclusion

Data Science is a beneficial strategy to visualize and quantify large amounts of data in order to draw meaningful inferences. In our case, we were able to employ data science and analysis strategies to draw meaningful conclusions about education and poverty on an international scale. This is relevant as it can showcase to leaders that by improving upon and implementing policies to increase education, it could have a direct positive impact on the fight against poverty globally.

This tutortial looks to go through different strategies to clean and filter dataframes in order to create beneficial visualizations for meaningful inferences. Due to the structure of the data used in the tutorial, the most meaningful type of plot that can be created for any of the data provided are line plots. This is because the education and poverty csv files all look to provide indicators over a span of time. Generally, when analzying trends overtime, line graphs are the best way to visualize the data. Thus, we walked through multiple ways to employ linegraphs to make different types of comparisons to draw conclusions. I encourage you to use repeat the steps of the tutortials with some of the other indicators discussed in order to practice data visualizations as well as to draw other types of conclusions between education and poverty. For example, one could explore the potential correlation between Poverty Gap at $3.20 a day and Education Attainment. This would mean looking at the average education level a person attains in a country and if that correlates to the number of people in the poverty gap. This is just one example among many other conclusions that can be analyzed from the given data.

The general process for data analysis involves first data collection and processing. In this tutorial, we skipped the data collection step, as the World Bank completed it for us by creating the csv files with the indicators we wanted to use. We processed the data by extracting it into dataframes that we then cleaned and tidy. Next comes the visualization and analysis step. In this is when we began to form plots of the data from the dataframes. Frome here we can begin to analyze the data and create hypothesis based on the trends we are seeing. This allows for us to then create and design machine learning models based on our hypothesis and data collected. These models can be used to predict future patterns which in turn can create insight used for policy modifications and other necessary decisions. In our case it could result in initiaitves to improve education in order to reduce poverty within countries.

5. Resources

The best way to get an understanding of how to learn data science is to begin analyzing a trend interesting to you, and learn as you analyze the data. The libraries used in this tutorial: matlabplot, pandas, plotly, and numpy are some of the most common python libraries used data analysis. To learn more about the functionality of these libraries their documentation pages are linked below:

Here is also a link to my github repository in which I created this tutortial: Link to repository

Data Science is a beneficial skillset that allows one to quantify and visualize large amounts of data. For this reason and the new access to big data, the field has continued to boom and data scientists continue to become higher in demand. If you are looking to explore data science further, here are some useful resources to expand your knowledge on the subject matter.

Lastly, if you are looking to persue data science as a future career opportunity, here are some necessary resources and background materials useful when looking for job oppportunities in the field.

Data Science is a field that provides the opportunity to connect people who are familiar with a computer science background with those who are not in order to collaborate and create necessary policy changes and decisions to strive towards innovation. As a result, data scientists carry the necessary tools in order to begin conversations to back arguments for change needed driven with statistics and facts. Therefore we encourage you to follow the tutorial and try to compare some of the other indicators explained in the beginning. The process to quantify and analyse the indicators will be very simliar to the ones mapped in this tutorial.