Working with World Bank Data with Pandas

Jeremy M. Mikecz

2019

498 Home Syllabus Schedule Mikecz Home Tutorials

Please also see:

Tasks

  1. Open your terminal.

  2. Check to see if pandas, numpy, and wbdata are installed:
    1. Open the terminal
    2. Type to see if pandas is installed:
    conda list
    1. If not installed, install by typing (note install numpy too as it is often used in conjunction with pandas):
    conda install pandas
    conda install numpy
    conda install wbdata
    1. If the “conda install” command does not work for any of these packages try “pip” instead:
    pip install wbdata
  3. In Spyder, create and save a new program in the editor. In the program:

    1. First, import pandas, numpy, and os
    import pandas as pd
    import numpy as np
    import os

    Run this code.

  4. Set your working directory(The path to your folder of Python files):

    os.chdir("C:/Users/mikecz/Documents/python") #replace with your folder path
  5. choose which countries for which you wish to download data. Some examples include:

    #wbdata.get_country(300) # to see list of countries
    LAMcountries=['ARG','BLZ','BOL','BRA','CHL','COL','CRI','CUB','SLV','GTM','GUY','HTI','HND',
              'JAM','MEX','NIC','PAN','PRY','PER','PRI','SUR','URY','VEN']
    NAMcountries=['USA','CAN']
    #regions=['LCN','WLD','NAC'] #WORLD = WLD; LAT AM AND CARRIBEAN = LCN; NAC=North America
    regions=['NAC','WLD','LCN']
    allcountries=LAMcountries+NAMcountries
    allregions=allcountries+regions
  6. to browse available wbdata variables you have two options:

    1. browse online
    2. explore with Python, i.e.:

      wbdata.get_topic()
      
      1       Agriculture & Rural Development
      2       Aid Effectiveness
      3       Economy & Growth
      4       Education
      5       Energy & Mining
      6       Environment
      7       Financial Sector
      8       Health
      9       Infrastructure
      10      Social Protection & Labor
      11      Poverty
      12      Private Sector
      13      Public Sector
      14      Science & Technology
      15      Social Development
      16      Urban Development
      17      Gender
      18      Millenium development goals
      19      Climate Change
      20      External Debt
      21      Trade

      To explore variables (“indicators”) available for one of these topics enter:

      wbdata.get_indicator(topic=17)
      
      WP15163_4.3             Mobile account, female (% age 15+) [w2]
      WP15163_4.2             Mobile account, male (% age 15+) [w2]
      WP_time_01.3            Account at a financial institution, female (% age 15+) [ts]
      WP_time_01.2            Account at a financial institution, male (% age 15+) [ts]
      SP.URB.TOTL.MA.ZS       Urban population, male (% of total)
      SP.URB.TOTL.FE.ZS       Urban population, female (% of total)
      ....#much more returned....

      You can also search for variables (“indicators”) by keywords:

      wbdata.search_indicators("GDP per capita")
  7. To choose which variables (“indicators”) you will import, create a dictionary linking the wbdata label for the variable with your own, more common-sense label, like this:

    indicators={"EN.ATM.PM25.MC.ZS":"air_poll",
           "SP.URB.TOTL":"urbanpop",
           "EN.URB.LCTY.UR.ZS":"popLgstCity",
           "EN.POP.EL5M.UR.ZS":"urbBelow5m",
           "EN.POP.SLUM.UR.ZS":"popSlums",
           "SP.POP.TOTL":"totalpop",
           "EN.POP.DNST":"popDens",
           "1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
           "1.3_ACCESS.ELECTRICITY.URBAN":"electAccessUrb"
        }
    
  8. With your dictionary of indicators and list of countries, you can now import the date as a pandas dataframe using:

    df=wbdata.get_dataframe(indicators,country=allregions)
  9. Now that your World Bank data has been imported as a dataframe, you review summary information about it just as you would any other dataframe:

    df.shape
    df.columns.values
    df.info()
    df.describe()
    df.head()
    df.head(10)
    df.tail()
    df.tail(10)
  10. Notice when you view your data using df.head() or df.tail() the country and date are listed as indices for each row rather than as separate columns. For our purposes, we will convert the indices (country and date) to columns.

    df2=df.copy() #creates a copy of dataframe; if you instead  do df2=df all changes to df2 will impact df
    df2.reset_index(inplace=True)
  11. Also, for the purpose of labeling points on our (future) data visualizations, we should create a separate column combining the country and date into one variable.

    df2['text']=df2['country']+df2['date'].map(str)
  12. Now, we can create new columns. For example, we will want to convert absolute numbers into meaningful percentages (i.e. % of population living in urban areas or % of land that is arable)

    df2['pcturb']=df2['urbanpop']/df2['totalpop']*100
    df2['pcturb_inSlums']=df2['popSlums']/df2['urbanpop']*100
  13. To calculate the mean (average) of these new columns, we can do either:

    df2['pcturb'].mean()
    df2['pcturb'].describe()
  14. Subsetting Data. Often it is useful to create smaller subsets of your dataframe. For example, if you just want to see all countries with a current population over 1 million people. Our all instances (country / year data points) that have population data.

    1. If we only wanted to experiment with a few rows of data, we can do something like this:
    df_1st5rows=df2[0:5]

    Notice how this works similar to subsetting the first ten items from a list. Subsetting the first ten columns, however, is a little different:

    df_1st5cols=df2.iloc[:,0:10]

    You can also create a subset for just a few countries:

    df_selctrys=df2[df2['country'].isin(['ARG','BRA','COL','VEN','MEX'])]

    or you can select for a chosen few years:

    df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)] #produces an error

    If this produces an era it is because the values in the ‘date’ column are recorded as strings not integers. To convert the entire column to integers try:

    df2['date']=pd.to_numeric(df2['date'])
    df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)]
    1. We may also want to identify and then eliminate missing data for particular variables.

    First, let us identify missing data. In your console, type “df2.info()”. This reveals the number of missing data for each variable. Some data is more complete than others. For example, for the variable “totalpop” notice we have 1566 out of 1593 possible data points. That means out of 1593 possible combinations of countries/regions and years (1960 - 2018) we are missing only 27, which likely means we are missing data for only one year (for all 27 countries). Let’s find out what year is missing.

    ~~~~
    dfnas=df2[df2['totalpop'].isnull()] 
    ~~~~

    You can then view the ‘dfnas’ dataframe to see what year was missing.

    To drop all instances (rows) of missing data for a particular variable, try the following code:

    df3=df2.dropna(subset=['totalpop'])

    Now, compare the size of the two dataframes:

    df2.shape
    df3.shape
  15. Subset and sort data:

    Let us begin by retrieving the data for just one year: 2016, the most recent year with mostly complete data.

    df2016=df3[df3['date']==2016]

    Next, let us sort this data to see which country has the highest percent of people living in urban areas:

    df2016.sort_values(by='pcturb')

    If you want to sort if from most urban to less, you can modify it as:

    df2016.sort_values(by='pcturb',ascending=False)
  16. You can also subset your dataframe by a conditional. For example, if you only wanted those data points (countries and years) which had a life expectancy greater than 75, you would type:

    longlifedf=df[df['life_exp']>75]

    To see what countries and years remain in this dataset and, therefore, have / had a life expectancy greater than 75, see:

    longlifedf['country'].value_counts()
    longlifedf['date'].value_counts()