Cheat Sheet - Working with World Bank Data with Pandas

Jeremy M. Mikecz

2019

498 Home Syllabus Schedule Mikecz Home Tutorials

In this tutorial, I will provide you a series of lines of code that import, summarize, sort, filter, and analyze date from the World Bank Open Data depository. For more on working with this dataset, please also see:

  1. My “Working with World Bank Data” tutorial

You can also find more information for working with World Bank Open data with Python’s Pandas package here:

Please refer to this documentation for further explanation and when you run into any trouble.

Install wbdata (the World Bank Data package for Python)

  1. Open your terminal.

  2. Check to see if pandas, numpy, and wbdata are installed:
    1. Open the terminal
    2. Type to see if pandas is installed:
    conda list
    1. If not installed, install by typing (note install numpy too as it is often used in conjunction with pandas):
    conda install pandas
    conda install numpy
    conda install wbdata
    1. If the “conda install” command does not work for any of these packages try “pip” instead:
    pip install wbdata

Begin working with World Bank Data in Spyder / Python

  1. Open Spyder
  2. Create a new Python program in the editor and save it in your Python folder.
  3. Copy and paste the following code in your new program in Spyder. Run it one line or one group of code at a time, paying attention to what each line of code does. Also, note the comments (set aside from the code Python reads by a “#”) that mark either:
    1. some guidelines / comments for the human user to explain what the code does (you should add additional comments of your own to help you remember what each line of code does) or
    2. stored lines of code that you many not want to run now, but want to save for later use.
    #need to install wbdata, pandas, numpy, and os first through the conda command on the terminal; 
    #i.e. "conda install wbdata"
    ##if conda install doesn't work you can also try "pip install wbdata"
    import wbdata
    import pandas as pd, numpy as np
    import os
    
    ##set your folder of Python files as your working directory
    os.chdir("C:/Users/mikecz/Documents/usc/IDrive-Sync/Teaching/498_Digital/python")
    #wbdata.get_source()
    #wbdata.get_country()
    #wbdata.get_topic()
    #wbdata.search_indicators("GDP per capita")
    #wbdata.search_indicators("CO2 emissions")
    
    #LAMcountries=['ARG','BLZ','BOL','BRA','CHL','COL','CRI','CUB','SLV','GTM','GUY','HTI','HND',
    #              'JAM','MEX','NIC','PAN','PRY','PER','PRI','SUR','URY','VEN']
    #NAMcountries=['USA','CAN']
    
    #create a Python list of the countries for which you would like to download data.
    ## use the wbdata.get_country() command to see a list of country abbreviations for the wbdata dataset
    selcountries=['ARG','BRA','CHL','CUB','MEX','PER','VEN','USA']
    
    
    #urbanind={"EN.ATM.PM25.MC.ZS":"air_poll",
    #           "SP.URB.TOTL":"urbanpop",
    #           "EN.URB.LCTY.UR.ZS":"popLgstCity",
    #           "EN.POP.EL5M.UR.ZS":"urbBelow5m",
    #           "EN.POP.SLUM.UR.ZS":"popSlums",
    #           "SP.POP.TOTL":"totalpop",
    #           "EN.POP.DNST":"popDens",
    #           "1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
    #           "1.3_ACCESS.ELECTRICITY.URBAN":"electAccessUrb"
    #        }
    
    #create a Python dictionary with the identifiers of the wbdata variables you would like to use, assigning to each a more common-sense name for each variable
    lifeQualInd={
                "NY.GDP.PCAP.PP.KD":"gdppc", 
                "NY.GNP.PCAP.PP.CD":'gnipc',
                "FP.CPI.TOTL.ZG":"inflation",
                "SE.PRM.CUAT.ZS":"educAtt_prim_25+",
                "SN.ITK.DEFC.ZS":"preval_undernour",
                "SM.POP.NETM":"netMig",
                "SP.DYN.LE00.IN":"life_exp",
                "SE.ADT.LITR.ZS":"literacyRate",
                "1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
                "SP.POP.TOTL":"totalpop",
                "SP.URB.TOTL":"urbanpop"
            }
    
    
    #assign the above dictionary of variables to a new name
    indicators = lifeQualInd
    
    #download data from wbdata given the countries and variables listed above
    df=wbdata.get_dataframe(indicators,country=selcountries)
    
    #some useful functions for retrieving summary data
    #df.head()
    #df.tail()
    #df.info()
    ### get column names
    #df.columns
    #df.columns.values
    #df.index.tolist()
    #list(df)
    #df.columns.tolist()
    #df.keys()
    #df.describe()
    #le=df['life_exp']
    #le.max()
    #le.min()
    #le.describe()
    
    ##create new columns using values of existing columns
    df['pcturb']=df['urbanpop']/df['totalpop']*100
    #df['pcturb_inSlums']=df['popSlums']/df['urbanpop']*100
    
    ##move index values into columns; by re-assigning the country and date as columns rather than indices
    ###this will allow us to more easily filter and sort the dataset by country and date
    df2=df.copy()
    df2.reset_index(inplace=True)
    
    
    #select certain years
    df2['date']=pd.to_numeric(df2['date']) #need to 1st convert data from string to integer
    df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)]
    
    df2016=df2[df2['date']==2016]
    
    
    #select only certain countries
    seldf=df2[df2['country'].isin(['Mexico','United States','Brazil'])]
    
    
    #select only certain columns
    gdpOnly=df2[['country','date','gdppc']]
    
    
    ##check what data is missing
    dfnopop=df2[df2['totalpop'].isnull()]
    
    dfnogdp=df2[df2['gdppc'].isnull()]
    dfnogdp['date'].value_counts()
    dfnogdp['country'].value_counts()
    
    dfnolfexp=df2[df2['life_exp'].isnull()]
    dfnolfexp['date'].value_counts()
    
    
    ##remove all rows that have an 'na' for a particular column and check what years and countries are left
    ### https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan
    df_haspop=df2.dropna(subset=['totalpop'])
    df_haslfexp=df2.dropna(subset=['life_exp'])
    df_hasgdp=df2.dropna(subset=['gdppc'])
    
    
    #sort dataset by the values of one variable
    df2016.sort_values(by='life_exp')
    df2016.sort_values(by='life_exp',ascending=False)
    
    #Once you have created a manageable dataframe you would like to use, you may export it to a csv for later re-use in Python or with other programs.
    ##in this case, the code below exports the dataframe subset you created above, which removes all countries/years with missing data for life expectancy.
    df_haslfexp.to_csv('LatAm_US_lifeexp_to2016.csv')
    

Again, as you run this code, keep referring to the links above for additional information or rely on Google Searches to help you find additional information (i.e. search for “Python Pandas how to subset dataframe by values of multiple columns”)

Next, continue to the next lesson to work on visualizing this data.