Cheat Sheet - Working with World Bank Data with Pandas

Jeremy M. Mikecz

2019

498 Home Syllabus Schedule Mikecz Home Tutorials

In this tutorial, I will provide you a series of lines of code that import, summarize, sort, filter, and analyze date from the World Bank Open Data depository. For more on working with this dataset, please also see:

My “Working with World Bank Data” tutorial

You can also find more information for working with World Bank Open data with Python’s Pandas package here:

for Pandas:
- Dataquest Pandas cheat sheet
- Pandas cheat sheet
to browse World Bank data from the World Bank Data webpage:
- World Bank Open Data
for using Pandas with World Bank data:
- wbdata documentation: Typical User Session
- Getting Worldbank Data with Python. Easily

Please refer to this documentation for further explanation and when you run into any trouble.

Install wbdata (the World Bank Data package for Python)

Open your terminal.
Check to see if pandas, numpy, and wbdata are installed:
1. Open the terminal
2. Type to see if pandas is installed:
```
conda list
```
1. If not installed, install by typing (note install numpy too as it is often used in conjunction with pandas):
```
conda install pandas
conda install numpy
conda install wbdata
```
1. If the “conda install” command does not work for any of these packages try “pip” instead:
```
pip install wbdata
```

Begin working with World Bank Data in Spyder / Python

Open Spyder
Create a new Python program in the editor and save it in your Python folder.

Copy and paste the following code in your new program in Spyder. Run it one line or one group of code at a time, paying attention to what each line of code does. Also, note the comments (set aside from the code Python reads by a “#”) that mark either:

some guidelines / comments for the human user to explain what the code does (you should add additional comments of your own to help you remember what each line of code does) or
stored lines of code that you many not want to run now, but want to save for later use.

#need to install wbdata, pandas, numpy, and os first through the conda command on the terminal; 
#i.e. "conda install wbdata"
##if conda install doesn't work you can also try "pip install wbdata"
import wbdata
import pandas as pd, numpy as np
import os

##set your folder of Python files as your working directory
os.chdir("C:/Users/mikecz/Documents/usc/IDrive-Sync/Teaching/498_Digital/python")
#wbdata.get_source()
#wbdata.get_country()
#wbdata.get_topic()
#wbdata.search_indicators("GDP per capita")
#wbdata.search_indicators("CO2 emissions")

#LAMcountries=['ARG','BLZ','BOL','BRA','CHL','COL','CRI','CUB','SLV','GTM','GUY','HTI','HND',
#              'JAM','MEX','NIC','PAN','PRY','PER','PRI','SUR','URY','VEN']
#NAMcountries=['USA','CAN']

#create a Python list of the countries for which you would like to download data.
## use the wbdata.get_country() command to see a list of country abbreviations for the wbdata dataset
selcountries=['ARG','BRA','CHL','CUB','MEX','PER','VEN','USA']


#urbanind={"EN.ATM.PM25.MC.ZS":"air_poll",
#           "SP.URB.TOTL":"urbanpop",
#           "EN.URB.LCTY.UR.ZS":"popLgstCity",
#           "EN.POP.EL5M.UR.ZS":"urbBelow5m",
#           "EN.POP.SLUM.UR.ZS":"popSlums",
#           "SP.POP.TOTL":"totalpop",
#           "EN.POP.DNST":"popDens",
#           "1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
#           "1.3_ACCESS.ELECTRICITY.URBAN":"electAccessUrb"
#        }

#create a Python dictionary with the identifiers of the wbdata variables you would like to use, assigning to each a more common-sense name for each variable
lifeQualInd={
            "NY.GDP.PCAP.PP.KD":"gdppc", 
            "NY.GNP.PCAP.PP.CD":'gnipc',
            "FP.CPI.TOTL.ZG":"inflation",
            "SE.PRM.CUAT.ZS":"educAtt_prim_25+",
            "SN.ITK.DEFC.ZS":"preval_undernour",
            "SM.POP.NETM":"netMig",
            "SP.DYN.LE00.IN":"life_exp",
            "SE.ADT.LITR.ZS":"literacyRate",
            "1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
            "SP.POP.TOTL":"totalpop",
            "SP.URB.TOTL":"urbanpop"
        }


#assign the above dictionary of variables to a new name
indicators = lifeQualInd

#download data from wbdata given the countries and variables listed above
df=wbdata.get_dataframe(indicators,country=selcountries)

#some useful functions for retrieving summary data
#df.head()
#df.tail()
#df.info()
### get column names
#df.columns
#df.columns.values
#df.index.tolist()
#list(df)
#df.columns.tolist()
#df.keys()
#df.describe()
#le=df['life_exp']
#le.max()
#le.min()
#le.describe()

##create new columns using values of existing columns
df['pcturb']=df['urbanpop']/df['totalpop']*100
#df['pcturb_inSlums']=df['popSlums']/df['urbanpop']*100

##move index values into columns; by re-assigning the country and date as columns rather than indices
###this will allow us to more easily filter and sort the dataset by country and date
df2=df.copy()
df2.reset_index(inplace=True)


#select certain years
df2['date']=pd.to_numeric(df2['date']) #need to 1st convert data from string to integer
df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)]

df2016=df2[df2['date']==2016]


#select only certain countries
seldf=df2[df2['country'].isin(['Mexico','United States','Brazil'])]


#select only certain columns
gdpOnly=df2[['country','date','gdppc']]


##check what data is missing
dfnopop=df2[df2['totalpop'].isnull()]

dfnogdp=df2[df2['gdppc'].isnull()]
dfnogdp['date'].value_counts()
dfnogdp['country'].value_counts()

dfnolfexp=df2[df2['life_exp'].isnull()]
dfnolfexp['date'].value_counts()


##remove all rows that have an 'na' for a particular column and check what years and countries are left
### https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan
df_haspop=df2.dropna(subset=['totalpop'])
df_haslfexp=df2.dropna(subset=['life_exp'])
df_hasgdp=df2.dropna(subset=['gdppc'])


#sort dataset by the values of one variable
df2016.sort_values(by='life_exp')
df2016.sort_values(by='life_exp',ascending=False)

#Once you have created a manageable dataframe you would like to use, you may export it to a csv for later re-use in Python or with other programs.
##in this case, the code below exports the dataframe subset you created above, which removes all countries/years with missing data for life expectancy.
df_haslfexp.to_csv('LatAm_US_lifeexp_to2016.csv')

Again, as you run this code, keep referring to the links above for additional information or rely on Google Searches to help you find additional information (i.e. search for “Python Pandas how to subset dataframe by values of multiple columns”)

Next, continue to the next lesson to work on visualizing this data.