2019
In this tutorial, I will provide you a series of lines of code that import, summarize, sort, filter, and analyze date from the World Bank Open Data depository. For more on working with this dataset, please also see:
You can also find more information for working with World Bank Open data with Python’s Pandas package here:
Please refer to this documentation for further explanation and when you run into any trouble.
Open your terminal.
conda list
conda install pandas
conda install numpy
conda install wbdata
pip install wbdata
#need to install wbdata, pandas, numpy, and os first through the conda command on the terminal;
#i.e. "conda install wbdata"
##if conda install doesn't work you can also try "pip install wbdata"
import wbdata
import pandas as pd, numpy as np
import os
##set your folder of Python files as your working directory
os.chdir("C:/Users/mikecz/Documents/usc/IDrive-Sync/Teaching/498_Digital/python")
#wbdata.get_source()
#wbdata.get_country()
#wbdata.get_topic()
#wbdata.search_indicators("GDP per capita")
#wbdata.search_indicators("CO2 emissions")
#LAMcountries=['ARG','BLZ','BOL','BRA','CHL','COL','CRI','CUB','SLV','GTM','GUY','HTI','HND',
# 'JAM','MEX','NIC','PAN','PRY','PER','PRI','SUR','URY','VEN']
#NAMcountries=['USA','CAN']
#create a Python list of the countries for which you would like to download data.
## use the wbdata.get_country() command to see a list of country abbreviations for the wbdata dataset
selcountries=['ARG','BRA','CHL','CUB','MEX','PER','VEN','USA']
#urbanind={"EN.ATM.PM25.MC.ZS":"air_poll",
# "SP.URB.TOTL":"urbanpop",
# "EN.URB.LCTY.UR.ZS":"popLgstCity",
# "EN.POP.EL5M.UR.ZS":"urbBelow5m",
# "EN.POP.SLUM.UR.ZS":"popSlums",
# "SP.POP.TOTL":"totalpop",
# "EN.POP.DNST":"popDens",
# "1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
# "1.3_ACCESS.ELECTRICITY.URBAN":"electAccessUrb"
# }
#create a Python dictionary with the identifiers of the wbdata variables you would like to use, assigning to each a more common-sense name for each variable
lifeQualInd={
"NY.GDP.PCAP.PP.KD":"gdppc",
"NY.GNP.PCAP.PP.CD":'gnipc',
"FP.CPI.TOTL.ZG":"inflation",
"SE.PRM.CUAT.ZS":"educAtt_prim_25+",
"SN.ITK.DEFC.ZS":"preval_undernour",
"SM.POP.NETM":"netMig",
"SP.DYN.LE00.IN":"life_exp",
"SE.ADT.LITR.ZS":"literacyRate",
"1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
"SP.POP.TOTL":"totalpop",
"SP.URB.TOTL":"urbanpop"
}
#assign the above dictionary of variables to a new name
indicators = lifeQualInd
#download data from wbdata given the countries and variables listed above
df=wbdata.get_dataframe(indicators,country=selcountries)
#some useful functions for retrieving summary data
#df.head()
#df.tail()
#df.info()
### get column names
#df.columns
#df.columns.values
#df.index.tolist()
#list(df)
#df.columns.tolist()
#df.keys()
#df.describe()
#le=df['life_exp']
#le.max()
#le.min()
#le.describe()
##create new columns using values of existing columns
df['pcturb']=df['urbanpop']/df['totalpop']*100
#df['pcturb_inSlums']=df['popSlums']/df['urbanpop']*100
##move index values into columns; by re-assigning the country and date as columns rather than indices
###this will allow us to more easily filter and sort the dataset by country and date
df2=df.copy()
df2.reset_index(inplace=True)
#select certain years
df2['date']=pd.to_numeric(df2['date']) #need to 1st convert data from string to integer
df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)]
df2016=df2[df2['date']==2016]
#select only certain countries
seldf=df2[df2['country'].isin(['Mexico','United States','Brazil'])]
#select only certain columns
gdpOnly=df2[['country','date','gdppc']]
##check what data is missing
dfnopop=df2[df2['totalpop'].isnull()]
dfnogdp=df2[df2['gdppc'].isnull()]
dfnogdp['date'].value_counts()
dfnogdp['country'].value_counts()
dfnolfexp=df2[df2['life_exp'].isnull()]
dfnolfexp['date'].value_counts()
##remove all rows that have an 'na' for a particular column and check what years and countries are left
### https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan
df_haspop=df2.dropna(subset=['totalpop'])
df_haslfexp=df2.dropna(subset=['life_exp'])
df_hasgdp=df2.dropna(subset=['gdppc'])
#sort dataset by the values of one variable
df2016.sort_values(by='life_exp')
df2016.sort_values(by='life_exp',ascending=False)
#Once you have created a manageable dataframe you would like to use, you may export it to a csv for later re-use in Python or with other programs.
##in this case, the code below exports the dataframe subset you created above, which removes all countries/years with missing data for life expectancy.
df_haslfexp.to_csv('LatAm_US_lifeexp_to2016.csv')
Again, as you run this code, keep referring to the links above for additional information or rely on Google Searches to help you find additional information (i.e. search for “Python Pandas how to subset dataframe by values of multiple columns”)
Next, continue to the next lesson to work on visualizing this data.