2019
Please also see:
Open your terminal.
conda list
conda install pandas
conda install numpy
conda install wbdata
pip install wbdata
In Spyder, create and save a new program in the editor. In the program:
import pandas as pd
import numpy as np
import os
Run this code.
Set your working directory(The path to your folder of Python files):
os.chdir("C:/Users/mikecz/Documents/python") #replace with your folder path
choose which countries for which you wish to download data. Some examples include:
#wbdata.get_country(300) # to see list of countries
LAMcountries=['ARG','BLZ','BOL','BRA','CHL','COL','CRI','CUB','SLV','GTM','GUY','HTI','HND',
'JAM','MEX','NIC','PAN','PRY','PER','PRI','SUR','URY','VEN']
NAMcountries=['USA','CAN']
#regions=['LCN','WLD','NAC'] #WORLD = WLD; LAT AM AND CARRIBEAN = LCN; NAC=North America
regions=['NAC','WLD','LCN']
allcountries=LAMcountries+NAMcountries
allregions=allcountries+regions
to browse available wbdata variables you have two options:
explore with Python, i.e.:
wbdata.get_topic()
1 Agriculture & Rural Development
2 Aid Effectiveness
3 Economy & Growth
4 Education
5 Energy & Mining
6 Environment
7 Financial Sector
8 Health
9 Infrastructure
10 Social Protection & Labor
11 Poverty
12 Private Sector
13 Public Sector
14 Science & Technology
15 Social Development
16 Urban Development
17 Gender
18 Millenium development goals
19 Climate Change
20 External Debt
21 Trade
To explore variables (“indicators”) available for one of these topics enter:
wbdata.get_indicator(topic=17)
WP15163_4.3 Mobile account, female (% age 15+) [w2]
WP15163_4.2 Mobile account, male (% age 15+) [w2]
WP_time_01.3 Account at a financial institution, female (% age 15+) [ts]
WP_time_01.2 Account at a financial institution, male (% age 15+) [ts]
SP.URB.TOTL.MA.ZS Urban population, male (% of total)
SP.URB.TOTL.FE.ZS Urban population, female (% of total)
....#much more returned....
You can also search for variables (“indicators”) by keywords:
wbdata.search_indicators("GDP per capita")
To choose which variables (“indicators”) you will import, create a dictionary linking the wbdata label for the variable with your own, more common-sense label, like this:
indicators={"EN.ATM.PM25.MC.ZS":"air_poll",
"SP.URB.TOTL":"urbanpop",
"EN.URB.LCTY.UR.ZS":"popLgstCity",
"EN.POP.EL5M.UR.ZS":"urbBelow5m",
"EN.POP.SLUM.UR.ZS":"popSlums",
"SP.POP.TOTL":"totalpop",
"EN.POP.DNST":"popDens",
"1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
"1.3_ACCESS.ELECTRICITY.URBAN":"electAccessUrb"
}
With your dictionary of indicators and list of countries, you can now import the date as a pandas dataframe using:
df=wbdata.get_dataframe(indicators,country=allregions)
Now that your World Bank data has been imported as a dataframe, you review summary information about it just as you would any other dataframe:
df.shape
df.columns.values
df.info()
df.describe()
df.head()
df.head(10)
df.tail()
df.tail(10)
Notice when you view your data using df.head() or df.tail() the country and date are listed as indices for each row rather than as separate columns. For our purposes, we will convert the indices (country and date) to columns.
df2=df.copy() #creates a copy of dataframe; if you instead do df2=df all changes to df2 will impact df
df2.reset_index(inplace=True)
Also, for the purpose of labeling points on our (future) data visualizations, we should create a separate column combining the country and date into one variable.
df2['text']=df2['country']+df2['date'].map(str)
Now, we can create new columns. For example, we will want to convert absolute numbers into meaningful percentages (i.e. % of population living in urban areas or % of land that is arable)
df2['pcturb']=df2['urbanpop']/df2['totalpop']*100
df2['pcturb_inSlums']=df2['popSlums']/df2['urbanpop']*100
To calculate the mean (average) of these new columns, we can do either:
df2['pcturb'].mean()
df2['pcturb'].describe()
Subsetting Data. Often it is useful to create smaller subsets of your dataframe. For example, if you just want to see all countries with a current population over 1 million people. Our all instances (country / year data points) that have population data.
df_1st5rows=df2[0:5]
Notice how this works similar to subsetting the first ten items from a list. Subsetting the first ten columns, however, is a little different:
df_1st5cols=df2.iloc[:,0:10]
You can also create a subset for just a few countries:
df_selctrys=df2[df2['country'].isin(['ARG','BRA','COL','VEN','MEX'])]
or you can select for a chosen few years:
df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)] #produces an error
If this produces an era it is because the values in the ‘date’ column are recorded as strings not integers. To convert the entire column to integers try:
df2['date']=pd.to_numeric(df2['date'])
df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)]
First, let us identify missing data. In your console, type “df2.info()”. This reveals the number of missing data for each variable. Some data is more complete than others. For example, for the variable “totalpop” notice we have 1566 out of 1593 possible data points. That means out of 1593 possible combinations of countries/regions and years (1960 - 2018) we are missing only 27, which likely means we are missing data for only one year (for all 27 countries). Let’s find out what year is missing.
~~~~
dfnas=df2[df2['totalpop'].isnull()]
~~~~
You can then view the ‘dfnas’ dataframe to see what year was missing.
To drop all instances (rows) of missing data for a particular variable, try the following code:
df3=df2.dropna(subset=['totalpop'])
Now, compare the size of the two dataframes:
df2.shape
df3.shape
Subset and sort data:
Let us begin by retrieving the data for just one year: 2016, the most recent year with mostly complete data.
df2016=df3[df3['date']==2016]
Next, let us sort this data to see which country has the highest percent of people living in urban areas:
df2016.sort_values(by='pcturb')
If you want to sort if from most urban to less, you can modify it as:
df2016.sort_values(by='pcturb',ascending=False)
You can also subset your dataframe by a conditional. For example, if you only wanted those data points (countries and years) which had a life expectancy greater than 75, you would type:
longlifedf=df[df['life_exp']>75]
To see what countries and years remain in this dataset and, therefore, have / had a life expectancy greater than 75, see:
longlifedf['country'].value_counts()
longlifedf['date'].value_counts()