Hey scientist! How is it going?
In this post series we’ll discuss the scatter (or dispersion) plots, which can be used to present several data more intuitively. Today we’ll start obtaining the desired data, seeking some info on IBGE . Let’s do this!
First we obtain the data for our plot: in this example, we want the info about population, GDP per capita and life expectancy of Brazilian states in 2013. These data were obtained from two online articles from Sala de Imprensa, IBGE (, ), and a part of these tables is given below:
After we get these tables, we need to organize the data. This way we can read it easier on Python. In this example, we copy the two tables in a LibreOffice  spreadsheet and record all into one xls file (“População e PIB”, meaning “Population and GDB”, and “Expectativa de vida”, meaning “Life expectancy”, in ).
On the first table, “População e PIB”, we are interested on the columns “População residente (1.000 hab.)(1)” and “Produto Interno Bruto per capita R$”. On the second table, “Expectativa de vida”, we want the column “Esperança de vida ao nascer – 2013 – Total”. Then, we create a third table, named “Table for analysis”, and we paste the three interest columns: “GDPperCapita”, “LifeExpec” and “PopX1000”, according to the data contained on the original ones.
There you go! We already have data for a good plot. But we’ll show also the state and the region where it belongs. For that we create the columns “UF” and “Region” on “Table for analysis”. They contain the abbreviation of the state and the region where it is, respectively. Download the file with both original tables and the analysis data in  (Click on ‘View Raw’). Check the table containing the data analysis below:
import pandas as pd data_brazil = pd.read_excel('data_ibge.xls', sheetname=2) data_brazil.head()
The argument sheetname=2 indicates we wish Pandas to read the third table.
Do you remember Python starts to count from zero? Then, we would have sheetname=0 for the first one and sheetname=1 for the second one. Actually, if you want the first table, you don’t need to use this argument.
After reading the table and putting it in the variable data_brazil, we want to see its contents. Pandas shows the first lines of the table when we use the command data_brazil.head(). The result follows:
Check that the columns represent the data from the LibreOffice spreadsheet. To manipulate the column contents independently, use the name of the column between brackets:
data_brazil['UF'] data_brazil['Region'] data_brazil['GDPperCapita'] data_brazil['LifeExpec'] data_brazil['PopX1000']
Now we’ll create a preliminary dispersion plot, based on the collected data. For that we’re using scatter(), a function on matplotlib :
import matplotlib.pyplot as plt plt.scatter(x = data_brazil['LifeExpec'], y = data_brazil['GDPperCapita'], s = data_brazil['PopX1000']) plt.show()
In this example, we use the life expectancy as X (argument x = data_brazil[‘LifeExpec’]), and the GDP per capita as Y (y = data_brazil[‘GDPperCapita’]). The population indicates the size of each circle (s = data_brazil[‘PopX1000’]). Check the resultant plot below:
Well… this plot isn’t very informative. What is the little circle above, on the right? Which state has the highest life expectancy? Are the regions well distributed?
We’ll start to improve this plot on the next week, setting labels, title, text, legend, and colors. Stay with us!
Thanks scientist! Gigaregards, see you next time!
May, 12: a new version of dados_ibge.xls, data_ibge.xls, is available. “Table for analysis” and the variables are now in English. Also corrected some wrong data, as pointed by A C Censi on the comments.
Did you like this post? Please comment and share with your friends!
Want to download Programando Ciência codes? Go to our GitHub!
Make a donation for Programando Ciência!
Like us also on Facebook: www.facebook.com/programandociencia
I’m on Twitter! Follow me if you can! @alexdesiqueira