Malaria Data Visualization

Three informative visualizations about malaria are created using Python, starting with the data sets at https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-11-13.

Malaria Dataset

Malaria Dataset includes 3 informative datasets.

3 Datasets

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

malaria_deaths.csv

df1 = pd.read_csv('malaria_deaths.csv')
df1 = df1.rename(columns={"Deaths - Malaria - Sex: Both - Age: Age-standardized (Rate) (per 100,000 people)": "Deaths per 100,000 People"})
df1.head()

malaria4

malaria_deaths_age.csv

df2 = pd.read_csv('malaria_deaths_age.csv')
df2.head()

malaria5

malaria_deaths_age.csv

df3 = pd.read_csv('malaria_inc.csv')
df3 = df3.rename(columns={"Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)": "Incidence per 1,000 population at risk"})
df3.head()

malaria8

Data Visualizations with Seaborn

In these datasets, there are totally 228 different countries (‘Entity’) recorded. For simplicity, I decided to divided them into 6 continents which is ‘Africa’, ‘Asia’, ‘Europe’, ‘North America’,‘Oceania’ and ‘South America’. I found another dataset ‘Countries-Continents.csv’ and merged with the dataset ‘malaria_deaths_age.csv’.

df4 = pd.read_csv('Countries-Continents.csv')
df_merge = df2.merge(df4,left_on='entity', right_on='Country')
df_merge.head()

malaria7

Group by ‘Continent’ and ‘year’ and have a sum on ‘deaths’. Then we have the total number of deaths for a certain Continent in a certain year.

df_total = df_merge[['deaths','Continent','year']].groupby(['Continent','year']).sum().reset_index()
df_total.head()

malaria6

Line Plot with Seaborn. With yscale = ‘log’ for clear seperations and with plt.xkcd() for fun!

with plt.xkcd():
    fig, ax = plt.subplots(1,1, figsize=(8,5))
    sns.lineplot(data=df_total, x="year", y="deaths", hue="Continent")
    ax.set_title('total deaths over years', fontsize=16)
    plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
    plt.yscale('log')

malaria1

  • The total number of deaths due to malaria differs on different continents. From the largest to least: Africa, Asia, Europe, North America, Oceania, South America. The rank has not changed for many years (from 1990 to 2015).
x_axis = ['Under 5', '5-14', '15-49', '50-69', '70 or older']

with plt.xkcd():
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 6))
    sns.barplot(x='age_group', y='deaths', data=df2,order = x_axis);
    ax.set_title('age VS deaths over all countries', fontsize=16)
    plt.show()

malaria2

As the figure shown above, it is quite obvious that the total number of deaths for people under 5 is the largest. But we cannot conclude that the people under 5 (children) are much more likely to be dead due to malaria since we have no idea about the situation of infections for each age group. The world wide total number of deaths reduces as the growth of age. However, since we don’t know the populations in different age group, we cannot say that the number of deaths is inversely proportional to the age.

grouped_1 = df1[['Year','Deaths per 100,000 People']].groupby('Year').mean().reset_index()

grouped_3 = df3[['Year','Incidence per 1,000 population at risk']].groupby('Year').mean().reset_index()

with plt.xkcd():
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))
    sns.lineplot(data=grouped_1, x='Year', y='Deaths per 100,000 People', ax=ax[0])
    ax[0].set_title('Deaths over Years')
    sns.lineplot(data=grouped_3, x='Year', y='Incidence per 1,000 population at risk', ax=ax[1])
    ax[1].set_title('Incidence over Years')
    plt.tight_layout()
    plt.show()


malaria3

  • The death rate of malaria for per 100,000 people first increases and then decreases. 2000 to 2003 is the peak.
  • Incidence per 1,000 population at risk decrease over years.

Yuhua Cai

Data Scientist from Duke

Durham, NC, USA https://yc392.github.io