Home Basic Data Analysis Data Analysis with Pandas and Customised Visuals with Matplotlib

Data Analysis with Pandas and Customised Visuals with Matplotlib

by s666

This blog post is a result of a request I received on the website Facebook group page from a follower who asked me to analyse/play around with a csv data file he had provided. The request was to use Pandas to wrangle the data and perform some filtering and aggregation, with the view to plot the resulting figures using Matplotlib. Now Matplotlib was explicitly asked for, rather than Seaborn or any other higher level plotting library (even if they are built on the Matplotlib API) so I shall endeavour to use base Matplotlib where possible, rather than rely on any of the aforementioned (more user friendly) modules.

For those of you wishing to follow along, the data file can be downloaded using the buttons below.

So as always, we fist need to specify our imports, after which point we will read in the first csv file:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

df = pd.read_csv('AppleStore.csv',index_col='id')

We then print out the first 5 rows of the newly created DataFrame to verify our import has gone as expected.

df.head(5)

An often useful method to call at this stage is the “.info()” method as shown below. This shows us whether we have any missing data or “null” values, along with the datatype of each column’s data.

df.info()

along with the “.describe()” method also – this gets us the “5 figure summary” of each numerical data column, along with the corresponding count and mean.

df.describe()

So let’s imagine that we wanted to investigate the data from the perspective of how the data differs for each “prime_genre” – that is, for each genre of app. Firstly, I would tend to get an idea of how many unique genres we are dealing with, and a list of our genres categories can be extracted in one of mutiple ways – 2 are shown below.

We can either use the build in “.unique()” method as follows:

df['prime_genre'].unique()

array([‘Games’, ‘Productivity’, ‘Weather’, ‘Shopping’, ‘Reference’, ‘Finance’, ‘Music’, ‘Utilities’, ‘Travel’, ‘Social Networking’, ‘Sports’, ‘Business’, ‘Health & Fitness’, ‘Entertainment’, ‘Photo & Video’, ‘Navigation’, ‘Education’, ‘Lifestyle’, ‘Food & Drink’, ‘News’, ‘Book’, ‘Medical’, ‘Catalogs’], dtype=object)

which as you can see, returns an array containing the unique values held in our “prime_genres” column. Another way to extract this information would be to create a “set” of the unique values, and cast that as a list as shown below:

list(set(df['prime_genre']))
['Entertainment',
 'Utilities',
 'Weather',
 'Sports',
 'Travel',
 'Games',
 'Photo & Video',
 'Catalogs',
 'Reference',
 'Navigation',
 'Music',
 'Book',
 'Health & Fitness',
 'Shopping',
 'Business',
 'Lifestyle',
 'Finance',
 'Education',
 'Food & Drink',
 'Productivity',
 'Social Networking',
 'News',
 'Medical']

By printing out the length of the list of unique genres, we can see how many categories we are dealing with. Below we use the new “f” string formatting syntax that was released in version 3.6 I believe.

count = len(list(set(df['prime_genre'])))

print(f'There are {count} unique app categories')
There are 23 unique app categories

As I was asked explicitly in the Facebook post to show some examples of “filtering” by multiple columns, I shall deal with that now. Now by “filtering”, the example was worded as “Please use pandas library to apply filtered result on two or three column at once. Like if we want to find how many male M from city_category A have purchase more than 7000″…now obviously that comment was referring to a different data set but we can adapt it to our use.

Let’s give ourselves the challenge of identifying how many apps of the “Game” genre which scored an average user rating of exactly 4.0 were rated by more than 20,000 people (or at least rated more than 20,000 times).

The below line of code gets us the subset of the DataFrame where all 3 of our conditions are met.

df[(df['prime_genre'] == "Games") & (df['user_rating'] == 4.0) & (df['rating_count_tot'] > 20000)]

We can then just wrap it in a “len” call to find out how many rows it contains and therefore how many apps meet our criteria – in our case 49 apps.

len(df[(df['prime_genre'] == "Games") & (df['user_rating'] == 4.0) & (df['rating_count_tot'] > 20000)])
49

Now imagine we want to find out the average rating, not for each individual app, but rather for each individual genre of app. We can use a simple “groupby”, passing the name of the column by which we wish to group as our first argument, and also setting an “aggregation function” – in our case it will be “np.mean” as we want the average value. Lastly, once we have run the groupby method, we then just extract the column we are interested in by using the standard bracket notation at the end to select that column.

genre_rating = df.groupby('prime_genre').agg(np.mean)['user_rating']

If we now display the contents of the “genre_rating” variable we have the following:

prime_genre
Book                 2.477679
Business             3.745614
Catalogs             2.100000
Education            3.376380
Entertainment        3.246729
Finance              2.432692
Food & Drink         3.182540
Games                3.685008
Health & Fitness     3.700000
Lifestyle            2.805556
Medical              3.369565
Music                3.978261
Navigation           2.684783
News                 2.980000
Photo & Video        3.800860
Productivity         4.005618
Reference            3.453125
Shopping             3.540984
Social Networking    2.985030
Sports               2.982456
Travel               3.376543
Utilities            3.278226
Weather              3.597222
Name: user_rating, dtype: float64

Its really very simple to then go and plot a simple bar chart using matplotlib – it can be done in its simplest form as a 1 liner…

plt.bar(x=genre_rating.index, height=genre_rating)
plt.show()

OK so technically it does the job and we have our app genre average rating bar chart. But its pretty darn ugly, and the x-axis labels are barely distinguishable as they are all mashed up together with lack of space. Let’s try to clean it up and make it look a little nicer.

Firstly, we could increase the size of the overall figure, as its currently rather tightly packed.

# set size of overall figure
plt.figure(figsize=(20,10))
plt.bar(x=genre_rating.index, height=genre_rating)
plt.show()

That looks a bit better – but the x-axis labels are still bleeding into one another, making then hard to read. Let’s fix that. One simple way to do it, rather than try to fiddle around with sizing and placement of the x-axis labels, we could just convert our plot to a horizontal bar chart instead – the number of categories certainly justifies this approach.

To convert our barchart to a horizontal bar chart,our arguments will change – now instead of passing in an “x” and a “heigh”, we now specify a “y” and a “width”. In effect our “x” becomes our “y” and our “height” becomes our “width”. We can see the result below.

I have also sorted the data so that it appears more neatly in the plot.

genre_rating = genre_rating.sort_values()

# set size of overall figure
plt.figure(figsize=(20,10))
plt.barh(y=genre_rating.index, width=genre_rating)
plt.show()

It’s starting to look a bit better but let’s give it a title and labels the x and y axis.

# set size of overall figure
plt.figure(figsize=(20,10))

# set chart title
plt.title('Average app User Rating by Genre')

# set x-axis label
plt.xlabel('User Rating')

# set y-axis label
plt.ylabel('Genre')

plt.barh(y=genre_rating.index, width=genre_rating)

plt.show()

Now we could start to apply some of the built-in style sheets that comes packaged with matplotlib – just before we do that though, let’s store the current settings in a variable so we are able to revert back to the default as and when necessary.

plt_default = plt.rcParams.copy()

Now we can see which styles are available to use simply by using:

print(plt.style.available)

[‘bmh’, ‘classic’, ‘dark_background’, ‘fast’, ‘fivethirtyeight’, ‘ggplot’, ‘grayscale’, ‘seaborn-bright’, ‘seaborn-colorblind’, ‘seaborn-dark-palette’, ‘seaborn-dark’, ‘seaborn-darkgrid’, ‘seaborn-deep’, ‘seaborn-muted’, ‘seaborn-notebook’, ‘seaborn-paper’, ‘seaborn-pastel’, ‘seaborn-poster’, ‘seaborn-talk’, ‘seaborn-ticks’, ‘seaborn-white’, ‘seaborn-whitegrid’, ‘seaborn’, ‘Solarize_Light2’, ‘tableau-colorblind10’, ‘_classic_test’

To actually use one of the styles, we can use the following syntax:

plt.style.use('ggplot')

Now if we run our code to plot our bar chart again we see the output has changed somewhat dramatically! The background colour has changed to a light grey, a white grid has been added and the fonts have changed for both the title and the axis labels. That’s not bad going for running just one line of code to change the style.

# set size of overall figure
plt.figure(figsize=(20,10))

# set chart title
plt.title('Average app User Rating by Genre')

# set x-axis label
plt.xlabel('User Rating')

# set y-axis label
plt.ylabel('Genre')

plt.barh(y=genre_rating.index, width=genre_rating)

plt.show()

If we prefer to use a different style, we have to complete 2 steps: firstly revert the setting to the default and then just run the original line of code again, this time with our preferred style name inserted instead…

# reset styles to default
plt.rcParams.update(plt_default)

# set new style
plt.style.use('seaborn-deep')

# set size of overall figure
plt.figure(figsize=(20,10))

# set chart title
plt.title('Average app User Rating by Genre')

# set x-axis label
plt.xlabel('User Rating')

# set y-axis label
plt.ylabel('Genre')

plt.barh(y=genre_rating.index, width=genre_rating)

plt.show()

or to choose something quite different…

# reset styles to default
plt.rcParams.update(plt_default)

# set new style
plt.style.use('dark_background')

# set size of overall figure
plt.figure(figsize=(20,10))

# set chart title
plt.title('Average app User Rating by Genre')

# set x-axis label
plt.xlabel('User Rating')

# set y-axis label
plt.ylabel('Genre')

plt.barh(y=genre_rating.index, width=genre_rating)

plt.show()

We can even style the plot ourselves if we aren’t satisfied with the built-in styles. Of course it takes longer to do this way, but matplotlib really does afford you control over every single little minutiae that you could hope to style.

Let’s look at just the bottom 10 genres by average rating, and begin a new plot with custom styling this time – see what we can come up with.

So firstly we extract just the bottom 10 genres. We could do this with a simple slice approach due to the fact we have already sorted our data by size, however to be more explicit I shall use the "nsmallest()” method. It does what it says on the tin and selects the "n” smallest data points, with "n” being the number passed to the method as an argument. So in our case, to extract the bottom 10 we will use:

bottom_10 = genre_rating.nsmallest(10).sort_values()

Now onto our custom plot!

We will be referencing and updating some of the values held in the "rcParams” object – the "rcParams” is a dictionary type object which holds the global style settings for matplotlib. We are able to reference and change these values in order to change default plotting behaviour in so far as the colours, styles, shapes, fonts etc etc that are used.

Note that we already have the orignal default settings saved in the "plt_default” variable to help us resest them if things start to go awry!

To give an example, to set a new global default figure size, we would run the following:

# reset styles to default
plt.rcParams.update(plt_default)

plt.rcParams["figure.figsize"] = (7, 5)

Now if we run a plot again, this time without specifying a figure size:

# set chart title
plt.title('Average app User Rating by Genre')

# set x-axis label
plt.xlabel('User Rating')

# set y-axis label
plt.ylabel('Genre')

plt.barh(y=bottom_10.index, width=bottom_10)

plt.show()

We can see from the above plot that the figure size has correctly conformed to the new global default of “(7, 5)” without having to specify it again in the specific plot code.

So let’s move on to some more settings that we can update:

# Update the default font used
plt.rcParams['font.famil['font.family']if'
plt.rcParams['font.sans-['font.sans-serif']a'

# set the style of the axes and the text color
plt.rcParams['axes.edgec['axes.edgecolor']lt.rcParams['axes.linew['axes.linewidth']arams['xtick.colo['xtick.color']lt.rcParams['ytick.colo['ytick.color']lt.rcParams['text.color['text.color']fig, ax = plt.subplots()

# create an horizontal line that starts at x = 0 with the length 
# represented by the specific user_rating value for that genre.
plt.hlines(y=bottom_10.index, xmin=0, xmax=bottom_10, color='#007ACC', alpha=0.2, linewidth=8)

# create for each expense type a dot at the level of the expense percentage value
plt.plot(bottom_10, bottom_10.index, "o", markersize=8, color='#007ACC', alpha=0.6)

# set labels
ax.set_xlabel('User Rating', fontsize=20, fontweight='black', color = '#002147')
ax.set_ylabel('')

# set axis
ax.tick_params(axis='both', which='major', labelsize=16)


# add an horizonal label for the y axis 
fig.text(-0.10, 0.96, 'Transaction Type', fontsize=20, fontweight='black', color = '#002147')

# # change the style of the axis spines
ax.spines['top'].set_['top']none')
ax.spines['right'].se['right']none')
ax.spines['left'].set['left']ounds(True)
ax.spines['bottom'].s['bottom']ounds(True)

So that has ended up looking significantly different from what we would expect from using one of the base stylesheets. It’s a little long winded and I am not sure you would want to write out that much additional code for each and every figure you wanted to plot…in actual fact it’s not THAT hard to create your own style sheet and use it just as you would any of the already included base style sheets (more on that in another post one feels!).

So now say we wanted to visualise the difference in each genres average score from the overall average score across all genres – how would we go about doing that?

The first thing we need to do of course is calculate the data we want to plot so it’s back to Pandas for a monent.

Luckily its just a super simple one liner to subtract the mean user rating value across all genres from each individual genre rating value – we are then left with the difference.

We will cast the series into a DataFrame this time to allow us to append new columns on in a second.

genre_rating_diff = (genre_rating - genre_rating.mean()).to_frame()

We then create a new column to hold the colour name string depending on whether the value is positive (‘darkgreen’) or negative (‘red’).

genre_rating_diff['colours'] ['colours']genre_rating_diff['user_ratin['user_rating']d', 'darkgreen')

genre_rating_diff.sort_values('user_rating',inplace=True)

Below we code the actual plot – it involves the zipping up of and iterating through of, the user_rating data to create the data point labels and so forth. The borders of the plot are then lightened and a title and x-label are added.

# Draw plot
plt.figure(figsize=(10,10), dpi= 80)
plt.scatter(genre_rating_diff['user_ratin['user_rating']ng_diff.index, s=350, alpha=.6, color=genre_rating_diff['colours'])['colours']tex in zip(genre_rating_diff['user_ratin['user_rating']ng_diff.index, genre_rating_diff['user_ratin['user_rating']lt.text(x, y, round(tex, 1), horizontalalignment='center', 
                 verticalalignment='center', fontdict={'color':'white'})
    
# Decorations
# Lighten borders
plt.gca().spines["top"].set_["top"]3)
plt.gca().spines["bottom"].s["bottom"]3)
plt.gca().spines["right"].se["right"]3)
plt.gca().spines["left"].set["left"]3)

plt.yticks(genre_rating_diff.index, genre_rating_diff.index)
plt.title('Diverging Dotplot of Genre User Rating', fontdict={'size':20})
plt.xlabel('User Rating')
plt.grid(linestyle='--', alpha=0.5)
plt.xlim(-1.5, 1.5)
plt.show()

Finally for this post (which I admit has turned out to be rather random in nature) I thought I would involve the use of Seaborn to illustrate how the package can be seamlesly combined with matplotlib to produce some rather pretty charts. We begin by importing the Seaborn module, followed by defining two “kdeplot”s to visualise the distribution of both “user_rating” and “user_rating_ver” which I believe represents the overall user ratings the app has received vs the user ratings the app has received for the latest version number/release.

import seaborn as sns

# Draw Plot
plt.figure(figsize=(16,10), dpi= 80)
sns.kdeplot(df['user_ratin['user_rating'], color="g", label="User Rating", alpha=.7)
sns.kdeplot(df['user_ratin['user_rating_ver'], color="deeppink", label="User Rating Version", alpha=.7)

# Decoration
plt.title('Density Plot of User Ratings vs Latest Version Specific User Ratings', fontsize=22)
plt.xlabel('User Rating')
plt.legend()
plt.show()

I hope this post has gone some small way to illustrating how flexible matplotlib can be, even if it does take a little bit of effort to climb its (relatively steep) learning curve.

I’ll leave it here for now as it’s turning into a bit of a mish mash!!

Until next time…

You may also like

3 comments

Anand prabhakar April 30, 2019 - 10:58 pm

Thank you sir.
It helped me very much for understanding
My missing skills.

Reply
s666 April 30, 2019 - 11:05 pm

Hi Anand – you’re very welcome, I hope it was at least somewhat useful!! If there are any other topics or areas you think I should cover then feel free to comment and make suggestions. 😉

Reply
simpliv May 14, 2019 - 12:19 pm

Worth reading.Informative

Reply

Leave a Reply

%d bloggers like this: