Home Basic Data Analysis Seaborn Module and Python – Distribution Plots

Seaborn Module and Python – Distribution Plots

by s666

I thought for this post I would look into the Seaborn library – Seaborn is a statistical plotting library and is built on top of Matplotlib. It has really nice looking default plotting styles and also works really well with Pandas DataFrames – so we can leverage the work we have done with Pandas in previous blog posts and hopefully create some great plots.

Seaborn can be installed just like any other Python package by using “pip”. Go to your command line and run:

pip install seaborn

The official documentation page for Seaborn can be found here and a lovely looking gallery page showing examples of what is possible with Seabon can be found here. You can click on any of the images on the gallery page and it will present you with example code on how to produce that particular plot. Another important page is the API page, which references the various available plot types – this can be found here.

I am going to try to break the Seaborn capabilities down into various categories – and begin with the plots that allow us to visualise the distribition of a data set

Distribution Plots

Let’s begin with our imports and load our data- I am going to be using the same “Financial Sample.xlsx” data that I have been using in the last couple of data analysis/business python blog posts to keep some consistency. The excel file can be downloaded below:

import pandas as pd
import seaborn as sns
#if using Jupyter Notebooks the below line allows us to display charts in the browser
%matplotlib inline

#load our data in a Pandas DataFrame 
df = pd.read_excel('Financial Sample.xlsx') 
#print first 5 rows of data to ensure it is loaded correctly 
df.head()

Let’s first look at the “distplot” – this allows us the look at the distribution of a univariate set of observations – univariate just means one variable.

#plot the distribution of the DataFrame "Profit" column
sns.distplot(df['Profit'])

So we have a plot now of the distribution we were interested in – but as a quick starter, the style looks somewhat bland. Let’s give it a more common “Seaborn” styling in an attempt to make it look a bit nicer…a bit more worthy of “publishing” if needed.

#set the style we wish to use for our plots
sns.set_style("darkgrid")

#plot the distribution of the DataFrame "Profit" column
sns.distplot(df['Profit'])

So notice that we have managed to plot, with just one line of code, the histogram of the DataFrame data along with the “KDE” line – that is the kernel density estimation plot. We can remove the KDE if we add “kde=False” to the plot call. We can also alter the number of “bins” in the histogram as follows – this instance they are set to 50:

sns.distplot(df['Profit'],kde=False,bins=50)

Let’s now look at a “jointplot” – this allows us to combined two distplots and deal with bivariate data. Let’s create a quick jointplot. For this we need to specify which DataFrame columns we want to plot by passing in the column names, and also the actual DataFrame from which we are pulling the columns. This can be done as follows: Let’s say I want to plot the “Profit” column vs the “Units Sold” column.

sns.jointplot(x='Profit',y='Units Sold',data=df)

We now have a plot that shows the scatter plot between the two variable columns, along with their corresponding distribution plots on either side (it even give us the Pearson Correlation coefficent and p score in the top right.)

The jointplot also allows us to set an additional argument parameter called “kind”. This allows you to affect how the main chart is represented. Currently it is a “scatter” as that is the default, but if we change it to “hex” for example, we get the following plot which represents the points on the charts as density hexagons – that is the hexagons which contain more data points are shown as darker than those which contain fewer points.

sns.jointplot(x='Profit',y='Units Sold',data=df,kind='hex')

Another argument we can put in for “kind” is “reg” which stands for regression. This will look a lot like a scatter plot, but this time a linear regression line will be added

sns.jointplot(x='Profit',y='Units Sold',data=df,kind='reg')

Yet another kind we can stipulate is “kde” which will plot a 2 dimensional KDE plot which essentially just shows you the density of where the data points appear most often.

sns.jointplot(x='Profit',y='Units Sold',data=df,kind='kde')

Ok let’s move on from jointplots and look at “pairplots”. These allow us to look at pairwise relationships across entire DataFrames (for numerical data) and also supports a “hue” argument for categorical data points. So the pairplot is essentially going to create a jointplot for each possible combination of the numerical columns in the DataFrame. I am going to quickly create a new DataFrame that drops the “Month Number” and “Year” columns as these aren’t really part of our continous numerical data such as “profit” and “COGS” (cost of goods sold) and wouldn’t make much sense if included in our pairplot. I’ll also drop a couple of the other columns to shrink our DataFrame so our output plot isn’t overly crowded.

#drop unwanted columns
new_df = df.drop(['Month Number','Year','Manufacturing Price','Sale Price'],axis=1)

sns.pairplot(new_df)

Note we basically have a pairplot for each pair of columns, and on the diagonal we have a histogram of the distriburion as it wouldn’t make sense to have a jointplot of the data against itself. This is a great way to quickly visualise our data. We can also add a “hue” – this is where we specify a categorical variable on which to split the data. Let’s add the “Segment” column as our “hue”.

sns.pairplot(new_df,hue='Segment')

Now the data points are coloured based off of the categorical data – the colour legend is shown in the right hand margin of the plot. Also we can change the colour palette that the plot uses by setting the “palette” argument. Below is an example using the “magma” colour scheme. All available schemes can be found on the Matplotlib site here.

sns.pairplot(new_df,hue='Segment',palette='magma')

The next plot we will look at is a “rugplot” – this will help us build and explain what the “kde” plot is that we created earlier- both in our distplot and when we passed “kind=kde” as an argument for our jointplot.

sns.rugplot(df['Profit'])

As seen above for a rugplot we pass in the column we want to plot as our argument – what the rugplot does is it draws a dashmark for every point in our distribution. So the difference between a rugplot and distplot is that the distplot involves the concept of “bins” and will add up all the data points in each bin, and plot this number where as the rugplot just plots a mark at each datapoint.

SO let us now convert the rugplot into a KDE plot. KDE stands for “Kernel Density Estimation” and info can be found at the following Wiki page – here. The image below is a useful image in explaining how rugplots are built up into KDE plots.

We can build our own KDE plot from a set of data and rugplot if we so choose – let’s do that and see that it matches with the KDE plot created directly using the built in “kdeplot”

#set up a set of 30 data points taken from the normal distribution
x = np.random.normal(0, 1, size=30)

#set the bandwidth for the KDE points
bandwidth = 1.06 * x.std() * x.size ** (-1 / 5.)

#set the limits of the y axis
support = np.linspace(-4, 4, 200)

#iterate through the data points and create kernels for each and then plot
#the kernels
kernels = []
for x_i in x:

    kernel = stats.norm(x_i, bandwidth).pdf(support)
    kernels.append(kernel)
    plt.plot(support, kernel, color="r")

sns.rugplot(x, color=".2", linewidth=3)
#Integrate along the given axis using the composite trapezoidal rule and create the KDE plot
from scipy.integrate import trapz
density = np.sum(kernels, axis=0)
density /= trapz(density, support)
plt.plot(support, density)

Now let’s plot the KDE plot using theb built in “kdeplot”

sns.kdeplot(x, shade=True)

Great – we can see that the two plots are the same and we have created our KDE plot correctly.

Ok so I’ll leave this post here as we have covered most of the distribution plot capabilities – next post I will move on to categorical plots and see what Seaborn can offer there. Until then…

You may also like

2 comments

spwcd July 29, 2018 - 5:39 pm

hi thanks for tutorial
i have a question why you used close value to calculate thing like correlation instead of percent change?

Reply
S666 July 29, 2018 - 5:48 pm

Hi there the data I am using isn’t stock data or time series data per se, so there is no “close price” and there is no need to calculate percentage change to create scatter plots etc.

If it has been time series stock price data for example then yes it may, in certain circumstances, make more sense to create a scatter plot of percentage change rather than closing price.

Reply

Leave a Reply

%d bloggers like this: