Hippocampus's Garden

Under the sea, in the hippocampus's garden...

    Search by

    Meet Pandas: Grouping and Boxplot

    June 14, 2020  |  4 min read  |  2,324 views

    • このエントリーをはてなブックマークに追加

    🐼Welcome to the “Meet Pandas” series (a.k.a. my memorandum of understanding Pandas)!🐼

    Last time, I discussed differences between Pandas methods loc, iloc, at, and iat.

    Today, I summarize how to group data by some variable and draw boxplots on it using Pandas and Seaborn. Let’s begin!

    Load Example Data

    In this post, I use the “tips” dataset provided by seaborn. This is a data of food servers’ tips in restaurants with six factors that might influence tips.

    The snippets in this post are supposed to be executed on Jupyter Notebook, Colaboratory, and stuff.

    import pandas as pd
    import seaborn as sns
    sns.set()
    
    df = sns.load_dataset('tips')
    df

    The dataframe should look something like this:

    2020 06 15 14 21 03

    Group by Categorical or Discrete Variable

    First, let’s group by the categorical variable time and create a boxplot for tip. This is done just by two pandas methods groupby and boxplot.

    df.groupby("time").boxplot(column="tip");

    2020 06 15 14 13 26

    * You can also group by discrete variables in the same way.

    It’s not bad, but maybe too simple. If you want to make it prettier, use seaborn’s boxplot().

    sns.boxplot(x="time", y="tip", data=df);

    2020 06 15 14 36 21

    Or, catplot() should produce the same output.

    sns.catplot(x="time", y="tip", kind="box", data=df);

    2020 06 15 14 35 02

    I’m not sure why it produced a figure of a little different size…

    Other Distribution Plots

    For larger datasets, boxenplot() gives more information about the shape of the distribution.

    sns.boxenplot(x="time", y="tip", data=df);

    2020 06 15 14 42 39

    violinplot() combines a boxplot with the kernel density estimation.

    sns.violinplot(x="time", y="tip", data=df);

    2020 06 15 14 43 21

    Group by Continuous Variable

    Next, let’s group by the continuous numerical variable total_bill and create boxplot for tip. What happens if I use seaborn’s boxplot() function in the same way as above?

    sns.boxplot(x="total_bill", y="tip", data=df);

    2020 06 15 14 06 54

    It divides the data into too many groups! This doesn’t really make sense. Well, I should have first bin the data by pandas cut() function.

    df["bin"] = pd.cut(df["total_bill"], 3)
    sns.boxplot(x="bin", y="tip", data=df);

    2020 06 15 14 55 51

    Or, use qcut() (quantile-based cut) if you want equal-sized bins.

    df["qbin"] = pd.qcut(df["total_bill"], 3)
    sns.boxplot(x="qbin", y="tip", data=df);

    2020 06 15 14 58 14

    References

    [1] pandas.core.groupby.DataFrameGroupBy.boxplot — pandas 1.0.4 documentation
    [2] seaborn.boxplot — seaborn 0.10.1 documentation
    [3] Plotting with categorical data — seaborn 0.10.1 documentation


    • このエントリーをはてなブックマークに追加
    [object Object]

    Written by Shion Honda. If you like this, please share!

    Shion Honda

    Hippocampus's Garden © 2021, Shion Honda. Built with Gatsby