Hippocampus's Garden

Under the sea, in the hippocampus's garden...

Meet Pandas: Query Dataframe

August 25, 2020  |  2 min read  |  21 views

  • このエントリーをはてなブックマークに追加

🐼Welcome back to the “Meet Pandas” series (a.k.a. my memorandum for learning Pandas)!🐼

Last time, I discussed grouping and several types of boxplot functions.

Today, I’m going to briefly summarize how to extract a piece of data from a dataframe by specifying some conditions.

Load Example Data

As before, I use the “tips” dataset provided by seaborn. This is a data of food servers’ tips in restaurants with six factors that might influence tips.

import pandas as pd
import seaborn as sns
sns.set()

df = sns.load_dataset('tips')
df

The dataframe should look something like this:

2020 08 25 12 28 32

Let’s say we want a subset of records where 2tip<32 \leq \mathrm{tip} < 3 and day{Sat,Sun}\mathrm{day} \in \{\mathrm{Sat},\mathrm{Sun}\}.

Boolean Indexing

You might solve this problem by Boolean indexing (if you get 28 records, your query is probably correct).

df[ (df['tip']>2) & (df['tip']<3)  & (df['day'].isin(['Sat', 'Sun'])) ]

But, have you felt that this is a little cumbersome and easy to cause mistakes? I have. The common mistakes include:

df[ (2<df['tip']<3)  & (df['day'].isin(['Sat', 'Sun'])) ]
# >> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

df[ (df['tip']>2) and (df['tip']<3)  and (df['day'].isin(['Sat', 'Sun'])) ]
# >> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

df[ (df['tip']>2) & (df['tip']<3)  & (df['day'] in ['Sat', 'Sun']) ]
# >> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

query Method

Pandas has query method, which is easier to read and write than Boolean indexing. This method allows you to specify querying conditions in an SQL-like manner.

df.query('2 < tip < 3 and day in ["Sat", "Sun"]')

I usually prefer query to Boolean indexing because its SQL-like syntax enhances readability and reduces potential errors. But query is not always the better choice. Please note the following:

  • If the column name includes spaces or periods, you can’t use query
  • pandas.Series objects do not have query method
  • Boolean indexing is a bit faster than query

References

[1] pandas.DataFrame.query — pandas 1.1.1 documentation
[2] pandas.DataFrameの行を条件で抽出するquery | note.nkmk.me
[3] pandasで複数条件のAND, OR, NOTから行を抽出(選択) | note.nkmk.me


  • このエントリーをはてなブックマークに追加

Written by Shion Honda. If you like this, please share!