Hippocampus's Garden

Under the sea, in the hippocampus's garden...

    Search by

    Meet Pandas: Group-wise Sampling

    October 13, 2020  |  3 min read  |  682 views

    • このエントリーをはてなブックマークに追加

    🐼Welcome back to the “Meet Pandas” series (a.k.a. my memorandum for learning Pandas)!🐼

    Last time, I discussed DataFrame’s easy-to-read selecting method called query.

    Today, I introduce how to sample groups, or group-wise split a dataset. This may help you when you want to avoid data leakage.

    Create Example Data

    Suppose we are developing a user-to-item recommender model and have a dataset of 1,000,000 user-item interactions, which include 10,000 unique users, 256 items, and the corresponding conversion flag. Let’s synthesize this dataset with itertools:

    from itertools import product
    import numpy as np
    import pandas as pd
    
    str_numbers = "".join(map(str, range(10)))
    # ['0000', '0001', ..., '9999']
    user_ids = list(map("".join, product(str_numbers, repeat=4)))
    # ['AAAA', 'AAAB', ..., 'DDDD']
    item_ids = list(map("".join, product("ABCD", repeat=4)))
    
    num_records = 10**6
    df = pd.DataFrame({'user_id': np.random.choice(user_ids, num_records),
                       'item_id': np.random.choice(item_ids, num_records),
                       'conversion': np.random.choice([0, 1], num_records)})
    df

    The dataframe should look something like this:

    2020 10 13 08 48 21

    Random Sampling Leads to Data Leakage

    This might be too bulky to handle, and you might feel like downsampling when you train a model as such:

    df.groupby('user_id').sample(frac=0.01)

    2020 10 13 09 15 00

    But, in this case, random split leads to data leakage because this dataset includes multiple (100 on average) records for each user. To avoid this, we should use group-wise splitting. But how?

    2020 10 13 09 48 03

    Group-wise Sampling

    Again, this dataset contains 10,000 unique users.

    df["user_id"].nunique()
    # >> 10000

    Let’s say we are sampling 100 users to obtain a set of approximately 10,000 records. This is achieved by chaining groupby and filter.

    df_sampled = df.query('user_id in @').filter(lambda _: np.random.rand() < 0.01)
    df_sampled["user_id"].nunique()
    # >> 107

    2020 10 13 09 08 26

    This solution isn’t bad, but it doesn’t assure that the sampled set includes exactly 100 users. This can be a problem when the number of users is small. If you want to sample exactly 100 users, you should explicitly do that before filtering.

    sampled_users = np.random.choice(df["user_id"].unique(), 100)
    df_sampled = df.groupby('user_id').filter(lambda x: x["user_id"].values[0] in sampled_users)
    df_sampled["user_id"].nunique()
    # >> 100
    # Wall time: 1.8 s

    This solution always returns a sampled dataset of 100 users! But, there is still room for improvement. The following code is 10 times faster and produces exactly the same result.

    sampled_users = np.random.choice(df["user_id"].unique(), 100)
    df_sampled = df.query('user_id in @sampled_users')
    df_sampled["user_id"].nunique()
    # >> 100
    # Wall time: 117 ms

    If you are not familiar with the method query, take a look at my last post on Pandas.

    References

    [1] Group by: split-apply-combine — pandas 1.1.3 documentation
    [2] itertools — Functions creating iterators for efficient looping — Python 3.9.0 documentation


    • このエントリーをはてなブックマークに追加
    [object Object]

    Written by Shion Honda. If you like this, please share!

    Shion Honda

    Hippocampus's Garden © 2021, Shion Honda. Built with Gatsby