Meet Pandas: Converting DataFrame to CSR Matrix
March 08, 2022 | 3 min read | 318 views
Welcome back to the 🐼Meet Pandas🐼 series (a.k.a. my memorandum for learning Pandas)!
When working with big data, we often encounter interactions between users and items. Examples of such data include:
- User ratings of movies, restaurants, or marchandise
- Number of times a song or video is played by each user
Representing these data as a dense matrix, where each row represents a user and each column represents an item, can lead to prohibitively large memory consumption. And, since interaction data are usually sparse, there must be more efficient ways to store the data. In such cases, representing the data as a sparse matrix is a good choice.
In this post, I will briefly show how to convert a
DataFrame of user-item interactions to a compressed sparse row (CSR) matrix, the most common format for sparse matrices.
As an example, we use the MovieLens dataset provided here. This is a collection of rmovie ratings by users. Movies are rated by a small fraction of users, so this is a perfect use case for a sparse matrix. MovieLens can be loaded by the following code:
from urllib.request import urlretrieve import zipfile import pandas as pd urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip") zip_ref = zipfile.ZipFile('movielens.zip', "r") zip_ref.extractall() df = pd.read_csv( 'ml-100k/u.data', sep='\t', names=['user_id', 'movie_id', 'rating', 'timestamp'], encoding='latin-1' ) df
The dataframe should look something like this (a screenshot from Colaboratory):
To convert a
DataFrame to a CSR matrix, you first need to create indices for users and movies. Then, you can perform conversion with the
sparse.csr_matrix function. It is a bit faster to convert via a coordinate (COO) matrix.
from pandas.api.types import CategoricalDtype from scipy import sparse users = df["user_id"].unique() movies = df["movie_id"].unique() shape = (len(users), len(movies)) # Create indices for users and movies user_cat = CategoricalDtype(categories=sorted(users), ordered=True) movie_cat = CategoricalDtype(categories=sorted(movies), ordered=True) user_index = df["user_id"].astype(user_cat).cat.codes movie_index = df["movie_id"].astype(movie_cat).cat.codes # Conversion via COO matrix coo = sparse.coo_matrix((df["rating"], (user_index, movie_index)), shape=shape) csr = coo.tocsr()
For your information, I compare a dense matrix and its COO and CSR format in the figure below:
CSR format consumes far less memory than its dense format for sparse matrices. Also, SciPy’s CSR matrix is compatible with many other libraries such as scikit-learn and XGBoost. For a more detailed explanation about sparse matrices, I refer readers to this post.
CSR matrix can be converted back to COO matrix by
.tocoo() method and to dense matrix by
Written by Shion Honda. If you like this, please share!