Elo vs Bradley-Terry: Which is Better for Comparing the Performance of LLMs?
March 17, 2024 | 4 min readChatbot Arena updated its LLM ranking method from Elo to Bradley-Terry. What changed? Let's dig into the differences.
Under the sea, in the hippocampus's garden...
Chatbot Arena updated its LLM ranking method from Elo to Bradley-Terry. What changed? Let's dig into the differences.
Discover the power of Flask's Server-Sent Events for better developer's experience of chatbots.
Two ways to calculate color histogram: OpenCV-based and PyTorch-based.
An optimized NumPy implementation of top-k function.
This post shows how to convert a DataFrame of user-item interactions to a compressed sparse row (CSR) matrix, the most common format for sparse matrices.
It's so easy for me to forget how to setup Jupyter in a newly created Poetry / Pipenv environment. So, here it is.
This post steps forward to multiple linear regression. The method of least squares is revisited --with linear algebra.
This post summarizes the basics of simple linear regression --method of least squares and coefficient of determination.
Is the sample correlation coefficient an unbiased estimator? No! This post visualizes how large its bias is and shows how to fix it.
The correlation coefficient is a familiar statistic, but there are several variations whose differences should be noted. This post recaps the definitions of these common measures.
When you sample from a finite population without replacement, beware the finite population correction. The samples are not independent of each other.
What is unbiased sample variance? Why divide by n-1? With a little programming with Python, it's easier to understand.
Let's re-inplement face swapping in 10 minutes! This post shows a naive solution using a pre-trained CNN and OpenCV.
Lightweight GAN has opened the way for generating fine images with ~100 training samples and affordable computing resources. This post presents "This Sushi Does Not Exist" and how I built it with GAE.
If you want to use a custom loss function with a modern GBDT model, you'll need the first- and second-order derivatives. This post shows how to implement them, using LightGBM as an example
This post introduces how to sample groups from a dataset, which is helpful when you want to avoid data leakage.
This post compares the behaviors of different feature importance measures in tricky situations.
This post introduces the Pandas method of `query`, which allows us to query dataframes in an SQL-like manner.
This post introduces PFRL, a new reinforcement learning library, and uses it to learn to play the Slime Volleyball game on Colaboratory.
This post summarizes how to group data by some variable and draw boxplots on it using Pandas and Seaborn.
Double descent is one of the mysteries of modern machine learning. I reproduced the main results of the recent paper by Nakkiran et al. and posed some questions that occurred to me.
Have you ever confused Pandas methods `loc`, `at`, and `iloc` with each other? It's no more confusing when you have this table in mind.
How does Google's PageRank work? Its theory and algorithm are explained, followed by numerical experiments.
Want to generate realistic images with a single GPU? This post demonstrates how to downsize StyleGAN2 with slight performance degradation.