Loss Functions for Ordinal Regression
January 04, 2025 | 9 min read | 41 views
In traditional classification problems, there is often no natural “order” among classes (e.g., {dog, cat, bird}). However, in cases like restaurant ratings (0-5) or clothing sizes (XS < S < M < L < XL), there exists a natural ordinal relationship among the classes. Tasks that handle such ordered relationships are called ordinal regression.
Ordinal regression has “order structure” between classes that we can use to improve learning, so there are dedicated loss functions different from those in standard classification or regression tasks. This article introduces three popular approaches to ordinal regression.
Threshold-Based Approach
The simplest approach divides ordinal classes {0,1,2,…,K−1} into multiple binary classification tasks based on thresholds. For example, for 3 classes (0,1,2), the tasks are:
- Task 1: Is it larger than 1?
- Task 2: Is it larger than 2?
At inference, the final class is determined as follows:
- If Task 1 outputs 0, the predicted class is 0.
- If Task 1 outputs 1 and Task 2 outputs 0, the predicted class is 1.
- If Task 1 and Task 2 both output 1, the predicted class is 2.
This approach is relatively simple and intuitive.
Soft Label Loss
This approach assigns probabilities not only to the true class but also to neighboring classes based on a distance-based weighting. A “soft label” distribution is generated and used to minimize the cross-entropy loss with the model’s predicted distribution.
For example, for classes {0,1,2} with the true class , the soft label for each class is defined as:
where is a distance metric (e.g., absolute error). The cross-entropy loss between the predicted distribution and the soft label distribution is minimized:
This naturally introduces the property of penalizing predictions farther from the true class more heavily.
Ordinal Logistic Loss (Cumulative Link Model)
This statistical approach, also known as cumulative link models, incorporates ordinal relationships by modeling the cumulative probability. In this approach, the network’s output is treated as a one-dimensional continuous variable, and learnable thresholds are introduced. The cumulative probability is modeled as:
where is the sigmoid function. From these cumulative probabilities, the probability of each class is derived as follows:
These probabilities are used to compute the likelihood, and the ordinal logistic loss is the negative log-likelihood:
This approach is theoretically grounded but requires careful implementation to ensure the thresholds are monotonic.
Implementation Example
Below is an implementation example using the wine quality dataset and a two-layer neural network. First of all, let’s prepare some common functions for loading data and training the model.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load and preprocess the Wine dataset
def load_wine_data():
wine = load_wine()
X = wine.data # shape=(n_samples, 13)
y = wine.target # 0,1,2 (3 classes)
# Train-test split: 8:2 ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert to Torch tensors
X_train_torch = torch.from_numpy(X_train).float()
y_train_torch = torch.from_numpy(y_train).long()
X_test_torch = torch.from_numpy(X_test).float()
y_test_torch = torch.from_numpy(y_test).long()
return X_train_torch, y_train_torch, X_test_torch, y_test_torch
# Two-layer neural network example
class TwoLayerNet(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x) # Raw logits or output
return x
# Simplified training & evaluation function
def train_and_evaluate(
model,
criterion,
optimizer,
train_loader,
X_test,
y_test,
num_epochs=30,
evaluate_fn=None
):
"""
model: nn.Module
criterion: Loss function
optimizer: torch.optim
train_loader: DataLoader of (X, y)
X_test, y_test: Test data (Tensor)
evaluate_fn: Function to get predicted labels during evaluation (model, X_test) -> pred_label
Defaults to standard multi-class argmax for accuracy calculation.
"""
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for batch_X, batch_y in train_loader:
optimizer.zero_grad()
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
running_loss += loss.item()
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {running_loss/len(train_loader):.4f}")
# --- Evaluation ---
model.eval()
with torch.no_grad():
if evaluate_fn is None:
# Standard multi-class: use argmax for predictions
logits = model(X_test)
pred_label = torch.argmax(logits, dim=1)
else:
# Use custom evaluation function (e.g., Threshold-based or Ordinal Logistic)
pred_label = evaluate_fn(model, X_test)
accuracy = (pred_label == y_test).float().mean().item()
return accuracy
Here is the code for the threshold-based approach:
def convert_to_threshold_labels(y):
"""
y: 0, 1, or 2
return: [is_ge1, is_ge2]
Example:
y=0 -> [0, 0]
y=1 -> [1, 0]
y=2 -> [1, 1]
"""
arr = []
for val in y:
arr.append([1 if val >= 1 else 0,
1 if val >= 2 else 0])
return torch.tensor(arr, dtype=torch.float32)
def train_threshold_based():
# --- Data Preparation ---
X_train, y_train, X_test, y_test = load_wine_data()
# Convert y to 2-dimensional binary labels
y_train_bin = convert_to_threshold_labels(y_train)
y_test_bin = convert_to_threshold_labels(y_test)
# DataLoader (batch_size=16)
train_dataset = torch.utils.data.TensorDataset(X_train, y_train_bin)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
# --- Model: Two outputs (>=1?, >=2?) ---
model = TwoLayerNet(input_dim=13, hidden_dim=32, output_dim=2)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# --- Training ---
# Define "evaluate_fn" for threshold-based predictions
def evaluate_fn(m, X):
m.eval()
with torch.no_grad():
logits = m(X) # shape: (N, 2)
probs = torch.sigmoid(logits) # (p1, p2)
# p1 = probability of >=1, p2 = probability of >=2
pred_list = []
for p1, p2 in probs:
if p1 < 0.5:
pred_list.append(0)
else:
# p1 >= 0.5
if p2 < 0.5:
pred_list.append(1)
else:
pred_list.append(2)
return torch.tensor(pred_list, dtype=torch.long)
acc = train_and_evaluate(
model,
criterion,
optimizer,
train_loader,
X_test,
y_test,
num_epochs=30,
evaluate_fn=evaluate_fn
)
print(f"[Threshold-based] Test Accuracy = {acc*100:.2f}%")
Here is the code for the soft label loss approach:
class SoftLabelLoss(nn.Module):
"""
Computes the cross-entropy loss between the model's output and a soft label distribution,
where the soft label is based on distances: y_i = exp(-|r_t - r_i|).
"""
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
def forward(self, logits, target):
"""
logits: (batch_size, num_classes) - Model's raw logits output.
target: (batch_size,) - Class labels for each sample (0, 1, 2, ...).
"""
with torch.no_grad():
# Create soft labels
batch_size = target.size(0)
# Reshape target to (batch_size, 1)
target_ = target.unsqueeze(1).float()
# Create a class index tensor of shape (1, num_classes)
class_indices = torch.arange(self.num_classes, device=logits.device).float().unsqueeze(0)
# Calculate distances between target and each class
dists = torch.abs(target_ - class_indices) # (batch_size, num_classes)
# Compute unnormalized weights exp(-distance)
unnorm = torch.exp(-dists)
# Normalize to create soft label distribution
soft_labels = unnorm / unnorm.sum(dim=1, keepdim=True)
# Compute log probabilities using log-softmax
log_probs = F.log_softmax(logits, dim=1) # (batch_size, num_classes)
# Compute cross-entropy loss
loss = - (soft_labels * log_probs).sum(dim=1).mean()
return loss
def train_soft_label():
# --- Data Preparation ---
X_train, y_train, X_test, y_test = load_wine_data()
# Create DataLoader (batch_size=16)
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
# --- Model: Outputs probabilities for 3 classes (class 0, 1, 2) ---
model = TwoLayerNet(input_dim=13, hidden_dim=32, output_dim=3)
criterion = SoftLabelLoss(num_classes=3)
optimizer = optim.Adam(model.parameters(), lr=0.01)
# --- Training & Evaluation ---
# For soft labels, the predicted class is determined using the usual argmax
acc = train_and_evaluate(
model,
criterion,
optimizer,
train_loader,
X_test,
y_test,
num_epochs=30,
evaluate_fn=None # None => Uses argmax to determine the predicted class
)
print(f"[Soft Label] Test Accuracy = {acc*100:.2f}%")
Here is the code for the ordinal logistic loss:
class OrdinalLogisticModel(nn.Module):
def __init__(self, input_dim, hidden_dim, num_classes):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, 1) # Scalar output z
# (K-1) threshold parameters
# Raw parameters raw_alpha => softplus => cumsum to ensure ascending order
self.raw_alpha = nn.Parameter(torch.zeros(num_classes - 1))
def forward(self, x):
h = F.relu(self.fc1(x))
z = self.fc2(h).squeeze(-1) # (batch_size,)
alpha_sorted = torch.cumsum(F.softplus(self.raw_alpha), dim=0)
return z, alpha_sorted
class OrdinalLogLoss(nn.Module):
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
def forward(self, inputs, target):
"""
inputs: A tuple (z, alpha)
z: (batch_size,) scalar output
alpha: (K-1,) thresholds
target: (batch_size,) labels 0..K-1
"""
z, alpha = inputs # Unpack
K = self.num_classes
batch_size = z.size(0)
# p(Y=0)
p_y0 = torch.sigmoid(alpha[0] - z) # (batch_size,)
# p(Y=K-1)
p_yK_1 = 1.0 - torch.sigmoid(alpha[-1] - z) # (batch_size,)
if K > 2:
sig_all = torch.sigmoid(alpha.unsqueeze(1) - z.unsqueeze(0)) # (K-1, batch_size)
# p_middle[k] = sigma(alpha_k - z) - sigma(alpha_{k-1} - z)
p_mid = sig_all[1:] - sig_all[:-1] # shape: (K-2, batch_size)
p_mid = p_mid.transpose(0, 1) # (batch_size, K-2)
else:
p_mid = None
p = torch.empty_like(z)
mask0 = (target == 0)
p[mask0] = p_y0[mask0]
maskK_1 = (target == (K-1))
p[maskK_1] = p_yK_1[maskK_1]
if K > 2:
mask_mid = ~(mask0 | maskK_1)
if mask_mid.any():
idx = target[mask_mid] - 1 # 1..K-2 => 0..(K-3)
rowwise = p_mid[mask_mid] # (N_mid, K-2)
idx_2d = idx.unsqueeze(1)
p_vals = torch.gather(rowwise, 1, idx_2d).squeeze(1)
p[mask_mid] = p_vals
# -log p
eps = 1e-12
nll = -torch.log(p + eps)
return nll.mean()
def train_ordinal_logistic():
# --- Data Preparation ---
X_train, y_train, X_test, y_test = load_wine_data()
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
# --- Model: Scalar output + 2 thresholds (for 3 classes) ---
model = OrdinalLogisticModel(input_dim=13, hidden_dim=32, num_classes=3)
criterion = OrdinalLogLoss(num_classes=3)
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Training loop (train_and_evaluate is used, but inputs passed to the criterion are special)
# evaluate_fn calculates class probabilities => argmax
def custom_criterion(outputs, targets):
# outputs = (z, alpha)
return criterion(outputs, targets)
def evaluate_fn(m, X):
m.eval()
with torch.no_grad():
z_test, alpha_test = m(X)
# p(Y=0) = σ(alpha_0 - z)
p_y0 = torch.sigmoid(alpha_test[0] - z_test).unsqueeze(1)
# p(Y=2) = 1 - σ(alpha_1 - z)
p_y2 = (1.0 - torch.sigmoid(alpha_test[-1] - z_test)).unsqueeze(1)
# p(Y=1) = middle probability
if alpha_test.size(0) > 1:
sig_all = torch.sigmoid(alpha_test.unsqueeze(1) - z_test.unsqueeze(0)) # (2, N)
p_mid = (sig_all[1:] - sig_all[:-1]).transpose(0, 1) # (N, 1)
else:
p_mid = None
if p_mid is None:
# For binary classification only
p_all = torch.cat([p_y0, p_y2], dim=1)
else:
p_all = torch.cat([p_y0, p_mid, p_y2], dim=1) # (N, 3)
pred_label = torch.argmax(p_all, dim=1)
return pred_label
acc = train_and_evaluate(
model,
custom_criterion,
optimizer,
train_loader,
X_test,
y_test,
num_epochs=30,
evaluate_fn=evaluate_fn
)
print(f"[Ordinal Logistic] Test Accuracy = {acc*100:.2f}%")
Summary
In this post, we covered three approaches to ordinal regression:
- Threshold-based approach: Decomposes the problem into multiple binary classification tasks based on thresholds.
- Soft label loss: Creates soft labels by spreading probabilities around the true class and minimizes cross-entropy.
- Ordinal logistic loss: Uses a scalar output and multiple thresholds to model cumulative probabilities.
Each method leverages ordinal information in its unique way. Choosing the right approach depends on the dataset’s characteristics, the number of classes, and the required model precision.
Written by Shion Honda. If you like this, please share!