torch.compile Benchmarked

May 19, 2023 | 2 min read | 982 views

PyTorch recently introduced torch.compile, a method to JIT-compile PyTorch code into optimized kernels, in its version 2.0. In the official website, they boast about their achievements as:

Across these 163 open-source models torch.compile works 93% of time, and the model runs 43% faster in training on an NVIDIA A100 GPU. At Float32 precision, it runs 21% faster on average and at AMP Precision it runs 51% faster on average.

This is fantastic! In this blog post, I will try out this new feature and see what it can do.

Experiments

torch.compile provides several compilation modes: default, reduce-overhead, and max-autotune. The optimal choice depends on various configurations that define the bottleneck, such as model architecture and input tensor size. It is explained here as:

The default mode is a preset that tries to compile efficiently without taking too long to compile or using extra memory.

Other modes such as reduce-overhead reduce the framework overhead by a lot more, but cost a small amount of extra memory. max-autotune compiles for a long time, trying to give you the fastest code it can generate.

Anyway, I measured following metrics against the eager mode and different compilation modes while running GPT-2 (1.5B parameters) on Colab A100.

Initial training step time & memory (when JIT compiling happens)
Training time
Inference time

I used AMP for a more realistic experiment. The core part of the code looks like this:

def time_train(model, optimizer, scaler, inputs) -> float:
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(N_ITER):
        optimizer.zero_grad()
        with autocast():
            outputs = model(**inputs, labels=inputs["input_ids"])
        scaler.scale(outputs.loss).backward()
        scaler.step(optimizer)
        scaler.update()
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end) / N_ITER

@torch.no_grad()
def time_infer(model, inputs) -> float:
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(N_ITER):
        with autocast():
            _ = model(**inputs)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end) / N_ITER

For more details about the experiment, please refer to the Colab notebook.

Results & Discussions

The table below summarizes the measured metrics.

Compilation mode	Initial step time [s]	Initial step memory [MiB]	Training time / iter [ms]	Inference time / iter [ms]
N/A (eager)	1	3675	57	18
default	29	3277	34	30
reduce-overhead	29	5736	34	28
max-autotune	35	5736	32	30

Indeed, torch.compile reduced the training time by 40-44% (corresponding to 68-78% speedup).

However, it didn’t reduce the inference time in this setting. I haven’t yet figured out what is going on here, but it seems that some people have reported negative effects of torch.compile.

If you have any thoughts on this, I’d appreciate your comments.