Hippocampus's Garden

Under the sea, in the hippocampus's garden...

torch.compile Benchmarked | Hippocampus's Garden

torch.compile Benchmarked

May 19, 2023  |  2 min read  |  756 views

  • このエントリーをはてなブックマークに追加

PyTorch recently introduced torch.compile, a method to JIT-compile PyTorch code into optimized kernels, in its version 2.0. In the official website, they boast about their achievements as:

Across these 163 open-source models torch.compile works 93% of time, and the model runs 43% faster in training on an NVIDIA A100 GPU. At Float32 precision, it runs 21% faster on average and at AMP Precision it runs 51% faster on average.

This is fantastic! In this blog post, I will try out this new feature and see what it can do.

Experiments

torch.compile provides several compilation modes: default, reduce-overhead, and max-autotune. The optimal choice depends on various configurations that define the bottleneck, such as model architecture and input tensor size. It is explained here as:

The default mode is a preset that tries to compile efficiently without taking too long to compile or using extra memory.

Other modes such as reduce-overhead reduce the framework overhead by a lot more, but cost a small amount of extra memory. max-autotune compiles for a long time, trying to give you the fastest code it can generate.

Anyway, I measured following metrics against the eager mode and different compilation modes while running GPT-2 (1.5B parameters) on Colab A100.

  • Initial training step time & memory (when JIT compiling happens)
  • Training time
  • Inference time

I used AMP for a more realistic experiment. The core part of the code looks like this:

def time_train(model, optimizer, scaler, inputs) -> float:
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(N_ITER):
        optimizer.zero_grad()
        with autocast():
            outputs = model(**inputs, labels=inputs["input_ids"])
        scaler.scale(outputs.loss).backward()
        scaler.step(optimizer)
        scaler.update()
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end) / N_ITER

@torch.no_grad()
def time_infer(model, inputs) -> float:
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(N_ITER):
        with autocast():
            _ = model(**inputs)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end) / N_ITER

For more details about the experiment, please refer to the Colab notebook.

Results & Discussions

The table below summarizes the measured metrics.

Compilation mode Initial step time [s] Initial step memory [MiB] Training time / iter [ms] Inference time / iter [ms]
N/A (eager) 1 3675 57 18
default 29 3277 34 30
reduce-overhead 29 5736 34 28
max-autotune 35 5736 32 30

Indeed, torch.compile reduced the training time by 40-44% (corresponding to 68-78% speedup).

However, it didn’t reduce the inference time in this setting. I haven’t yet figured out what is going on here, but it seems that some people have reported negative effects of torch.compile.

If you have any thoughts on this, I’d appreciate your comments.

References

[1] torch.compile Tutorial — PyTorch Tutorials 2.0.1+cu117 documentation
[2] Is PyTorch 2.0 Faster Than PyTorch 1.13? | PyTorch 2.0 Benchmarks v2 – Weights & Biases
[3] PyTorch 2.0の新機能「torch.compile」使ってみた - まったり勉強ノート


  • このエントリーをはてなブックマークに追加
[object Object]

Written by Shion Honda. If you like this, please share!

Shion Honda

Hippocampus's Garden © 2024, Shion Honda. Built with Gatsby