Hippocampus's Garden

Under the sea, in the hippocampus's garden...

    Search by

    Downsizing StyleGAN2 for Training on a Single GPU

    March 04, 2020  |  4 min read  |  3,118 views

    • このエントリーをはてなブックマークに追加

    StyleGAN2 [1] is famous for its success in generating high-resolution human face images that we can’t tell apart from real images. For example, can you believe this image was generated by AI?

    StyleGAN2

    * You can get face images generated by StyleGAN2 here.

    Have you ever thought of training StyleGAN2 yourself to generate other kinds of pictures, and probably gave it up due to the limited machine resources? Actually, the authors trained it on NVIDIA DGX-1 with 8 Tesla V100 GPUs for 9 days. That’s too expensive for individual researchers and practitioners! But don’t worry. In this post, I demonstrate how to downsize StyleGAN2 to train from scratch on a single GPU, modifying this PyTorch implementation.

    Before reading this post, please make sure you really have to train it from scratch. If you want to generate 1024x1024 anime face images, you can fine-tune StyleGAN2 pre-trained on FFHQ. There are some pre-trained models for cars, cats, and so on, which are available in the official repository.

    Dataset

    Due to the limitation of the machine resources (I assume a single GPU with 8 GB RAM), I use the FFHQ dataset downsized to 256x256.

    First, download the original images using the download script. It will take several hours depending on your network capacity and result in about 80 GB.

    python download_ffhq.py --images

    Then, resize the images to 256x256 (e.g., with Pillow).

    Training Tips

    Based on the great PyTorch implementation by Kim Seonghyeon, I downsize it to train on a single GPU. For basic usage of this repository, please refer to README. Here I focus on implicit tips.

    Requirements

    You need LMDB installed to create a database for the collection of images. Also, as mentioned in the issue, the versions of CUDA and PyTorch (10.2 and 1.3.1, respectively) are critical.

    Reduce the Model Size

    To reduce the memory consumption, I decrease 1) the number of channels in the generator and discriminator, 2) resolution of the images, 3) latent size, and 4) the number of samples generated at a time.

    You can change the number of channels in model.py.

    model.py
    ...
    
    class Generator(nn.Module):
        def __init__(
            self,
            size,
            style_dim,
            n_mlp,
            channel_multiplier=2,
            blur_kernel=[1, 3, 3, 1],
            lr_mlp=0.01,
        ):
            super().__init__()
    
            self.size = size
    
            self.style_dim = style_dim
    
            layers = [PixelNorm()]
    
            for i in range(n_mlp):
                layers.append(
                    EqualLinear(
                        style_dim, style_dim, lr_mul=lr_mlp, activation='fused_lrelu'
                    )
                )
    
            self.style = nn.Sequential(*layers)
    
            self.channels = {
                4: 256, # originally 512
                8: 256, # originally 512
                16: 256, # originally 512
                32: 256, # originally 512
                64: 256 * channel_multiplier,
                128: 128 * channel_multiplier,
                256: 64 * channel_multiplier,
                512: 32 * channel_multiplier,
                1024: 16 * channel_multiplier,
            }
    ...
    
    class Discriminator(nn.Module):
        def __init__(self, size, channel_multiplier=2, blur_kernel=[1, 3, 3, 1]):
            super().__init__()
    
            channels = {
                4: 256, # originally 512
                8: 256, # originally 512
                16: 256, # originally 512
                32: 256, # originally 512
                64: 256 * channel_multiplier,
                128: 128 * channel_multiplier,
                256: 64 * channel_multiplier,
                512: 32 * channel_multiplier,
                1024: 16 * channel_multiplier,
            }
    
    ...

    In train.py, you can change the resolution, latent size, and the number of samples generated at a time. The resolution is set to 256x256 by default. args.n_samples means the batch size during inference.

    train.py
    if __name__ == '__main__':
        device = 'cuda'
    
        parser = argparse.ArgumentParser()
    
        parser.add_argument('path', type=str)
        parser.add_argument('--iter', type=int, default=800000)
        parser.add_argument('--batch', type=int, default=16)
        parser.add_argument('--n_sample', type=int, default=25) # originally 64
        parser.add_argument('--size', type=int, default=256)
    
    ...
    
        args.latent = 32 # originally 512
        args.n_mlp = 8
    
    ...

    Log

    Since the training steps are as many as 800k, sampling every 100 steps results in 8,000 images! I recommend you also change the sampling interval in train.py

    train.py
    def train(args, loader, generator, discriminator, g_optim, d_optim, g_ema, device):
        
    ...
    
                if i % 1000 == 0: # originally 100
                    with torch.no_grad():
                        g_ema.eval()
                        sample, _ = g_ema([sample_z])
                        utils.save_image(
                            sample,
                            f'sample/{str(i).zfill(6)}.png',
                            nrow=int(args.n_sample ** 0.5),
                            normalize=True,
                            range=(-1, 1),
                        )

    The metrics such as loss and perceptual path length (PPL) can be logged by Weights & Biases!

    train.py supports Weights & Biases logging. If you want to use it, add --wandb arguments to the script.

    Finally, run the command like the following:

    train.py --batch BATCH_SIZE LMDB_PATH --wandb

    It took almost 2 weeks in my environment, but halving the number of training steps won’t really harm the quality of generated images.

    Results

    After training for 800k steps, StyleGAN2 generates nice images! They sometimes have strange parts, but more than half of them look great to me.

    result

    This is how StyleGAN2 learns to generate human face images. Note that the time scale is not linear.

    References

    [1] Karras, Tero, et al. ”Analyzing and improving the image quality of stylegan.” arXiv preprint arXiv:1912.04958 (2019).


    • このエントリーをはてなブックマークに追加
    [object Object]

    Written by Shion Honda. If you like this, please share!

    Shion Honda

    Hippocampus's Garden © 2021, Shion Honda. Built with Gatsby