Joint Training of Variational Auto-Encoder and Latent Energy-Based Model

1 Department of Computer Science, Stevens institute of Technology, USA
2 University of California, Los Angeles (UCLA), USA

Abstract

This paper proposes a joint training method to learn both the variational auto-encoder (VAE) and the latent energy-based model (EBM).The joint training of VAE and latent EBM are based on an objective function that consists of three Kullback-Leibler divergences between three joint distributions on the latent vector and the image, and the objective function is of an elegant symmetric and anti-symmetric form of divergence triangle that seamlessly integrates variational and adversarial learning. In this joint training scheme, the latent EBM serves as a critic of the generator model, while the generator model and the inference model in VAE serves as the approximate synthesis sampler and inference sampler of the latent EBM. Our experiments show that the joint training greatly improves the synthesis quality of the VAE. It also enables learning of an energy function that is capable of detecting out of sample examples for anomaly detection.


Paper

The publication can be found here.
@inproceedings{han2020joint,
  title={Joint Training of Variational Auto-Encoder and Latent Energy-Based Model},
  author={Han, Tian and Nijkamp, Erik and Zhou, Linqi and Pang, Bo and Zhu, Song-Chun and Wu, Ying Nian},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7978--7987},
  year={2020}
}

Code

The code and pre-trained model can be found here.


Experiments

Experiment 1: Object Synthesis on Large-scale Dataset

For object categories, we test our model on three commonly-used datasets of natural images: CIFAR-10, CelebA (Liu et al., 2015) and Large-scale Scene Understand (LSUN) dataset (Yu et al., 2015). For CelebA face dataset, we randomly select 9,000 images for training and another 1,000 images for testing in reconstruction task. The face images and the bedroom images are resized to 64 x 64 and CIFAR-10 images remain 32 x 32. The qualitative results of generated samples for objects are shown in Figure 1.

Figure 1: Generated samples. Left: generated samples on CIFAR-10 dataset. Middle: generated samples on CelebA dataset. Right: generated samples on LSUN-bedroom dataset.

Experiment 2: Testing Reconstruction

We evaluate the model on hold-out testing set of CIFAR-10, CelebA and LSUN-bedroom. Specifically, we use its own 10,000 testing images for CIFAR-10, 1,000 and 40,000 hold-out testing images for CelebA and LSUN-bedroom. The qualitative reconstruction results for CIFAR-10 and CelebA are shown in Figure 2. Note that KL(+) indicates using Eqn. 17 to learn the inference model while KL(-) indicates using Eqn. 15 instead. Please see the paper for details.

Figure 2: Testing Reconstruction for CIFAR-10 and CelebA. Left: testing images. Middle: testing reconstruction using KL(+). Right: testing reconstruction using KL(-).