Real-world applications

Summary:

Traditional variational autoencoders may struggle to generate good samples when the underlying input data has many different modalities, such as a complex mixture distribution
This paper presents a VAE formulation (CP-VAE) with a conditional prior that ideally learns to entirely separate different modalities
The latent variable in a CP-VAE model is composed of both a discrete and a continuous piece.

Medium level:

As always, one of the issues with the traditional VAE formulation is the huge amount of assumptions inherently made. This paper addresses a commonly looked at one; the isotropic Gaussian prior. By using both a continuous prior (a Gaussian was used) and a categorical discrete prior, the model may have an easier time distinguishing between vastly different modes via the use of a discontinuous latent space. While normally, a discontinuous latent space is highly unappealing, the sampling procedure they also provide mitigates this problem.

Low-ish level:

We start with the new \(p_\theta(\mathbf{x})\):

\[p_\theta (\mathbf{x}) = \sum_c \int p_\theta (\mathbf{x \vert z, c}) p_\phi (\mathbf{z \vert c}) p(\mathbf{c}) d\mathbf{z}\]

It is a two level hierarchical generative process; the latent space is composed of the traditional Gaussian continuous \(\mathbf{z}\) and the discrete \(\mathbf{c}\) component with the joint distribution \(p(\mathbf{z, c}) = p_\phi (\mathbf{z \vert c}) p(\mathbf{c})\). The authors assume a uniform categorical prior for \(\mathbf{c}\).

Instead of the traditional ELBO, instead we optimize a joint KL term.

\[\text{ELBO} := \underbrace{\mathbb{E}_{q_\phi(z, c \vert x_i)}\bigg[\log p_\theta(\mathbf{x_i \vert z, c})\bigg]}_{(1) \text{Reconstruction likelihood}} - \underbrace{D_\text{KL} \bigg[q_\phi(\mathbf{z, c \vert x_i}) \vert \vert p_\psi(\mathbf{z, c})\bigg]}_{(2) \text{Prior constraint}}\]

\((2)\) can instead be rewritten as the sum of two separate KL terms; the categorical and continuous parts.

\[(2) = D_\text{KL} \bigg[q_\phi(\mathbf{z, c \vert x_i}) \vert \vert p_\psi(\mathbf{z, c})\bigg] \\ = D_\text{KL} \bigg[q_\phi(\mathbf{c \vert x}) \vert \vert p(\mathbf{c})) \bigg] + \mathbb{E}_{q_\phi (\mathbf{c \vert x})} D_\text{KL} \bigg[q_\phi(\mathbf{z \vert c, x_i}) \vert \vert p_\psi(\mathbf{z \vert c})\bigg]\]

As such, the minimization of the initial KL term pushes both distributions towards the given prior.

\(\mathbf{c}\) itself is supposed to be a pure categorical (one-hot) vector, however, considering one cannot backprop through the argmax operator, a Gumbel-softmax reparameterization is used.

In implementation, the authors made a encoder with 3 outputs: \(\mu\), \(\log \sigma\), and a Gumbel-softmax representation of \(\mathbf{c}\). The input of the decoder was then \(\mathbf{z}\) with \(\mathbf{c}\) concatenated on. Considering this implementation, one can think of \(\mathbf{c}\) as the output of a function that takes in a sample, \(\mathbf{x}\), and ‘softly’ predicts what modality it lies in; this is learned implicitly by the model. On MNIST, using only \(\mathbf{c}\), the model achieves about 95% accuracy without labels.

Theoretical improvements:

When compared to a vanilla VAE, CP-VAE seem to have the following positives:

Possible better generated recreations if the underlying data has many different roughly-grained modes
Possible more flexible posterior

Excluding possible pathological edge cases, there doesn’t seem to be many drawbacks. However, the authors did not test CP-VAE on any dataset of reasonable dimension, so no real conclusions can be drawn.