Auto-encoding generative adversarial networks (GANs) combine the standard GAN algorithm, which discriminates between real and model-generated data, with a reconstruction loss given by an auto-encoder. Such models aim to prevent mode collapse in the learned generative model by ensuring that it is grounded in all the available training data. In this paper, we develop a principle upon which auto-encoders can be combined with generative adversarial networks by exploiting the hierarchical structure of the generative model. The underlying principle shows that variational inference can be used a basic tool for learning, but with the in- tractable likelihood replaced by a synthetic likelihood, and the unknown posterior distribution replaced by an implicit distribution; both synthetic likelihoods and implicit posterior distributions can be learned using discriminators. This allows us to develop a natural fusion of variational auto-encoders and generative adversarial networks, combining the best of both these methods. We describe a unified objective for optimization, discuss the constraints needed to guide learning, connect to the wide range of existing work, and use a battery of tests to systematically and quantitatively assess the performance of our method.
Choosing the parameters of a probability distribution in order to minimize an expected loss is the central problem in many machine learning applications, including inference in latent variable models or policy search in reinforcement learning. In most cases however, the exact gradient of the expected loss is not available as a closed-form expression. This has led to the development of flexible gradient estimators, with a focus on the conflicting objectives of being both general-purpose and low-variance. If the loss is non-differentiable, as is the case in reinforcement learning, or if the distribution is discrete, as for probabilistic models with discrete latent variables, we have to resort to score-function (SF) gradient estimators. Naive SF estimators have high variance and therefore require sophisticated variance reduction techniques, such as baseline models, to render them effective in practice. Here we show that under certain symmetry and parametric assumptions on the distribution, one can derive unbiased stochastic gradient estimators based on finite differences (FD) of the loss function. These estimators do not require learning baseline models and potentially have less variance. Furthermore, we highlight connections of the FD estimators to simultaneous perturbation sensitivity analysis (SPSA), as well as weak derivative and “straight-through” gradient estimators.
We consider the problem of density estimation on Riemannian manifolds. Density estimation on manifolds has many applications in fluid-mechanics, optics and plasma physics and it appears often when dealing with angular variables (such as used in protein folding, robot limbs, gene-expression) and in general directional statistics. In spite of the multitude of algorithms available for density estimation in the Euclidean spaces Rn that scale to large n (e.g. normalizing flows, kernel methods and variational approximations), most of these methods are not immediately suitable for density estimation in more general Riemannian manifolds. We revisit techniques related to homeomorphisms from differential geometry for projecting densities to sub-manifolds and use it to generalize the idea of normalizing flows to more general Riemannian manifolds. The resulting algorithm is scalable, simple to implement and suitable for use with automatic differentiation. We demonstrate concrete examples of this method on the n-sphere Sn.
A key goal of computer vision is to recover the underlying 3D structure from 2D observations of the world. In this paper we learn strong deep generative models of 3D structures, and recover these structures from 3D and 2D images via probabilistic inference. We demonstrate high-quality samples and report log-likelihoods on several datasets, including ShapeNet , and establish the first benchmarks in the literature. We also show how these models and their inference networks can be trained end-to-end from 2D images. This demonstrates for the first time the feasibility of learning to infer 3D representations of the world in a purely unsupervised manner.
Automated discovery of early visual concepts from raw image data is a major open challenge in AI research. Addressing this problem, we propose an unsupervised approach for learning disentangled representations of the underlying factors of variation. We draw inspiration from neuroscience, and show how this can be achieved in an unsupervised generative model by applying the same learning pressures as have been suggested to act in the ventral visual stream in the brain. By enforcing redundancy reduction, encouraging statistical independence, and exposure to data with transform continuities analogous to those to which human infants are exposed, we obtain a variational autoencoder (VAE) framework capable of learning disentangled factors. Our approach makes few assumptions and works well across a wide variety of datasets. Furthermore, our solution has useful emergent properties, such as zero-shot inference and an intuitive understanding of “objectness”.
Humans have an impressive ability to reason about new concepts and experiences from just a single example. In particular, humans have an ability for one-shot generalization: an ability to encounter a new concept, understand its structure, and then be able to generate compelling alternative variations of the concept. We develop machine learning systems with this important capacity by developing new deep generative models, models that combine the representational power of deep learning with the inferential power of Bayesian reasoning. We develop a class of sequential generative models that are built on the principles of feedback and attention. These two characteristics lead to generative models that are among the state-of-the art in density estimation and image generation. We demonstrate the one-shot generalization ability of our models using three tasks: unconditional sampling, generating new exemplars of a given concept, and generating new exemplars of a family of concepts. In all cases our models are able to generate compelling and diverse samples— having seen new examples just once—providing an important class of general-purpose models for one-shot machine learning.
The mutual information is a core statistical quantity that has applications in all areas of machine learning, whether this is in training of density models over multiple data modalities, in maximising the efficiency of noisy transmission channels, or when learning behaviour policies for exploration by artificial agents. Most learning algorithms that involve optimisation of the mutual information rely on the Blahut-Arimoto algorithm --- an enumerative algorithm with exponential complexity that is not suitable for modern machine learning applications. This paper provides a new approach for scalable optimisation of the mutual information by merging techniques from variational inference and deep learning. We develop our approach by focusing on the problem of intrinsically-motivated learning, where the mutual information forms the definition of a well-known internal drive known as empowerment. Using a variational lower bound on the mutual information, combined with convolutional networks for handling visual input streams, we develop a stochastic optimisation algorithm that allows for scalable information maximisation and empowerment-based reasoning directly from pixels to actions.
The choice of approximate posterior distribution is one of the core problems in variational inference. Most applications of variational inference employ simple families of posterior approximations in order to allow for efficient inference, focusing on mean-field or other simple structured approximations. This restriction has a significant impact on the quality of inferences made using variational methods. We introduce a new approach for specifying flexible, arbitrarily complex and scalable approximate posterior distributions. Our approximations are distributions constructed through a normalizing flow, whereby a simple initial density is transformed into a more complex one by applying a sequence of invertible transformations until a desired level of complexity is attained. We use this view of normalizing flows to develop categories of finite and infinitesimal flows and provide a unified view of approaches for constructing rich posterior approximations. We demonstrate that the theoretical advantages of having posteriors that better match the true posterior, combined with the scalability of amortized variational approaches, provides a clear improvement in performance and applicability of variational inference.
Variational inference is one of the tools that now lies at the heart of the modern data analysis lifecycle. Variational inference is the term used to encompass approximation techniques for the solution of intractable integrals and complex distributions and operates by transforming the hard problem of integration into one of optimisation. As a result, using variational inference we are now able to derive algorithms that allow us to apply increasingly complex probabilistic models to ever larger data sets on ever more powerful computing resources.
This tutorial is meant as a broad introduction to modern approaches for approximate, large-scale inference and reasoning in probabilistic models. It is designed to be of interest to both new and experienced researchers in machine learning, statistics and engineering and is intended to leave everyone with an understanding of an invaluable tool for probabilistic inference and its connections to a broad range of fields, such as Bayesian analysis, deep learning, information theory, and statistical mechanics.
The tutorial will begin by motivating probabilistic data analysis and the problem of inference for statistical applications, such as density estimation, missing data imputation and model selection, and for industrial problems in search and recommendation, text mining and community discovery. We will then examine importance sampling as one widely-used Monte Carlo inference mechanism and from this begin our journey towards the variational approach for inference. The principle of variational inference and basic tools from variational calculus will be introduced, as well as the class of latent Gaussian models that will be used throughout the tutorial as a running example. Using this foundation, we shall discuss different approaches for approximating posterior distributions, the smorgasbord of techniques for optimising the variational objective function, a discussion of implementation and largescale applications, a brief look at the available theory for variational methods, and an overview of other variational problems in machine learning and statistics.
The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.
We marry ideas from deep neural networks and approximate Bayesian inference to derive a generalised class of deep, directed generative models, endowed with a new algorithm for scalable inference and learning. Our algorithm introduces a recognition model to represent approximate posterior distributions, and that acts as a stochastic encoder of the data. We develop stochastic back-propagation -- rules for back-propagation through stochastic variables -- and use this to develop an algorithm that allows for joint optimisation of the parameters of both the generative and recognition model. We demonstrate on several real-world data sets that the model generates realistic samples, provides accurate imputations of missing data and is a useful tool for high-dimensional data visualisation.
We show how mixture models, partial membership models, factor analysis, and their extensions to more general mixed-membership models, can be unified under a simple framework using the exponential family of distributions and variations in the prior assumptions on the latent variables that are used. We describe two models within this common latent variable framework: a Bayesian partial membership model and a Bayesian exponential family factor analysis model. Accurate inferences can be achieved within this framework that allow for prediction, missing value imputation, and data visualisation, and importantly, allow us to make a broad range of insightful probabilistic queries of our data. We emphasise the adaptability and flexibility of these models for a wide range of tasks, characteristics that will continue to see such models used at the core of modern data analysis paradigms.
In this paper we address the widely-experienced difficulty in tuning Hamiltonian-based Monte Carlo samplers. We develop an algorithm that allows for the adaptation of Hamiltonian and Riemann manifold Hamiltonian Monte Carlo samplers using Bayesian optimization that allows for infinite adaptation of the parameters of these samplers. We show that the resulting sampling algorithms are ergodic, and that the use of our adaptive algorithms makes it easy to obtain more efficient samplers, in some cases precluding the need for more complex solutions. Hamiltonian-based Monte Carlo samplers are widely known to be an excellent choice of MCMC method, and we aim with this paper to remove a key obstacle towards the more widespread use of these samplers in practice.
We present a new variational inference algorithm for Gaussian process regression with non-conjugate likelihood functions, with application to a wide array of problems including binary and multi-class classification, and ordinal regression. Our method constructs a concave lower bound that is optimized using an efficient fixed-point updating algorithm. We show that the new algorithm has highly competitive computational complexity, matching that of alternative approximate inference methods. We also prove that the use of concave variational bounds provides stable and guaranteed convergence – a property not available to other approaches. We show empirically for both binary and multi-class classification that our new algorithm converges much faster than existing variational methods, and without any degradation in performance.
Rich and complex time-series data, such as those generated from engineering systems, financial markets, videos or neural recordings, are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires flexible and accurate models. In this paper, we promote Gaussian process dynamical systems (GPDS) as a rich model class that is appropriate for such analysis. In particular, we present a message passing algorithm for approximate inference in GPDSs based on expectation propagation. By posing inference as a general message passing problem, we iterate forward-backward smoothing. Thus, we obtain more accurate posterior distributions over latent structures, resulting in improved predictive performance compared to state-of-the-art GPDS smoothers, which are special cases of our general message passing algorithm. Hence, we provide a unifying approach within which to contextualize message passing in GPDSs.
The use of L1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically under-performs in terms of predictive performance when compared to other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of “L1”, and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spikeand-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods outperform L1 minimisation, even on a computational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.
The development of accurate models and effi- cient algorithms for the analysis of multivariate categorical data are important and longstanding problems in machine learning and computational statistics. In this paper, we focus on modeling categorical data using Latent Gaussian Models (LGMs). We propose a novel stick-breaking likelihood function for categorical LGMs that exploits accurate linear and quadratic bounds on the logistic log-partition function, leading to an effective variational inference and learning framework. We thoroughly compare our approach to existing algorithms for multinomial logit/probit likelihoods on several problems, including inference in multinomial Gaussian process classification and learning in latent factor models. Our extensive comparisons demonstrate that our stick-breaking model effectively captures correlation in discrete data and is well suited for the analysis of categorical data.
This paper studies issues relating to the parameterization of probability distributions over binary data sets. Several such parameterizations of models for binary data are known, including the Ising, generalized Ising, canonical and full parameterizations. We also discuss a parameterization that we call the “spectral parameterization”, which has received significantly less coverage in existing literature. We provide this parameterization with a spectral interpretation by casting loglinear models in terms of orthogonal WalshHadamard harmonic expansions. Using various standard and group sparse regularizers for structural learning, we provide a comprehensive theoretical and empirical comparison of these parameterizations. We show that the spectral parameterization, along with the canonical, has the best performance and sparsity levels, while the spectral does not depend on any particular reference state. The spectral interpretation also provides a new starting point for analyzing the statistics of binary data sets; we measure the magnitude of higher order interactions in the underlying distributions for several data sets
Factor analysis and related models for probabilistic matrix factorisation are of central importance to the unsupervised analysis of data, with a colourful history more than a century long. Probabilistic models for matrix factorisation allow us to explore the underlying structure in data, and have relevance in a vast number of application areas including collaborative filtering, source separation, missing data imputation, gene expression analysis, information retrieval, computational finance and computer vision, amongst others. This thesis develops generalisations of matrix factorisation models that advance our understanding and enhance the applicability of this important class of models.
The generalisation of models for matrix factorisation focuses on three concerns: widening the applicability of latent variable models to the diverse types of data that are currently available; considering alternative structural forms in the underlying representations that are inferred; and including higher order data structures into the matrix factorisation framework. These three issues reflect the reality of modern data analysis and we develop new models that allow for a principled exploration and use of data in these settings. We place emphasis on Bayesian approaches to learning and the advantages that come with the Bayesian methodology. Our port of departure is a generalisation of latent variable models to members of the exponential family of distributions. This generalisation allows for the analysis of data that may be real-valued, binary, counts, non-negative or a heterogeneous set of these data types. The model unifies various existing models and constructs for unsupervised settings, the complementary framework to the generalised linear models in regression.
Moving to structural considerations, we develop Bayesian methods for learning sparse latent representations. We define ideas of weakly and strongly sparse vectors and investigate the classes of prior distributions that give rise to these forms of sparsity, namely the scale-mixture of Gaussians and the spike-and-slab distribution. Based on these sparsity favouring priors, we develop and compare methods for sparse matrix factorisation and present the first comparison of these sparse learning approaches. As a second structural consideration, we develop models with the ability to generate correlated binary vectors. Moment-matching is used to allow binary data with specified correlation to be generated, based on dichotomisation of the Gaussian distribution. We then develop a novel and simple method for binary PCA based on Gaussian dichotomisation. The third generalisation considers the extension of matrix factorisation models to multi-dimensional arrays of data that are increasingly prevalent. We develop the first Bayesian model for non-negative tensor factorisation and explore the relationship between this model and the previously described models for matrix factorisation.
Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible.
We present a probabilistic model for learning non-negative tensor factorizations (NTF), in which the tensor factors are latent variables associated with each data dimension. The non-negativity constraint for the latent factors is handled by choosing priors with support on the non-negative numbers. Two Bayesian inference procedures based on Markov chain Monte Carlo sampling are described: Gibbs sampling and Hamiltonian Markov chain Monte Carlo. We evaluate the model on two food science data sets, and show that the probabilistic NTF model leads to better predictions and avoids overfitting compared to existing NTF approaches.
Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of over- fitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.