The complementary nature of discriminative and generative approaches to machine learning has motivated lots of research on the ways in which these can be combined: among these the most used and promising is the creation of generative score spaces. A score function induced by a generative model of the data can provide a feature vector of a fixed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a fixed-length vector in a highly informative space, typically referred to as a “score space”. A score space is termine once a score score argument and a score operator are given. The score argument is the function of the generative model manipulated by the score operator which in turn determines how the score argument is manipulated; these two ingredients define a score mapping. The most known example of score space is the Fisher space, in this case the score argument is the loglikelihood of a sample and the score operator is the gradient with respect to the parameters. Discriminative classifiers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classifiers, or the discriminative classifiers using standard feature extractors.
In this paper, we propose a novel score space called "Free Energy Score Space" (FESS) which focuses on how well the data point fits different parts of the generative model, rather than on derivatives with respect to the model parameters. We start with the variational free energy as a lower bound on the negative log-likelihood of the data, as this affords us with two advantages. First of all, the variational free energy can be computed for an arbitrary structure of the posterior distribution, allowing us to deal with generative models with many latent variables and complex structure without compromising tractability, as was previously done for inference in generative models. Second, a variational approximation of the posterior typically provides an additive decomposition of the free energy, providing many terms that can be used as features. These terms/features are divided into two categories: the “entropy set” of terms that express uncertainty in the posterior distribution, and the “cross-entropy set” describing the quality of the fit of the data to different parts of the model according to the posterior distribution.
Figure 1: Basic ideas upon which FESS is built
The ideas upon which FESS is builts are highlighted in the figure below. Suppose to have 3 data points, the first belongs to a class, the other two to another class. Suppose that we are calculating the free energies under a model learned with samples from the class 1, x represent the visible variable, c's are the hidden variable. Generative classification is based on free energy comparison, lower free energy indicates a better fit. In this case both samples 1 and 2 fit well with the model and are likely to be, wrongly, classified in the same class.
Free energy score space breaks down the total free energy in several contributes or "pieces" uniquely dependent on the topology of the generative models; instead of summing them (recovering the final free energy), it looks for discriminative information dealing singularly with them. In the example below we have that the final free energy under the generative model considered is the sum of three contributes associated to local parts of the models. Despite the final free energy of sample 1 and 2 is similar, the way in which this value is obtained is different. Sample 1 fit very well at observation level (i.e. p(x|c) ), while sample 2 fits well in the other subparts. Using these "local free energies" as features for a discriminative classifiers results in the suppression of spurious and noisy free energy contributions that derives from a bad model specification or local minima. In the example below the three samples could correctly classified only considering the first free energy term, the observation model.
Free Energy Score Space, NIPS 2009 [pdf]
Free Energy Score Space, NIPS 2009 [bibtex]
Example of free energy decomposition (see the paper for details)
We want to extract the fess scores from the generative model described by the Bayesian network illustrated in figure 2-a). In figure 2-b) two possible posterior factorizations are illustrated; let us assume to choose the Mean Field form
Figure 2: a) Bayesian Netowrk b) Two possible posterior factorizations
The generative model (joint distribution) and the posterior distribution factorize as follows:
Given that, we can write the free energy of the model as the sum of the free energies of all the samples. In formulae:
Appling the facorizations (1) to the free energy formulation of (2) we obtain
Five terms are visible (see Table 1). The sum of the first two terms represents the entropy of the hidden variables, while the sum of the other three represents the crossentropy.
The free energy of a sample can be decomposed in 4 levels of detail. The score vector at level of detail is characterized by a length M. We assume the two hidden variables are discrete and they can assume D values.
Table 1: The free energy pieces
“Hybrid” levels are also possible. For example I use level 1 for the entropy (1 addend), level 3 for the third term (D addends) and level 2 for the remaining two terms (2 addends) for a total of M = 3+D terms. See the paper for further details.
The code assume the generative models are learnt, we used BNT ( Ver. 1.0.7 ) for hidden Markov models, LibSVM for Matlab ( Ver. 2.91-1 ) and various implementations for pLSA/LDA.
FESS for dHmm - demo dHMM (promoters classification) - FESS for pLSA
FESS Scores for Fei Fei 13 classes dataset: unormalized / normalized
FESS Scores for Chichen Pieces dataset 6 classes dataset: unormalized / normalized + sequences