(1) Representation learning and modeling in latent space, in the form of algebraic and geometric properties of transformation, with applications in neuroscience and vision (Gao et al. NeurIPS 2021, Gao et al. ICLR 2019, Zhu et al. CVPR 2021), and latent space energy-based models for image, text, molecule, trajectory (as inverse planning), semi-supervised learning (with information bottleneck), meta-learning (Pang et al. NeurIPS 2020, CVPR 2021, ICML 2021).

(2) Maximum likelihood learning of deep generative models, including multi-layer top-down generative models and undirected energy-based models as well as their integrations (Pang et al. NeurIPS 2020), powered by short-run MCMC for inference (Nijkamp et al. ECCV 2020) and synthesis (Nijkamp et al. NeurIPS 2019) computation, which can be compared to attractor dynamics in neuroscience.

(3) Joint training and discriminative/contrastive training of various models, e.g., energy-based model, flow-based model, generator model, and inference model, without resorting to MCMC, which is amortized by learned computation.

(1)

(2)

(3)

* Maximum likelihood learning of modern ConvNet-parametrized

* Maximum likelihood learning of

*

*

*

* Learning

* Formulate modern ConvNet-parametrized EBM as exponential tilting of a reference distribution, and connect it to discriminative ConvNet classifier (Dai et al. ICLR 15, Xie et al. ICML 16). EBM is a

*

* Scale up maximum likelihood learning of ConvNet-EBM to big datasets by multi-grid or

(learned grid cells |

(Initializer is like a policy model, whose solution is refined by a value model)

(Predicting trajectory)

(Change of pose in physical space = rotation of vector in neural space)

(Synthesized images)

(Left: latent space EBM stands on generator. Right: Short-run MCMC in latent space)

The latent space EBM stands on a top-down generation network. It is like a value network or cost function defined in latent space.

The scalar-valued energy function is an objective function, a cost function, an evaluator or a critic. It is about constraints, regularities, rules, perceptual organizations, and Gestalt laws. The energy-based model is descriptive instead of generative, which is the reason we used to call it the descriptive model. It only describes what it wants without bothering with how to get it. Compared to generator model (whose output is high dimensional instead of scalar), the energy-based model is like setting up an equation, whereas the generator model is like generating the solution directly. It is much easier to set up the equation than giving the answer, i.e., it is easier to specify a scalar-valued energy function than a vector-valued generation function, the latter is like a policy network.

The energy-based model in latent space is simple and yet expressive, capturing rules or regularities implicitly but effectively. The latent space seems the right home for energy-based model.

Short-run MCMC in latent space for prior and posterior sampling is efficient and mixes well. One can amortize MCMC with learned network (see our recent work on semi-supervised learning), but in this initial paper we prefer to keep it pure and simple, without mixing in tricks from VAE and GAN.

(Left: latent EBM captures chemical rules implicitly in latent space. Right: generated molecules)

(the symbolic one-hot y is coupled with dense vector z to form an associative memory, and z is the information bottleneck between x and y)

(VAE as alternating projection)

(mode traversing HMC chains)

(learned V1 cells)

(The model generates both displacement field and appearance)

(neural-symbolic learning)

(Three densities in joint space: pi: latent EBM, p: generator, q: inference)

(reconstruction by short-run MCMC, yes it can reconstruct observed images)

(learned grid cells)

(videos generated by the learned model)

(videos generated by the learned model)

(faces generated and interpolated by the learned model)

(pi is EBM, p is generator, q is inference)

(videos generated by the learned model)

(face rotation by the learned model)

(learning directly from occluded images. Row 1: original images, not available to model; Row 2: training images. Row 3: learning and reconstruction. )

(left: observed; right: synthesized.)

(Langevin dynamics for sampling ConvNet-EBM)

J Xie, W Hu, SC Zhu, and YN Wu (2014) Learning sparse FRAME models for natural image patterns. International Journal of Computer Vision. pdf project page

J Dai, Y Hong, W Hu, SC Zhu, and YN Wu (2014) Unsupervised learning of dictionaries of hierarchical compositional models. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pdf project page

J Dai, YN Wu, J Zhou, and SC Zhu (2013) Cosegmentation and cosketch by unsupervised learning. Proceedings of International Conference on Computer Vision (ICCV). pdf project page

Y Hong, Z Si, WZ Hu, SC Zhu, and YN Wu (2013) Unsupervised learning of compositional sparse code for natural image representation. Quarterly of Applied Mathematics. pdf project page

YN Wu, Z Si, H Gong, SC Zhu (2010) Learning active basis model for object detection and recognition. International Journal of Computer Vision, 90, 198-235. pdf project page

Z Si, H Gong, SC Zhu, YN Wu (2010) Learning active basis models by EM-type algorithms. Statistical Science, 25, 458-475. pdf project page

YN Wu, C Guo, SC Zhu (2008) From information scaling of natural images to regimes of statistical models. Quarterly of Applied Mathematics, 66, 81-122. pdf

YN Wu, Z Si, C Fleming, and SC Zhu (2007) Deformable template as active basis. Proceedings of International Conference of Computer Vision. pdf project page

M Zheng, LO Barrera, B Ren, YN Wu (2007) ChIP-chip: data, model and analysis. Biometrics, 63, 787-796. pdf

C Guo, SC Zhu, and YN Wu (2007) Primal sketch: integrating structure and texture. Computer Vision and Image Understanding, 106, 5-19. pdf project page

C Guo, SC Zhu, and YN Wu (2003) Towards a mathematical theory of primal sketch and sketchability. Proceedings of International Conference of Computer Vision. 1228-1235. pdf project page

G Doretto, A Chiuso, YN Wu, S Soatto (2003) Dynamic textures. International Journal of Computer Vision, 51, 91-109. pdf (source code given in paper) project page

C Guo, SC Zhu, and YN Wu (2003) Modeling visual patterns by integrating descriptive and generative models. International Journal of Computer Vision, 53(1), 5-29. pdf

JC Pinheiro, C Liu, YN Wu (2001) Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate t distribution. Journal of Computational and Graphical Statistics 10 (2), 249-276. pdf

YN Wu, SC Zhu, X Liu (2000) Equivalence of Julesz ensembles and FRAME models. International Journal of Computer Vision, 38, 247-265. pdf project page

JS Liu, YN Wu (1999) Parameter expansion for data augmentation. Journal of the American Statistical Association, 94, 1264-1274. pdf

C Liu, DB Rubin, YN Wu (1998) Parameter expansion to accelerate EM -- the PX-EM algorithm. Biometrika, 85, 755-770. pdf

SC Zhu, YN Wu, DB Mumford (1998) Minimax entropy principle and its application to texture modeling. Neural Computation, 9, 1627-1660. pdf

SC Zhu, YN Wu, DB Mumford (1997) Filter, Random field, And Maximum Entropy (FRAME): towards a unified theory for texture modeling. International Journal of Computer Vision, 27, 107-126. pdf

YN Wu (1995) Random shuffling: a new approach to matching problem. Proceedings of American Statistical Association, 69-74. Longer version pdf