Notes

Seminars

2010--, Software Institute, Peking University

2013--2015, SIG ML/NLP, Software Institute, Peking University

2015.11--, Baidu Inc. with Rui Yan

Dec 2016/Jun 2017. Special Talks on Scientific Writing [slides]

27 Mar 2017	Neural Programming
11 May/08 Jun	Sequence Generation [1, 2]	Deep learning is far beyond CNNs, RNNs, etc. In these two seminars, Yunchuan and I introduced several recent techniques of sequence (sentence) generation, including sampling approaches, reinforcement learning, and variational autoencoding.

23/27 Apr	Trans*	Trans* is a family method of learning the vector representations of entities and relations in a knowledge base (or a knowledge graph---Don't ask me the difference). From \|h+t-r\| started all.


04 Apr 2016	AlphaGo	(Courtesy of Yunchuan) A combination of convolutional neural network (CNN) and Monta Carto tree search (MCTS).

27 Mar 2016	Neural Symbolics	A series work from Noah's Ark Lab, Huawei. My understanding is to design a (complicated) neural network to mimic human behaviors: modeling a sentence, querying a table/KB, selecting a field/column, selecting a row, copying something, etc. Several challenges of end-to-end neuralized learning include differentiability, supervision, scalability.

26 Mar 2016	Neural science and Alzheimer's disease	(Courtesy of Yu Wu) Ankyrin G (AnkG) plays a critical role at the axon initial segment (AIS). AnkG downregulartion induces impaired selective filtering machineary at AIS. Impaired AIS filtering might underlie functional defects in APP/PS1 neurons. Disclaimer: I am not an expert in neural science.

08 Mar 2016	Generative Adversarial Nets	A combition of neural networks and game theory. Imagine that we have two agents Generator and Discriminator: G generates fake samples, while D tries to distinguish these fake samples in disguise. The objective is to minimize_G max_D V(D,G).

28 Oct 2015	Variational Autoencoders	(By Yunchuan) Variational autoencoders give a distribution of hidden variables, z, while traditional autoencoders compute z in a deterministic way. But why is it useful in practice?

28 Oct 2015	Domain Adaptation	Including EasyAdapt, instance weighting, and structural correspondence learning. I am, in fact, curious about adaptation in neural network-based settings. However, NNs are adaptable by the incremental/multi-task training nature. Therefore, there is little point, as far as I can currently see, in NN adaptation. Nevertheless, I have conducted a series of comparative studies to shed more light on transferring knowledge in neural networks [pdf (EMNLP-16)].

21 Oct 2015	Variational Inference (again)	Let x be visible variables, and z be invisible (hidden) ones. Estimating p(x) is usually difficult because we have to sum/integrate over z. A variational lower bound peaks when z~p(z\|x), which is oftentimes intractable. The mixture of Gaussian, for example, assumes z in parametric forms, i.e., Gaussian. In VI in general, we still have to restrict the form of z, but not in a parametric way. A typical approximation is factorization, that it, p(z)=\| \|_i p(z_i).

14 Oct 2015	Attention-based Networks	(By Hao Peng) The encoding-decoding model opens a new era of sequence generation. It is unrealistic, however, to encode a very long input sequence to a fixed vector. The attention mechanism is designed to aggregate information over the input sequence by an adaptable weighted sum. Selected Papers: NIPS'14, pp. 3104--3112, ICLR'15, ICML'15 EMNLP'15, pp. 319--389 EMNLP'15, pp. 1412--1421

14 Oct 2015	Discourse Parsing with PCFG	We wrap up discourse analysis by PCFG-based discourse parsing, which requires probabilistic context-free grammar in general.

23 Sep 2015	Discourse Analysis	We shall also explore various NLP research topics, and discourse analysis, discussed in this seminar, precedes our horizon expansion. Notice that the slide is nothing but snapshots of papers in the proceedings, and in fact has little substance.

22 Jul 2015	Variational Inference	I am a tyro in variational inference. Please refer to Ch 10, Pattern Recognition and Machine Learning.

Bad news: Thursday evening's seminars are suspended temporarily. Good news: I am reading Statistical Decision Theory and Bayesian Analysis, by James O. Berger (1985). Following list some hopefully useful materials.
	Ch 1: Losses, Risks and Decision Principles
	Resources:	My textual digest, highlighting some meaningful philosophy discussion in the textbook. My written note, mostly derived from the textbook with remarks. Slide, by Dr. Yu, who was the instructor of my undergraduate course Probability Theory and Statistics. I was always agitated after his lectures.

	Ch 2: Utilities and Losses [digest, note, slide by Dr. Yu]
	Ch 3: Prior Information and Subjective Probability [digest, note, slide by Dr. Yu]
	Frequentist vs Bayesian



30 Apr 2015	1. (By Yangyang Lu) A guided tour to selected papers. 2. Gaussian processes for classification. Ref: Ch. 6.4.5, 6.4.6, Pattern Recognition and Machine Learning.
		Equipped with Bayesian logistic regression and GP in general, we find GP classification is easy except the seemingly overwhelming formulas.

29 Apr 2015	Sampling methods	(Courtesy of Yunchuan Chen) God does not play dices, but we human do. As inference in many machine learning models is intractable, we have to resort to some approximations, among which are sampling methods. The idea of sampling is straightforward---if we want to estimante p(Head) of a coin, one approach is to go through all mathematical and physical details, which does not seem to be a good idea; an alternative is to toss the coin multiple times, giving a fairly good estimation of p(Head). However, how to design efficient sampling algorithms is a $64,000,000 question.

23 Apr 2015	Linear Classification Ref: Ch. 4, Ch. 6.4, Pattern Recognition and Machine Learning	We first wrap up our discussion of Gaussian processes by introducing hyperparameter learning in kernels. Then we introduce linear classification models, including discriminant functions, probabilistic generative/ discriminative models, and Bayesian logistic regression (with special interest). Linear classification is easy---my good old friend, logistic regression, always serves as a baseline method in various applications. Through a systematic study, however, we can grasp the main idea behind a range of machine learning techniques. This seminar also precedes our future discussion on GP classfication.

17 Apr 2015	Sum Product Networks	(By Weizhuo Li) On some theortical aspects of SPNs, e.g., normalizing, decompositionality, etc. Weizhuo also highlighted a '11 NIPS paper on deep architectures vis-a-vis shallow ones.

16 Apr 2015	Memory Networks	(Courtesy of Yangyang Lu)

09 Apr 2015	Gaussian Processes +Bayesian linear regression	In this seminar, we introduce Gaussian process regression, which extends Bayesian linear regression with kernels. However, as far as I am concerned, the two models are not equivalent, even with finite basis functions. If I were wrong, please feel free to tell me. It was really an awesome seminar, filled with a whole bunch of food, drinks, and also fruitful discussion. [See photos 1 2 3.]

14 Jan 2015	Sum Product Networks	(Courtesy of Weizhuo Li) Sum product networks (SPNs) are a way of decomposition joint distributions. Most inference is tractable w.r.t. the size the the SPN network. However, it seems that graphical models, if converted to SPNs, have exponential numbers of nodes in SPNs. The story confirms the "no free lunch theorem." As in general no perfect "I-map" exists for most real-world applications, what we have to do is to capture important aspects by ignoring unimportant ones.

7 Jan 2015	Deep Belief Nets	One of the most core concepts in deep learning is that "do things wrongly and hope they work." G. Hinton introduced CD-k algorithm for fast training restricted Boltzmann machines; he also introduced layer-wise RBM pretraining for neural networks, opening an era of deep learning.

19 Dec 2014	Copulas Ref: Ch. 4.6, Statistical Pattern Recognition	Given marginal distributions, the joint distribution in not unique because of all possible kinds of independencies among varibles. A copula is defined as the joint distribution on a unit cube with uniform marginals. It can (just can) capture nontrivial independencies and link marginals with joint distributions. Sklar's theorem says, Copula(Marginals)=Joint