Bayesian Inference on Neural Networks

We look at how to use Bayesian Inference on Neural Networks. Specifically on variational approximate inference. Often when we train a model, we have a cost function constructed from, say taking the difference of network output and the desired output in the training set. The result of the optimisation is a minimised cost function at a single set of model parameters. Bayesian inference aims to get a distribution of the model parameters instead (generally obtaining a posterior distribution).

The distribution of model parameters conditioned on the training data is known as the posterior. With Bayes rule, we can write the model parameter posterior distribution in terms of the likelihood probabilty produced by the model, a prior distribution on the model parameter that captures what we think it should be, and the marginal-likelihood. This is similar to the latent variable posterior in the Variational Autoencoder notebook, but very different in concept as we train the model to get a distibution of the model paramters rather than a point estimate. When we have a distribution of parameters, we can tell how well trained the model is by looking at the variance of the parameter distribution. A high variance means there is lot of uncertaincy in the model and a sharp mode indicates the model parameters are well set. Large model parameter variance does not always lead to large output variance as the variance of the model parameters are transfored to produce the output (same input, output variance as we sample parameters from the posterior).

Generally we can write the folloing relation between the posterior and likelihood according to Bayes formula:

$p(\textbf{$\omega$}|\textbf{X}_{tr}, \textbf{Y}_{tr}) = \frac {p(y|\textbf{$\omega$}, \textbf{X}_{tr}, \textbf{Y}_{tr})p(\textbf{$\omega$})} {p(\textbf{Y}_{tr}|\textbf{X}_{tr})}$

Here $\omega$ are the model parameters and $X_{tr}$ and $Y_{tr}$ are respective inputs and outputs of training examples of a task (e.g. classification). In some simple cases we can analytically calculate the posterior from the likelihood and prior. (see conjugate priors).

For neural network classification the likelihood $p(y|\omega , X_{tr}, Y_{tr})$ is the product (or sum for log likelihood) of computed probability of all training set examples (or all examples within a minibatch).

So how do we train a Bayesian neural network?

For neural networks (e.g. classification), we have the likelihood function computed from softmax of the network outputs. These network outputs are computed by folloing the activation equations with a large number of network weights. Computing an expression for the posterior from the likelihood function is thus intractable in any practical case (similar situation for variational autoencoders). The marginal likelihood is as with the general case - very hard to compute if try marginalising out all the weights from the likelihood. One way to tackle this is to use Variational Inference - i.e. assume a manageable form of the posterior (called Variational Distribution). This manageable form is parameterised (parameter $\theta$) and we use optimisation to find the optimal parameter. This is best illustrated with an example. Suppose we define a variational distribution to approximate the true posterior:

$q(\omega|\theta) \approx \frac {p(y|\omega, \textbf{X}_{tr}, \textbf{Y}_{tr})p(\omega)} {Z}$

we’re writing the marginal likelihood $Z=p(Y_{tr}, X_{tr})$ for simplicity and $\theta$ is the parameter of the variational distribution $q(\omega \vert \theta)$ that we can tune to achieve a better approximation of the posterior. To measure how well the approximation is carried out, we can use the KL divergence between $q(\omega | \theta)$ and the true posterior:

$KL \left ( q(\omega|\theta) \bigg| \bigg| \frac {p(y|\omega, \textbf{X}_{tr}, \textbf{Y}_{tr})p(\omega)} {Z} \right)\\ = \int q(\omega|\theta) log q(\omega|\theta) d\omega - \int q(\omega|\theta) log\left[ \frac {p(y|\omega, \textbf{X}_{tr}, \textbf{Y}_{tr})p(\omega)} {Z}\right] d\omega \\ = \int q(\omega|\theta) log \left[ \frac{q(\omega|\theta)}{p(\omega)} \right] d\omega - \int q(\omega|\theta) log \left[ p(y|\omega, \textbf{X}_{tr}, \textbf{Y}_{tr}) \right] d\omega + \text{ constant}$

The left term is the KL divergence between the variational distribution and the prior whereas the right term is the expectation of the log likelihood and can be approximated by MC sample mean with samples of the network weights $\omega$ drawn from the variational distribution $q(\omega \vert \theta)$. We can find the parameters $\theta$ that minimises the above expression (i.e. parameters where the variational distribution is at it’s best approximation for the true posterior). i.e.

$arg\_min_{\theta} \left\{ KL \left( q(\omega|\theta) || p(\omega) \right) + \left< log \left[ p(y|\omega, \textbf{X}_{tr}, \textbf{Y}_{tr}) \right] \right>\bigg|_{\omega \sim q(\omega|\theta)} \right\}$

Under this condition, we maximise the likelihood with a regularlisation term that originates from $KL \left( q(\omega|\theta) || p(\omega) \right)$. For the neural network we an choose multivariant gaussians as the variational distribution and prior. i.e. $p(\omega) = \mathcal{N}(0, \mathcal{I})$ and $q(\omega|\theta) = \mathcal{N}(\mu, \Sigma)$, with $\theta=(\mu, \Sigma)$ being the mean and covariance matrix parameters}. The KL divergence of the two multivariant gaussians are given by:

$KL\left( p(\omega) || q(\omega|\theta) \right) = \frac{1}{2}\left\{ Tr(\Sigma) + \mu^T\mu - k - log|\Sigma| \right\}$

This includes “L2 regularisation” terms on the mean vector and the diagonal elements of the covariance matrix of the variational distribution. This combined with the log likelihood, is the cost function we’re going to feed into the optimisation routine. How do we compute the gradient of the cost with respect to our parameters $\theta=(\mu, \Sigma)$? Afterall, we generate some random weights, use these weights to compute the average log likeliood, and expect to compute gradient of the log likelihood w.r.t the weights. We can do this a sily if the random weights are generated using a parameter free generator and some linear algebra (known as the “re-parameterisation” trick), where each sample of the weights is presented as:

$w_i = \mu_i + L_i\epsilon \text{, with } \epsilon \sim \mathcal{N}(0, \mathcal{I}) \text{, and } \\ L_iL_i^T = \Sigma_i \text{ being the covariance matrix of vector } w_i$

This is a stardard way of generating a general multivariant gaussian via transforming samples from an isotropic, unit variance and zero mean gaussian distribution. The resulting mean of the transformed variables is given by $\mu_i$ and the covariance matrix is given by $L_iL_i^T$. In the code we force $L_i$ to be a lower triangular matrix such that we have a positive definite covariance matrix. The resulting weights, are thus differentiable w.r.t parameters $\mu_i$ and $\Sigma_i$ ($L_i$). We’ve also separated out each neural network layer $i$, implicitely setting the network weights between layers to be independent of each other. In the objective function above the regularisation term is derived with respective single prior $p(\omega)$ and variational distribution $q(\omega \vert \theta)$. If we assume the combined variational distribution of the weights defining each neural network layer is also gaussian. The overall variational distribution is characterised by:

$\mu = [\mu_1, \mu_2, ..\mu_n]\\ \Sigma = \begin{bmatrix} \Sigma_1 & 0 & 0 & \dots & 0 \\ 0 & \Sigma_2 & 0 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & \Sigma_n \end{bmatrix}$

The determinant of a block matrix is given by the product of the individual blocks. The log determinant required in the regularisation term above can then be computed as sum of log determinant of each individual block. $KL\left( p(\omega) \vert\vert q(\omega \vert \theta) \right)$, the overall regularisation term for the entire network is thus simply the sum of regularisation terms of each layer.

Note - Adding noise to neural network weights is connected to variational bayes approximation of the weights posterior of a neural network. The difference is with a Variational Bayesian approach we first optimise for the parameter of the weights’ distribution rather than on the weights, whereas adding noise with a fixed noise parameter we’re operating on a sub-optimal variational distribution. In other words, for Variational Bayes, the noise parameter is also included in the optimisation process.

We train a multilayer perceptron with gaussian weights below using tensorflow. Because each gradient update is calculated based on stochastic weight samples, the network is substantially harder to train than one with deterministic weights. Like any other MC sampling based gradient descent optimisation - learning rate, regularisation vs likelihood ratio, initialisation, number of mc samples and input value range are all critical for successful network training. Some literature advocates a single sample per mini batch element, this could be data/network dependent. More reliable ways to tackle gradient noise (e.g. variance reduction) are reported in literature. Here we use a large number of MC samples to smoothout the gradient updates.

# simple multilayer perceptron with gaussian variational distribution
# wrapping all network graphs in a class
import tensorflow as tf
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)

tf.reset_default_graph()

class MCMLP:
    """
    multi layer perceptron with random weights - for variational bayes training
    variational distribution of network weights are gaussian with isotropic gaussian priors
    
    tuning - set eps to zero, see it converge, then solve the variance problem separately
           - very sensitive to learning rate in MC mode. 
           - initialisation is crucial
           - training algorithm must have some "averaging" and "momentum" feature
           - inputs best to be scaled to a common range across features. 
    """
    def __init__(self, n_in, neu_list, n_mc, activation=tf.nn.elu, guard=10**-6,
                 learning_rate=0.01, mc=True):
        """
        constructs the graph of MLP
        
        n_in: input size
        neu_list: list dictating how many neurons for each layer
        n_mc: number of mc samples for each run
        activation: tensorflow activation functions
            e.g. tf.nn.elu, tf.nn.sigmoid
            for hidden layers only. output layer uses sigmoid
        mc: bool
            do we have stochastic weights? disable to fine tune network parameters 
            with noise free gradients
        """
        assert len(neu_list) > 1
        self.n_in = n_in
        self.neu_list = neu_list
        self.in_list = [n_in] + neu_list[:-1]
        self.n_mc = n_mc
        self.guard = tf.constant(guard, dtype=tf.float32)
        
        # input/output variables
        self.x = tf.placeholder(dtype=tf.float32, shape=(None, n_in))
        self.y = tf.placeholder(dtype=tf.float32, shape=(None, 1))               
        
        # holding variables of all layers in lists
        self.mu = []  # means
        self.tiled_mu = [] # tiled mean for convenience
        self.sig_sqrt = []  # square root of covariance matrices
        self.sig = []  # covariance matrices
        self.eps = []  # random ~N(0, 1) for mc sampling
        self.weights = []  # actual weights for MLP
        self.weights_mb = []  # transposed weights for mb
        self.hidden = [] # output of each layer
        self.kl = []  # KL(variational_dist||prior) = KL( N(mu,sig) || N(0,I) ) of each layer weights
        self.l1 = []  # l1 cost of weights and bias
        self.bias = []  # bias of neurons. non-stochastic
        
        for n, (l_in, l_neu) in enumerate(zip(self.in_list, self.neu_list)):
            self.mu.append(
                tf.Variable(initial_value=tf.random_normal(shape=(l_neu*l_in, 1)))
            )             
            self.sig_sqrt.append(
                tf.matrix_band_part(
                    tf.Variable(initial_value=tf.random_uniform(shape=(l_neu*l_in, l_neu*l_in), 
                                                                minval=10**-6, maxval=1.0)),
                    -1, 0)
            )
            self.sig.append(
                tf.matmul(self.sig_sqrt[-1], tf.transpose(self.sig_sqrt[-1]))
            )
            
            if mc is True:
                self.eps.append(tf.random_normal(shape=(l_neu*l_in, n_mc)))
            else:
                self.eps.append(tf.zeros(shape=(l_neu*l_in, n_mc)))  # set eps to zero -> check if we have variance issues in MC
                
            self.tiled_mu.append(
                tf.matmul(self.mu[-1], tf.ones(shape=(1, n_mc)))
            )
            self.weights.append(
                self.tiled_mu[-1] +  tf.matmul(self.sig_sqrt[-1], self.eps[-1])
            )
            self.weights_mb.append(
                tf.reshape(tf.transpose(self.weights[-1], [1, 0]), [n_mc, l_in, -1])
            )
            self.bias.append(
                tf.Variable(initial_value=tf.random_normal(shape=(1, l_neu)))
            )
            self.l1.append(
                tf.reduce_sum(tf.abs(self.weights[-1])) + tf.reduce_sum(tf.abs(self.bias[-1]))
            )
            self.kl.append(
                tf.trace(self.sig[-1]) + tf.matmul(tf.transpose(self.mu[-1]), self.mu[-1]) - l_neu*l_in -
                tf.linalg.slogdet(self.sig[-1])[1] + tf.matmul(self.bias[-1], tf.transpose(self.bias[-1]))
            )
            if n == 0:
                self.hidden.append(
                    activation(tf.matmul(tf.stack([self.x]*n_mc, axis=0), self.weights_mb[-1]) + self.bias[0])
                )  # input layer needs tiling self.x
            elif n == len(neu_list) - 1:
                self.hidden.append(
                    tf.sigmoid(tf.matmul(self.hidden[-1], self.weights_mb[-1]) + self.bias[-1])
                )  # last layer sigmoid
            else:
                self.hidden.append(
                    activation(tf.matmul(self.hidden[-1], self.weights_mb[-1]) + self.bias[-1])
                )  # hidden layers we use activation specified
                
        self.output_p = tf.reduce_mean(self.hidden[-1], axis=0)  # [mb_size, 1], mc averaged
        
        self.l_likelihood = tf.reduce_sum(  # across MB components
            tf.log(tf.maximum(self.guard, 
                              tf.multiply(self.y, self.output_p) + 
                              tf.multiply((tf.ones_like(self.y) - self.y), 
                                          (tf.ones_like(self.output_p) - self.output_p)))))
        
        self.kl_sum = tf.add_n(self.kl)
        self.l1_sum = tf.add_n(self.l1)
        self.cost = - self.l_likelihood + 0.001*self.kl_sum

        self.optimiser = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=0.1, beta2=0.1)
        self.training_op = self.optimiser.minimize(self.cost)
        
        # evaluating output
        self.outputs_pn = tf.sign(self.output_p - tf.random_uniform(tf.shape(self.output_p), minval=0, maxval=1))  # {-1, +1}
        self.output = tf.minimum(tf.maximum(self.outputs_pn, tf.constant(0.0, dtype=tf.float32)), tf.constant(1.0, dtype=tf.float32))  # {0, 1}
            
n_in = 4 # input size
n_mb = 3 # mini batch size
n_mc = 2000 # number of mc samples per input array

LAYER = 1  # which layer to print out in this experiment
    
# simple test run with toy data
with tf.Session() as sess:

    mc_mlp = MCMLP(n_in, [5, 3, 1], n_mc, learning_rate=0.05)
    sess.run(tf.global_variables_initializer())
    
    for n in range(200):
        (mu_, tiled_mu_, bias_, sig_sqrt_, sig_,
         eps_, weights_, weights_mb_, hidden_, kl_, output_p_,
         l_likelihood_, kl_sum_, cost_,
         training_op_,
         output_) = sess.run([mc_mlp.mu[LAYER], mc_mlp.tiled_mu[LAYER], mc_mlp.bias[LAYER],
                                                     mc_mlp.sig_sqrt[LAYER], mc_mlp.sig[LAYER],
                                                     mc_mlp.eps[LAYER], mc_mlp.weights[LAYER],
                                                     mc_mlp.weights_mb[LAYER], mc_mlp.hidden[LAYER],
                                                     mc_mlp.kl[LAYER],
                                                     mc_mlp.output_p, mc_mlp.l_likelihood, 
                                                     mc_mlp.kl_sum, mc_mlp.cost,
                                                     mc_mlp.training_op,
                                                     mc_mlp.output],
                                                    {mc_mlp.x: np.array([[1, 1, 1, 1],
                                                                         [2, 2, 2, 2],
                                                                         [3, 3, 3, 3]],dtype=np.float32),
                                                    mc_mlp.y: np.array([[1], [0], [1]], dtype=np.float32)})
#     print('likelihood: {}, KL term: {}, Overall cost: {}'.format(l_likelihood_, kl_sum_, cost_))
    # evaluate
    x_test_arr =  np.array([[1, 1, 1, 1], 
                            [2, 2, 2, 2], 
                            [3, 3, 3, 3]],dtype=np.float32)
    y_test_arr = np.array([[1], [0], [1]], dtype=np.float32)
    test_output_ = sess.run(mc_mlp.output, {mc_mlp.x: x_test_arr})
    (precision, recall, 
     fbeta_score, split) = precision_recall_fscore_support(y_test_arr, test_output_)
    print('precision: {}'.format(precision))
    print('recall: {}'.format(recall))
    print('fbeta_score: {}'.format(fbeta_score))
    print('test data split: {}'.format(split))    

precision: [1. 1.]
recall: [1. 1.]
fbeta_score: [1. 1.]
test data split: [1 2]

# lets load a dataset for classification:
data = load_breast_cancer()

df = pd.DataFrame(data.data, columns = data.feature_names)

num_features = len(data.feature_names)
# some preprocessing e.g. taking log
for each_col in data.feature_names:
    df[each_col] = np.log(np.maximum(df[each_col], 10**-6))
df['target'] = data.target
print('processed data')
display(df.describe())

# balance the class and do train/test split
df = pd.concat([df[df.target == 0].iloc[:212], df[df.target == 1].iloc[:212]])
x_train, x_test, y_train, y_test = \
    train_test_split(df[data.feature_names], df.target, test_size=0.3, random_state=0, stratify=df.target)

# balanced?
y_train.describe()

processed data

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	fractal dimension error	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	target
count	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000
mean	2.619131	2.935269	4.489174	6.363185	-2.350210	-2.380518	-3.092911	-3.553578	-1.719430	-2.773718	-1.065554	0.104874	0.889575	3.379864	-5.028600	-3.879864	-3.958432	-4.741481	-3.948656	-5.728233	2.749578	3.217009	4.631289	6.615811	-2.037005	-1.550372	-1.908423	-2.582310	-1.258202	-2.497773	0.627417
std	0.238189	0.220789	0.251084	0.483139	0.145572	0.494459	1.955770	1.784399	0.148229	0.106867	0.542183	0.426717	0.540019	0.728101	0.370586	0.650584	1.722883	1.468238	0.342024	0.526934	0.276438	0.240730	0.290892	0.554917	0.173086	0.617256	2.064994	1.832695	0.200010	0.195784	0.483918
min	1.943192	2.273156	3.779405	4.966335	-2.944469	-3.943514	-13.815511	-13.815511	-2.244316	-2.996533	-2.193731	-1.021096	-0.278392	1.917217	-6.369509	-6.095937	-13.815511	-13.815511	-4.843174	-7.018910	2.070653	2.486572	3.920190	5.221436	-2.642684	-3.601235	-13.815511	-13.815511	-1.854699	-2.899695	0.000000
25%	2.459589	2.783158	4.319752	6.040969	-2.449115	-2.734600	-3.521333	-3.896642	-1.820776	-2.852498	-1.459295	-0.181642	0.473747	2.882004	-5.265076	-4.336671	-4.193723	-4.874619	-4.189095	-6.097714	2.565718	3.048325	4.432125	6.244749	-2.149006	-1.915963	-2.167180	-2.734446	-1.384696	-2.638617	0.000000
50%	2.593013	2.935982	4.457134	6.311916	-2.344762	-2.379142	-2.788068	-3.396210	-1.719253	-2.788068	-1.126395	0.102557	0.827241	3.199897	-5.054587	-3.889772	-3.653898	-4.516244	-3.977629	-5.748675	2.706048	3.235143	4.581492	6.531606	-2.030270	-1.551641	-1.484128	-2.303285	-1.265139	-2.525229	1.000000
75%	2.758743	3.081910	4.645352	6.662749	-2.250942	-2.037149	-2.034851	-2.603690	-1.631172	-2.716284	-0.736263	0.387980	1.211048	3.810876	-4.810228	-3.428055	-3.168896	-4.219228	-3.751606	-5.390871	2.933325	3.391820	4.831509	6.988413	-1.924149	-1.081460	-0.959981	-1.823870	-1.146018	-2.385098	1.000000
max	3.336125	3.670715	5.239098	7.824446	-1.811554	-1.063052	-0.851440	-1.603456	-1.190728	-2.328518	1.055357	1.586169	3.090133	6.295635	-3.469583	-1.999522	-0.926341	-2.941434	-2.538941	-3.511906	3.584629	3.902780	5.526249	8.355615	-1.502379	0.056380	0.224742	-1.234432	-0.409774	-1.572624	1.000000

count    296.000000
mean     0.500000  
std      0.500847  
min      0.000000  
25%      0.000000  
50%      0.500000  
75%      1.000000  
max      1.000000  
Name: target, dtype: float64

# lets feed the data in a small network:
tf.reset_default_graph()

n_neurons = [8, 5, 3, 1] # in each layer
n_in = len(data.feature_names) # input size
n_mb = 8 # mini batch size
n_mc = 2000 # number of mc samples per input array
n_epoch = 100

LAYER = 0  # which layer to print out in this experiment

y_train_mb = np.reshape(y_train.values, [-1, n_mb, 1])
x_train_mb = np.reshape(x_train.values, [-1, n_mb, n_in])

with tf.Session() as sess:

    mc_mlp = MCMLP(n_in, n_neurons, n_mc, learning_rate=0.005, mc=True)
    sess.run(tf.global_variables_initializer())

    for n in range(n_epoch):
        for each_x_mb, each_y_mb in zip(x_train_mb, y_train_mb): 
            (mu_, tiled_mu_, sig_sqrt_, sig_,
             eps_, weights_, weights_mb_, hidden_, kl_, output_p_,
             l_likelihood_, kl_sum_, cost_,
             training_op_) = sess.run([mc_mlp.mu[LAYER], mc_mlp.tiled_mu[LAYER],
                                                         mc_mlp.sig_sqrt[LAYER], mc_mlp.sig[LAYER],
                                                         mc_mlp.eps[LAYER], mc_mlp.weights[LAYER],
                                                         mc_mlp.weights_mb[LAYER], mc_mlp.hidden[LAYER],
                                                         mc_mlp.kl[LAYER],
                                                         mc_mlp.output_p, mc_mlp.l_likelihood, 
                                                         mc_mlp.kl_sum, mc_mlp.cost,
                                                         mc_mlp.training_op],
                                                        {mc_mlp.x: each_x_mb, 
                                                         mc_mlp.y: each_y_mb})
#         print(n, l_likelihood_, kl_sum_, cost_)
    
    # evaluate
    y_test_arr = y_test.values  # [:8]
    x_test_arr = x_test.values  # [:8]
    test_output_ = sess.run(mc_mlp.output, {mc_mlp.x: x_test_arr})
    (precision, recall, 
     fbeta_score, split) = precision_recall_fscore_support(y_test_arr, test_output_)
    print('precision: {}'.format(precision))
    print('recall: {}'.format(recall))
    print('fbeta_score: {}'.format(fbeta_score))
    print('test data split: {}'.format(split))  

precision: [0.88059701 0.91803279]
recall: [0.921875 0.875   ]
fbeta_score: [0.90076336 0.896     ]
test data split: [64 64]

Reference

Peter M. Lee, “Bayesian Statistics - an introduction”, Hodder Arnold.

Yarin Gal, “Uncertainty in Deep Learning,” PhD Thesis. Uni. Cambridge, 2016.

F. J. R. Ruiz, M. K. Titsias, D. M. Blei, “The Generalized Reparameterization Gradient”

Written on September 26, 2018