Bayesian Inference on Neural Networks
We look at how to use Bayesian Inference on Neural Networks. Specifically on variational approximate inference. Often when we train a model, we have a cost function constructed from, say taking the difference of network output and the desired output in the training set. The result of the optimisation is a minimised cost function at a single set of model parameters. Bayesian inference aims to get a distribution of the model parameters instead (generally obtaining a posterior distribution).
The distribution of model parameters conditioned on the training data is known as the posterior. With Bayes rule, we can write the model parameter posterior distribution in terms of the likelihood probabilty produced by the model, a prior distribution on the model parameter that captures what we think it should be, and the marginal-likelihood. This is similar to the latent variable posterior in the Variational Autoencoder notebook, but very different in concept as we train the model to get a distibution of the model paramters rather than a point estimate. When we have a distribution of parameters, we can tell how well trained the model is by looking at the variance of the parameter distribution. A high variance means there is lot of uncertaincy in the model and a sharp mode indicates the model parameters are well set. Large model parameter variance does not always lead to large output variance as the variance of the model parameters are transfored to produce the output (same input, output variance as we sample parameters from the posterior).
Generally we can write the folloing relation between the posterior and likelihood according to Bayes formula:
Here $\omega$ are the model parameters and $X_{tr}$ and $Y_{tr}$ are respective inputs and outputs of training examples of a task (e.g. classification). In some simple cases we can analytically calculate the posterior from the likelihood and prior. (see conjugate priors).
For neural network classification the likelihood $p(y|\omega , X_{tr}, Y_{tr})$ is the product (or sum for log likelihood) of computed probability of all training set examples (or all examples within a minibatch).
So how do we train a Bayesian neural network?
For neural networks (e.g. classification), we have the likelihood function computed from softmax of the network outputs. These network outputs are computed by folloing the activation equations with a large number of network weights. Computing an expression for the posterior from the likelihood function is thus intractable in any practical case (similar situation for variational autoencoders). The marginal likelihood is as with the general case - very hard to compute if try marginalising out all the weights from the likelihood. One way to tackle this is to use Variational Inference - i.e. assume a manageable form of the posterior (called Variational Distribution). This manageable form is parameterised (parameter $\theta$) and we use optimisation to find the optimal parameter. This is best illustrated with an example. Suppose we define a variational distribution to approximate the true posterior:
we’re writing the marginal likelihood $Z=p(Y_{tr}, X_{tr})$ for simplicity and $\theta$ is the parameter of the variational distribution $q(\omega \vert \theta)$ that we can tune to achieve a better approximation of the posterior. To measure how well the approximation is carried out, we can use the KL divergence between and the true posterior:
The left term is the KL divergence between the variational distribution and the prior whereas the right term is the expectation of the log likelihood and can be approximated by MC sample mean with samples of the network weights $\omega$ drawn from the variational distribution $q(\omega \vert \theta)$. We can find the parameters $\theta$ that minimises the above expression (i.e. parameters where the variational distribution is at it’s best approximation for the true posterior). i.e.
Under this condition, we maximise the likelihood with a regularlisation term that originates from $KL \left( q(\omega|\theta) || p(\omega) \right)$. For the neural network we an choose multivariant gaussians as the variational distribution and prior. i.e. $p(\omega) = \mathcal{N}(0, \mathcal{I})$ and $q(\omega|\theta) = \mathcal{N}(\mu, \Sigma)$, with $\theta=(\mu, \Sigma)$ being the mean and covariance matrix parameters}. The KL divergence of the two multivariant gaussians are given by:
This includes “L2 regularisation” terms on the mean vector and the diagonal elements of the covariance matrix of the variational distribution. This combined with the log likelihood, is the cost function we’re going to feed into the optimisation routine. How do we compute the gradient of the cost with respect to our parameters $\theta=(\mu, \Sigma)$? Afterall, we generate some random weights, use these weights to compute the average log likeliood, and expect to compute gradient of the log likelihood w.r.t the weights. We can do this a sily if the random weights are generated using a parameter free generator and some linear algebra (known as the “re-parameterisation” trick), where each sample of the weights is presented as:
This is a stardard way of generating a general multivariant gaussian via transforming samples from an isotropic, unit variance and zero mean gaussian distribution. The resulting mean of the transformed variables is given by $\mu_i$ and the covariance matrix is given by $L_iL_i^T$. In the code we force $L_i$ to be a lower triangular matrix such that we have a positive definite covariance matrix. The resulting weights, are thus differentiable w.r.t parameters $\mu_i$ and $\Sigma_i$ ($L_i$). We’ve also separated out each neural network layer $i$, implicitely setting the network weights between layers to be independent of each other. In the objective function above the regularisation term is derived with respective single prior $p(\omega)$ and variational distribution $q(\omega \vert \theta)$. If we assume the combined variational distribution of the weights defining each neural network layer is also gaussian. The overall variational distribution is characterised by:
The determinant of a block matrix is given by the product of the individual blocks. The log determinant required in the regularisation term above can then be computed as sum of log determinant of each individual block. $KL\left( p(\omega) \vert\vert q(\omega \vert \theta) \right)$, the overall regularisation term for the entire network is thus simply the sum of regularisation terms of each layer.
Note - Adding noise to neural network weights is connected to variational bayes approximation of the weights posterior of a neural network. The difference is with a Variational Bayesian approach we first optimise for the parameter of the weights’ distribution rather than on the weights, whereas adding noise with a fixed noise parameter we’re operating on a sub-optimal variational distribution. In other words, for Variational Bayes, the noise parameter is also included in the optimisation process.
We train a multilayer perceptron with gaussian weights below using tensorflow. Because each gradient update is calculated based on stochastic weight samples, the network is substantially harder to train than one with deterministic weights. Like any other MC sampling based gradient descent optimisation - learning rate, regularisation vs likelihood ratio, initialisation, number of mc samples and input value range are all critical for successful network training. Some literature advocates a single sample per mini batch element, this could be data/network dependent. More reliable ways to tackle gradient noise (e.g. variance reduction) are reported in literature. Here we use a large number of MC samples to smoothout the gradient updates.
# simple multilayer perceptron with gaussian variational distribution
# wrapping all network graphs in a class
import tensorflow as tf
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
tf.reset_default_graph()
class MCMLP:
"""
multi layer perceptron with random weights - for variational bayes training
variational distribution of network weights are gaussian with isotropic gaussian priors
tuning - set eps to zero, see it converge, then solve the variance problem separately
- very sensitive to learning rate in MC mode.
- initialisation is crucial
- training algorithm must have some "averaging" and "momentum" feature
- inputs best to be scaled to a common range across features.
"""
def __init__(self, n_in, neu_list, n_mc, activation=tf.nn.elu, guard=10**-6,
learning_rate=0.01, mc=True):
"""
constructs the graph of MLP
n_in: input size
neu_list: list dictating how many neurons for each layer
n_mc: number of mc samples for each run
activation: tensorflow activation functions
e.g. tf.nn.elu, tf.nn.sigmoid
for hidden layers only. output layer uses sigmoid
mc: bool
do we have stochastic weights? disable to fine tune network parameters
with noise free gradients
"""
assert len(neu_list) > 1
self.n_in = n_in
self.neu_list = neu_list
self.in_list = [n_in] + neu_list[:-1]
self.n_mc = n_mc
self.guard = tf.constant(guard, dtype=tf.float32)
# input/output variables
self.x = tf.placeholder(dtype=tf.float32, shape=(None, n_in))
self.y = tf.placeholder(dtype=tf.float32, shape=(None, 1))
# holding variables of all layers in lists
self.mu = [] # means
self.tiled_mu = [] # tiled mean for convenience
self.sig_sqrt = [] # square root of covariance matrices
self.sig = [] # covariance matrices
self.eps = [] # random ~N(0, 1) for mc sampling
self.weights = [] # actual weights for MLP
self.weights_mb = [] # transposed weights for mb
self.hidden = [] # output of each layer
self.kl = [] # KL(variational_dist||prior) = KL( N(mu,sig) || N(0,I) ) of each layer weights
self.l1 = [] # l1 cost of weights and bias
self.bias = [] # bias of neurons. non-stochastic
for n, (l_in, l_neu) in enumerate(zip(self.in_list, self.neu_list)):
self.mu.append(
tf.Variable(initial_value=tf.random_normal(shape=(l_neu*l_in, 1)))
)
self.sig_sqrt.append(
tf.matrix_band_part(
tf.Variable(initial_value=tf.random_uniform(shape=(l_neu*l_in, l_neu*l_in),
minval=10**-6, maxval=1.0)),
-1, 0)
)
self.sig.append(
tf.matmul(self.sig_sqrt[-1], tf.transpose(self.sig_sqrt[-1]))
)
if mc is True:
self.eps.append(tf.random_normal(shape=(l_neu*l_in, n_mc)))
else:
self.eps.append(tf.zeros(shape=(l_neu*l_in, n_mc))) # set eps to zero -> check if we have variance issues in MC
self.tiled_mu.append(
tf.matmul(self.mu[-1], tf.ones(shape=(1, n_mc)))
)
self.weights.append(
self.tiled_mu[-1] + tf.matmul(self.sig_sqrt[-1], self.eps[-1])
)
self.weights_mb.append(
tf.reshape(tf.transpose(self.weights[-1], [1, 0]), [n_mc, l_in, -1])
)
self.bias.append(
tf.Variable(initial_value=tf.random_normal(shape=(1, l_neu)))
)
self.l1.append(
tf.reduce_sum(tf.abs(self.weights[-1])) + tf.reduce_sum(tf.abs(self.bias[-1]))
)
self.kl.append(
tf.trace(self.sig[-1]) + tf.matmul(tf.transpose(self.mu[-1]), self.mu[-1]) - l_neu*l_in -
tf.linalg.slogdet(self.sig[-1])[1] + tf.matmul(self.bias[-1], tf.transpose(self.bias[-1]))
)
if n == 0:
self.hidden.append(
activation(tf.matmul(tf.stack([self.x]*n_mc, axis=0), self.weights_mb[-1]) + self.bias[0])
) # input layer needs tiling self.x
elif n == len(neu_list) - 1:
self.hidden.append(
tf.sigmoid(tf.matmul(self.hidden[-1], self.weights_mb[-1]) + self.bias[-1])
) # last layer sigmoid
else:
self.hidden.append(
activation(tf.matmul(self.hidden[-1], self.weights_mb[-1]) + self.bias[-1])
) # hidden layers we use activation specified
self.output_p = tf.reduce_mean(self.hidden[-1], axis=0) # [mb_size, 1], mc averaged
self.l_likelihood = tf.reduce_sum( # across MB components
tf.log(tf.maximum(self.guard,
tf.multiply(self.y, self.output_p) +
tf.multiply((tf.ones_like(self.y) - self.y),
(tf.ones_like(self.output_p) - self.output_p)))))
self.kl_sum = tf.add_n(self.kl)
self.l1_sum = tf.add_n(self.l1)
self.cost = - self.l_likelihood + 0.001*self.kl_sum
self.optimiser = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=0.1, beta2=0.1)
self.training_op = self.optimiser.minimize(self.cost)
# evaluating output
self.outputs_pn = tf.sign(self.output_p - tf.random_uniform(tf.shape(self.output_p), minval=0, maxval=1)) # {-1, +1}
self.output = tf.minimum(tf.maximum(self.outputs_pn, tf.constant(0.0, dtype=tf.float32)), tf.constant(1.0, dtype=tf.float32)) # {0, 1}
n_in = 4 # input size
n_mb = 3 # mini batch size
n_mc = 2000 # number of mc samples per input array
LAYER = 1 # which layer to print out in this experiment
# simple test run with toy data
with tf.Session() as sess:
mc_mlp = MCMLP(n_in, [5, 3, 1], n_mc, learning_rate=0.05)
sess.run(tf.global_variables_initializer())
for n in range(200):
(mu_, tiled_mu_, bias_, sig_sqrt_, sig_,
eps_, weights_, weights_mb_, hidden_, kl_, output_p_,
l_likelihood_, kl_sum_, cost_,
training_op_,
output_) = sess.run([mc_mlp.mu[LAYER], mc_mlp.tiled_mu[LAYER], mc_mlp.bias[LAYER],
mc_mlp.sig_sqrt[LAYER], mc_mlp.sig[LAYER],
mc_mlp.eps[LAYER], mc_mlp.weights[LAYER],
mc_mlp.weights_mb[LAYER], mc_mlp.hidden[LAYER],
mc_mlp.kl[LAYER],
mc_mlp.output_p, mc_mlp.l_likelihood,
mc_mlp.kl_sum, mc_mlp.cost,
mc_mlp.training_op,
mc_mlp.output],
{mc_mlp.x: np.array([[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]],dtype=np.float32),
mc_mlp.y: np.array([[1], [0], [1]], dtype=np.float32)})
# print('likelihood: {}, KL term: {}, Overall cost: {}'.format(l_likelihood_, kl_sum_, cost_))
# evaluate
x_test_arr = np.array([[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]],dtype=np.float32)
y_test_arr = np.array([[1], [0], [1]], dtype=np.float32)
test_output_ = sess.run(mc_mlp.output, {mc_mlp.x: x_test_arr})
(precision, recall,
fbeta_score, split) = precision_recall_fscore_support(y_test_arr, test_output_)
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fbeta_score: {}'.format(fbeta_score))
print('test data split: {}'.format(split))
precision: [1. 1.]
recall: [1. 1.]
fbeta_score: [1. 1.]
test data split: [1 2]
# lets load a dataset for classification:
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns = data.feature_names)
num_features = len(data.feature_names)
# some preprocessing e.g. taking log
for each_col in data.feature_names:
df[each_col] = np.log(np.maximum(df[each_col], 10**-6))
df['target'] = data.target
print('processed data')
display(df.describe())
# balance the class and do train/test split
df = pd.concat([df[df.target == 0].iloc[:212], df[df.target == 1].iloc[:212]])
x_train, x_test, y_train, y_test = \
train_test_split(df[data.feature_names], df.target, test_size=0.3, random_state=0, stratify=df.target)
# balanced?
y_train.describe()
processed data
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | radius error | texture error | perimeter error | area error | smoothness error | compactness error | concavity error | concave points error | symmetry error | fractal dimension error | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
mean | 2.619131 | 2.935269 | 4.489174 | 6.363185 | -2.350210 | -2.380518 | -3.092911 | -3.553578 | -1.719430 | -2.773718 | -1.065554 | 0.104874 | 0.889575 | 3.379864 | -5.028600 | -3.879864 | -3.958432 | -4.741481 | -3.948656 | -5.728233 | 2.749578 | 3.217009 | 4.631289 | 6.615811 | -2.037005 | -1.550372 | -1.908423 | -2.582310 | -1.258202 | -2.497773 | 0.627417 |
std | 0.238189 | 0.220789 | 0.251084 | 0.483139 | 0.145572 | 0.494459 | 1.955770 | 1.784399 | 0.148229 | 0.106867 | 0.542183 | 0.426717 | 0.540019 | 0.728101 | 0.370586 | 0.650584 | 1.722883 | 1.468238 | 0.342024 | 0.526934 | 0.276438 | 0.240730 | 0.290892 | 0.554917 | 0.173086 | 0.617256 | 2.064994 | 1.832695 | 0.200010 | 0.195784 | 0.483918 |
min | 1.943192 | 2.273156 | 3.779405 | 4.966335 | -2.944469 | -3.943514 | -13.815511 | -13.815511 | -2.244316 | -2.996533 | -2.193731 | -1.021096 | -0.278392 | 1.917217 | -6.369509 | -6.095937 | -13.815511 | -13.815511 | -4.843174 | -7.018910 | 2.070653 | 2.486572 | 3.920190 | 5.221436 | -2.642684 | -3.601235 | -13.815511 | -13.815511 | -1.854699 | -2.899695 | 0.000000 |
25% | 2.459589 | 2.783158 | 4.319752 | 6.040969 | -2.449115 | -2.734600 | -3.521333 | -3.896642 | -1.820776 | -2.852498 | -1.459295 | -0.181642 | 0.473747 | 2.882004 | -5.265076 | -4.336671 | -4.193723 | -4.874619 | -4.189095 | -6.097714 | 2.565718 | 3.048325 | 4.432125 | 6.244749 | -2.149006 | -1.915963 | -2.167180 | -2.734446 | -1.384696 | -2.638617 | 0.000000 |
50% | 2.593013 | 2.935982 | 4.457134 | 6.311916 | -2.344762 | -2.379142 | -2.788068 | -3.396210 | -1.719253 | -2.788068 | -1.126395 | 0.102557 | 0.827241 | 3.199897 | -5.054587 | -3.889772 | -3.653898 | -4.516244 | -3.977629 | -5.748675 | 2.706048 | 3.235143 | 4.581492 | 6.531606 | -2.030270 | -1.551641 | -1.484128 | -2.303285 | -1.265139 | -2.525229 | 1.000000 |
75% | 2.758743 | 3.081910 | 4.645352 | 6.662749 | -2.250942 | -2.037149 | -2.034851 | -2.603690 | -1.631172 | -2.716284 | -0.736263 | 0.387980 | 1.211048 | 3.810876 | -4.810228 | -3.428055 | -3.168896 | -4.219228 | -3.751606 | -5.390871 | 2.933325 | 3.391820 | 4.831509 | 6.988413 | -1.924149 | -1.081460 | -0.959981 | -1.823870 | -1.146018 | -2.385098 | 1.000000 |
max | 3.336125 | 3.670715 | 5.239098 | 7.824446 | -1.811554 | -1.063052 | -0.851440 | -1.603456 | -1.190728 | -2.328518 | 1.055357 | 1.586169 | 3.090133 | 6.295635 | -3.469583 | -1.999522 | -0.926341 | -2.941434 | -2.538941 | -3.511906 | 3.584629 | 3.902780 | 5.526249 | 8.355615 | -1.502379 | 0.056380 | 0.224742 | -1.234432 | -0.409774 | -1.572624 | 1.000000 |
count 296.000000
mean 0.500000
std 0.500847
min 0.000000
25% 0.000000
50% 0.500000
75% 1.000000
max 1.000000
Name: target, dtype: float64
# lets feed the data in a small network:
tf.reset_default_graph()
n_neurons = [8, 5, 3, 1] # in each layer
n_in = len(data.feature_names) # input size
n_mb = 8 # mini batch size
n_mc = 2000 # number of mc samples per input array
n_epoch = 100
LAYER = 0 # which layer to print out in this experiment
y_train_mb = np.reshape(y_train.values, [-1, n_mb, 1])
x_train_mb = np.reshape(x_train.values, [-1, n_mb, n_in])
with tf.Session() as sess:
mc_mlp = MCMLP(n_in, n_neurons, n_mc, learning_rate=0.005, mc=True)
sess.run(tf.global_variables_initializer())
for n in range(n_epoch):
for each_x_mb, each_y_mb in zip(x_train_mb, y_train_mb):
(mu_, tiled_mu_, sig_sqrt_, sig_,
eps_, weights_, weights_mb_, hidden_, kl_, output_p_,
l_likelihood_, kl_sum_, cost_,
training_op_) = sess.run([mc_mlp.mu[LAYER], mc_mlp.tiled_mu[LAYER],
mc_mlp.sig_sqrt[LAYER], mc_mlp.sig[LAYER],
mc_mlp.eps[LAYER], mc_mlp.weights[LAYER],
mc_mlp.weights_mb[LAYER], mc_mlp.hidden[LAYER],
mc_mlp.kl[LAYER],
mc_mlp.output_p, mc_mlp.l_likelihood,
mc_mlp.kl_sum, mc_mlp.cost,
mc_mlp.training_op],
{mc_mlp.x: each_x_mb,
mc_mlp.y: each_y_mb})
# print(n, l_likelihood_, kl_sum_, cost_)
# evaluate
y_test_arr = y_test.values # [:8]
x_test_arr = x_test.values # [:8]
test_output_ = sess.run(mc_mlp.output, {mc_mlp.x: x_test_arr})
(precision, recall,
fbeta_score, split) = precision_recall_fscore_support(y_test_arr, test_output_)
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fbeta_score: {}'.format(fbeta_score))
print('test data split: {}'.format(split))
precision: [0.88059701 0.91803279]
recall: [0.921875 0.875 ]
fbeta_score: [0.90076336 0.896 ]
test data split: [64 64]
Reference
Peter M. Lee, “Bayesian Statistics - an introduction”, Hodder Arnold.
Yarin Gal, “Uncertainty in Deep Learning,” PhD Thesis. Uni. Cambridge, 2016.
F. J. R. Ruiz, M. K. Titsias, D. M. Blei, “The Generalized Reparameterization Gradient”