Welcome to the exciting world of Probabilistic Programming! This article is a gentle introduction to the field, you only need a basic understanding of Deep Learning and Bayesian statistics.
By the end of this article, you should have a basic understanding of the field, its applications, and how it differs from more traditional deep learning methods.
If, like me, you have heard of Bayesian Deep Learning, and you guess it involves bayesian statistics, but you don’t know exactly how it is used, you are in the right place.
One of the main limitation of Traditional deep learning is that even though they are very powerful tools, they don’t provide a measure of their uncertainty.
Chat GPT can say false information with blatant confidence. Classifiers output probabilities that are often not calibrated.
Uncertainty estimation is a crucial aspect of decision-making processes, especially in the areas such as healthcare, self-driving cars. We want a model to be able to be able to estimate when its very unsure about classifying a subject with a brain cancer, and in this case we require further diagnosis by a medical expert. Similarly we want autonomous cars to be able to slow down when it identifies a new environment.
To illustrate how bad a neural network can estimates the risk, let’s look at a very simple Classifier Neural Network with a softmax layer in the end.
The softmax has a very understandable name, it is a Soft Max function, meaning that it is a “smoother” version of a max function. The reason for that is that if we had picked a “hard” max function just taking the class with the highest probability, we would have a zero gradient to all the other classes.
With a softmax, the probability of a class can be close to 1, but never exactly 1. And because the sum of probabilities of all classes is 1, there is still some gradient flowing to the other classes.
However, the softmax function also presents an issue. It outputs probabilities that are poorly calibrated. Small changes in the values before applying the softmax function are squashed by the exponential, causing minimal changes to the output probabilities.
This often results in overconfidence, with the model giving high probabilities for certain classes even in the face of uncertainty, a characteristic inherent to the ‘max’ nature of the softmax function.
Comparing a traditional Neural Network (NN) with a Bayesian Neural Network (BNN) can highlight the importance of uncertainty estimation. A BNN’s certainty is high when it encounters familiar distributions from training data, but as we move away from known distributions, the uncertainty increases, providing a more realistic estimation.
Here is what an estimation of uncertainty can look like:
You can see that when we are close to the distribution we have observed during training, the model is very certain, but as we move farther from the known distribution, the uncertainty increases.
There is one central Theorem to know in Bayesian statistics: The Bayes Theorem.
- The prior is the distribution of theta we think is the most likely before any observation. For a coin toss for example we could assume that the probability of having a head is a gaussian around p = 0.5
- If we want to put as little inductive bias as possible, we could also say p is uniform between [0,1].
- The likelihood is given a parameter theta, how likely is that we got our observations X, Y
- The marginal likelihood is the likelihood integrated over all theta possible. It is called “marginal” because we marginalized theta by averaging it over all probabilities.
The key idea to understand in Bayesian Statistics is that you start from a prior, it’s your best guess of what the parameter could be (it is a distribution). And with the observations you make, you adjust your guess, and you obtain a posterior distribution.
Note that the prior and posterior are not a punctual estimations of theta but a probability distribution.
To illustrate this:
On this image you can see that the prior is shifted to the right, but the likelihood rebalances our prior to the left, and the posterior is somewhere in between.
Bayesian Deep Learning is an approach that marries two powerful mathematical theories: Bayesian statistics and Deep Learning.
The essential distinction from traditional Deep Learning resides in the treatment of the model’s weights:
In traditional Deep Learning, we train a model from scratch, we randomly initialize a set of weights, and train the model until it converges to a new set of parameters. We learn a single set of weights.
Conversely, Bayesian Deep Learning adopts a more dynamic approach. We begin with a prior belief about the weights, often assuming they follow a normal distribution. As we expose our model to data, we adjust this belief, thus updating the posterior distribution of the weights. In essence, we learn a probability distribution over the weights, instead of a single set.
During inference, we average predictions from all models, weighting their contributions based on the posterior. This means, if a set of weights is highly probable, its corresponding prediction is given more weight.
Let’s formalize all of that:
Inference in Bayesian Deep Learning integrates over all potential values of theta (weights) using the posterior distribution.
We can also see that in Bayesian Statistics, integrals are everywhere. This is actually the principal limitation of the Bayesian framework. These integrals are often intractable (we don’t always know a primitive of the posterior). So we have to do very computationally expensive approximations.
Advantage 1: Uncertainty estimation
- Arguably the most prominent benefit of Bayesian Deep Learning is its capacity for uncertainty estimation. In many domains including healthcare, autonomous driving, language models, computer vision, and quantitative finance, the ability to quantify uncertainty is crucial for making informed decisions and managing risk.
Advantage 2: Improved training efficiency
- Closely tied to the concept of uncertainty estimation is improved training efficiency. Since Bayesian models are aware of their own uncertainty, they can prioritize learning from data points where the uncertainty — and hence, potential for learning — is highest. This approach, known as Active Learning, leads to impressively effective and efficient training.
As demonstrated in the graph below, a Bayesian Neural Network using Active Learning achieves 98% accuracy with just 1,000 training images. In contrast, models that don’t exploit uncertainty estimation tend to learn at a slower pace.
Advantage 3: Inductive Bias
Another advantage of Bayesian Deep Learning is the effective use of inductive bias through priors. The priors allow us to encode our initial beliefs or assumptions about the model parameters, which can be particularly useful in scenarios where domain knowledge exists.
Consider generative AI, where the idea is to create new data (like medical images) that resemble the training data. For example, if you’re generating brain images, and you already know the general layout of a brain — white matter inside, grey matter outside — this knowledge can be included in your prior. This means you can assign a higher probability to the presence of white matter in the center of the image, and grey matter towards the sides.
In essence, Bayesian Deep Learning not only empowers models to learn from data but also enables them to start learning from a point of knowledge, rather than starting from scratch. This makes it a potent tool for a wide range of applications.
It seems that Bayesian Deep Learning is incredible! So why is it that this field is so underrated? Indeed we often talk about Generative AI, Chat GPT, SAM, or more traditional neural networks, but we almost never hear about Bayesian Deep Learning, why is that?
Limitation 1: Bayesian Deep Learning is slooooow
The key to understand Bayesian Deep Learning is that we “average” the predictions of the model, and whenever there is an average, there is an integral over the set of parameters.
But computing an integral is often intractable, this means that there is no closed or explicit form that makes the computation of this integral quick. So we can’t compute it directly, we have to approximate the integral by sampling some points, and this makes the inference very slow.
Imagine that for each data point x we have to average out the prediction of 10,000 models, and that each prediction can take 1s to run, we end up with a model that is not scalable with a large amount of data.
In most of the business cases, we need fast and scalable inference, this is why Bayesian Deep Learning is not so popular.
Limitation 2: Approximation Errors
In Bayesian Deep Learning, it’s often necessary to use approximate methods, such as Variational Inference, to compute the posterior distribution of weights. These approximations can lead to errors in the final model. The quality of the approximation depends on the choice of the variational family and the divergence measure, which can be challenging to choose and tune properly.
Limitation 3: Increased Model Complexity and Interpretability
While Bayesian methods offer improved measures of uncertainty, this comes at the cost of increased model complexity. BNNs can be difficult to interpret because instead of a single set of weights, we now have a distribution over possible weights. This complexity might lead to challenges in explaining the model’s decisions, especially in fields where interpretability is key.
There is a growing interest for XAI (Explainable AI), and Traditional Deep Neural Networks are already challenging to interpret because it is difficult to make sense of the weights, Bayesian Deep Learning is even more challenging.
Whether you have feedback, ideas to share, wanna work with me, or simply want to say hello, please fill out the form below, and let’s start a conversation.
Say Hello 🌿
Don’t hesitate to leave a clap or follow me for more!
- Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521(7553), 452–459. Link
- Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Link
- Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning (pp. 1050–1059). Link
- Louizos, C., Welling, M., & Kingma, D. P. (2017). Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312. Link
- Neal, R. M. (2012). Bayesian learning for neural networks (Vol. 118). Springer Science & Business Media. Link