Measure Theory for ML, AI and Diffusion Models
A summary of measure theory required for machine learning and artificial intelligence.
Table of Contents
- Randomness, Mathematics, and Intuition
- Sigma Algebras
- Measures
- Random Variables
- Comparing Random Variables
- Topology
- Integration
- Expectations
- Lebesgue Measure
- Density Functions
- Change of Variables and Jacobians
- Markov Kernels
- Conditional Expectations
- Stochastic Processes
Introduction
In this post, I will cover enough measure theory so that the reader will be able to understand the theory behind Denoising Diffusion Models, and more generally Machine Learning and Artificial Intelligence. To understand measure theory in full, a lot of background mathematical knowledge is required and inevitable, but I will try my best to make things intuitive and yet precise, while assuming very few prerequisites. Most importantly, I will always aim to show examples within Probability, Statistics, or Machine Learning, as to keep things on theme. After this post is complete, the plan is to make another one about Stochastic Calculus and SDEs for Machine Learning and Diffusion Models.
If you find any mistake, typo or for anything else, don’t hesitate to contact me.
Randomness, Mathematics, and Intuition
In everyday language, we sometimes use the expression “the probability of/that”. For instance,
The probability of rolling a six is
.
I have emphasized in three different colors the key components of that sentence. Measure theory can be used to define each of those three terms rigorously. Specifically, it can answer these questions:
- What is a probability? (orange and magenta)
- What “things” can have an associated probability? (blue)
All we need are sets and functions.
What this is all about
Until we have built enough measure theory to talk about machine learning, I will often talk about this very simple scenario: A person rolling a six-sided die and observing which number comes out on top. This is a classic and straightforward scenario considered in probability theory, but will allow me to talk about various notions in measure theory without getting lost in the details. I will assume:
- The die will always land with exactly one face up, meaning that it will never get stuck in any crack or height difference. Imagine the die is roll on an infinitely long and perfectly flat surface.
- The person will always be able to observe which number comes out on top.
Basically, I am making this super easy: the person rolls the die and observes either
Rolling a die and observing the number on top is just a specific “experiment”. The concept of experiment is central to probability theory because whenever we talk about probabilities, we have in mind some sort of experiment, which will result in exactly one outcome, and we are interested in figuring out what is the probability of some of these outcomes. I say some because at times we are not interested in the probability of a single outcome, but in the probability of some outcomes. For instance, we may be interested in the probability of rolling an even number. As we will see soon, we can call event the collection of all the “outcomes” that we are interested in.
Intuition behind probabilities
A probability of something happening is a numeric value that tells us how likely that something is to happen (if this seems circular, that’s good, because it is). Ideally, we would like to have a maximum probability value for things that are certain and a minimum probability value for things that are impossible. We could choose any two values, but historically, mathematicians have settled for
Intuition behind events
What “things” can we assign a probability to? Imagine that measure theory has not been developed yet and you find yourself wanting to write the statement above mathematically. What mathematical object can represent the words rolling a six? A first attempt could be to set it to the number six
Rolling an even number (i.e. either 2, 4, or 6)
using this convention. There are infinitely many natural numbers, so one could potentially encode this and many more complicated expressions for larger and larger numbers. For instance, rolling an even number could be represented as
After attempting to represents “things” with numbers, one may switch to sets. Consider again the task of rolling a six-sided die and observing which number comes on top. We assume that the die cannot get stuck anywhere and always lands on one face or the other, and that we can always observe the number on top. The possible outcomes of this task are that we observe
If we can talk about the probability of “something” happening, that something is known as an event.
Something that may come as a shock, is that not every set is an event. There are sets for which we cannot talk about their “probability”, we will talk more about that later.
Intuition behind relationship of probabilities and events
Okay, we can represent events as sets and it seems to be a sensible choice. Now, what does it mean that a certain event has a probability? We need to define the probability function. As we expect, this should take sets as inputs and return a value between
Measures
Measure Space: A triplet
is called a measure space if is a measure on the measurable space .
Integration
A few comments are helpful to understand these spaces:
Markov Kernels
Markov kernels have various properties. The first one we explore is that they can be composed.
-Systems and -Systems
Of course, since
- theorem: Let be a non-empty set, be a -system and a -system over respectively, then .