Supervised Learning
Supervised learning is a fundamental type of machine learning where we learn from examples that have been labeled with their correct answers. The term “supervised” comes from the idea that some knowledgeable entity (the supervisor) has already gone through the data and provided the right answer for each example. This supervisor could be:
- Human experts manually labeling data (e.g., doctors marking which X-ray images show tumors)
- Automated systems collecting natural labels (e.g., whether a user clicked on an ad)
- Historical records with known outcomes (e.g., past house sales with their final prices)
- Physical measurements (e.g., sensors recording the actual temperature)
The key is that for each training example, we have both the input features and the known correct output. Our task is then to learn patterns from these labeled examples that will let us make accurate predictions on new, unseen cases. Let’s explore some concrete examples that illustrate this paradigm.
Email spam detection is perhaps one of the most ubiquitous applications of supervised learning that we interact with daily. Gmail’s spam filter, for instance, analyzes various features of incoming emails - the sender’s address, email subject, content words, HTML structure, and embedded links. These features serve as input X, while the output Y is a simple binary label: spam or not spam. The system learns from millions of emails that users have previously marked as spam or legitimate. This training data helps Gmail achieve remarkably high accuracy, with false positive rates below 0.1%. The filter continues to adapt as spammers evolve their tactics, learning from new examples of spam that users report.
House price prediction has become increasingly sophisticated with services like Zillow’s “Zestimate.” The system takes as input X detailed property characteristics including square footage, number of bedrooms, location (often down to the exact GPS coordinates), age of the house, recent renovations, and even local school ratings. The output Y is the predicted market value in dollars. Zillow trains its models on millions of actual home sales recorded in public databases. Their algorithms combine multiple prediction methods and regularly retrain on new sales data, achieving a median error rate of less than 2% in many major markets.
In medical diagnosis, systems like Stanford’s CheXNet analyze chest X-ray images (X) to detect various lung conditions (Y) like pneumonia, edema, and cardiomegaly. The model trains on large public datasets of labeled chest radiographs with clinically verified diagnoses. While not replacing human radiologists, these systems serve as valuable diagnostic aids, often matching or exceeding specialist-level accuracy.
Image classification has seen dramatic advances through deep learning, exemplified by systems like Google Photos. The input X consists of raw pixel values from digital images, often millions of pixels per image. The system predicts category labels Y such as “cat,” “dog,” “beach,” or “birthday party.” Modern image classifiers train on massive datasets like ImageNet, containing millions of labeled images. Google Photos can now recognize thousands of distinct objects and scenes with remarkable accuracy, even handling subtle categories like different dog breeds or architectural styles.
Text autocomplete has become remarkably sophisticated, as demonstrated by systems like Gmail’s Smart Compose and GitHub’s Copilot. These systems take as input X the sequence of words or characters the user has already typed, along with surrounding context. The output Y is a prediction of what the user will type next - from single words to entire sentences or code blocks. Training data comes from vast corpora of text: emails and documents for Smart Compose, or public code repositories for Copilot. These systems have become so effective that their suggestions are often indistinguishable from human-written text, though they can still make amusing mistakes that reveal their statistical nature.
In all these cases, we have some output Y that depends on the input X in complex ways. For example, in spam detection, whether an email is spam (Y) depends on subtle patterns in the text and metadata (X). In house price prediction, the market value (Y) emerges from a complex interplay of location, size, condition and market factors (X). In medical diagnosis, the presence of a disease (Y) manifests through intricate patterns in imaging data (X).
The key challenge is that we do not fully know, or cannot easily express in rules or formulas, the exact nature of this dependence between X and Y. Even domain experts who understand the field well - experienced doctors, real estate agents, or email security specialists - often cannot precisely articulate how they make their judgments. Their expertise is more intuitive than algorithmic.
That is precisely what we would like our machine learning software to help us with - to discover and encode these complex patterns automatically from examples, even when we cannot explicitly describe them ourselves. The software should extract the underlying regularities from the training data and use them to make predictions about new cases. In this chapter, we develop a basic mathematical and conceptual framework for thinking about this fundamental problem of learning from examples.
In general, the output Y does not depend only on the inputs X we observe. Many other factors may affect Y. For example, in medical diagnosis, a patient’s symptoms (X) do not fully determine their condition (Y) - unobserved genetic factors, environmental exposures, or lifestyle habits could all influence the outcome. In house price prediction, factors like the seller’s motivation or recent nearby sales that aren’t captured in our feature set X could impact the final sale price Y.
Beyond unobserved factors, there may be inherent randomness or stochasticity in the process generating Y. Two patients with identical observable symptoms might have different outcomes due to chance variations in how their bodies respond to the disease. Two identical houses might sell for different prices due to random fluctuations in market conditions or buyer preferences.
A complete analysis of supervised learning must account for both these sources of uncertainty: unobserved factors and inherent randomness. This leads naturally to a probabilistic framework where we model the relationship between X and Y as a conditional probability distribution P(Y|X) rather than a deterministic function.
However, to build intuition, let us start with the simpler case where Y is a deterministic but unknown function of X. This means that if we knew all the relevant factors and there was no inherent randomness, we could perfectly predict Y from X. While this assumption is rarely strictly true in practice, it provides a useful starting point for developing the core concepts of supervised learning.
The Basic Problem of Induction
Suppose we have a function f: X → Y. The only information we have about f is a training set consisting of input-output pairs (xi,f(xi)). Our goal is to predict the values of f(x) for arbitrary inputs x that are not in our training set. This is the core challenge of supervised learning: generalizing from known examples to make predictions about new cases.
In general, this task is impossible without making additional assumptions about the nature of f. To understand why, consider a simple gambling game where two coins are tossed in sequence, and based on the outcome, the player either wins or loses one rupee. Formally:
- X = {HH, HT, TH, TT} represents all possible sequences of two coin tosses
- Y = { − 1, 1} represents losing or winning one rupee
- f : X → Y is the function that determines the reward for each sequence
Suppose our training set only tells us that f(HH) = 1 (win on HH) and f(HT) = − 1 (lose on HT). How well can we predict f(TH) or f(TT)? The crucial insight is that these two known cases tell us absolutely nothing about the remaining cases. The function f could map TH and TT to any combination of wins and losses:
- Maybe f(TH) = 1 and f(TT) = 1 (win on any sequence starting with T)
- Or f(TH) = − 1 and f(TT) = − 1 (lose on any sequence starting with T)
- Or f(TH) = 1 and f(TT) = − 1 (win only on TH)
- Or f(TH) = − 1 and f(TT) = 1 (win only on TT)
All of these possibilities are equally consistent with our training data. For any prediction we make about the unknown cases, there exists a valid function f that matches our training data perfectly but gives exactly the opposite of our predictions on the unknown cases.
This simple example illustrates a profound result known as the “No Free Lunch Theorem.” The theorem states that when averaged across all possible functions f, every learning algorithm performs exactly the same. In other words, for any learning algorithm you devise, there will always be some target functions on which it performs poorly. If we know absolutely nothing about the nature of f - if we’re completely agnostic about what patterns might exist in the data - then no learning algorithm can guarantee good performance on unseen cases. We need some additional assumptions or prior knowledge about what kinds of patterns are more likely to occur in our domain.
This is an infinitely important point that gets to the heart of scientific reasoning. Following the skeptical traditions of science, our instinct is to approach problems with no prejudices, to “let the data speak.” We aim to be objective observers who simply record and analyze what we see, without imposing our preconceptions on the data. This seems like the most rigorous scientific approach.
However, what the coin toss example above reveals is a deep limitation: the data can only speak definitively about the specific situations we have actually observed. No matter how much data we collect about past coin tosses, that data alone cannot tell us with logical certainty what will happen on the next toss. By itself, empirical data cannot bridge the gap between observed and unobserved situations.
Philosophers have though about this problem from at least since the work of David Hume’s in the eighteenth century. Hume analyzed what we now call “the problem of induction”. This refers to the difficulty of justifying inferences from past observations to future predictions. For example, the sun has risen in the east every day in recorded history. However, this pattern alone does not logically guarantee that the sun will rise in the east tomorrow.
Let’s set aside modern astronomical knowledge (which itself relies on inductive reasoning about physical laws remaining constant). Imagine you’re living thousands of years ago, with a long and reliable historical record showing that the sun has risen in the east every single day for the past thousand years. Do these observations alone provide any logical guarantee about tomorrow’s sunrise? From pure logic, the sun could just as easily rise in the west tomorrow.
One might argue: “Given such a consistent pattern over such a long time, it’s extremely likely the pattern will continue tomorrow.” But this argument has two problems. First, it relies on a notion of “likelihood” or “probability” that’s difficult to define precisely in this context. More fundamentally, it commits a circular argument: it uses past experience (the rarity of pattern changes) to justify relying on past experience (the sunrise direction). If we question our right to use past experience in the second case, we must also question it in the first. The argument does not solve the problem of induction at all.
Where facts fail, only faith can help. When pure logic and empirical data alone cannot bridge the gap to future predictions, we must rely on beliefs and assumptions about the nature of reality. These beliefs, which philosophers call a priori knowledge and machine learning researchers call inductive bias, shape how we interpret patterns and make predictions about the unknown. An inductive bias represents the set of assumptions a learning algorithm uses to predict outputs given inputs that it has not encountered during training. These biases can range from simple intuitions like “nature tends to be continuous rather than discontinuous” to more specific domain knowledge like “similar inputs tend to produce similar outputs.” These beliefs act as constraints that make learning possible by limiting the space of possibilities we need to consider.
Returning to our gambling problem, such inductive biases can dramatically constrain our predictions. If you believe that only the second toss matters for determining the outcome, this immediately reduces the space of possible functions from 16 to just 4. Under this assumption, when you observe f(HH) = 1, you can confidently predict f(TH) = 1 since both sequences end in H. Similarly, from f(HT) = − 1, you can predict f(TT) = − 1 since both end in T. The accuracy of these predictions ultimately depends on whether your underlying belief about the second toss being decisive is correct.
Such inductive biases enter into machine learning models in several important ways. For example:
Model Architecture: The basic structure we choose for our model reflects beliefs about what patterns might exist in the data. A linear model assumes relationships are roughly straight-line. A decision tree assumes the world can be divided into meaningful categories based on simple yes/no questions. A neural network assumes patterns can be built up by combining simpler patterns.
Feature Selection: When we decide which input characteristics to include in our model, we’re expressing beliefs about what information is relevant for prediction. For house prices, including square footage but not the house number reflects a belief that size matters but street numbering is arbitrary.
Regularization: This refers to techniques that penalize overly complex models. Using regularization expresses a belief that simpler explanations are more likely to be correct - a formal version of Occam’s Razor. For example, we might prefer a gentle curve through our data points rather than a wildly oscillating one that fits perfectly.
Data Preprocessing: How we clean and transform our data reflects beliefs about what constitutes “normal” versus “anomalous” values. When we remove outliers or normalize features to have similar scales, we’re encoding assumptions about what patterns are meaningful versus what’s likely to be noise.
Each step in the machine learning pipeline requires us to inject inductive biases:
Model architecture choices reflect beliefs about what mathematical patterns can exist - like believing images contain hierarchical features when using convolutional networks.
Feature selection encodes beliefs about what information matters for prediction - like believing square footage affects house prices more than street numbers.
Regularization represents beliefs about simplicity - like believing smooth curves are more likely than jagged ones.
Data preprocessing embeds beliefs about what constitutes signal versus noise - like believing extreme outliers are probably measurement errors rather than real data.
The key question then becomes: how do we choose which inductive biases to inject at each step? How do we decide which architectures to try, which features matter most, or how much to regularize?
One very tempting approach is to empirically test various prior beliefs using techniques like test sets and cross-validation. In these approaches, we split our available data into two parts: a training set that we use to fit our model, and a test set that we hold out to evaluate how well the model generalizes. Cross-validation takes this idea further by dividing our data into k equal parts (or “folds”). We then train our model k different times - each time using k-1 folds for training and the remaining fold for testing. This process is repeated k times so that each fold serves as the test set exactly once, giving us a more robust estimate of model performance.
These techniques are incredibly useful in practice, as we’ll see later - they help us detect overfitting and compare different models. The multiple rounds of training and testing in cross-validation give us a more robust estimate of how well our model generalizes than a single train-test split. However, they cannot help us escape the fundamental problem of induction. Even if our model performs well on multiple test sets or cross-validation folds, we still want our predictions to do well on completely new cases that weren’t in any of our datasets. There is no guarantee they will, unless we bring in additional untested beliefs about how the world works. Test sets and cross-validation can validate our beliefs against known data, but they cannot validate them against unknown future data.
Like ordinary beliefs, the inductive biases in machine learning must ultimately come from traditions and dreams. It is our good fortune that Nature is benevolent and has created a world where traditions and dreams have been good guide to making predictions.
A More Formal Treatment
Let’s formalize the learning problem we discussed earlier using two key frameworks: the PAC (Probably Approximately Correct) learning framework and the Empirical Risk Minimization (ERM) principle.
PAC Learning Framework
The PAC framework provides a mathematical foundation for analyzing learning from examples. It consists of:
An input space X and an output space Y. In our coin toss problem:
- X contains all possible sequences of two tosses (HH, HT, TH, TT)
- Y contains the outcomes (+1 for win, -1 for loss)
An unknown target function f that maps inputs to outputs: f : X → Y In our coin example, this is the gambling rule we aim to discover.
A training dataset S of m examples: S = {(x1,y1), …, (xm,ym)} Each pair contains an input and its correct output: yi = f(xi) For instance: (HH, +1), (HT, -1)
A candidate set ℋ of functions we consider as possible solutions. In statistical learning theory, these candidates are called hypotheses, though this differs from the statistical hypothesis testing meaning.
A probability distribution 𝒟 over the input space X. This distribution represents how frequently different inputs occur. For example, physical bias might make HH more common than TT.
Our goal is to find a candidate function h from ℋ that generalizes well beyond our training examples. To quantify performance, we define the “true error”:
err(h) = ℙx ∼ 𝒟[h(x) ≠ f(x)]
This measures the probability of making incorrect predictions on new inputs drawn from distribution 𝒟. The notation x ∼ 𝒟 indicates that x is randomly sampled according to distribution 𝒟.
Empirical Risk Minimization
Since we cannot compute the true error directly (we don’t know f or 𝒟), we estimate performance using the training error:
$$ \text{err}_S(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}\bigl[h(x_i) \neq y_i\bigr] $$
Here, 1[⋅] is an indicator function returning 1 for incorrect predictions and 0 for correct ones. The training error represents the fraction of mistakes on our training data.
The ERM principle selects the candidate function with minimal training error:
hERM = arg minh ∈ ℋerrS(h)
Two Sources of Error
This learning approach faces two distinct challenges:
Approximation Error (ϵapp): This represents the limitations of our candidate set ℋ:
ϵapp = minh ∈ ℋerr(h)
The formula above shows that approximation error ϵapp is the minimum true error achievable by any hypothesis in our candidate set ℋ. In other words, it represents how well the best possible function in ℋ could perform, if we somehow knew which one to choose. For example, if our candidate set only contains functions examining the second toss, but the true rule depends on both tosses, we have inherent approximation error - even the best function in ℋ cannot match the true rule’s performance.
Estimation Error (ϵest): This represents the difficulty of selecting the best candidate from limited data:
ϵest = err(hERM) − ϵapp
Even with the true rule in our candidate set, insufficient training data might lead us to select an inferior alternative. This happens because our training examples only cover a small portion of the input space - we might see (HH,+1) and (HT,-1), but not (TH,?) or (TT,?). Without seeing these other cases, we could mistakenly choose a hypothesis that matches our limited training data but makes wrong predictions on the unseen inputs. The more complex our hypothesis space, the more likely we are to find such deceptive matches to our partial view of the domain.
The total error combines these components:
ϵtotal = err(hERM) = ϵapp + ϵest
This formalization illuminates the core challenge in learning: bridging the gap between performance on seen (training) data and unseen (test/future) data. In general, there is a trade-off between ϵapp and ϵest. Making ℋ larger is likely to reduce approximation error but increase estimation error.
ℋ itself reflects our inductive biases about the world. These are our candidates for the true function f. ϵapp tells us how good our best candidate can be. Because we do not know the true f, we cannot measure ϵapp directly. Yet, the larger we make ℋ, the more likely it is that its best member will be really good. Indeed, if ℋ contains the true f, then ϵapp can be zero.
So why not make ℋ as large as possible—say, include all functions from X to Y? The problem is that we also need to control ϵest. Returning to our gambling example: when we allowed all possible functions for f, it was easy to find a perfect fit to the known data that performed as badly as possible on the unseen data. In our current terminology, that’s a maximal estimation error. If we restrict the candidate functions to those that depend only on the second toss, then the data we have seen can uniquely determine behavior on the unseen cases (assuming our belief is correct). We can push ϵest closer to zero—so long as reality agrees with our assumption.
Hence, we want to choose ℋ that is large enough to include good approximations to f but small enough to give good control on estimation error. This returns us to the earlier discussion about induction: we cannot escape the need for inductive biases. The choice of ℋ embeds our assumptions about which patterns can exist, and these assumptions determine our ability to learn from data versus being misled by it. The art of machine learning lies in choosing an ℋ that is flexible enough to capture real patterns but constrained enough to prevent learning spurious ones.
One clarification before concluding this section: while we talk about “small” or “large” ℋ, what really matters is the complexity of the class, not the raw number of functions in it. There are various ways of quantifying complexity (e.g., VC dimension, Rademacher complexity, etc.). We will not cover them here, but keep the intuition: it is the flexibility or expressive power of ℋ, rather than its literal size, that matters.
Bringing Back Noise
So far, we have assumed a deterministic function f. This is often unrealistic. Consider predicting whether someone will get cancer based on lifestyle factors. Even with identical inputs, different individuals may have different outcomes, due to genetics or pure chance. Hence, we need a probabilistic setting.
Instead of a deterministic target f, assume there is an unknown joint probability distribution P over X × Y. Our training examples (xi,yi) are drawn independently from this distribution.
The true error of a hypothesis h is then:
err(h) = ℙ(x,y) ∼ P[h(x) ≠ y].
Now we sample both x and y from P, rather than generating y deterministically via f(x).
The empirical error remains:
$$ \text{err}_S(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}\bigl[h(x_i) \neq y_i\bigr]. $$
The Bayes Optimal Classifier
In this probabilistic setting, we can ask: What is the best we can possibly do? For any input x, the optimal prediction is the label y that is most likely given x. This is the Bayes optimal classifier:
h*(x) = arg maxyP(y∣x).
In binary classification with Y = {0, 1}, this becomes
$$ h^*(x) = \begin{cases} 1 & \text{if } P(y=1 \mid x) > 0.5,\\ 0 & \text{otherwise}. \end{cases} $$
The error of the Bayes optimal classifier is the lowest possible error one can achieve. Even with infinite data and unlimited computational power, no classifier can do better than this error rate, which represents the inherent noise or uncertainty in the data.
The decomposition into approximation and estimation error carries over:
Approximation Error ϵapp: how closely the best hypothesis in ℋ can approximate the Bayes optimal classifier,
ϵapp = minh ∈ ℋerr(h) − err(h*).
Estimation Error ϵest: how close our chosen hypothesis is to the best one in ℋ,
ϵest = err(hERM) − minh ∈ ℋerr(h).
The same fundamental challenge remains: choose ℋ to balance approximation vs. estimation error, but now in a setting where perfect prediction may be impossible. The Bayes error rate tells us how far even the best model can go—there is an irreducible noise floor that no amount of data or training can breach.
Overfitting
Noise introduces another danger to induction known as overfitting. When our training data is noisy, a flexible model might memorize those random fluctuations instead of learning the underlying pattern. For instance, suppose you try to predict house prices from square footage. The true relationship might be roughly linear, but your training data includes random variation from factors like location or condition. A very flexible model (e.g., a high-degree polynomial) might twist and turn to match every data point exactly, including noise. Such a model might perform poorly on new data because it has learned quirks of the noise rather than the true relationship. Consequently, it becomes even more important to control the complexity of ℋ with appropriate inductive biases.
To take another example consider a binary classification problem where X = [0,1], Y = {0, 1} and
$$ P(Y=1\mid x) = \begin{cases} 0.7 & \text{if } x \leq 0.5, \\ 0.3 & \text{otherwise} \end{cases} $$
In this case the Bayes optimal classifier is h*(x) = 1 if x ≤ 0.5 and h*(x) = 0 otherwise, since P(Y=1|x) is greater than 0.5 only when x ≤ 0.5.
Now suppose we draw a sample of 5 observations which turn out to be (0.2,1), (0.4,0), (0.6,1), (0.7,0), (0.9,0).
If ℋ includes only single threshold rules (i.e. rules of the form h(x) = 1 if x ≤ t and h(x) = 0 otherwise, for some threshold t) then ERM may produce the rule h(x) = 1 if x ≤ 0.35 and h(x) = 0 otherwise. Even though the Bayes optimal classifier is itself a single threshold rule, ERM cannot find it because we have a noisy small sample from the data.
But things can get much worse if ℋ allows multiple thresholds then ERM might yield the more complicated rule:
$$ h(x) = \begin{cases} 1 & \text{if } x \leq 0.2 \text{ or } 0.4 < x \leq 0.6, \\ 0 & \text{otherwise} \end{cases} $$
The second rule, with its multiple thresholds, has overfit by precisely matching all the peculiarities of our small, noisy sample. While it achieves perfect classification on our 5 training points, it will perform poorly on new, unseen data because it has learned to model the random sampling noise rather than the true underlying probability distribution. The rule creates artificial decision boundaries at x=0.2 and x=0.6 based on just single observations, when we know the true distribution only has one meaningful threshold at x=0.5.
This overly complex model will, on average, produce more classification errors than the simpler single-threshold ERM rule when tested on new data. This is because the single-threshold rule, while imperfect, better captures the general structure of the true probability distribution. Its simpler form makes it more robust to sampling noise and thus more likely to generalize well to new observations drawn from the same distribution.
In this noisy setting, cross-validation becomes an especially powerful tool for model selection and evaluation. The key insight is that each fold of cross-validation provides an independent draw from the underlying data distribution P. When we evaluate our model on a validation fold, we’re testing it against noise patterns that are statistically independent from those in the training data. This independence is crucial - if a model has memorized noise in the training data, it won’t help predict the different noise patterns in the validation fold. Only the true underlying patterns that generalize across different samples will lead to good validation performance.
For example, in our house price prediction scenario, suppose certain houses in the training data were unusually expensive due to temporary market conditions. A model that memorizes these specific cases won’t help predict prices in the validation fold, where the temporary conditions might be different. Cross-validation helps us identify such overfitting by exposing the model to multiple independent sets of noise patterns. The model’s average performance across these folds gives us a more reliable estimate of its true generalization ability than a single train-test split would provide.
This reliability becomes even more important when selecting between models of different complexities. Consider choosing between a linear model and a tenth-degree polynomial for our house price predictor. The polynomial might achieve near-zero training error by fitting the noise, while the linear model maintains a modest but stable error across all folds. Cross-validation will reveal this pattern through consistently poor validation performance of the polynomial compared to more stable performance of the linear model. This stability across multiple independent noise patterns is a strong signal that we’ve found a model that captures the true underlying relationship while appropriately handling the inherent noise in the data.