Linear Discriminant Analysis (Fisher's)

Mathematical basics
#

1. Mean vector
#

For one-dimensional data:

$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x^{(i)} $$

But for multi-dimensional data, where each sample is a vector:

$$ \mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_d^{(i)} \end{bmatrix} $$

the mean vector is defined as:

$$ \mathbf{\mu} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}^{(i)}, \mathbf{\mu} \in \R^d $$

in detail,

$$ \mathbf{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix} = \begin{bmatrix} \frac{1}{n} \sum_{i=1}^n x_1^{(i)} \\ \frac{1}{n} \sum_{i=1}^n x_2^{(i)} \\ \vdots \\ \frac{1}{n} \sum_{i=1}^n x_d^{(i)} \\ \end{bmatrix} $$

For example, suppose we have a $\mathbf{x}$ that has three two-dimensional samples:

$$ x^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space x^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space x^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$

then the mean vector is:

$$ \mathbf{\mu} = \frac{1}{3} \left( \begin{bmatrix} 1 \\ 2 \end{bmatrix} + \begin{bmatrix} 3 \\ 4 \end{bmatrix} + \begin{bmatrix} 5 \\ 6 \end{bmatrix} \right) = \begin{bmatrix} 3 \\ 4 \end{bmatrix} $$

2. Covariance matrix
#

For one-dimensional data:

$$ \mathrm{Var}(x)=\frac1n\sum (x^{(i)} - \bar x)^2 $$

To measure how two variables vary together, we define covariance as:

$$ \mathrm{Cov}(x, y) = \frac{1}{n} \sum_{i=1}^{n} (x^{(i)} - \bar{x}) (y^{(i)} - \bar{y}) $$

For example, we have

$$ x^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space x^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space x^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$

Mark the first dimension as $x_1$, and the second dimension as $x_2$, then

$$ \bar{x}_1 = \frac{1 + 3 + 5}{3} = 3,\space \bar{x}_2 = \frac{2 + 4 + 6}{3} = 4 $$

so,

$$ \begin{align*} \mathrm{Cov}(x_1, x_2) &= \frac{1}{3}[(1 - 3)(2 - 4) + (3 - 3)(4 - 4) + (5 - 3)(6 - 4)] \\ &= \frac{1}{3}(4 + 0 + 4) = \frac{8}{3} \end{align*} $$

But for multi-dimensional data, where each sample is a vector, the covariance matrix is defined as:

$$ \Sigma = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{x}^{(i)} - \mathbf{\mu})(\mathbf{x}^{(i)} - \mathbf{\mu})^T $$

where $\Sigma \in \mathbb{R}^{d \times d}$.

Each entry of $\Sigma$ represents the covariance between two dimensions:

$$ \Sigma_{jk} = \mathrm{Cov}(x_j, x_k) $$

Therefore, for a two-dimensional vector

$$ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} $$

the covariance matrix becomes

$$ \Sigma = \begin{bmatrix} \mathrm{Var}(x_1) & \mathrm{Cov}(x_1, x_2) \\ \mathrm{Cov}(x_2, x_1) & \mathrm{Var}(x_2) \end{bmatrix} $$

For example, using the previous dataset:

$$ \mathbf{x}^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space \mathbf{x}^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space \mathbf{x}^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$

We already know that

$$ \mathbf{\mu} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} $$

Then:

$$ \mathbf{x}^{(1)} - \mathbf{\mu} = \begin{bmatrix} -2 \\ -2 \end{bmatrix},\space \mathbf{x}^{(2)} - \mathbf{\mu} = \begin{bmatrix} 0 \\ 0 \end{bmatrix},\space \mathbf{x}^{(3)} - \mathbf{\mu} = \begin{bmatrix} 2 \\ 2 \end{bmatrix} $$

Therefore,

$$ \begin{align*} \Sigma &= \frac{1}{3} \left[ \begin{bmatrix} -2 \\ -2 \end{bmatrix} \begin{bmatrix} -2 & -2 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \end{bmatrix} \begin{bmatrix} 0 & 0 \end{bmatrix} + \begin{bmatrix} 2 \\ 2 \end{bmatrix} \begin{bmatrix} 2 & 2 \end{bmatrix} \right] \\ &= \frac{1}{3} \left[ \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} + \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} \right] \\ &= \frac{1}{3} \begin{bmatrix} 8 & 8 \\ 8 & 8 \end{bmatrix} \\ &= \begin{bmatrix} \frac{8}{3} & \frac{8}{3} \\ \frac{8}{3} & \frac{8}{3} \end{bmatrix} \end{align*} $$

Projection onto a Line
#

The core idea of Fisher’s Linear Discriminant Analysis is to project high-dimensional samples onto a line:

$$ y = \mathbf{w}^T \mathbf{x} $$

where:

$\mathbf{x} \in \mathbb{R}^d$ is the original sample,
$\mathbf{w} \in \mathbb{R}^d$ is the projection direction,
$y \in \mathbb{R}$ is the projected scalar.

Geometrically, this means projecting all samples onto the line spanned by $\mathbf{w}$.

Projected Mean
#

Suppose the original data has mean vector $\mathbf{\mu}$.

Then the mean of the projected data is:

$$ \mu_y = \mathbf{w}^T \mathbf{\mu} $$

This follows from the linearity of expectation:

$$ \begin{align*} \mu_y &= \mathbb{E}[y] \\ &= \mathbb{E}[\mathbf{w}^T\mathbf{x}] \\ &= \mathbf{w}^T\mathbb{E}[\mathbf{x}] \\ &= \mathbf{w}^T\mathbf{\mu} \end{align*} $$

Thus, after projection, the center of each class is also projected onto the line.

Projected Variance
#

Suppose the original covariance matrix is $\Sigma$.

Then the variance of the projected data is:

$$ \sigma_y^2 = \mathbf{w}^T \Sigma \mathbf{w} $$

Proof:

$$ \begin{align*} \sigma_y^2 &= \mathrm{Var}(y) \\ &= \mathrm{Var}(\mathbf{w}^T\mathbf{x}) \\ &= \mathbb{E}\left[(\mathbf{w}^T\mathbf{x} - \mathbf{w}^T\mathbf{\mu})^2\right] \\ &= \mathbb{E}\left[(\mathbf{w}^T(\mathbf{x}-\mathbf{\mu}))^2\right] \\ &= \mathbb{E}\left[\mathbf{w}^T(\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^T\mathbf{w}\right] \\ &= \mathbf{w}^T \mathbb{E}\left[(\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^T\right] \mathbf{w} \\ &= \mathbf{w}^T \Sigma \mathbf{w} \end{align*} $$

Therefore, the projected variance is a quadratic form of $\Sigma$.

Fisher Criterion
#

Suppose we have two classes:

Class 0 with mean $\mathbf{\mu}_0$ and covariance $\Sigma_0$
Class 1 with mean $\mathbf{\mu}_1$ and covariance $\Sigma_1$

After projection onto $\mathbf{w}$:

Their projected means become:

$$ m_0 = \mathbf{w}^T \mathbf{\mu}_0,\quad m_1 = \mathbf{w}^T \mathbf{\mu}_1 $$

Their projected variances become:

$$ s_0^2 = \mathbf{w}^T \Sigma_0 \mathbf{w},\quad s_1^2 = \mathbf{w}^T \Sigma_1 \mathbf{w} $$

After projection, the classification problem becomes one-dimensional.
Therefore, a good projection direction should satisfy:

The projected class centers are as far apart as possible;
The projected samples within each class are as concentrated as possible.

Thus, Fisher proposed the following objective:

$$ J(\mathbf{w}) = \frac{(m_0 - m_1)^2}{s_0^2 + s_1^2} $$

where:

The numerator measures the separation between the projected class means;
The denominator measures the total spread within the projected classes.

Substituting the projected mean/variance formulas gives:

$$ J(\mathbf{w}) = \frac{(\mathbf{w}^T(\mathbf{\mu}_0-\mathbf{\mu}_1))^2}{\mathbf{w}^T(\Sigma_0+\Sigma_1)\mathbf{w}} $$

Scatter Matrices
#

To simplify the notation, we introduce two scatter matrices.

The within-class scatter matrix summarizes the spread of samples inside each class:

$$ S_W = \Sigma_0 + \Sigma_1 $$

The between-class scatter matrix describes the separation between the class centers:

$$ S_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$

Using these definitions:

$$ \mathbf{w}^T S_B \mathbf{w} = (\mathbf{w}^T(\mathbf{\mu}_0-\mathbf{\mu}_1))^2 $$

and

$$ \begin{align*} \mathbf{w}^T S_W \mathbf{w} &= \mathbf{w}^T \Sigma_0 \mathbf{w} + \mathbf{w}^T \Sigma_1 \mathbf{w} \\ &= s_0^2+s_1^2 \end{align*} $$

Therefore, the Fisher criterion can be rewritten compactly as:

$$ J(\mathbf{w}) = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} $$

Optimal Projection Direction
#

We now seek the projection direction that maximizes the Fisher criterion:

$$ \mathbf{w}^* = \arg\max_{\mathbf{w}} \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} $$

Since scaling $\mathbf{w}$ does not change the value of the ratio,

$$ J(c\mathbf{w}) = J(\mathbf{w}),\quad c\neq 0 $$

only the direction of $\mathbf{w}$ matters.

Thus, we may impose the constraint

$$ \mathbf{w}^T S_W \mathbf{w}=1 $$

and convert the problem into

$$ \max_{\mathbf{w}} \mathbf{w}^T S_B \mathbf{w} \quad \text{subject to } \mathbf{w}^T S_W \mathbf{w}=1 $$

Lagrangian Derivation
#

Construct the Lagrangian:

$$ L(\mathbf{w},\lambda) = \mathbf{w}^T S_B \mathbf{w} - \lambda(\mathbf{w}^T S_W \mathbf{w}-1) $$

Take derivative with respect to $\mathbf{w}$ and set it to zero:

$$ \frac{\partial L}{\partial \mathbf{w}} = 2S_B\mathbf{w} - 2\lambda S_W\mathbf{w} = 0 $$

Thus:

$$ S_B\mathbf{w} = \lambda S_W\mathbf{w} $$

This is a generalized eigenvalue problem.

Substituting the Form of $S_B$
#

Recall that

$$ S_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$

Substitute into the previous equation:

$$ (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} = \lambda S_W\mathbf{w} $$

Notice that

$$ (\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} $$

is a scalar. Let

$$ c=(\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} $$

Then:

$$ c(\mathbf{\mu}_0-\mathbf{\mu}_1) = \lambda S_W\mathbf{w} $$

Multiply both sides by $S_W^{-1}$:

$$ \mathbf{w} = \frac{c}{\lambda} S_W^{-1} (\mathbf{\mu}_0-\mathbf{\mu}_1) $$

Since $\frac{c}{\lambda}$ is only a scalar factor and only the direction matters, we obtain:

$$ \boxed{ \mathbf{w}^* = S_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) } $$

This is the Fisher Linear Discriminant direction.

Intuitively:

$(\mathbf{\mu}_0-\mathbf{\mu}_1)$ points toward the difference between the two class centers;
$S_W^{-1}$ adjusts this direction by down-weighting directions with large within-class scatter.

Classification Rule
#

After obtaining $\mathbf{w}$, project a new sample $\mathbf{x}$:

$$ y = \mathbf{w}^T \mathbf{x} $$

A simple decision threshold is the midpoint between the projected class means:

$$ \theta = \frac{m_0 + m_1}{2} $$

where

$$ m_0 = \mathbf{w}^T\mathbf{\mu}_0,\quad m_1 = \mathbf{w}^T\mathbf{\mu}_1 $$

Classification rule:

$$ \begin{cases} y > \theta,\quad \text{Class 1} \\ y \le \theta,\quad \text{Class 0} \end{cases} $$

This means that the sample is assigned to the class whose projected center is closer.

Summary of Fisher’s LDA
#

The complete pipeline of Fisher’s LDA is:

Compute the class means $\mathbf{\mu}_0,\mathbf{\mu}_1$
Compute the covariance matrices $\Sigma_0,\Sigma_1$
Construct scatter matrices:

$$ S_W=\Sigma_0+\Sigma_1,\quad S_B=(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$

Solve for the optimal projection direction:

$$ \mathbf{w}^*=S_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) $$

Project samples onto the line:

$$ y=\mathbf{w}^T\mathbf{x} $$

Classify by thresholding:

$$ y \mathop{\gtrless}_{\text{Class 0}}^{\text{Class 1}} \theta $$

Thus, Fisher’s LDA converts a high-dimensional binary classification problem into a one-dimensional thresholding problem.

Mathematical basics #

1. Mean vector #

2. Covariance matrix #

Projection onto a Line #

Projected Mean #

Projected Variance #

Fisher Criterion #

Scatter Matrices #

Optimal Projection Direction #

Lagrangian Derivation #

Substituting the Form of \(S_B\) #

Classification Rule #

Summary of Fisher’s LDA #