メインコンテンツへスキップ

Linear Discriminant Analysis (Fisher's)

·1350 文字·7 分

Mathematical basics
#

1. Mean vector
#

For one-dimensional data:

$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x^{(i)} $$

But for multi-dimensional data, where each sample is a vector:

$$ \mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_d^{(i)} \end{bmatrix} $$

the mean vector is defined as:

$$ \mathbf{\mu} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}^{(i)}, \mathbf{\mu} \in \R^d $$

in detail,

$$ \mathbf{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix} = \begin{bmatrix} \frac{1}{n} \sum_{i=1}^n x_1^{(i)} \\ \frac{1}{n} \sum_{i=1}^n x_2^{(i)} \\ \vdots \\ \frac{1}{n} \sum_{i=1}^n x_d^{(i)} \\ \end{bmatrix} $$

For example, suppose we have a \(\mathbf{x}\) that has three two-dimensional samples:

$$ x^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space x^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space x^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$

then the mean vector is:

$$ \mathbf{\mu} = \frac{1}{3} \left( \begin{bmatrix} 1 \\ 2 \end{bmatrix} + \begin{bmatrix} 3 \\ 4 \end{bmatrix} + \begin{bmatrix} 5 \\ 6 \end{bmatrix} \right) = \begin{bmatrix} 3 \\ 4 \end{bmatrix} $$

2. Covariance matrix
#

For one-dimensional data:

$$ \mathrm{Var}(x)=\frac1n\sum (x^{(i)} - \bar x)^2 $$

To measure how two variables vary together, we define covariance as:

$$ \mathrm{Cov}(x, y) = \frac{1}{n} \sum_{i=1}^{n} (x^{(i)} - \bar{x}) (y^{(i)} - \bar{y}) $$

For example, we have

$$ x^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space x^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space x^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$

Mark the first dimension as \(x_1\), and the second dimension as \(x_2\), then

$$ \bar{x}_1 = \frac{1 + 3 + 5}{3} = 3,\space \bar{x}_2 = \frac{2 + 4 + 6}{3} = 4 $$

so,

$$ \begin{align*} \mathrm{Cov}(x_1, x_2) &= \frac{1}{3}[(1 - 3)(2 - 4) + (3 - 3)(4 - 4) + (5 - 3)(6 - 4)] \\ &= \frac{1}{3}(4 + 0 + 4) = \frac{8}{3} \end{align*} $$

But for multi-dimensional data, where each sample is a vector, the covariance matrix is defined as:

$$ \Sigma = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{x}^{(i)} - \mathbf{\mu})(\mathbf{x}^{(i)} - \mathbf{\mu})^T $$

where \(\Sigma \in \mathbb{R}^{d \times d}\).

Each entry of \(\Sigma\) represents the covariance between two dimensions:

$$ \Sigma_{jk} = \mathrm{Cov}(x_j, x_k) $$

Therefore, for a two-dimensional vector

$$ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} $$

the covariance matrix becomes

$$ \Sigma = \begin{bmatrix} \mathrm{Var}(x_1) & \mathrm{Cov}(x_1, x_2) \\ \mathrm{Cov}(x_2, x_1) & \mathrm{Var}(x_2) \end{bmatrix} $$

For example, using the previous dataset:

$$ \mathbf{x}^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space \mathbf{x}^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space \mathbf{x}^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$

We already know that

$$ \mathbf{\mu} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} $$

Then:

$$ \mathbf{x}^{(1)} - \mathbf{\mu} = \begin{bmatrix} -2 \\ -2 \end{bmatrix},\space \mathbf{x}^{(2)} - \mathbf{\mu} = \begin{bmatrix} 0 \\ 0 \end{bmatrix},\space \mathbf{x}^{(3)} - \mathbf{\mu} = \begin{bmatrix} 2 \\ 2 \end{bmatrix} $$

Therefore,

$$ \begin{align*} \Sigma &= \frac{1}{3} \left[ \begin{bmatrix} -2 \\ -2 \end{bmatrix} \begin{bmatrix} -2 & -2 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \end{bmatrix} \begin{bmatrix} 0 & 0 \end{bmatrix} + \begin{bmatrix} 2 \\ 2 \end{bmatrix} \begin{bmatrix} 2 & 2 \end{bmatrix} \right] \\ &= \frac{1}{3} \left[ \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} + \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} \right] \\ &= \frac{1}{3} \begin{bmatrix} 8 & 8 \\ 8 & 8 \end{bmatrix} \\ &= \begin{bmatrix} \frac{8}{3} & \frac{8}{3} \\ \frac{8}{3} & \frac{8}{3} \end{bmatrix} \end{align*} $$

Projection onto a Line
#

The core idea of Fisher’s Linear Discriminant Analysis is to project high-dimensional samples onto a line:

$$ y = \mathbf{w}^T \mathbf{x} $$

where:

  • \(\mathbf{x} \in \mathbb{R}^d\) is the original sample,
  • \(\mathbf{w} \in \mathbb{R}^d\) is the projection direction,
  • \(y \in \mathbb{R}\) is the projected scalar.

Geometrically, this means projecting all samples onto the line spanned by \(\mathbf{w}\).


Projected Mean
#

Suppose the original data has mean vector \(\mathbf{\mu}\).

Then the mean of the projected data is:

$$ \mu_y = \mathbf{w}^T \mathbf{\mu} $$

This follows from the linearity of expectation:

$$ \begin{align*} \mu_y &= \mathbb{E}[y] \\ &= \mathbb{E}[\mathbf{w}^T\mathbf{x}] \\ &= \mathbf{w}^T\mathbb{E}[\mathbf{x}] \\ &= \mathbf{w}^T\mathbf{\mu} \end{align*} $$

Thus, after projection, the center of each class is also projected onto the line.


Projected Variance
#

Suppose the original covariance matrix is \(\Sigma\).

Then the variance of the projected data is:

$$ \sigma_y^2 = \mathbf{w}^T \Sigma \mathbf{w} $$

Proof:

$$ \begin{align*} \sigma_y^2 &= \mathrm{Var}(y) \\ &= \mathrm{Var}(\mathbf{w}^T\mathbf{x}) \\ &= \mathbb{E}\left[(\mathbf{w}^T\mathbf{x} - \mathbf{w}^T\mathbf{\mu})^2\right] \\ &= \mathbb{E}\left[(\mathbf{w}^T(\mathbf{x}-\mathbf{\mu}))^2\right] \\ &= \mathbb{E}\left[\mathbf{w}^T(\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^T\mathbf{w}\right] \\ &= \mathbf{w}^T \mathbb{E}\left[(\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^T\right] \mathbf{w} \\ &= \mathbf{w}^T \Sigma \mathbf{w} \end{align*} $$

Therefore, the projected variance is a quadratic form of \(\Sigma\).


Fisher Criterion
#

Suppose we have two classes:

  • Class 0 with mean \(\mathbf{\mu}_0\) and covariance \(\Sigma_0\)
  • Class 1 with mean \(\mathbf{\mu}_1\) and covariance \(\Sigma_1\)

After projection onto \(\mathbf{w}\):

  • Their projected means become:

$$ m_0 = \mathbf{w}^T \mathbf{\mu}_0,\quad m_1 = \mathbf{w}^T \mathbf{\mu}_1 $$

  • Their projected variances become:

$$ s_0^2 = \mathbf{w}^T \Sigma_0 \mathbf{w},\quad s_1^2 = \mathbf{w}^T \Sigma_1 \mathbf{w} $$

After projection, the classification problem becomes one-dimensional.
Therefore, a good projection direction should satisfy:

  1. The projected class centers are as far apart as possible;
  2. The projected samples within each class are as concentrated as possible.

Thus, Fisher proposed the following objective:

$$ J(\mathbf{w}) = \frac{(m_0 - m_1)^2}{s_0^2 + s_1^2} $$

where:

  • The numerator measures the separation between the projected class means;
  • The denominator measures the total spread within the projected classes.

Substituting the projected mean/variance formulas gives:

$$ J(\mathbf{w}) = \frac{(\mathbf{w}^T(\mathbf{\mu}_0-\mathbf{\mu}_1))^2}{\mathbf{w}^T(\Sigma_0+\Sigma_1)\mathbf{w}} $$


Scatter Matrices
#

To simplify the notation, we introduce two scatter matrices.

The within-class scatter matrix summarizes the spread of samples inside each class:

$$ S_W = \Sigma_0 + \Sigma_1 $$

The between-class scatter matrix describes the separation between the class centers:

$$ S_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$

Using these definitions:

$$ \mathbf{w}^T S_B \mathbf{w} = (\mathbf{w}^T(\mathbf{\mu}_0-\mathbf{\mu}_1))^2 $$

and

$$ \begin{align*} \mathbf{w}^T S_W \mathbf{w} &= \mathbf{w}^T \Sigma_0 \mathbf{w} + \mathbf{w}^T \Sigma_1 \mathbf{w} \\ &= s_0^2+s_1^2 \end{align*} $$

Therefore, the Fisher criterion can be rewritten compactly as:

$$ J(\mathbf{w}) = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} $$


Optimal Projection Direction
#

We now seek the projection direction that maximizes the Fisher criterion:

$$ \mathbf{w}^* = \arg\max_{\mathbf{w}} \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} $$

Since scaling \(\mathbf{w}\) does not change the value of the ratio,

$$ J(c\mathbf{w}) = J(\mathbf{w}),\quad c\neq 0 $$

only the direction of \(\mathbf{w}\) matters.

Thus, we may impose the constraint

$$ \mathbf{w}^T S_W \mathbf{w}=1 $$

and convert the problem into

$$ \max_{\mathbf{w}} \mathbf{w}^T S_B \mathbf{w} \quad \text{subject to } \mathbf{w}^T S_W \mathbf{w}=1 $$


Lagrangian Derivation
#

Construct the Lagrangian:

$$ L(\mathbf{w},\lambda) = \mathbf{w}^T S_B \mathbf{w} - \lambda(\mathbf{w}^T S_W \mathbf{w}-1) $$

Take derivative with respect to \(\mathbf{w}\) and set it to zero:

$$ \frac{\partial L}{\partial \mathbf{w}} = 2S_B\mathbf{w} - 2\lambda S_W\mathbf{w} = 0 $$

Thus:

$$ S_B\mathbf{w} = \lambda S_W\mathbf{w} $$

This is a generalized eigenvalue problem.


Substituting the Form of \(S_B\)
#

Recall that

$$ S_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$

Substitute into the previous equation:

$$ (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} = \lambda S_W\mathbf{w} $$

Notice that

$$ (\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} $$

is a scalar. Let

$$ c=(\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} $$

Then:

$$ c(\mathbf{\mu}_0-\mathbf{\mu}_1) = \lambda S_W\mathbf{w} $$

Multiply both sides by \(S_W^{-1}\):

$$ \mathbf{w} = \frac{c}{\lambda} S_W^{-1} (\mathbf{\mu}_0-\mathbf{\mu}_1) $$

Since \(\frac{c}{\lambda}\) is only a scalar factor and only the direction matters, we obtain:

$$ \boxed{ \mathbf{w}^* = S_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) } $$

This is the Fisher Linear Discriminant direction.

Intuitively:

  • \((\mathbf{\mu}_0-\mathbf{\mu}_1)\) points toward the difference between the two class centers;
  • \(S_W^{-1}\) adjusts this direction by down-weighting directions with large within-class scatter.

Classification Rule
#

After obtaining \(\mathbf{w}\), project a new sample \(\mathbf{x}\):

$$ y = \mathbf{w}^T \mathbf{x} $$

A simple decision threshold is the midpoint between the projected class means:

$$ \theta = \frac{m_0 + m_1}{2} $$

where

$$ m_0 = \mathbf{w}^T\mathbf{\mu}_0,\quad m_1 = \mathbf{w}^T\mathbf{\mu}_1 $$

Classification rule:

$$ \begin{cases} y > \theta,\quad \text{Class 1} \\ y \le \theta,\quad \text{Class 0} \end{cases} $$

This means that the sample is assigned to the class whose projected center is closer.


Summary of Fisher’s LDA
#

The complete pipeline of Fisher’s LDA is:

  1. Compute the class means \(\mathbf{\mu}_0,\mathbf{\mu}_1\)
  2. Compute the covariance matrices \(\Sigma_0,\Sigma_1\)
  3. Construct scatter matrices:

$$ S_W=\Sigma_0+\Sigma_1,\quad S_B=(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$

  1. Solve for the optimal projection direction:

$$ \mathbf{w}^*=S_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) $$

  1. Project samples onto the line:

$$ y=\mathbf{w}^T\mathbf{x} $$

  1. Classify by thresholding:

$$ y \mathop{\gtrless}_{\text{Class 0}}^{\text{Class 1}} \theta $$

Thus, Fisher’s LDA converts a high-dimensional binary classification problem into a one-dimensional thresholding problem.