Mathematical basics #
1. Mean vector #
For one-dimensional data:
$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x^{(i)} $$
But for multi-dimensional data, where each sample is a vector:
$$ \mathbf{x}^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_d^{(i)} \end{bmatrix} $$
the mean vector is defined as:
$$ \mathbf{\mu} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}^{(i)}, \mathbf{\mu} \in \R^d $$
in detail,
$$ \mathbf{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix} = \begin{bmatrix} \frac{1}{n} \sum_{i=1}^n x_1^{(i)} \\ \frac{1}{n} \sum_{i=1}^n x_2^{(i)} \\ \vdots \\ \frac{1}{n} \sum_{i=1}^n x_d^{(i)} \\ \end{bmatrix} $$
For example, suppose we have a \(\mathbf{x}\) that has three two-dimensional samples:
$$ x^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space x^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space x^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$
then the mean vector is:
$$ \mathbf{\mu} = \frac{1}{3} \left( \begin{bmatrix} 1 \\ 2 \end{bmatrix} + \begin{bmatrix} 3 \\ 4 \end{bmatrix} + \begin{bmatrix} 5 \\ 6 \end{bmatrix} \right) = \begin{bmatrix} 3 \\ 4 \end{bmatrix} $$
2. Covariance matrix #
For one-dimensional data:
$$ \mathrm{Var}(x)=\frac1n\sum (x^{(i)} - \bar x)^2 $$
To measure how two variables vary together, we define covariance as:
$$ \mathrm{Cov}(x, y) = \frac{1}{n} \sum_{i=1}^{n} (x^{(i)} - \bar{x}) (y^{(i)} - \bar{y}) $$
For example, we have
$$ x^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space x^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space x^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$
Mark the first dimension as \(x_1\), and the second dimension as \(x_2\), then
$$ \bar{x}_1 = \frac{1 + 3 + 5}{3} = 3,\space \bar{x}_2 = \frac{2 + 4 + 6}{3} = 4 $$
so,
$$ \begin{align*} \mathrm{Cov}(x_1, x_2) &= \frac{1}{3}[(1 - 3)(2 - 4) + (3 - 3)(4 - 4) + (5 - 3)(6 - 4)] \\ &= \frac{1}{3}(4 + 0 + 4) = \frac{8}{3} \end{align*} $$
But for multi-dimensional data, where each sample is a vector, the covariance matrix is defined as:
$$ \Sigma = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{x}^{(i)} - \mathbf{\mu})(\mathbf{x}^{(i)} - \mathbf{\mu})^T $$
where \(\Sigma \in \mathbb{R}^{d \times d}\).
Each entry of \(\Sigma\) represents the covariance between two dimensions:
$$ \Sigma_{jk} = \mathrm{Cov}(x_j, x_k) $$
Therefore, for a two-dimensional vector
$$ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} $$
the covariance matrix becomes
$$ \Sigma = \begin{bmatrix} \mathrm{Var}(x_1) & \mathrm{Cov}(x_1, x_2) \\ \mathrm{Cov}(x_2, x_1) & \mathrm{Var}(x_2) \end{bmatrix} $$
For example, using the previous dataset:
$$ \mathbf{x}^{(1)} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},\space \mathbf{x}^{(2)} = \begin{bmatrix} 3 \\ 4 \end{bmatrix},\space \mathbf{x}^{(3)} = \begin{bmatrix} 5 \\ 6 \end{bmatrix} $$
We already know that
$$ \mathbf{\mu} = \begin{bmatrix} 3 \\ 4 \end{bmatrix} $$
Then:
$$ \mathbf{x}^{(1)} - \mathbf{\mu} = \begin{bmatrix} -2 \\ -2 \end{bmatrix},\space \mathbf{x}^{(2)} - \mathbf{\mu} = \begin{bmatrix} 0 \\ 0 \end{bmatrix},\space \mathbf{x}^{(3)} - \mathbf{\mu} = \begin{bmatrix} 2 \\ 2 \end{bmatrix} $$
Therefore,
$$ \begin{align*} \Sigma &= \frac{1}{3} \left[ \begin{bmatrix} -2 \\ -2 \end{bmatrix} \begin{bmatrix} -2 & -2 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \end{bmatrix} \begin{bmatrix} 0 & 0 \end{bmatrix} + \begin{bmatrix} 2 \\ 2 \end{bmatrix} \begin{bmatrix} 2 & 2 \end{bmatrix} \right] \\ &= \frac{1}{3} \left[ \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} + \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} \right] \\ &= \frac{1}{3} \begin{bmatrix} 8 & 8 \\ 8 & 8 \end{bmatrix} \\ &= \begin{bmatrix} \frac{8}{3} & \frac{8}{3} \\ \frac{8}{3} & \frac{8}{3} \end{bmatrix} \end{align*} $$
Projection onto a Line #
The core idea of Fisher’s Linear Discriminant Analysis is to project high-dimensional samples onto a line:
$$ y = \mathbf{w}^T \mathbf{x} $$
where:
- \(\mathbf{x} \in \mathbb{R}^d\) is the original sample,
- \(\mathbf{w} \in \mathbb{R}^d\) is the projection direction,
- \(y \in \mathbb{R}\) is the projected scalar.
Geometrically, this means projecting all samples onto the line spanned by \(\mathbf{w}\).
Projected Mean #
Suppose the original data has mean vector \(\mathbf{\mu}\).
Then the mean of the projected data is:
$$ \mu_y = \mathbf{w}^T \mathbf{\mu} $$
This follows from the linearity of expectation:
$$ \begin{align*} \mu_y &= \mathbb{E}[y] \\ &= \mathbb{E}[\mathbf{w}^T\mathbf{x}] \\ &= \mathbf{w}^T\mathbb{E}[\mathbf{x}] \\ &= \mathbf{w}^T\mathbf{\mu} \end{align*} $$
Thus, after projection, the center of each class is also projected onto the line.
Projected Variance #
Suppose the original covariance matrix is \(\Sigma\).
Then the variance of the projected data is:
$$ \sigma_y^2 = \mathbf{w}^T \Sigma \mathbf{w} $$
Proof:
$$ \begin{align*} \sigma_y^2 &= \mathrm{Var}(y) \\ &= \mathrm{Var}(\mathbf{w}^T\mathbf{x}) \\ &= \mathbb{E}\left[(\mathbf{w}^T\mathbf{x} - \mathbf{w}^T\mathbf{\mu})^2\right] \\ &= \mathbb{E}\left[(\mathbf{w}^T(\mathbf{x}-\mathbf{\mu}))^2\right] \\ &= \mathbb{E}\left[\mathbf{w}^T(\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^T\mathbf{w}\right] \\ &= \mathbf{w}^T \mathbb{E}\left[(\mathbf{x}-\mathbf{\mu})(\mathbf{x}-\mathbf{\mu})^T\right] \mathbf{w} \\ &= \mathbf{w}^T \Sigma \mathbf{w} \end{align*} $$
Therefore, the projected variance is a quadratic form of \(\Sigma\).
Fisher Criterion #
Suppose we have two classes:
- Class 0 with mean \(\mathbf{\mu}_0\) and covariance \(\Sigma_0\)
- Class 1 with mean \(\mathbf{\mu}_1\) and covariance \(\Sigma_1\)
After projection onto \(\mathbf{w}\):
- Their projected means become:
$$ m_0 = \mathbf{w}^T \mathbf{\mu}_0,\quad m_1 = \mathbf{w}^T \mathbf{\mu}_1 $$
- Their projected variances become:
$$ s_0^2 = \mathbf{w}^T \Sigma_0 \mathbf{w},\quad s_1^2 = \mathbf{w}^T \Sigma_1 \mathbf{w} $$
After projection, the classification problem becomes one-dimensional.
Therefore, a good projection direction should satisfy:
- The projected class centers are as far apart as possible;
- The projected samples within each class are as concentrated as possible.
Thus, Fisher proposed the following objective:
$$ J(\mathbf{w}) = \frac{(m_0 - m_1)^2}{s_0^2 + s_1^2} $$
where:
- The numerator measures the separation between the projected class means;
- The denominator measures the total spread within the projected classes.
Substituting the projected mean/variance formulas gives:
$$ J(\mathbf{w}) = \frac{(\mathbf{w}^T(\mathbf{\mu}_0-\mathbf{\mu}_1))^2}{\mathbf{w}^T(\Sigma_0+\Sigma_1)\mathbf{w}} $$
Scatter Matrices #
To simplify the notation, we introduce two scatter matrices.
The within-class scatter matrix summarizes the spread of samples inside each class:
$$ S_W = \Sigma_0 + \Sigma_1 $$
The between-class scatter matrix describes the separation between the class centers:
$$ S_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$
Using these definitions:
$$ \mathbf{w}^T S_B \mathbf{w} = (\mathbf{w}^T(\mathbf{\mu}_0-\mathbf{\mu}_1))^2 $$
and
$$ \begin{align*} \mathbf{w}^T S_W \mathbf{w} &= \mathbf{w}^T \Sigma_0 \mathbf{w} + \mathbf{w}^T \Sigma_1 \mathbf{w} \\ &= s_0^2+s_1^2 \end{align*} $$
Therefore, the Fisher criterion can be rewritten compactly as:
$$ J(\mathbf{w}) = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} $$
Optimal Projection Direction #
We now seek the projection direction that maximizes the Fisher criterion:
$$ \mathbf{w}^* = \arg\max_{\mathbf{w}} \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}} $$
Since scaling \(\mathbf{w}\) does not change the value of the ratio,
$$ J(c\mathbf{w}) = J(\mathbf{w}),\quad c\neq 0 $$
only the direction of \(\mathbf{w}\) matters.
Thus, we may impose the constraint
$$ \mathbf{w}^T S_W \mathbf{w}=1 $$
and convert the problem into
$$ \max_{\mathbf{w}} \mathbf{w}^T S_B \mathbf{w} \quad \text{subject to } \mathbf{w}^T S_W \mathbf{w}=1 $$
Lagrangian Derivation #
Construct the Lagrangian:
$$ L(\mathbf{w},\lambda) = \mathbf{w}^T S_B \mathbf{w} - \lambda(\mathbf{w}^T S_W \mathbf{w}-1) $$
Take derivative with respect to \(\mathbf{w}\) and set it to zero:
$$ \frac{\partial L}{\partial \mathbf{w}} = 2S_B\mathbf{w} - 2\lambda S_W\mathbf{w} = 0 $$
Thus:
$$ S_B\mathbf{w} = \lambda S_W\mathbf{w} $$
This is a generalized eigenvalue problem.
Substituting the Form of \(S_B\) #
Recall that
$$ S_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$
Substitute into the previous equation:
$$ (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} = \lambda S_W\mathbf{w} $$
Notice that
$$ (\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} $$
is a scalar. Let
$$ c=(\mathbf{\mu}_0-\mathbf{\mu}_1)^T\mathbf{w} $$
Then:
$$ c(\mathbf{\mu}_0-\mathbf{\mu}_1) = \lambda S_W\mathbf{w} $$
Multiply both sides by \(S_W^{-1}\):
$$ \mathbf{w} = \frac{c}{\lambda} S_W^{-1} (\mathbf{\mu}_0-\mathbf{\mu}_1) $$
Since \(\frac{c}{\lambda}\) is only a scalar factor and only the direction matters, we obtain:
$$ \boxed{ \mathbf{w}^* = S_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) } $$
This is the Fisher Linear Discriminant direction.
Intuitively:
- \((\mathbf{\mu}_0-\mathbf{\mu}_1)\) points toward the difference between the two class centers;
- \(S_W^{-1}\) adjusts this direction by down-weighting directions with large within-class scatter.
Classification Rule #
After obtaining \(\mathbf{w}\), project a new sample \(\mathbf{x}\):
$$ y = \mathbf{w}^T \mathbf{x} $$
A simple decision threshold is the midpoint between the projected class means:
$$ \theta = \frac{m_0 + m_1}{2} $$
where
$$ m_0 = \mathbf{w}^T\mathbf{\mu}_0,\quad m_1 = \mathbf{w}^T\mathbf{\mu}_1 $$
Classification rule:
$$ \begin{cases} y > \theta,\quad \text{Class 1} \\ y \le \theta,\quad \text{Class 0} \end{cases} $$
This means that the sample is assigned to the class whose projected center is closer.
Summary of Fisher’s LDA #
The complete pipeline of Fisher’s LDA is:
- Compute the class means \(\mathbf{\mu}_0,\mathbf{\mu}_1\)
- Compute the covariance matrices \(\Sigma_0,\Sigma_1\)
- Construct scatter matrices:
$$ S_W=\Sigma_0+\Sigma_1,\quad S_B=(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^T $$
- Solve for the optimal projection direction:
$$ \mathbf{w}^*=S_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) $$
- Project samples onto the line:
$$ y=\mathbf{w}^T\mathbf{x} $$
- Classify by thresholding:
$$ y \mathop{\gtrless}_{\text{Class 0}}^{\text{Class 1}} \theta $$
Thus, Fisher’s LDA converts a high-dimensional binary classification problem into a one-dimensional thresholding problem.