Linear Algebra

TODO

References

Much of the notes here are from links below:

Operations

Outer product

TODO

Properties of transpose

Properties of Jacobians

Properties of positive (semi-)definite matrices

Complexity of operations

In general, optimized implementations of even simple operations like matrix multiplications can be 10~100x faster than naive implementation

Solutions to linear systems

Linear algebra view of linear regression

Normal equation

Recall that we want to find $\hat{x}$ that minimizes: \(\big\vert\big\vert Ax - b \big\vert\big\vert_2\)

Another way to think about this is that we are interested in where vector $b$ is closest to the subspace spanned by $A$ (called the range of $A$). This is the projection of $b$ onto $A$. Since $b - A\hat{x}$ must be perpendicular to the subspace spanned by $A$, we see that

\[A^T (b - A\hat{x}) = 0\]

(we are using $A^T$ because we want to multiply each column of $A$ by $b - A\hat{x}$

Note: $b-A\hat{x}$ is vector between $b$ and $Ax$, which is perpendicular to A.

This leads us to the normal equations: \(x = (A^TA)^{-1}A^T b\)

Underdetermined systems

$Ax=b$ where $A \in R^{m \times n}, m < n$. i.e. $A$ is “fat”

Infinite number of solutions in nullspace of $A$. Usually want least-squares solution that minimizes $||x||$. One solution uses the right-inverse $x_{LN} = A^T(AA^T)^{-1}b$. ($AA^T$ is invertible since A is full rank).

Proof

  1. Let $x$ be any solution other than $x_{LN}$. $Ax=b\text{, so } (Ax-Ax_{LN})=(b-b)=0$.
  2. Take the projection of difference between this solution and the least-norm solution, onto the least-norm solution.

\((x- x_{LN})^T x_{LN}=(x-x_{LN})^TA^T(AA^T)^{-1}b\)=(A(x-x_{LN}))^T(AA^T)^{-1}b=0\(\)(x-x_{LN})\bot x_{LN}$$

  1. Then, taking the norm of the solution $x$,

\(\|\|x\|\|^2 = \|\|x_{LN} + x - x_{LN}\|\|^2\) \(=\|\|x_{LN}\|\|^2 + \|\|x-x_{LN}\|\|^2 \geq \|\|x_{LN}\|\|^2\)

i.e. $x_{LN}$ has the smallest norm of any solution, $x_{LN} \bot N(A)$ ($x_{LN}$ is projection of 0 on solution set)

$I-A^T(AA^T)^{-1}A$ gives projection onto $N(A)$

Matrix decompositions

Why use decompositions? It makes operations faster

Linear regression via QR has been recommended by numerical analysts as the standard method for years. It is natural, elegant, and good for “daily use”. SVD is most robust

QR decomposition

\(A x = b\) \(A = Q R\) \(Q R x = b\) \(R x = Q^T b\)

Gram-Schmidt gets $R$ (triangular) as a result of constructing $Q$ (orthogonal) using a succession of triangular operations, while Householder gets $Q$ (orthogonal) as a result of constructing $R$ (triangular) through orthogonal operations.

Householder reflections lead to a more nearly orthogonal matrix Q with rounding errors

Gram-Schmidt can be stopped part-way, leaving a reduced QR of 1st n columns of A

Householder algorithm

Householder matrices are orthogonal matrices (they are reflections) that are convenient for introducing zeros into a matrix, in the same way that Gauss transformations (used in LU) are.

A Householder matrix is defined by a nonzero vector $u$, and it’s a reflection along the $u$ direction.

\[H = I - 2 \frac{u u^T}{\|u\|^2}\]

Find sequence of Householder reflections (orthogonal matrices) that can turn $A$ into an upper-triangular matrix $R$: \(Q_n \cdots Q_2 Q_1 A = R\) \((AB)^T=B^TA^T \rightarrow A = Q_1Q_2 \cdots Q_n R\) \(Q = Q_1Q_2 \cdots Q_n\)

Singular Value Decomposition

\(A x = b\) \(A = U \Sigma V^T\) \(\Sigma V^T x = U^T b\) \(\Sigma w = U^T b\) \(x = V w\)

We usually talk about SVD in terms of matrices, \(A = U \Sigma V^T\) but we can also think about it in terms of vectors. SVD gives us sets of orthonormal vectors ${v_j}$ and ${u_j}$ such that \(A v_j = \sigma_j u_j\). Note that this is very similar to the definition of eigen values/vectors, except $v_j$ doesn’t have to be the same as $u_j$. $\sigma_j$ are scalars, called singular values

Relationship between SVD and Eigen Decomposition: the left-singular vectors of A are the eigenvectors of $AA^T$. The right-singular vectors of A are the eigenvectors of $A^T A$. The non-zero singular values of A are the square roots of the eigenvalues of $A^T A$ (and $A A^T$).

SVD is a generalization of eigen decomposition. Not all matrices have eigen values, but ALL matrices have singular values.

Note: By the definition of eigen vectors/values $Av = \lambda v$, it only makes sense for square matrices to have eigen decompositions

Vocab: A Hermitian matrix is one that is equal to it’s own conjugate transpose. In the case of real-valued matrices (which is all we are considering in this course), Hermitian means the same as Symmetric.

Eigen decomposition can be derived from a fundamental property of eigenvectors: \(Av=\lambda v \rightarrow AQ=Q\Lambda \rightarrow A = Q\Lambda Q^{-1}\)

Relevant Theorems:

Applications:

LU factorization

Cholesky factorization

From normal equations: $A^TA x = A^T b$. If $A$ has full rank, the pseudo-inverse $(A^TA)^{-1}A^T$ is a square, hermitian positive definite matrix. The standard way of solving such a system is Cholesky Factorization, which finds upper-triangular $R$ s.t. $A^TA = R^TR$.

We can also just solve $Ax=b$ if $A$ is symmetric positive definite:

\(A^T A x = A^T b\) \(R^T R x = A^T b\) \(R^T w = A^T b\) \(R x = w\)

Power method

Iterative method for finding eigenvalues and eigenvectors. Doesn’t compute a decomposition, so can be used for very large matrices

Eigenvectors of a matrix are: vectors that, when multiplied by the matrix, don’t change direction and only get scaled: $Av=\lambda v$.

Intuition behind the power method is to repeatedly multiply a (originally random) vector, and watch how the vector changes over iterations. The component of the vector that is along the dominant (largest magnitude) eigenvector should be “scaled” the most. To make sure the vector doesn’t explode in magnitude, we normalize the vector at each step.

y = normalized random vector

for i in range(n_iter):
	y = A*y
	normalize(y)
eigvec = y
eigval = abs((A*y).dot(y) / y.dot(y)) # Rayleigh quotient

This can also be used for computing spectral of a matrix, which is equal to the largest singular value (square root of largest magnitude eigenvalue).

To find smaller magnitude eigenvalues, subtract the component of vector change along known eigenvectors at every iteration (not done that often, since this method converges slowly)

\[Ab,\,A^2b,\,A^3b,\,A^4b\, \ldots\]

The matrix with those vectors as columns is called the Krylov matrix, and the space spanned by those vectors is the Krylov subspace.

Numerical conditioning

Misc