Misc math notes

Useful tid-bits

Logarithms

Log rules

Incremental mean

$\mu_k = \frac{1}{k}\sum_{j=1}^{k} x_j$

$=\frac{1}{k}\left(x_k + \sum_{j=1}^{k-1}x_j \right)$

$=\frac{1}{k}(x_k + (k-1)\mu_{k-1})$

$=\mu_{k-1}+\frac{1}{k}(x_k - \mu_{k-1})$

Numerical integration

For $\frac{dy}{dt}=f(t,y)$:

Derivatives

Multivariate chain rule

For function $z(x(t),y(t))$, $\frac{\delta z}{\delta t} = \frac{\delta z}{\delta x}\frac{\delta x}{\delta t} + \frac{\delta z}{\delta y}\frac{\delta y}{\delta t}$

Geometry

Support vector machine

Goal: Find a “maximum margin hyperplane” that divides the data into correct sets.

Data that’s not linearly separable

Kernel trick

Replace dot products with nonlinear kernel functions to transform the data into an implicit feature space, in which a linear separating hyperplane can be computed. The idea is that we want to do normal linear SVM on a set of transformed data points $\psi(x_i)$. We are given a kernel function that satisfies $k(x_i, x_j)= \psi(x_i) \cdot \psi(x_j)$.

Instead of explicitly calculating $\psi(x_i)$ for each data point, which may be expensive for a large amount of high-dimensional data points, we can calculate the kernel $k(x_i, x_j)$ directly. It’s cheaper because dot product between two vectors is a scalar, so instead of transforming $N$ vectors of dimension $D$ we can just find $N$ scalars (each one “compared” to vector $w$).

Shortest/perpendicular distance from a hyperplane to a point: $d = \frac{|w \cdot x + b|}{ ||w|| }$. So with bias trick, $w \cdot x$ encodes “distance” (unnormalized by $||w||$) and the side of the separator that $x$ is on. What the kernel does is define a new “distance metric” between the hyperplane and the data point. This distance metric can implicitly be defined in a higher-dimensional space by the kernel function. See Stanford notes for examples.

References