Useful tid-bits

Logarithms

$e=2.718\dots$ is the constant for continuous 100% compounding
- $e^x = e^{rt}$ is the amount of continuous growth after $x$ amount of time
- $ln(x)$ is the amount of time needed to reach a certain level of continuous growth
- Continuous growth is contrasted with discrete growth (e.g. interest every month from saving account)
- Reference: https://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/

Log rules

$\log_b(mn) = \log_b(m)+\log_b(n)$
$\log_b(m/n) = \log_b(m)-\log_b(n)$
$\log_b(m^n) = n \log_b(m)$
Change of base: $\log_b(x)=\frac{\log_d(x)}{\log_d(b)}$
- changing the base is just a linear scaling
$\log(x)=y \rightarrow e^{\log(x)} = e^y \rightarrow e^y=x$

Incremental mean

$\mu_k = \frac{1}{k}\sum_{j=1}^{k} x_j$

$=\frac{1}{k}\left(x_k + \sum_{j=1}^{k-1}x_j \right)$

$=\frac{1}{k}(x_k + (k-1)\mu_{k-1})$

$=\mu_{k-1}+\frac{1}{k}(x_k - \mu_{k-1})$

Numerical integration

For $\frac{dy}{dt}=f(t,y)$:

Forward Euler: $y_{k+1}=y_k+hf(t_{k},y_{k})$
Backward Euler: $y_{k+1}=y_k+hf(t_{k+1},y_{k+1})$
- Note that $y_{k+1}$ appears on both sides of the equation. Thus this is an implicit method and must be solved iteratively
Newmark-beta method

Derivatives

Multivariate chain rule

For function $z(x(t),y(t))$, $\frac{\delta z}{\delta t} = \frac{\delta z}{\delta x}\frac{\delta x}{\delta t} + \frac{\delta z}{\delta y}\frac{\delta y}{\delta t}$

Geometry

Finding the angle between 2 vectors in 2D
- use $atan2$, which returns values in the range $[-\pi, \pi]$
- $\theta = atan2(v2.y, v2.x) - atan2(v1.y, v1.x)$
Dot product
- $v_1\cdot v_2 = v_{1x}v_{2x} + v_{1y}v_{2y}=|v_1| |v_2|cos(\theta)$
- Projection onto vector
Averaging rotations: project into Euclidean vectors on unit ball, average the vectors and find rotation again

Hyperplanes in n-dimensions: $a^T x = b$ or $a^T x - b = 0$, where $a,x$ are vectors and $b$ is scalar

$a$ is direction (normal vector of plane), $b/

$ is shortest/perpendicular distance between origin and hyperplane

Distance between hyperplane and point: take projection of point onto plane

$d = \frac{

w \cdot x + b

}{

Intuition: $w \cdot x = |w||x|cos\theta$, unnormalized projection of $x$ onto $w$. $w \cdot x - b$ is distance between hyperplane and $x$ in unnormalized units. “Normalized” projection (in Euclidean units) would be $\frac{|w \cdot x - b|}{||w||}$.
https://math.stackexchange.com/questions/1210545/distance-from-a-point-to-a-hyperplane

Support vector machine

Goal: Find a “maximum margin hyperplane” that divides the data into correct sets.

Hyperplane can be written as: $\vec{w}\vec{x}+b = 0$ or $\vec{w}\vec{x}=0$ with bias trick (add $1$ to bottom of $x$)
For linearly separable training data, we can try to find two parallel hyperplanes that are max distance (“margin”) apart, which separate data: $\vec{w}\vec{x}+b=1, \vec{w}\vec{x}+b=-1$. The distance between the two hyperplanes is $\frac{2}{||\vec{w}||}$, so to maximize distance we want to minimize $||\vec{w}||$. Adding constraints to keep the two sets on different sides of the two hyperplanes, the whole thing can be written as a convex optimization problem (primal formulation): $\begin{eqnarray}\min_{w,b} \|\|w\|\|^2 \\ \text{subject to } y_i(w^T x_i - b) \geq 1 \text{, for } i=1, \ldots, n \end{eqnarray}$
Support vectors: data points $x_i$ close to the separating hyperplane that determine the maximum margin
Note that SVMs in their normal form can only do binary classification

Data that’s not linearly separable

For data that is not linearly separable, use hinge loss plus regularization/margin size: $max(0, 1-y_i(w^T x_i - b)) + \lambda |w|^2$
- Use sub-gradients
Can also do it with slack variable: $y_i(w^T x_i - b) \geq 1 - \xi_i$, where $\xi_i$ is distance from the margin for mis-classified points. $\xi_i \geq 0$. $\begin{eqnarray}\min_{w,\xi} \|\|w\|\|^2+C\sum_i \xi_i, \\ \text{subject to } y_i(w^T x_i - b) \geq 1 - \xi_i \text{, for } i=1, \ldots, n \end{eqnarray}$

Kernel trick

Replace dot products with nonlinear kernel functions to transform the data into an implicit feature space, in which a linear separating hyperplane can be computed. The idea is that we want to do normal linear SVM on a set of transformed data points $\psi(x_i)$. We are given a kernel function that satisfies $k(x_i, x_j)= \psi(x_i) \cdot \psi(x_j)$.

Instead of explicitly calculating $\psi(x_i)$ for each data point, which may be expensive for a large amount of high-dimensional data points, we can calculate the kernel $k(x_i, x_j)$ directly. It’s cheaper because dot product between two vectors is a scalar, so instead of transforming $N$ vectors of dimension $D$ we can just find $N$ scalars (each one “compared” to vector $w$).

Shortest/perpendicular distance from a hyperplane to a point: $d = \frac{|w \cdot x + b|}{ ||w|| }$. So with bias trick, $w \cdot x$ encodes “distance” (unnormalized by $||w||$) and the side of the separator that $x$ is on. What the kernel does is define a new “distance metric” between the hyperplane and the data point. This distance metric can implicitly be defined in a higher-dimensional space by the kernel function. See Stanford notes for examples.

For matrix $K$ where $K_{ij} = K(x_i, x_j)$, necessary and sufficient condition for the kernel to be valid is that it is positive semi-definite

Misc math notes