Forward-Mode AD via High Dimensional Algebras

Machine Epsilon and Roundoff Error

Floating point arithmetic is relatively scaled, which means that the precision that you get from calculations is relative to the size of the floating point numbers. Generally, you have 16 digits of accuracy in (64-bit) floating point operations. To measure this, we define machine epsilon as the value by which 1 + E = 1. For floating point numbers, this is:

eps(Float64)

2.220446049250313e-16


However, since it's relative, this value changes as we change our reference value:

@show eps(1.0)

eps(1.0) = 2.220446049250313e-16

@show eps(0.1)

eps(0.1) = 1.3877787807814457e-17

@show eps(0.01)

eps(0.01) = 1.734723475976807e-18
1.734723475976807e-18


Thus issues with roundoff error come when one subtracts out the higher digits. For example, $(x + \epsilon) - x$ should just be $\epsilon$ if there was no roundoff error, but if $\epsilon$ is small then this kicks in. If $x = 1$ and $\epsilon$ is of size around $10^{-10}$, then $x+ \epsilon$ is correct for 10 digits, dropping off the smallest 10 due to error in the addition to $1$. But when you subtract off $x$, you don't get those digits back, and thus you only have 6 digits of $\epsilon$ correct.

Let's see this in action:

ϵ = 1e-10rand()
@show ϵ

ϵ = 5.7848124379111953e-11

@show (1+ϵ)

1 + ϵ = 1.0000000000578482

ϵ2 = (1+ϵ) - 1
(ϵ - ϵ2)

-4.631898182815095e-17


See how $\epsilon$ is only rebuilt at accuracy around $10^{-16}$ and thus we only keep around 6 digits of accuracy when it's generated at the size of around $10^{-10}$!

Finite Differencing and Numerical Stability

To start understanding how to compute derivatives on a computer, we start with finite differencing. For finite differencing, recall that the definition of the derivative is:

$f'(x) = \lim_{\epsilon \rightarrow \infty} \frac{f(x+\epsilon)-f(x)}{\epsilon}$

Finite differencing directly follows from this definition by choosing a small $\epsilon$. However, choosing a good $\epsilon$ is very difficult. If $\epsilon$ is too large than there is error since this definition is asymtopic. However, if $\epsilon$ is too small, you receive roundoff error. To understand why you would get roundoff error, recall that floating point error is relative, and can essentially store 16 digits of accuracy. So let's say we choose $\epsilon = 10^{-6}$. Then $f(x+\epsilon) - f(x)$ is roughly the same in the first 6 digits, meaning that after the subtraction there is only 10 digits of accuracy, and then dividing by $10^{-6}$ simply brings those 10 digits back up to the correct relative size.

This means that we want to choose $\epsilon$ small enough that the $\mathcal{O}(\epsilon^2)$ error of the truncation is balanced by the $O(1/\epsilon)$ roundoff error. Under some minor assumptions, one can argue that the average best point is $\sqrt(E)$, where E is machine epsilon

@show eps(Float64)

eps(Float64) = 2.220446049250313e-16

@show sqrt(eps(Float64))

sqrt(eps(Float64)) = 1.4901161193847656e-8
1.4901161193847656e-8


This means we should not expect better than 8 digits of accuracy, even when things are good with finite differencing.

The centered difference formula is a little bit better, but this picture suggests something much better...

Differencing in a Different Dimension: Complex Step Differentiation

The problem with finite differencing is that we are mixing our really small number with the really large number, and so when we do the subtract we lose accuracy. Instead, we want to keep the small perturbation completely separate.

To see how to do this, assume that $x \in \mathbb{R}$ and assume that $f$ is complex analytic. You want to calculate a real derivative, but your function just happens to also be complex analytic when extended to the complex plane. Thus it has a Taylor series, and let's see what happens when we expand out this Taylor series purely in the complex direction:

$f(x+ih) = f(x) + f'(x)ih + \mathcal{O}(h^2)$

which we can re-arrange as:

$if'(x) = \frac{f(x+ih) - f(x)}{h} + \mathcal{O}(h)$

Since $x$ is real and $f$ is real-valued on the reals, $if'$ is purely imaginary. So let's take the imaginary parts of both sides:

$f'(x) = \frac{Im(f(x+ih))}{h} + \mathcal{O}(h)$

since $Im(f(x)) = 0$ (since it's real valued!). Thus with a sufficiently small choice of $h$, this is the complex step differentiation formula for calculating the derivative.

But to understand the computational advantage, recal that $x$ is pure real, and thus $x+ih$ is an imaginary number where the $h$ never directly interacts with $x$ since a complex number is a two dimensional number where you keep the two pieces separate. Thus there is no numerical cancellation by using a small value of $h$, and thus, due to the relative precision of floating point numbers, both the real and imaginary parts will be computed to (approximately) 16 digits of accuracy for any choice of $h$.

Derivatives as nilpotent sensitivities

The derivative measures the sensitivity of a function, i.e. how much the function output changes when the input changes by a small amount $\epsilon$:

$f(a + \epsilon) = f(a) + f'(a) \epsilon + o(\epsilon).$

In the following we will ignore higher-order terms; formally we set $\epsilon^2 = 0$. This form of analysis can be made rigorous through a form of non-standard analysis called Smooth Infinitesimal Analysis [1], though note that nilpotent infinitesimal requires constructive logic, and thus proof by contradiction is not allowed in this logic due to a lack of the law of the excluded middle.

A function $f$ will be represented by its value $f(a)$ and derivative $f'(a)$, encoded as the coefficients of a degree-1 (Taylor) polynomial in $\epsilon$:

$f \rightsquigarrow f(a) + \epsilon f'(a)$

Conversely, if we have such an expansion in $\epsilon$ for a given function $f$, then we can identify the coefficient of $\epsilon$ as the derivative of $f$.

Dual numbers

Thus, to extend the idea of complex step differentiation beyond complex analytic functions, we define a new number type, the dual number. A dual number is a multidimensional number where the sensitivity of the function is propagated along the dual portion.

Here we will now start to use $\epsilon$ as a dimensional signifier, like $i$, $j$, or $k$ for quaternion numbers. In order for this to work out, we need to derive an appropriate algebra for our numbers. To do this, we will look at Taylor series to make our algebra reconstruct differentiation.

Note that the chain rule has been explicitly encoded in the derivative part.

$f(a + \epsilon) = f(a) + \epsilon f'(a)$

to first order. If we have two functions

$f \rightsquigarrow f(a) + \epsilon f'(a)$

$g \rightsquigarrow g(a) + \epsilon g'(a)$

then we can manipulate these Taylor expansions to calculate combinations of these functions as follows. Using the nilpotent algebra, we have that:

$(f + g) = [f(a) + g(a)] + \epsilon[f'(a) + g'(a)]$

$(f \cdot g) = [f(a) \cdot g(a)] + \epsilon[f(a) \cdot g'(a) + g(a) \cdot f'(a) ]$

From these we can infer the derivatives by taking the component of $\epsilon$. These also tell us the way to implement these in the computer.

To encode this information in Julia, we introduce a new number type, called Dual, containing a value and derivative at some point $a$ (which is not usually explicitly recorded):

struct Dual
val::Float64
partial::Float64
end

val(x::Dual) = x.val
partial(x::Dual) = x.partial

partial (generic function with 1 method)


Now we define the rules of dual number arithmetic:

import Base: +, *

+(f::Dual, g::Dual) = Dual(val(f) + val(g),
partial(f) + partial(g))

*(f::Dual, g::Dual) = Dual(val(f) * val(g),
val(f)*partial(g) + val(g)*partial(f))

* (generic function with 351 methods)


To speed up our derivative function, we can directly hardcode the derivative of known functions which we call primitives. For example, for sin we can add a rule like:

$\sin(a + \epsilon b) = \sin(a) + \epsilon b \cos(a).$

Base.sin(f::Dual) = Dual(sin(val(f)), cos(val(f)) * partial(f))


For functions where we don't have a rule, we can recursively do dual number arithmetic within the function until we hit primitives where we know the derivative, and then use the chain rule to propagate the information back up.

References

• John L. Bell, An Invitation to Smooth Infinitesimal Analysis, http://publish.uwo.ca/~jbell/invitation%20to%20SIA.pdf

• Bell, John L. A Primer of Infinitesimal Analysis

• Nocedal & Wright, Numerical Optimization, Chapter 8

• Griewank & Walther, Evaluating Derivatives

Many thanks to David Sanders for helping make these lecture notes.