Mixing Differential Equations and Neural Networks for Physics-Informed Learning

Chris Rackauckas
November 19th, 2019

Given this background in both neural network and differential equation modeling, let's take a moment to survey some methods which integrate the two ideas. Neural differential equations are of course one possible method, but there are methods which utilize the composition of these ideas. Some are good, some are not.

Julia codes for these methods are being developed, optimized, and tested in the NeuralNetDiffEq.jl package. A generalized neural differential equation solver can be found at DiffEqFlux.jl. Note that as a standard and maintained Julia package, these methods are available today for solving these kinds of problems with these neural-enhanced methods.

Generalized Neural Differential Equations

While our previous lectures focused on ordinary differential equations, the larger classes of differential equations can also have neural networks, for example:

For each of these equations, one can come up with an adjoint definition in order to define a backpropogation. However, there are limitatinos to this. In many cases, such as with stochastic differential equations, the time complexity of the continuous adjoint method is actually worse than backpropogation through the solver itself. Secondly, every single one of these equations requires an adjoint method, and some of them need different adjoint methods in different contexts (state-dependent delay differential equations vs constant-time delay differential equations).

DiffEqFlux.jl is the first library to support the wide gambit of possible differential equations. It does so by using Julia's language-wide AD tooling, such as Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to the user).

For example, a neural stochastic differential equation can be trained using the following example, and known differential equations can be mixed with unknown portions. Additionally, this blog post showcases the mixing of event handling with neural stochastic differential equations and using stiff solvers for ODEs to train neural partial differential equations.

Many of the methods below can be encapsolated as a choice of a neural differential equation and trained with higher order, adaptive, and more efficient methods with DiffEqFlux.jl. A paper detailing this is to come out in the next few weeks.

Artificial neural networks for solving ordinary and partial differential equations

Solving Ordinary Differential Equations

This is a result due to Lagaris et. al from 1998. The idea is to solve differential equations using neural networks by representing the solution by a neural network and training the resulting network to satisfy the conditions required by the differential equation.

Let's say we want to solve a system of ordinary differential equations

\[ u' = f(u,t) \]

with $t \in [0,1]$ and a known initial condition $u(0)=u_0$. To solve this, we approximate the solution by a neural network:

\[ N(t) \approx u(t) \]

If $N(t)$ was the true solution, then it would hold that $N'(t) = f(N(t),t)$ for all $t$. Thus we turn this condition into our loss function. This motivates the loss function:

\[ L(p) = \sum_i \left(\frac{dN(t_i)}{dt} - f(N(t_i),t_i) \right)^2 \]

The choice of $t_i$ could be done in many ways: it can be random, it can be a grid, etc. Anyways, when this loss function is minimized (gradients computed with standard reverse-mode automatic differentiation), then we have that $\frac{dN(t_i)}{dt} \approx f(N(t_i),t_i)$ and thus $N(t)$ approximately solves the differential equation.

Quick Question for Understanding: What is an effective way to compute dN/dt?

If you thought reverse-mode automatic differentiation, go back and think about why that is incorrect! Hint: what is the dimensionality of the input and the output?

Note that we still have to handle the initial condition. One simple way to do this is to add an initial condition term to the cost function. While that would work, it can be more efficient to encode the initial condition into the function itself so that it's trivially satisfied for any possible set of parameters. For example, instead of directly using a neural network, we can use:

\[ T(t) = u_0 + tN(t) \]

as our solution. Notice that this will always satisfy the initial condition, so if we train this to satisfy the derivative function then it will automatically be a solution to the derivative function.

Solving Partial Differential Equations

This same idea can be employed to solve partial differential equations. For example, take the Poisson equation:

\[ \Delta u = f(x,y) \]

for the 2-dimensional Poisson equation with $x \in [0,1]$ and $y \in [0,1]$. Assume that the boundary conditions are Dirichlet, i.e. $u(0,y) = u_{0x}(y)$, $u(1,y) = u_{1x}(y)$, $u(x,0) = u_{0y}(x)$, and $u(x,1)=u_{1y}(y)$ are all given. If this is the case, we can have

\[ T(x,y) = A(x,y) + x(1-x)y(1-y)N(x,y) \]

where $A(x,y)$ is a function that is made so $T(x,y)$ satisfies the boundary conditions. Then we have the loss function:

\[ L(p) = \sum_i \left(\Delta T(x_i,y_i) - f(x_i,y_i) \right)^2 \]

Julia Implementation

A Julia implementation for solving ODEs via neural networks through this method exists in NeuralNetDiffEq.jl. Take a look at the code and make sure you understand it. It should be easily understandable given this description of the method!

Analysis of the Method and Dimensionality

While the method "will work" with large enough neural networks due to the universal approximation theorem, this method is highly inefficient since no information about differential equations are embedded into the problem. There is no convergence rate guarantee, and there are no strict bounds like one would expect from numerical analysis. All implementations of the method that I have found (including my own) show that this method is very very inefficient. For example, it can be hard to train a solver on the Lotka-Volterra equation using a few GPU hours, while a Runge-Kutta method will solve it in nanoseconds...

That said, the method itself is an interesting starting point. While other ODE and PDE solvers need to approximate an equation at specific points, i.e. on a mesh, this method is mesh-free in the sense that there is no mesh and just a direct representation of the continuous solution itself. This algorithm was reinvented in 2018 as the Deep Galerkin Method (DGM) and exploited the mesh-free behavior to see this as a method for high dimensional partial differential equations. While a finite difference method would require $N^d$ many points, a neural network just needs to have its parameters in memory, and thus it can scale well to high dimensional partial differential equation problems.

Physics-Informed Deep Learning

The results on this are found across three papers, the main being this article from the Journal of Computational Physics. with this JMLR paper having a similar flavor and all stemming from this arxiv paper. The authors are M.Raissi, P.Perdikaris, and G.E.Karniadakis, two at Brown university and Perdikaris at UPenn.

The main idea behind the Physics-Informed Neural Network (PINN) is to use known physical information as a form of prior information in the structure of the neural network architecture in order to build a more data-efficient form of machine learning for scientific computing applications. This setup starts by assuming that the underlying data follows some nonlinear partial differential equation

\[ u_t + \mathcal{N}[u;\lambda] = 0 \]

for some $\lambda$ parameterized PDE operator over some space $x \in \Omega$ and $t \in [0,T]$. From here, we proceed by approximating $u(t,x)$ by a neural network $N(t,x)$. Let

\[ f(t,x) = N_t + \mathcal{N}[N;\lambda] \]

The loss function that is used is:

\[ MSE = MSE_u + MSE_f \]


\[ MSE_u = \frac{1}{N_u} \sum_i |N(t_i^u,x_i^u) - u_i |^2 \]

is the difference between the neural network and the data $u$, while

\[ MSE_f = \frac{1}{N_f} \sum_i |f(t_i^f,x_i^f)|^2 \]


is the loss on the PDE function itself. Notice that, if $f=0$ then the PDE is satisfied, and thus the second term indeed looks like a regularization to push towards the solution of the PDE.

Thus in some sense one can understand this as an extension to the previous deep learning methods for solving partial differential equations, where here the $MSE_f$ term is enforcing that the neural network solves the PDE, just like before, but it is mixed with a loss function that drives $N(t,x)$ towards the known data $u_i(t_i,x_i)$. In this sense, this is a method not for solving a partial differential equation with a neural network, but by using the structure of a neural network PDE solver as a component within a data-driven deep learning approach to relax the solution towards an underlying known physical structure.

Once this is established, the authors present two separate ways of training the physics-informed neural network $f$. One way to do this is to directly evaluate the loss function $MSE_f$ at chosen or random points in time using automatic differentiation. However, the number of points needed for this method to work is seen as a computational bottleneck (indeed, as noted before, the continuous mesh-free algorithm for partial differential equations is highly ineffecient!). To overcome this difficulty, they move towards a discrete-time model.

For this discrete time model, they discretize the partial differential equation by a Runge-Kutta method. For example, we the simplest Runge-Kutta method is the Euler method, which we can write out as:

\[ u_{n+1} = u_n - \Delta t \mathcal{N}[u_n;\lambda] \]

Now in this case, instead of making a neural network apply to $u(t,x)$ and try to approximate the full continuous function of time, the authors make a neural networks

\[ N_n(x) = u_{n}(x) \]

In this form, the previous loss function can be reinterpreted to be on this discretized set of neural networks. Specifically,

\[ SSE = SSE_n + SSE_f + SSE_b \]


\[ SSE_n = \sum_i |N_i(x_i) - u_i|^2 \]

and $SSE_b$ is a loss function of each $u_i$ against the boundary conditions. Here $SSE_f$ is simply the loss on the equation of the discretization:

\[ SSE_f = \sum_i N_{n+1}(x_i) - N_n(x_i) + \Delta t \mathcal{N}[N_n;\lambda] \]

to relax towards the solution of the PDE.

Using this style, the authors show that this method is a data-efficient form of learning because it can use the known physics as a form of heavy prior information.

Deep BSDE Methods for High Dimensional Partial Differential Equations

The key paper on deep BSDE methods is this article from PNAS by Jiequn Han, Arnulf Jentzen, and Weinan E. Follow up papers have like this one have identified a larger context in the sense of forward-backwards SDEs for a large class of partial differential equations.

Understanding the Setup for Terminal PDEs

While this setup may seem a bit contrived given the "very specific" partial differential equation form (you know the end value? You have some parabolic form?), it turns out that there is a large class of problems in economics and finance that satisfy this form. The reason is because in these problems you may know the value of something at the end, when you're going to sell it, and you want to evaluate it right now. The classic example is in options pricing. An option is a contract to be able to solve a stock at a given value. The simplest case is a contract that can only be executed at a pre-determined time in the future. Let's say we have an option to sell a stock at 100 no matter what. This means that, if the stock at the strike time (the time the option can be sold) is 70, we will make 30 from this option, and thus the option itself is worth 30. The question is, if I have this option today, the strike time is 3 months in the future, and the stock price is currently 70, how much should I value the option today?

To solve this, we need to put a model on how we think the stock price will evolve. One simple version is a linear stochastic differential equation, i.e. the stock price will evolve with a constant interest rate $r$ with some volitility (randomness) $\sigma$, in which case:

\[ dX_t = r X_t dt + \sigma X_t dW_t. \]

From this model, we can evaluate the probability that the stock is going to be at given values, which then gives us the probability that the option is worth a given value, which then gives us the expected (or average) value of the option. This is the Black-Scholes problem. However, a more direct way of calculating this result is writing down a partial differential equation for the evolution of the value of the option $V$ as a function of time $t$ and the current stock price $x$. At the final time point, if we know the stock price then we know the value of the option, and thus we have a terminal condition $V(T,x) = g(x)$ for some known value function $g(x)$. The question is, given this value at time $T$, what is the value of the option at time $t=0$ given that the stock currently has a value $x = \zeta$. Why is this interesting? This will tell you what you think the option is currently valued at, and thus if it's cheaper than that, you can gain money by buying the option right now! This means that the "solution" to the PDE is the value $V(0,\zeta)$, where we know the final points $V(T,x) = g(x)$. This is precisely the type of problem that is solved by the deep BSDE method.

The Deep BSDE Method

Consider the class of semilinear parabolic PDEs, in finite time $t\in[0, T]$ and $d$-dimensional space $x\in\mathbb R^d$, that have the form

\[ \begin{align} \frac{\partial u}{\partial t}(t,x) &+\frac{1}{2}\text{trace}\left(\sigma\sigma^{T}(t,x)\left(\text{Hess}_{x}u\right)(t,x)\right)\\ &+\nabla u(t,x)\cdot\mu(t,x) \\ &+f\left(t,x,u(t,x),\sigma^{T}(t,x)\nabla u(t,x)\right)=0,\end{align} \]

with a terminal condition $u(T,x)=g(x)$. In this equation, $\text{trace}$ is the trace of a matrix, $\sigma^T$ is the transpose of $\sigma$, $\nabla u$ is the gradient of $u$, and $\text{Hess}_x u$ is the Hessian of $u$ with respect to $x$. Furthermore, $\mu$ is a vector-valued function, $\sigma$ is a $d \times d$ matrix-valued function and $f$ is a nonlinear function. We assume that $\mu$, $\sigma$, and $f$ are known. We wish to find the solution at initial time, $t=0$, at some starting point, $x = \zeta$.

Let $W_{t}$ be a Brownian motion and take $X_t$ to be the solution to the stochastic differential equation

\[ dX_t = \mu(t,X_t) dt + \sigma (t,X_t) dW_t \]

with initial condition $X(0)=\zeta$. Previous work has shown that the solution to satisfies the following BSDE:

\[ \begin{align} u(t, &X_t) - u(0,\zeta) = \\ & -\int_0^t f(s,X_s,u(s,X_s),\sigma^T(s,X_s)\nabla u(s,X_s)) ds \\ & + \int_0^t \left[\nabla u(s,X_s) \right]^T \sigma (s,X_s) dW_s,\end{align} \]

with terminating condition $g(X_T) = u(X_T,W_T)$.

At this point, the authors approximate $\left[\nabla u(s,X_s) \right]^T \sigma (s,X_s)$ and $u(0,\zeta)$ as neural networks. Using the Euler-Maruyama discretization of the stochstic differential equation system, one arrives at a recurrent neural network:


Julia Implementation

A Julia implementation for the deep BSDE method can be found at NeuralNetDiffEq.jl. The examples considered below are part of the standard test suite.

Financial Applications of Deep BSDEs: Nonlinear Black-Scholes

Now let's look at a few applications which have PDEs that are solved by this method. One set of problems that are solved, given our setup, are Black-Scholes types of equations. Unless a lot of previous literature, this works for a wide class of nonlinear extensions to Black-Scholes with large portfolios. Here, the dimension of the PDE for $V(t,x)$ is the dimension of $x$, where the dimension is the number of stocks in the portfolio that we want to consider. If we want to track 1000 stocks, this means our PDE is 1000 dimensional! Traditional PDE solvers would need around $N^{1000}$ points evolving over time in order to arrive at the solution, which is completely impractical.

One example of a nonlinear Black-Scholes equation in this form is the Black-Scholes equation with default risk. Here we are adding to the standard model the idea that the companies that we are buying stocks for can default, and thus our valuation has to take into account this default probability as the option will thus become value-less. The PDE that is arrived at is:

\[ \frac{\partial u}{\partial t}(t,x) + \bar{\mu}\cdot \nabla u(t, x) + \frac{\bar{\sigma}^{2}}{2} \sum_{i=1}^{d} \left |x_{i} \right |^{2} \frac{\partial^2 u}{\partial {x_{i}}^2}(t,x) \\ - (1 -\delta )Q(u(t,x))u(t,x) - Ru(t,x) = 0 \]

with terminating condition $g(x) = \min_{i} x_i$ for $x = (x_{1}, . . . , x_{100}) \in R^{100}$, where $\delta \in [0, 1)$, $R$ is the interest rate of the risk-free asset, and Q is a piecewise linear function of the current value with three regions $(v^{h} < v ^{l}, \gamma^{h} > \gamma^{l})$,

\[ \begin{align} Q(y) &= \mathbb{1}_{(-\infty,\upsilon^{h})}(y)\gamma ^{h} + \mathbb{1}_{[\upsilon^{l},\infty)}(y)\gamma ^{l} \\ &+ \mathbb{1}_{[\upsilon^{h},\upsilon^{l}]}(y) \left[ \frac{(\gamma ^{h} - \gamma ^{l})}{(\upsilon ^{h}- \upsilon ^{l})} (y - \upsilon ^{h}) + \gamma ^{h} \right ]. \end{align} \]

This PDE can be cast into the form of the deep BSDE method by setting:

\[ \begin{align} \mu &= \overline{\mu} X_{t} \\ \sigma &= \overline{\sigma} \text{diag}(X_{t}) \\ f &= -(1 -\delta )Q(u(t,x))u(t,x) - R u(t,x) \end{align} \]

The Julia code for this exact problem in 100 dimensions can be found here

Stochastic Optimal Control as a Deep BSDE Application

Another type of problem that fits into this terminal PDE form is the stochastic optimal control problem. The problem is a generalized context to what motivated us before. In this case, there are a set of agents which undergo some known stochastic model. What we want to do is apply some control (push them in some direction) at every single timepoint towards some goal. For example, we have the physics for the dynamics of drone flight, but there's randomness in the wind condition, and so we want to control the engine speeds to move in a certain direction. However, there is a cost associated with controling, and thus the question is how to best balance the use of controls with the natural stochastic evolution.

It turns out this is in the same form as the Black-Scholes problem. There is a model evolving forwards, and when we get to the end we know how much everything "cost" because we know if the drone got to the right location and how much energy it took. So in the same sense as Black-Scholes, we can know the value at the end and try and propogate it backwards given the current state of the system $x$, to find out $u(0,\zeta)$, i.e. how should we control right now given the current system is in the state $x = \zeta$. It turns out that the solution of $u(t,x)$ where $u(T,x)=g(x)$ is given and we want to find $u(0,\zeta)$ is given by a partial differential equation which is known as the Hamilton-Jacobi-Bellman equation, which is one of these terminal PDEs that is representable by the deep BSDE method.

Take the classical linear-quadratic Gaussian (LQG) control problem in 100 dimensions

\[ dX_t = 2\sqrt{\lambda} c_t dt + \sqrt{2} dW_t \]

with $t\in [0,T]$, $X_0 = x$, and with a cost function

\[ C(c_t) = \mathbb{E}\left[\int_0^T \Vert c_t \Vert^2 dt + g(X_t) \right] \]

where $X_t$ is the state we wish to control, $\lambda$ is the strength of the control, and $c_t$ is the control process. To minimize the control, the Hamilton–Jacobi–Bellman equation:

\[ \frac{\partial u}{\partial t}(t,x) + \Delta u(t,x) - \lambda \Vert \nabla u(t,x) \Vert^2 = 0 \]

has a solution $u(t,x)$ which at $t=0$ represents the optimal cost of starting from $x$.

This PDE can be rewritten into the canonical form of \Cref{pdeform} by setting:

\[ \begin{align} \mu &= 0, \\ \sigma &= \overline{\sigma} I, \\ f &= -\alpha \left \| \sigma^T(s,X_s)\nabla u(s,X_s)) \right \|^{2}, \end{align} \]

where $\overline{\sigma} = \sqrt{2}$, T = 1 and $X_0 = (0,. . . , 0) \in R^{100}$.

The Julia code for solving this exact problem in 100 dimensions can be found here

Surrogate Acceleration Methods

Another approach for mixing neural networks with differential equations is as a surrogate method. These methods are more mathematically trivial than the previous ideas, but can still achieve interesting results. A full example is explained in this video.

Say we have some function $g(p)$ which depends on a solution to a differential equation $u(t;p)$ and choices of parameters $p$. Computationally how we evaluate this equation is we do the following:

  • Solve the differential equation with parameters $p$

  • Evaluate $g$ on the numerical solution for $u$

However, this process is computationally expensive since it requires the numerical solution of $u$ for every evaluation. Thus, one can look at this setup and see $g(p)$ itself is a nonlinear function. The idea is to train a neural network to be the function $g(p)$, i.e. directly put in $p$ and return the appropriate value without ever solving the differential equation.

The video highlights an important fact about this method: it can be computationally expensive to train this kind of surrogate since many data points $(p,g(p))$ are required. In fact, many more data points than you might use. However, after training, the surrogate network for $g(p)$ can be a lot faster than the original simulation-based approach. This means that this is a method for accelerating real-time solutions by doing upfront computations. The total compute time will always be more, but in some sense the cost is amortized or shifted to be done before hand, so that the model does not need to be simulated on the fly. This can allow for things like computationally expensive models of drone flight to be used in a real-time controller.

This technique goes a long way back, but some recent examples of this have been shown. For example, there's this paper which "accelerated" the solution of the 3-body problem using a neural network surrogate trained over a few days to get a 1 million times acceleration (after generating many points beforehand of course! In the paper, notice that it took 10 days to generate the training dataset). Additionally, there is this deep learning trebuchet example which showcased that inverse problems, i.e. control or finding parameters, can be completely encapsolated as a $g(p)$ and learned with sufficient data.