## Fiddling around with validated ODE integration, Sum of Squares, Taylor Models.

As I have gotten more into the concerns of formal methods, I’ve become unsure that ODEs actually exist. These are concerns that did not bother me much when I defined myself as being more in the physics game. How times change. Here’s a rough cut.

A difficulty with ODE error analysis is that it is very confusing how to get the error on something you are having difficulty approximating in the first place.

If I wanted to know the error of using a finite step size dt vs a size dt/10, great. Just compute both and compare. However, no amount of this seems to bootstrap you down to the continuum. And so I thought, you’re screwed in regards to using numerics in order to get true hard facts about the true solution. You have to go to paper and pencil considerations of equations and variables and epsilons and deltas and things. It is now clearer to me that this is not true. There is a field of verified/validated numerics.

A key piece of this seems to be interval arithmetic. https://en.wikipedia.org/wiki/Interval_arithmetic An interval can be concretely represented by its left and right point. If you use rational numbers, you can represent the interval precisely. Interval arithmetic over approximates operations on intervals in such a way as to keep things easily computable. One way it does this is by ignoring dependencies between different terms. Check out Moore et al’s book for more.

What switching over to intervals does is you think about sets as the things you’re operating on rather than points. For ODEs (and other things), this shift of perspective is to no longer consider individual functions, but instead sets of functions. And not arbitrary extremely complicated sets, only those which are concretely manipulable and storable on a computer like intervals. Taylor models are a particular choice of function sets. You are manipulating an interval tube around a finite polynomial. If during integration / multiplication you get higher powers, truncate the polynomials by dumping the excess into the interval term. This keeps the complexity under wraps and closes the loop of the descriptive system.

If we have an iterative, contractive process for getting better and better solutions of a problem (like a newton method or some iterative linear algebra method), we can get definite bounds on the solution if we can demonstrate that a set maps into itself under this operation. If this is the case and we know there is a unique solution, then it must be in this set.

It is wise if at all possible to convert an ODE into integral form. $\dot{x}= f(x,t)$ is the same as $x(t) = x_0 + \int f(x,t)dt$.

For ODEs, the common example of such an operation is known as Picard iteration. In physical terms, this is something like the impulse approximation / born approximation. One assumes that the ODE evolves according to a known trajectory $x_0(t)$ as a first approximation. Then one plugs in the trajectory to the equations of motion $f(x_0,t)$ to determine the “force” it would feel and integrate up all this force. This creates a better approximation $x_1(t)$ (probably) which you can plug back in to create an even better approximation.

If we instead do this iteration on an intervally function set / taylor model thing, and can show that the set maps into itself, we know the true solution lies in this interval. The term to search for is Taylor Models (also some links below).

I was tinkering with whether sum of squares optimization might tie in to this. I have not seen SOS used in this context, but it probably has or is worthless.

An aspect of sum of squares optimization that I thought was very cool is that it gives you a simple numerical certificate that confirms that at the infinitude of points for which you could evaluate a polynomial, it comes out positive. This is pretty cool. http://www.philipzucker.com/deriving-the-chebyshev-polynomials-using-sum-of-squares-optimization-with-sympy-and-cvxpy/

But that isn’t really what makes Sum of squares special. There are other methods by which to do this.

There are very related methods called DSOS and SDSOS https://arxiv.org/abs/1706.02586 which are approximations of the SOS method. They replace the SDP constraint at the core with a more restrictive constraint that can be expressed with LP and socp respectively. These methods lose some of the universality of the SOS method and became basis dependent on your choice of polynomials. DSOS in fact is based around the concept of a diagonally dominant matrix, which means that you should know roughly what basis your certificate should be in.

This made me realize there is an even more elementary version of DSOS that perhaps should have been obvious to me from the outset. Suppose we have a set of functions we already know are positive everywhere on a domain of interest. A useful example is the raised chebyshev polynomials. https://en.wikipedia.org/wiki/Chebyshev_polynomials The appropriate chebyshev polynomials oscillate between [-1,1] on the interval [-1,1], so if you add 1 to them they are positive over the whole interval [-1,1]. Then nonnegative linear sums of them are also positive. Bing bang boom. And that compiles down into a simple linear program (inequality constraints on the coefficients) with significantly less variables than DSOS. What we are doing is restricting ourselves to completely positive diagonal matrices again, which are of course positive semidefinite. It is less flexible, but it also has more obvious knobs to throw in domain specific knowledge. You can use a significantly over complete basis and finding this basis is where you can insert your prior knowledge.

It is not at all clear there is any benefit over interval based methods.

Here is a sketch I wrote for $x'=x$ which has solution $e^t$. I used raised chebyshev polynomials to enforce positive polynomial constraints and tossed in a little taylor model / interval arithmetic to truncate off the highest terms.

I’m using my helper functions for translating between sympy and cvxpy expressions. https://github.com/philzook58/cvxpy-helpers Sympy is great for collecting up the coefficients on terms and polynomial multiplication integration differentiation etc. I do it by basically creating sympy matrix variables corresponding to cvxpy variables which I compile to cvxpy expressions using lambdify with an explicit variable dictionary.

Seems to work, but I’ve been burned before.

man, LP solvers are so much better than SDP solvers

Random junk and links: Should I be more ashamed of dumps like this? I don’t expect you to read this.

https://github.com/JuliaIntervals/TaylorModels.jl

https://github.com/JuliaIntervals

Functional analysis by and large analyzes functions by analogy with more familiar properties of finite dimensional vector spaces. In ordinary 2d space, it is convenient to work with rectangular regions or polytopic regions.

Suppose I had a damped oscillator converging to some unknown point. If we can show that every point in a set maps within the set, we can show that the function

One model of a program is that it is some kind of kooky complicated hyper nonlinear discrete time dynamical system. And vice versa, dynamical systems are continuous time programs. The techniques for analyzing either have analogs in the other domain. Invariants of programs are essential for determining correctness properties of loops. Invariants like energy and momentum are essential for determining what physical systems can and cannot do. Lyapunov functions demonstrate that control systems are converging to the set point. Terminating metrics are showing that loops and recursion must eventually end.

If instead you use interval arithmetic for a bound on your solution rather than your best current solution, and if you can show the interval maps inside itself, then you know that the iterative process must converge inside of the interval, hence that is where the true solution lies.

A very simple bound for an integral $\int_a^b f(x)dx$ is $\int max_{x \in [a,b]}f(x) dx= max_{x \in [a,b]}f(x) \int dx = max_{x \in [a,b]}f(x) (b - a)$

The integral is a very nice operator. The result of the integral is a positive linear sum of the values of a function. This means it plays nice with inequalities.

Rigorously Bounding ODE solutions with Sum of Squares optimization – Intervals

Intervals – Moore book. Computational functional analaysis. Tucker book. Coqintervals. fixed point theorem. Hardware acceleration? Interval valued functions. Interval extensions.

• Banach fixed point – contraction mapping
• Brouwer fixed point
• Schauder
• Knaster Tarski

Picard iteration vs? Allowing flex on boundary conditions via an interval?

Interval book had an interesting integral form for the 2-D

sympy has cool stuff

google scholar search z3, sympy brings up interesting things

https://moorepants.github.io/eme171/resources.html

The pydy guy Moore has a lot of good shit. resonance https://www.moorepants.info/blog/introducing-resonance.html

Lyapunov functions. Piecewise affine lyapunov funcions. Are lyapunov functions kind of like a PDE? Value functions are pdes. If the system is piecewise affine we can define a grid on the same piecewise affine thingo. Compositional convexity. Could we use compositional convexity + Relu style piecewise affinity to get complicated lyapunov functions. Lyapunov functions don’t have to be continiuous, they just have to be decreasing. The Lie derivative wrt the flow is always negative, i.e gradeint of function points roughly in direction of flow. trangulate around equilibrium if you want to avoid quadratic lyapunov. For guarded system, can relax lyapunov constrain outside of guard if you tighten inside guard. Ax>= 0 is guard. Its S-procedure.

Best piecewise approximation with point choice?

Connection to petri nets?

KoAt, LoAT. AProve. Integer transition systems. Termination analysis. Loops?

https://lfcps.org/pub/Pegasus.pdf darboux polynomials. barreir certificates. prelle-singer method. first integrals.

method 1. arbitrary polynomial p(t). calculate p'(t). find coefficents that make p'(t) = 0 by linear algebra. Idea: near invaraints? min max|p'(t) |

Lie Algebra method

https://www.researchgate.net/publication/233653257_Solving_Differential_Equations_by_Symmetry_Groups sympy links this paper. Sympy has some lie algebra stuff in there

https://www-users.math.umn.edu/~olver/sm.html Peter Olver tutorial

https://www-sop.inria.fr/members/Evelyne.Hubert/publications/PDF/Hubert_HDR.pdf

https://www.cs.cmu.edu/~aplatzer/logic/diffinv.html andre platzer. Zach says Darboux polynomials?

Books: Birhoff and Rota, Guggenheimer, different Olver books, prwctical guide to invaraints https://www.amazon.com/Practical-Invariant-Monographs-Computational-Mathematics/dp/0521857015

Idea: Approximate invariants? At least this ought to make a good coordinate system to work in where the dynamics are slow. Like action-angle and adiabatic transformations. Could also perhaps bound the

Picard Iteration

I have a method that I’m not sure is ultimately sound. The idea is to start with

Error analysis most often uses an appeal to Taylor’s theorem and Taylor’s theorem is usually derived from them mean value theorem or intermediate value theorem. Maybe that’s fine. But the mean value theorem is some heavy stuff. There are computational doo dads that use these bounds + interval analysis to rigorously integrate ODEs. See https://github.com/JuliaIntervals/TaylorModels.jl

The beauty of sum of squares certificates is that they are very primitive proofs of positivity for a function on a domain of infinitely many values. If I give you a way to write an expression as a sum of square terms, it is then quite obvious that it has to be always positive. This is algebra rather than analysis.

$y(t) = \lambda(t) \and \lambda(t) is SOS \Rightarrow \forall t. y(t) >= 0$. Sum of squares is a kind of a quantifier elimination method. The reverse direction of the above implication is the subject of the positivstullensatz, a theorem of real algebraic geometry. At the very least, we can use the SOS constraint as a relaxation of the quantified constraint.

So, I think by using sum of squares, we can turn a differential equation into a differential inequation. If we force the highest derivative to be larger than the required differential equation, we will get an overestimate of the required function.

A function that is dominated by another in derivative, will be dominated in value also. You can integrate over inequalities (I think. You have to be careful about such things. ) $\forall t. \frac{dx}{dt} >= \frac{dx}{dt} \Rightarrow$ x(t) – x(0) >= y(t) – y(0) $The derivative of a polynomial can be thought of as a completely formal operation, with no necessarily implied calculus meaning. It seems we can play a funny kind of shell game to avoid the bulk of calculus. As an example, let’s take $\frac{dx}{dt}=y$ $y(0) = 1$ with the solution $y = e^t$. $e$ is a transcendental The S-procedure is trick by which you can relax a sum of squares inequality to only need to be enforced in a domain. If you build a polynomials function that describes the domain, it that it is positive inside the domain and negative outside the domain, you can add a positive multiple of that to your SOS inequalities. Inside the domain you care about, you’ve only made them harder to satisfy, not easier. But outside the domain you have made it easier because you can have negative slack. For the domain $t \in [0,1]$ the polynomial $(1 - t)t$ works as our domain polynomial. We parametrize our solution as an explicit polynomial $x(t) = a_0 + a_1 t + a_2 t^2 + ...$. It is important to note that what follows is always linear in the $a_i$. $\frac{dx}{dt} - x >= 0$ can be relaxed to $\frac{dx}{dt} - x(t) + \lambda(t)(1-t)t >= 0$ with $\lambda(t) is SOS$. So with that we get a reasonable formulation of finding a polynomial upper bound solution of the differential equation $\min x(1)$ $\frac{dx}{dt} - x(t) + \lambda_1(t)(1-t)t = \lambda_2(t)$ $\lambda_{1,2}(t) is SOS$. And here it is written out in python using my cvxpy-helpers which bridge the gap between sympy polynomials and cvxpy. We can go backwards to figure out sufficient conditions for a bound. We want $x_u(t_f) \gte x(t_f)$. It is sufficient that $\forall t. x_u(t) \gte x(t)$. For this it is sufficient that $\forall t. x_u'(t) >= x'(t) \and x_u(t_i) >= x(t_i)$. We follow this down in derivative until we get the lowest derivative in the differential equation. Then we can use the linear differential equation itself $x^{(n)}(t) = \sum_i a_i(t) x^{(i)}(t)$. $x_u^{(n)}(t) >= \sum max(a_i x^{(i)}_u, x^{(i)}_l)$. $a(t) * x(t) >= \max a(t) x_u(t), a(t) x_l(t)$. This accounts for the possibility of terms changing signs. Or you could separate the terms into regions of constant sign. The minimization characterization of the bound is useful. For any class of functions that contains our degree-d polynomial, we can show that the minimum of the same optimization problem is less than or equal to our value. Is the dual value useful? The lower bound on the least upper bound Doesn’t seem like the method will work for nonlinear odes. Maybe it will if you relax the nonlinearity. Or you could use perhaps a MIDSP to make piecewise linear approximations of the nonlinearity? It is interesting to investigtae linear programming models. It is simpler and more concrete to examine how well different step sizes approximate each other rather than worry about the differential case. We can explicit compute a finite difference solution in the LP, which is a power that is difficult to achieve in general for differential equations. We can instead remove the exact solution by a convservative bound. While we can differentiate through an equality, we can’t differentiate through an inequality. Differentiation involves negation, which plays havoc with inequalities. We can however integrate through inequalities. $\frac{dx}{dt} >= f \and x(0) >= a \Rightarrow$ x(t) >= \int^t_0 f(x) + a$

As a generalization we can integrate $\int p(x)$ over inequalities as long as $p(x) \gte 0$

In particular $\forall t. \frac{dx}{dt} >= \frac{dx}{dt} \Rightarrow$ x(t) – x(0) >= y(t) – y(0) \$

We can convert a differential equation into a differential inequation. It is not entirely clear to me that there is a canonical way to do this. But it works to take the biggest.

$\frac{dx}{dt} = A(t)x + f(t)$

Is there a tightest

We can integrate

Here let’s calculate e

https://tel.archives-ouvertes.fr/tel-00657843v2/document Thesis on ODE bounds in Isabelle

not so good. very small

## Stupid Z3Py Tricks Strikes Back: Verifying a Keras Neural Network

Neural networks are all the rage these days. One mundane way of looking at neural networks is that they are a particular class of parametrized functions . What makes them useful is:

1. They can be used at insane scale due to their simplicity and excellent available implementations and tooling
2. There are intuitive ways to input abstract structure and symmetry expected of a problem, for example translation symmetry, or a hierarchy of small scale pattern recognition combining into large scale structures. How this all works is very peculiar.
3. Inspirational analogies can be drawn from nature.

Neural networks made out of just relus (rectified linear units, relu(x) = max(0,x) ) and linear layers are particularly amenable to formal analysis. Regarding the weights as fixed (once a network has be trained), the complete neural network is a piecewise linear function. The regions where it is piecewise define are polyhedra (They are defined by the hyperplanes of the relu transitions as inequality constraints). Such functions are among those the most amenable to automated rigorous analysis.

Most machine learning tasks don’t have a mathematically precise specification. What is the mathematically precise definition of a picture of a horse? We could try to come up with something (this is sort of what good old fashioned AI tried to do), but in my opinion it would be rather suspect.

Tasks that do have a precise spec are questionable areas for machine learning techniques, because how can you know that the network meets the spec? Also, one would suspect such a problem would have structure that you might be better off with a more traditional algorithmic approach.

However, there a a couple areas where one does have reasonable formal questions one might want to ask about a neural network:

• Robustness around training and validation data. Finding Adversarial examples or proving there are none.
• Games like Go. Alpha Go is a marriage of more traditional algorithmic approaches and machine learning. There is a core of traditional game tree search to it.
• Control Systems – There are many control systems which we do have a reasonable formal spec one could write, such as walking robots. These systems are so high dimensional that it is difficult to derive a good controller from the spec, and hence reinforcement learning may be a reasonable option. But we would like to confirm the controller is good and not dangerous
• Neural networks as computational accelerators. There are problems which we know how to solve, but are too slow. Neural networks can be evaluated very quickly and easily thanks to modern frameworks. It may be useful to presolve a large number of examples offline using the slow algorithm and train a neural network to give good estimates. We may be able to replace the slow thing entirely if we can verify the neural network always is good enough.

We’re going to use a neural network to fit a chebyshev polynomial. Here we’re picking a Chebyshev polynomial as our base truth because Chebyshev polynomials have some pleasant waviness to them. Why not. I like ’em. Also polynomials are easily understood by z3 as a base spec.

This example of course is a complete toy. How often do you see 1-D input space neural networks? Not often I bet.
But it’s nice for a couple reasons:

1. Because we can visualize it.
2. It drives home the point about neural nets being a space of piecewise linear function approximators, and how similar training is to curve fitting.
3. It’s simple enough that z3 can crush it. There is a big question if z3 or other methods can scale to realistic neural nets. Modern top of the line neural nets are insanely huge. As we’ve done it here, I highly doubt it. There are special purpose SMT solvers being built for this task. Also the slightly different technology of mixed integer programming can be used and seems very promising. So this is an area of research. See links at the bottom for more.

Generally speaking, the combination of the capabilities of sympy and z3 give us access to some very intriguing possibilities. I’m not going to investigate this in detail in this post, but I are showing how you can convert a sympy derived polynomial into a python expression using lambdify, which can then be in turn used on z3 variables.

Here we build a very small 3 layers network using Keras. We use a least squares error and an adam optimizer. Whatever. I actually had difficulty getting nice results out for higher order chebyshev polynomials. Neural networks are so fiddly.

And here we extract the weights and reinterpret them into z3. We could also have used z3 floating point capabilities rather than reals if you’re concerned about numerical issues. It was convenient to have my layers be different sizes, so that size mismatch would throw a python error. That’s how I found out the weights are transposed by default. The code at the end extracts a found countermodel and evaluates it. If you want to feel fancier, you can also use the prove function rather than an explicit Solver() instance. Saying you proved the neural network matches the polynomial to a certain tolerance feels really good. If you look at the graphs, the edges at 1 and -1 actually have pretty bad absolute error, around 0.5.

https://github.com/sisl/NeuralVerification.jl

https://arxiv.org/abs/1711.07356 – Evaluating Robustness of Neural Networks with Mixed Integer Programming

https://github.com/vtjeng/MIPVerify.jl

https://arxiv.org/abs/1702.01135 – reluplex, SMT specifically for neural networks

## Categorical LQR Control with Linear Relations

Last time, I described and built an interesting expansion of describing systems via linear functions, that of using linear relations. We remove the requirement that mappings be functional (exactly one output vector for any given input vector), which extends our descriptive capability. This is useful for describing “open” pieces of a linear system like electric circuits. This blog post is about another application: linear dynamical systems and control.

Linear relations are an excellent tool for reachability/controllability. In a control system, it is important to know where it is possible to get the system to. With linear dynamics $x_{t+1} = Ax_{t} + B u_{t}$, an easy controllability question is “can my system reach into some subspace”. This isn’t quite the most natural question physically, but it is natural mathematically. My first impulse would be to ask “given the state starts in this little region over here, can it get to this little region over here”, but the subspace question is a bit easier to answer. It’s a little funky but it isn’t useless. It can detect if the control parameter only touches 1 of 2 independent dynamical systems for example. We can write the equations of motion as a linear relation and iterate composition on them. See Erbele for more.

There is a set of different things (and kind of more interesting things) that are under the purview of linear control theory, LQR Controllers and Kalman filters. The LQR controller and Kalman filter are roughly (exactly?) the same thing mathematically. At an abstract mathematical level, they both rely on the fact that optimization of a quadratic objective $x^T Q x + r^T x + c$ with a linear constraints $Ax=b$ is solvable in closed form via linear algebra. The cost of the LQR controller could be the squared distance to some goal position for example. When you optimize a function, you set the derivative to 0. This is a linear equation for quadratic objectives. It is a useful framework because it has such powerful computational teeth.

The Kalman filter does a similar thing, except for the problem of state estimation. There are measurements linear related to the state that you want to match with the prior knowledge of the linear dynamics of the system and expected errors of measurement and environment perturbations to the dynamics.

There are a couple different related species of these filters. We could consider discrete time or continuous time. We can also consider infinite horizon or finite horizon. I feel that discrete time and finite horizon are the most simple and fundamental so that is what we’ll stick with. The infinite horizon and continuous time are limiting cases.

We also can consider the dynamics to be fixed for all time steps, or varying with time. Varying with time is useful for the approximation of nonlinear systems, where there are different good nonlinear approximation (taylor series of the dynamics) depending on the current state.

There are a couple ways to solve a constrained quadratic optimization problem. In some sense, the most conceptually straightforward is just to solve for a span of the basis (the “V-Rep” of my previous post) and plug that in to the quadratic objective and optimize over your now unconstrained problem. However, it is commonplace and elegant in many respects to use the Lagrange multiplier method. You introduce a new parameter for every linear equality and add a term to the objective $\lambda^T (A x - b)$. Hypothetically this term is 0 for the constrained solution so we haven’t changed the objective in a sense. It’s a funny slight of hand. If you optimize over x and $\lambda$, the equality constraint comes out as your gradient conditions on $\lambda$. Via this method, we convert a linearly constrained quadratic optimization problem into an unconstrained quadratic optimization problem with more variables.

Despite feeling very abstract, the value these Lagrangian variables take on has an interpretation. They are the “cost” or “price” of the equality they correspond to. If you moved the constraint by an amount $Ax - b + \epsilon$, you would change the optimal cost by an amount $\lambda \epsilon$. (Pretty sure I have that right. Units check out. See Boyd https://www.youtube.com/watch?v=FJVmflArCXc)

The Lagrange multipliers enforcing the linear dynamics are called the co-state variables. They represent the “price” that the dynamics cost, and are derivatives of the optimal value function V(s) (The best value that can be achieved from state s) that may be familiar to you from dynamic programming or reinforcement learning. See references below for more.

Let’s get down to some brass tacks. I’ve suppressed some terms for simplicity. You can also add offsets to the dynamics and costs.

A quadratic cost function with lagrange multipliers. Q is a cost matrix associated with the x state variables, R is a cost matrix for the u control variables.

$C = \lambda_0 (x_0 - \tilde{x}_0) + \sum_t x_t^T Q x_t + u_t^T R u_t + \lambda_{t+1}^T (x_{t+1} - A x_t - B u_t )$

Equations of motion results from optimizing with respect to $\lambda$ by design.

$\nabla_{\lambda_{t+1}} C = x_{t+1} - A x_t - B u_t = 0$.

The initial conditions are enforced by the zeroth multiplier.

$\nabla_{\lambda_0} C = x_i - x_{0} = 0$

Differentiation with respect to the state x has the interpretation of backwards iteration of value derivative, somewhat related to what one finds in the Bellman equation.

$\nabla_{x_t} C = Q x_{t} + A^T \lambda_{t+1} - \lambda_{t} = 0 \Longrightarrow \lambda_{t} = A^T \lambda_{t+1} + Q x_{t}$

The final condition on value derivative is the one that makes it the most clear that the Lagrange multiplier has the interpretation of the derivative of the value function, as it sets it to that.

$\nabla_{x_N} C = Q x_N - \lambda_{N} = 0$

Finally, differentiation with respect to the control picks the best action given knowledge of the value function at that time step.

$\nabla_{u_t} C = R u_{t} - B^T \lambda_t = 0$

Ok. Let’s code it up using linear relations. Each of these conditions is a separate little conceptual box. We can build the optimal control step relation by combining these updates together

The following is a bit sketchy. I am confident it makes sense, but I’m not confident that I have all of the signs and other details correct. I also still need to add the linear terms to the objective and offset terms to the dynamics. These details are all kind of actually important. Still, I think it’s nice to set down in preliminary blog form.

### Bits and Bobbles

• The code in context https://github.com/philzook58/ConvexCat/blob/master/src/LinRel.hs
• Some of the juicier stuff is nonlinear control. Gotta walk before we can run though. I have some suspicions that a streaming library may be useful here, or a lens-related approach. Also ADMM.
• Reminiscent of Keldysh contour. Probably a meaningless coincidence.
• In some sense H-Rep is the analog of (a -> b -> Bool) and V-rep [(a,b)]
• Note that the equations of motion are relational rather than functional for a control systems. The control parameters $u$ describe undetermined variables under our control.
• Loopback (trace) the value derivative for infinite horizon control.
• Can solve the Laplace equation with a Poincare Steklov operator. That should be my next post.
• There is something a touch unsatisfying even with the reduced goals of this post. Why did I have to write down the quadratic optimization problem and then hand translate it to linear relations? It’s kind of unacceptable. The quadratic optimization problem is really the most natural statement, but I haven’t figured out to make it compositional. The last chapter of Rockafellar?

### References

https://arxiv.org/abs/1405.6881 Baez and Erbele – Categories in Control

http://www.scholarpedia.org/article/Optimal_control

http://www.argmin.net/2016/05/18/mates-of-costate/ – interesting analogy to backpropagation. Lens connection?

https://courses.cs.washington.edu/courses/cse528/09sp/optimality_chapter.pdf

## Failing to Bound Kissing Numbers

https://en.wikipedia.org/wiki/Kissing_number

Cody brought up the other day the kissing number problem.Kissing numbers are the number of equal sized spheres you can pack around another one in d dimensions. It’s fairly self evident that the number is 2 for 1-d and 6 for 2d but 3d isn’t so obvious and in fact puzzled great mathematicians for a while. He was musing that it was interesting that he kissing numbers for some dimensions are not currently known, despite the fact that the first order theory of the real numbers is decidable https://en.wikipedia.org/wiki/Decidability_of_first-order_theories_of_the_real_numbers

I suggested on knee jerk that Sum of Squares might be useful here. I see inequalities and polynomials and then it is the only game in town that I know anything about.

Apparently that knee jerk was not completely wrong

https://arxiv.org/pdf/math/0608426.pdf

Somehow SOS/SDP was used for bounds here. I had an impulse that the problem feels SOS-y but I do not understand their derivation.

One way the problem can be formulated is by finding or proving there is no solution to the following set of equations constraining the centers $x_i$ of the spheres. Set the central sphere at (0,0,0,…) . Make the radii 1. Then$\forall i. |x_i|^2 = 2^2$ and $\forall i j. |x_i - x_j|^2 \ge 2^2$

I tried a couple different things and have basically failed. I hope maybe I’ll someday have a follow up post where I do better.

So I had 1 idea on how to approach this via a convex relaxation

Make a vector $x = \begin{bmatrix} x_0 & y _0 & x_1 & y _1 & x_2 & y _2 & ... \end{bmatrix}$ Take the outer product of this vector $x^T x = X$ Then we can write the above equations as linear equalities and inequalities on X. If we forget that we need X to be the outer product of x (the relaxation step), this becomes a semidefinite program. Fingers crossed, maybe the solution comes back as a rank 1 matrix. Other fingers crossed, maybe the solution comes back and says it’s infeasible. In either case, we have solved our original problem.

Didn’t work though. Sigh. It’s conceivable we might do better if we start packing higher powers into x?

Ok Round 2. Let’s just ask z3 and see what it does. I’d trust z3 with my baby’s soft spot.

It solves for 5 and below. Z3 grinds to a halt on N=6 and above. It ran for days doin nothing on my desktop.

Ok. A different tact. Try to use a positivstellensatz proof. If you have a bunch of polynomial inequalities and equalities if you sum polynomial multiples of these constraints, with the inequalities having sum of square multiples, in such a way to = -1, it shows that there is no real solution to them. We have the distance from origin as equality constraint and distance from each other as an inequality constraint. I intuitively think of the positivstellensatz as deriving an impossibility from false assumptions. You can’t add a bunch of 0 and positive numbers are get a negative number, hence there is no real solution.

I have a small set of helper functions for combining sympy and cvxpy for sum of squares optimization. I keep it here along with some other cute little constructs https://github.com/philzook58/cvxpy-helpers

and here is the attempted positivstellensatz.

It worked in 1-d, but did not work in 2d. At order 3 polynomials N=7, I maxed out my ram.

I also tried doing it in Julia, since sympy was killing me. Julia already has a SOS package

It was faster to encode, but it’s using the same solver (SCS), so basically the same thing.

I should probably be reducing the system with respect to equality constraints since they’re already in a Groebner basis. I know that can be really important for reducing the size of your problem

I dunno.

Blah blah blah blah A bunch of unedited trash

https://github.com/peterwittek/ncpol2sdpa Peter Wittek has probably died in an avalanche? That is very sad.

These notes

https://web.stanford.edu/class/ee364b/lectures/sos_slides.pdf

Positivstullensatz.

kissing number

Review of sum of squares

minimimum sample as LP. ridiculous problem
min t
st. f(x_i) – t >= 0

dual -> one dual variable per sample point
The only dual that will be non zero is that actually selecting the minimum.

Hm. Yeah, that’s a decent analogy.

How does the dual even have a chance of knowing about poly airhtmetic?
It must be during the SOS conversion prcoess. In building the SOS constraints,
we build a finite, limittted version of polynomial multiplication
x as a matrix. x is a shift matrix.
In prpducing the characterstic polynomial, x is a shift matrix, with the last line using the polynomial
known to be zero to
eigenvectors of this matrix are zeros of the poly.

SOS does not really on polynomials persay. It relies on closure of the suqaring operaiton

maybe set one sphere just at x=0 y = 2. That breaks some symmettry

set next sphere in plane something. random plane through origin?

order y components – breaks some of permutation symmettry.

no, why not order in a random direction. That seems better for symmettry breaking

## Neural Networks with Weighty Lenses (DiOptics?)

I wrote a while back how you can make a pretty nice DSL for reverse mode differentiation based on the same type as Lens. I’d heard some interesting rumblings on the internet around these ideas and so was revisiting them.

Composition is defined identically for reverse mode just as it is for lens.

After chewing on it a while, I realized this really isn’t that exotic. How it works is that you store the reverse mode computation graph, and all necessary saved data from the forward pass in the closure of the (dy -> dx). I also have a suspicion that if you defunctionalized this construction, you’d get the Wengert tape formulation of reverse mode ad.

Second, Lens is just a nice structure for bidirectional computation, with one forward pass and one backward pass which may or may not be getting/setting. There are other examples for using it like this.

It is also pretty similar to the standard “dual number” form type FAD x dx y dy = (x,dx)->(y,dy) for forward mode AD. We can bring the two closer by a CPS/Yoneda transformation and then some rearrangement.

and meet it in the middle with

I ended the previous post somewhat unsatisfied by how ungainly writing that neural network example was, and I called for Conal Elliot’s compiling to categories plugin as a possible solution. The trouble is piping the weights all over the place. This piping is very frustrating in point-free form, especially when you know it’d be so trivial pointful. While the inputs and outputs of layers of the network compose nicely (you no longer need to know about the internal computations), the weights do not. As we get more and more layers, we get more and more weights. The weights are in some sense not as compositional as the inputs and outputs of the layers, or compose in a different way that you need to maintain access to.

I thought of a very slight conceptual twist that may help.

The idea is we keep the weights out to the side in their own little type parameter slots. Then we define composition such that it composes input/outputs while tupling the weights. Basically we throw the repetitive complexity appearing in piping the weights around into the definition of composition itself.

These operations are easily seen as 2 dimensional diagrams.

Here’s the core reverse lens ad combinators

And here are the two dimensional combinators. I tried to write them point-free in terms of the combinators above to demonstrate that there is no monkey business going on. We

I wonder if this is actually nice?

I asked around and it seems like this idea may be what davidad is talking about when he refers to dioptics

http://events.cs.bham.ac.uk/syco/strings3-syco5/slides/dalrymple.pdf

Perhaps this will initiate a convo.

Edit: He confirms that what I’m doing appears to be a dioptic. Also he gave a better link http://events.cs.bham.ac.uk/syco/strings3-syco5/papers/dalrymple.pdf

He is up to some interesting diagrams

### Bits and Bobbles

• Does this actually work or help make things any better?
• Recurrent neural nets flip my intended role of weights and inputs.
• Do conv-nets naturally require higher dimensional diagrams?
• This weighty style seems like a good fit for my gauss seidel and iterative LQR solvers. A big problem I hit there was getting all the information to the outside, which is a similar issue to getting the weights around in a neural net.

## Flappy Bird as a Mixed Integer Program

Mixed Integer Programming is a methodology where you can specify convex (usually linear) optimization problems that include integer/boolean variables.

Flappy Bird is a game about a bird avoiding pipes.

We can use mixed integer programming to make a controller for Flappy Bird. Feel free to put this as a real-world application in your grant proposals, people.

While thinking about writing a MIP for controlling a lunar lander game, I realized how amenable to mixed integer modeling flappy bird is. Ben and I put together the demo on Saturday. You can find his sister blog post here.

The bird is mostly in free fall, on parabolic trajectories. This is a linear dynamic, so it can directly be expressed as a linear constraint. It can discretely flap to give itself an upward impulse. This is a boolean force variable at every time step. Avoiding the ground and sky is a simple linear constraint. The bird has no control over its x motion, so that can be rolled out as concrete values. Because of this, we can check what pipes are relevant at time points in the future and putting the bird in the gap is also a simple linear constraint.

There are several different objectives one might want to consider and weight. Perhaps you want to save the poor birds energy and minimize the sum of all flaps cvx.sum(flap). Or perhaps you want to really be sure it doesn’t hit any pipes by maximizing the minimum distance from any pipe. Or perhaps minimize the absolute value of the y velocity, which is a reasonable heuristic for staying in control. All are expressible as linear constraints. Perhaps you might want a weighted combo of these. All things to fiddle with.

There is a pygame flappy bird clone which made this all the much more slick. It is well written and easy to understand and modify. Actually figuring out the appropriate bounding boxes for pipe avoidance was finicky. Figuring out the right combo of bird size and pipe size is hard, combined with computer graphics and their goddamn upside down coordinate system.

We run our solver in a model predictive control configuration. Model predictive control is where you roll out a trajectory as an optimization problem and resolve it at every action step. This turns an open loop trajectory solve into a closed loop control, at the expense of needing to solve a perhaps very complicated problem in real time. This is not always feasible.

My favorite mip modeling tool is cvxpy. It gives you vectorized constraints and slicing, which I love. More tools should aspire to achieve numpy-like interfaces. I’ve got lots of other blog posts using this package which you can find in my big post list the side-bar 👀.

The github repo for the entire code is here: https://github.com/philzook58/FlapPyBird-MPC

And here’s the guts of the controller:

I think it is largely self explanatory but here are some notes. The dt = t//10 + 1 thing is about decreasing our time resolution the further out from the current time step. This increases the time horizon without the extra computation cost. Intuitively modeling accuracy further out in time should matter less. The last_solution stuff is for in case the optimization solver fails for whatever reason, in which case it’ll follow open-loop the last trajectory it got.

### Bits and Bobbles

• We changed the dynamics slightly from the python original to make it easier to model. We found this did not change the feel of the game. The old dynamics were piecewise affine though, so are also modelable using mixed integer programming. http://groups.csail.mit.edu/robotics-center/public_papers/Marcucci18.pdf . Generally check out the papers coming out of the Tedrake group. They are sweet. Total fanboy over here.
• The controller as is is not perfect. It fails eventually, and it probably shouldn’t. A bug? Numerical problems? Bad modeling of the pipe collision? A run tends to get through about a hundred pipes before something gets goofy.
• Since we had access to the source code, we could mimic the dynamics very well. How robust is flappy bird to noise and bad modeling? We could add wind, or inaccurate pipe data.
• Unions of Convex Region. Giving the flappy bird some x position control would change the nature of the problem. We could easily cut up the allowable regions of the bird into rectangles, and represent the total space as a union of convex regions, which is also MIP representable.
• Verification – The finite difference scheme used is crude. It is conceivable for the bird to clip a pipe. Since basically we know the closed form of the trajectories, we could verify that the parabolas do not intersect the regions. For funzies, maybe use sum of squares optimization?
• Realtime MIP. Our solver isn’t quite realtime. Maybe half real time. One might pursue methods to make the mixed integer program faster. This might involve custom branching heuristics, or early stopping. If one can get the solver fast enough, you might run the solver in parallel and only query a new path plan every so often.
• 3d flappy bird? Let the bird rotate? What about a platformer (Mario) or lunar lander? All are pretty interesting piecewise affine systems.
• Is this the best way to do this? Yes and no. Other ways to do this might involve doing some machine learning, or hardcoding a controller that monitors the pipe locations and has some simple feedback. You can find some among the forks of FlapPyBird. I have no doubt that you could write these quickly, fiddle with them and get them to work better and faster than this MIP controller. However, for me there is a difference between could work and should work. You can come up with a thousand bizarre schemes that could work. RL algorithms fall in this camp. But I have every reason to believe the MIP controller should work, which makes it easier to troubleshoot.

## The Classical Coulomb Gas as a Mixed Integer Quadratic Program

The coulomb gas is a model of electrostatics where you take the discreteness of charge into account. That is what makes it hard compared to the continuous charge problem. Previously, I’ve used mixed integer programming to find lowest energy states of the ising model. This is even more obvious application of mixed integer programming to a physics model.

We ordinarily consider electric charge to be a continuum, but it isn’t. It comes in chunks of the electron charge. Historically, people didn’t even know that for quite a while. It is usually a reasonable approximation for most purposes to consider electric charge to be continuous

If you consider a network of capacitors cooled to the the level that there is not enough thermal energy to borrow to get an electron to jump, the charges on the capacitors will be observably discretized. With a sufficiently slow cooling to this state, the charges should arrange themselves into the lowest energy state.

The coulomb gas model also is of interest due to its connections to the XY model, which I’ve taken a stab at with mixed integer programming before. The coulomb gas models the energy of vortices in that model. I think the connection between the models actually requires a statistical or quantum mechanical context though, whereas we’ve been looking at the classical energy minimization.

We can formulate the classical coulomb gas problem very straightforwardly as a mixed integer quadratic program. We can easily include an externally applied field and a charge conservation constraint if we so desire within the framework.

We write this down in python using the cvxpy library, which has a built in free MIQP solver, albeit not a very good one. Commercial solvers are probably quite a bit better.

The results seems reasonable. It makes sense for charge to go in the direction of the electric field. Going to the corners makes sense because then like charges are far apart. So this might be working. Who knows.

### Interesting places to go with this:

Prof Vanderbei shows how you can embed an FFT to enable making statements about both the time and frequency domain while keeping the problem sparse. The low time/memory $N log(N)$ complexity of the FFT is reflected in the spasity of the resulting linear program.

https://vanderbei.princeton.edu/tex/ffOpt/ffOptMPCrev4.pdf

Here’s a sketch about what this might look like. Curiously, looking at the actual number of nonzeros in the problem matrices, there are way too many. I am not sure what is going on. Something is not performing as i expect in the following code

The equivalent dense DFT:

It would be possible to use a frequency domain solution of the interparticle energy rather than the explicit coulomb law form. Hypothetically this might increase the sparsity of the problem.

It seems very possible to me in a similar manner to embed a fast multipole method or barnes-hut approximation within a MIQP. Introducing explicit charge summary variables for blocks would create a sparse version of the interaction matrix. So that’s fun.

## A Basic Branch and Bound Solver in Python using Cvxpy

Branch and bound is a useful problem solving technique. The idea is, if you have a minimization problem you want to solve, maybe there is a way to relax the constraints to an easier problem. If so, the solution of the easier problem is a lower bound on the possible solution of the hard problem. If the solution of the easier problem just so happens to also obey the more constrained hard problem, then it must also be the solution to the hard problem. You can also use the lower bound coming from a relaxed problem to prune your search tree for the hard problem. If even the relaxed problem doesn’t beat the current best found, don’t bother going down that branch.

A standard place this paradigm occurs is in mixed integer programming. The relaxation of a binary constraint (either 0 or 1) can be all the values in between (any number between 0 and 1). If this relaxed problem can be expressed in a form amenable to a solver like a linear programming solver, you can use that to power the branch and bound search, also using returned solutions for possible heuristics.

I built a basic version of this that uses cvxpy as the relaxed problem solver. Cvxpy already has much much faster mixed integer solvers baked in (which is useful to make sure mine is returning correct results), but it was an interesting exercise. The real reason I’m toying around is I kind of want the ability to add custom branching heuristics or inspect and maintain the branch and bound search tree, which you’d need to get into the more complicated guts of the solvers bound to cvxpy to get at. Julia might be a better choice.

A somewhat similar (and better) project is https://github.com/oxfordcontrol/miosqp which doesn’t use cvxpy explicitly, but does have the branch and bound control in the python layer of the solver. There are also other projects that can use fairly arbitrary solvers like Bonmin

As a toy problem I’m using a knapsack problem where we have objects of different sizes and different values. We want to maximize the value while keeping the total size under the capacity of the bag. This can be phrased linearly like so: $\max v \cdot x$ s.t. $\sum_i s_i x_i<= capacity$, $x \in {0,1}$. The basic heuristic I’m using is to branch on variables that are either 0 or 1 in even the relaxed solution. The alternative branch hopefully gets pruned fast.

This is at least solving the problem fairly quickly. It needs better heuristics and to be sped up, which is possible in lots of ways. I was not trying to avoid all performance optimizations. It takes maybe 5 seconds, whereas the cvxpy solver is almost instantaneous.

Edit : I should investigate the Parameter functionality of cvxpy. That would make a make faster version than the one above based on deepcopy. If you made the upper and lower vectors on the binary variables parameters, you could restrict the interval to 0/1 without rebuilding the problem or copying everything.

## Mixed Integer Programming & Quantization Error

I though of another fun use case of mixed integer programming the other day. The quantization part of a digital to analog converter is difficult to analyze by the techniques taught in a standard signals course (linear analysis, spectral techniques, convolution that sort of thing). The way it is usually done is via assuming the quantization error is a kind of randomized additive noise.

Mixed Integer programming does have the ability to directly encode some questions about this quantization though. We can directly encode the integer rounding relations by putting the constraint that the quantized signal is exactly +-1/2 a quantization interval away from the original signal. Then we can run further analysis on the signals and compare them. For example, I wrote down a quick cosine transform. Then I ask for the worst case signal that leads to the most error on the quantized transform versus the transform of the unquantized signal. My measure of worst case performance was the sum of the difference of the two transforms. I chose this because it is tractable as a mixed integer linear program. Not all reasonable metrics one might want will be easily encodable in a mixed integer framework it seems. Some of them are maximizing over a convex function, which is naughty. (for example trying to maximize the L2 error $\sum|x-y|^2$ )

In a variant of this, it is also possible to directly encode the digital signal process in terms of logic gate construction and compare that to the intended analog transform, although this will be a great deal more computational expensive.

This is interesting as a relatively straightforward technique for the analysis of quantization errors.

This also might be an interesting place to use the techniques of Vanderbei https://vanderbei.princeton.edu/tex/ffOpt/ffOptMPCrev4.pdf . He does a neato trick where he partially embeds the FFT algorithm into an optimization problem by adding auxiliary variables. Despite the expense of adding these variables, it greatly increases the sparsity of the constraint matrices, which will probably be a win. I wonder if one might do something similar with a Fast Multipole Method , Barnes Hut, or Wavelet transform? Seems likely. Would be neat, although I’m not sure what for. I was thinking of simulating the coulomb gas. That seems like a natural choice. Oooh. I should do that.

## Solving the XY Model using Mixed Integer Optimization in Python

There are many problems in physics that take the form of minimizing the energy. Often this energy is taken to be quadratic in the field. The canonical example is electrostatics. The derivative of the potential $\phi$ gives the electric field E. The energy is given as $\int (|\nabla \phi|^2 + \phi \rho) d^3 x$. We can encode a finite difference version of this (with boundary conditions!) directly into a convex optimization modelling language like so.

It is noted rarely in physics, but often in the convex optimization world that the barrier between easy and hard problems is not linear vs. nonlinear, it is actually more like convex vs. nonconvex. Convex problems are those that are bowl shaped, on round domains. When your problem is convex, you can’t get caught in valleys or on corners, hence greedy local methods like gradient descent and smarter methods work to find the global minimum. When you differentiate the energy above, it results in the linear Laplace equations $\nabla^2 \phi = \rho$. However, from the perspective of solvability, there is not much difference if we replace the quadratic energy with a convex alternative.

Materials do actually have non-linear permittivity and permeability, this may be useful in modelling that. It is also possible to consider the convex relaxation of truly hard nonlinear problems and hope you get the echoes of the phenomenology that occurs there.

Another approach is to go mixed integer. Mixed Integer programming allows you to force that some variables take on integer values. There is then a natural relaxation problem where you forget the integer variables have to be integers. Mixed integer programming combines a discrete flavor with the continuous flavor of convex programming. I’ve previously shown how you can use mixed integer programming to find the lowest energy states of the Ising model but today let’s see how to use it for a problem of a more continuous flavor.

As I’ve described previously, in the context of robotics, the non-convex constraint that variables lie on the surface of a circle can be approximated using mixed integer programming. We can mix this fairly trivially with the above to make a global solver for the minimum energy state of the XY model. The XY model is a 2d field theory where the value of the field is constrained to lie on a circle. It is a model of a number of physical systems, such as superconductivity, and is the playground for a number of interesting phenomenon, like the Kosterlitz-Thouless phase transition. Our encoding is very similar to the above except we make two copies of the field $phi$ and we then force them to lie on a circle. I’m trying to factor out the circle thing into my library cvxpy-helpers, which is definitely a work in progress.

Now, this isn’t really an unmitigated success as is. I switched to an absolute value potential because GLPK_MI needs it to be linear. ECOS_BB works with a quadratic potential, but it was not doing a great job. The commercial solvers (Gurobi, CPlex, Mosek) are supposed to be a great deal better. Perhaps switching to Julia, with it’s richer ecosystem might be a good idea too. I don’t really like how the solution of the absolute value potential looks. Also, even at such a small grid size it still takes around a minute to solve. When you think about it, it is exploring a ridiculously massive space and still doing ok. There are hundreds of binary variables in this example. But there is a lot of room for tweaking and I think the approach is intriguing.

Musings:

• Can one do steepest descent style analysis for low energy statistical mechanics or quantum field theory?
• Is the trace of the mixed integer program search tree useful for perturbative analysis? It seems intuitively reasonable that it visits low lying states
• The Coulomb gas is a very obvious candidate for mixed integer programming. Let the charge variables on each grid point = integers. Then use the coulomb potential as a quadratic energy. The coulomb gas is dual to the XY model. Does this exhibit itself in the mixed integer formalism?
• Nothing special about the circle. It is not unreasonable to make piecewise linear approximations or other convex approximations of the sphere or of Lie groups (circle is U(1) ). This is already discussed in particular about SO(3) which is useful in robotic kinematics and other engineering topics.

By absolute value potential, I mean using |del phi| as compared to a more ordinary quadratic |del phi|2.

This is where I’m getting confused. As you say later, you are actually using two fields, phi_x and phi_y. So I’m guessing your potential is the “L1 norm”

|del phi| = |del phi_x| + |del phi_y|

? This is the only linear thing I can think of.

I don’t feel that the exact specifics of the XY model actually matter all the much.

You should be careful here though. A key point in the XY model is the O(2) symmetry of the potential: you can multiply the vector (phi_x,phi_y) by a 2D rotation matrix and the Hamiltonian is unchanged. You have explicitly broken this symmetry down to Z_4 if your potential is as I have written above. In this case, the results of the famous JKKN paper and this followup by Kadanoff suggest that you’ll actually get a phase transition of the so-called “Ashkin-Teller” universality class. These are actually closely related to the Kosterlitz-Thouless transitions of the XY model; the full set of Ashkin-Teller phase transitions actually continuously link the XY transition with that of two decoupled Ising models.

You should still get an interesting phase transition in any case! Just wanted to give some background, as the physics here is extremely rich. The critical exponents you see will be different from the XY model, and you will actually get an ordered Z_4 phase at low temperatures rather than the quasi-long range order seen in the low temperature phase of the XY model. (You should be in the positive h_4 region of the bottom phase diagram of Figure 1 of the linked JKKN paper.)”

These are some interesting points and references.