Categorical Combinators for Convex Optimization and Model Predictive Control using Cvxpy

We’re gonna pump this well until someone MAKES me stop.

This particular example is something that I’ve been trying to figure out for a long time, and I am pleasantly surprised at how simple it all seems to be. The key difference with my previous abortive attempts is that I’m not attempting the heavy computational lifting myself.

We can take pointful DSLs and convert them into point-free category theory inspired interface. In this case a very excellent pointful dsl for convex optimization: cvxpy. Some similar and related posts converting dsls to categorical form

A convex optimization problem optimizes a convex objective function with constraints that define a convex set like polytopes or balls. They are polynomial time tractable and shockingly useful. We can make a category out of convex optimization problems. We consider some variables to be “input” and some to be “output”. This choice is somewhat arbitrary as is the case for many “relation” feeling things that aren’t really so rigidly oriented.

Convex programming problems do have a natural notion of composition. Check out the last chapter of Rockafeller, where he talks about the convex algebra of bifunctions. Instead of summing over the inner composition variable like in Vect \sum_j A_{ij}B_{jk}, or existentializing like in Rel \{ (a,c) |\exists b. (a,b)\in A, (b,c) \in B \}, we instead minimize over the inner composition variable min_y A(x,y) + B(y,z). These are similar operations in that they all produce bound variables.

The identity morphism is just the simple constraint that the input variables equal to output variables with an objective function of 0. This is an affine constraint, hence convex.

In general, if we ignore the objective part entirely by just setting it to zero, we’re actually working in a very computationally useful subcategory of Rel, ConvexRel, the category of relations which are convex sets. Composition corresponds to an existential operation, which is quickly solvable by convex optimization techniques. In operations research terms, these are feasibility problems rather than optimization problems. Many of the combinators do nothing to the objective.

The monoidal product just stacks variables side by side and adds the objectives and combines the constraints. They really are still independent problems. Writing things in this way opens up a possibility for parallelism. More on that some other day.

We can code this all up in python with little combinators that return the input, output, objective, constraintlist. We need to hide these in inner lambdas for appropriate fresh generation of variables.

Now for a somewhat more concrete example: Model Predictive control of an unstable (linearized) pendulum.

Model predictive control is where you solve an optimization problem of the finite time rollout of a control system online. In other words, you take measurement of the current state, update the constraint in an optimization problem, ask the solver to solve it, and then apply the force or controls that the solver says is the best.

This gives the advantage over the LQR controller in that you can set hard inequality bounds on total force available, or positions where you wish to allow the thing to go. You don’t want your system crashing into some wall or falling over some cliff for example. These really are useful constraints in practice. You can also include possibly time dependent aspects of the dynamics of your system, possibly helping you model nonlinear dynamics of your system.

I have more posts on model predictive control here http://www.philipzucker.com/model-predictive-control-of-cartpole-in-openai-gym-using-osqp/ http://www.philipzucker.com/flappy-bird-as-a-mixed-integer-program/

Here we model the unstable point of a pendulum and ask the controller to find forces to balance the pendulum.

We can interpret the controller in GraphCat in order to produce a diagram of the 10 step lookahead controller. This is an advantage of the categorical style a la compiling to categories

We can also actually run it in model predictive control configuration in simulation.

And some curves. How bout that.

Bits and Bobbles

LazySets https://github.com/JuliaReach/LazySets.jl

ADMM – It’s a “lens”. I’m pretty sure I know how to do it pointfree. Let’s do it next time.

The minimization can be bubbled out to the top is we are always minimizing. If we mix in maximization, then we can’t and we’re working on a more difficult problem. This is similar to what happens in Rel when you have relational division, which is a kind of universal quantification \forall . Mixed quantifier problems in general are very tough. These kinds of problems include games, synthesis, and robustness. More on this some other day.

Convex-Concave programming minimax https://web.stanford.edu/~boyd/papers/pdf/dccp_cdc.pdf https://web.stanford.edu/class/ee364b/lectures/cvxccv.pdf

The minimization operation can be related to the summation operation by the method of steepest descent in some cases. The method of steepest descent approximates a sum or integral by evaluating it at it’s most dominant position and expanding out from there, hence converts a linear algebra thing into an optimization problem. Examples include the connection between statistical mechanics and thermodynamics and classical mechanics and quantum mechanics.

Legendre Transformation ~ Laplace Transformation via steepest descent https://en.wikipedia.org/wiki/Convex_conjugate yada yada, all kinds of good stuff.

Intersection is easy. Join/union is harder. Does MIP help?

Quantifier elimination

MIP via adjunction

Naive Synthesis of Sorting Networks using Z3Py

As a simple extension of verifying the sorting networks from before, we can synthesize optimally small sorting networks. The “program” of the sorting network is specified by a list of tuples of the elements we wish to compare and swap in order. We just generate all possible sequences of comparison operations and ask z3 to try verifying. If z3 says it verifies, we’re done.

Here are some definitions for running the thing

and here is a simple generating thing for all possible pairs.

As is, this is astoundingly slow. Truly truly abysmally slow. The combinatorics of really naively search through this space is abysmal. I doubt you’re going to get more than a network of size 6 out of this as is.

Some possible optimizations: early pruning of bad networks with testing, avoiding ever looking at obviously bad networks. Maybe a randomized search might be faster if one doesn’t care about optimality. We could also ask z3 to produce networks.

For more on program synthesis, check out Nadia Polikarpova’s sick course here.

https://github.com/nadia-polikarpova/cse291-program-synthesis

Rough Ideas on Categorical Combinators for Model Checking Petri Nets using Cvxpy

Petri nets are a framework for modeling dynamical systems that is very intuitive to some people. The vanilla version is that there are discrete tokens at nodes on a graph representing resources of some kind and tokens can be combined according to the firing of transition rules into new tokens in some other location.

This is a natural generalization of chemical reaction kinetics, where tokens are particular kinds of atoms that need to come together. It also is a useful notion for computer systems, where tokens represent some computational resource.

To me, this becomes rather similar to a flow problem or circuit problem. Tokens feel a bit like charge transitions are a bit like current (although not necessarily conservative). In a circuit, one can have such a small current that the particulate nature of electric current in terms of electrons is important. This happens for shot noise or for coulomb blockade for example.

If the number of tokens is very large, it seems intuitively sensible to me that one can well approximate the behavior by relaxing to a continuum. Circuits have discrete electrons and yet are very well approximated by ohm’s laws and the like. Populations are made of individuals, and yet in the thousands their dynamics are well described by differential equations.

It seems to me that mixed integer programming is a natural fit for this problem. Mixed integer programming has had its implementations and theory heavily refined for over 70 years so now very general purpose and performant off the shelf solvers are available. The way mixed integer programs are solved is by considering their quickly solved continuous relaxation (allowing fractional tokens and fractional transitions more akin to continuous electrical circuit flow) and using this information to systematically inform a discrete search process. This seems to me a reasonable starting approximation. Another name for petri nets is Vector Addition Systems, which has more of the matrix-y flavor.

We can encode a bounded model checking for reachability of a petri net into a mixed integer program fairly easily. We use 2-index variables, the first of which will be used for time step. We again turn to the general purpose functional way of encoding pointful dsls into pointfree ones as I have done here and here. The key point is that you need to be careful where you generate fresh variables. This is basically copy catting my posts here. http://www.philipzucker.com/categorical-combinators-for-graphviz-in-python/ http://www.philipzucker.com/a-sketch-of-categorical-relation-algebra-combinators-in-z3py/

I’m like 70% sure what I did here makes sense, but I’m pretty sure the general idea makes sense with some fiddling.

The big piece is the weighted_block function. It let’s you build a combinator with an internal state and input and output flow variables. You give matrices with entries for every possible transition. Whether transitions occurred between t and t+1 is indicated by integer variables. There is also possible accumulation of tokens at nodes, which also requires integer variables. Perhaps we’d want to expose the token state of the nodes to the outside too?

Weighted block schematically looks something like this

We can also get out a graphical representation of the net by reinterpreting our program into GraphCat. This is part of the power of the categorical interface. http://www.philipzucker.com/categorical-combinators-for-graphviz-in-python/

When I talked to Zach about this, he seemed skeptical that MIP solvers wouldn’t eat die a horrible death if you threw a moderately large petri net into them. Hard to say without trying.

Thoughts

There is an interesting analogy to be found with quantum field theory in that if you lift up to considering distributions of tokens, it looks like an occupation number representation. See Baez. http://math.ucr.edu/home/baez/stoch_stable.pdf

If you relax the integer constraint and the positivity constraints, this really is a finite difference formulation for capacitor circuits. The internal states would then be the charge of the capacitor. Would the positivity constraint be useful for diodes?

I wonder how relevant the chunky nature of petri nets might be for considering superconducting circuits, which have quantization of quantities from quantum mechanical effects.

Cvxpy let’s you describe convex regions. We can use this to implement a the convex subcategory of Rel which you can ask reachability questions. Relational division won’t work probably. I wonder if there is something fun there with respect the the integerizing operation and galois connections.

Edit: I should have googled a bit first. https://www.sciencedirect.com/science/article/pii/S0377221705009240 mathemtical programming tecchniques for petri net reachability. So it has been tried, and it sounds like the results weren’t insanely bad.

Fiddling around with validated ODE integration, Sum of Squares, Taylor Models.

As I have gotten more into the concerns of formal methods, I’ve become unsure that ODEs actually exist. These are concerns that did not bother me much when I defined myself as being more in the physics game. How times change. Here’s a rough cut.

A difficulty with ODE error analysis is that it is very confusing how to get the error on something you are having difficulty approximating in the first place.

If I wanted to know the error of using a finite step size dt vs a size dt/10, great. Just compute both and compare. However, no amount of this seems to bootstrap you down to the continuum. And so I thought, you’re screwed in regards to using numerics in order to get true hard facts about the true solution. You have to go to paper and pencil considerations of equations and variables and epsilons and deltas and things. It is now clearer to me that this is not true. There is a field of verified/validated numerics.

A key piece of this seems to be interval arithmetic. https://en.wikipedia.org/wiki/Interval_arithmetic An interval can be concretely represented by its left and right point. If you use rational numbers, you can represent the interval precisely. Interval arithmetic over approximates operations on intervals in such a way as to keep things easily computable. One way it does this is by ignoring dependencies between different terms. Check out Moore et al’s book for more.

What switching over to intervals does is you think about sets as the things you’re operating on rather than points. For ODEs (and other things), this shift of perspective is to no longer consider individual functions, but instead sets of functions. And not arbitrary extremely complicated sets, only those which are concretely manipulable and storable on a computer like intervals. Taylor models are a particular choice of function sets. You are manipulating an interval tube around a finite polynomial. If during integration / multiplication you get higher powers, truncate the polynomials by dumping the excess into the interval term. This keeps the complexity under wraps and closes the loop of the descriptive system.

If we have an iterative, contractive process for getting better and better solutions of a problem (like a newton method or some iterative linear algebra method), we can get definite bounds on the solution if we can demonstrate that a set maps into itself under this operation. If this is the case and we know there is a unique solution, then it must be in this set.

It is wise if at all possible to convert an ODE into integral form. \dot{x}= f(x,t) is the same as x(t) = x_0 + \int f(x,t)dt.

For ODEs, the common example of such an operation is known as Picard iteration. In physical terms, this is something like the impulse approximation / born approximation. One assumes that the ODE evolves according to a known trajectory x_0(t) as a first approximation. Then one plugs in the trajectory to the equations of motion f(x_0,t) to determine the “force” it would feel and integrate up all this force. This creates a better approximation x_1(t) (probably) which you can plug back in to create an even better approximation.

If we instead do this iteration on an intervally function set / taylor model thing, and can show that the set maps into itself, we know the true solution lies in this interval. The term to search for is Taylor Models (also some links below).

I was tinkering with whether sum of squares optimization might tie in to this. I have not seen SOS used in this context, but it probably has or is worthless.

An aspect of sum of squares optimization that I thought was very cool is that it gives you a simple numerical certificate that confirms that at the infinitude of points for which you could evaluate a polynomial, it comes out positive. This is pretty cool. http://www.philipzucker.com/deriving-the-chebyshev-polynomials-using-sum-of-squares-optimization-with-sympy-and-cvxpy/

But that isn’t really what makes Sum of squares special. There are other methods by which to do this.

There are very related methods called DSOS and SDSOS https://arxiv.org/abs/1706.02586 which are approximations of the SOS method. They replace the SDP constraint at the core with a more restrictive constraint that can be expressed with LP and socp respectively. These methods lose some of the universality of the SOS method and became basis dependent on your choice of polynomials. DSOS in fact is based around the concept of a diagonally dominant matrix, which means that you should know roughly what basis your certificate should be in.

This made me realize there is an even more elementary version of DSOS that perhaps should have been obvious to me from the outset. Suppose we have a set of functions we already know are positive everywhere on a domain of interest. A useful example is the raised chebyshev polynomials. https://en.wikipedia.org/wiki/Chebyshev_polynomials The appropriate chebyshev polynomials oscillate between [-1,1] on the interval [-1,1], so if you add 1 to them they are positive over the whole interval [-1,1]. Then nonnegative linear sums of them are also positive. Bing bang boom. And that compiles down into a simple linear program (inequality constraints on the coefficients) with significantly less variables than DSOS. What we are doing is restricting ourselves to completely positive diagonal matrices again, which are of course positive semidefinite. It is less flexible, but it also has more obvious knobs to throw in domain specific knowledge. You can use a significantly over complete basis and finding this basis is where you can insert your prior knowledge.

It is not at all clear there is any benefit over interval based methods.

Here is a sketch I wrote for x'=x which has solution e^t. I used raised chebyshev polynomials to enforce positive polynomial constraints and tossed in a little taylor model / interval arithmetic to truncate off the highest terms.

I’m using my helper functions for translating between sympy and cvxpy expressions. https://github.com/philzook58/cvxpy-helpers Sympy is great for collecting up the coefficients on terms and polynomial multiplication integration differentiation etc. I do it by basically creating sympy matrix variables corresponding to cvxpy variables which I compile to cvxpy expressions using lambdify with an explicit variable dictionary.

Seems to work, but I’ve been burned before.

man, LP solvers are so much better than SDP solvers


Random junk and links: Should I be more ashamed of dumps like this? I don’t expect you to read this.

https://github.com/JuliaIntervals/TaylorModels.jl

https://github.com/JuliaIntervals

Functional analysis by and large analyzes functions by analogy with more familiar properties of finite dimensional vector spaces. In ordinary 2d space, it is convenient to work with rectangular regions or polytopic regions.

Suppose I had a damped oscillator converging to some unknown point. If we can show that every point in a set maps within the set, we can show that the function

One model of a program is that it is some kind of kooky complicated hyper nonlinear discrete time dynamical system. And vice versa, dynamical systems are continuous time programs. The techniques for analyzing either have analogs in the other domain. Invariants of programs are essential for determining correctness properties of loops. Invariants like energy and momentum are essential for determining what physical systems can and cannot do. Lyapunov functions demonstrate that control systems are converging to the set point. Terminating metrics are showing that loops and recursion must eventually end.

If instead you use interval arithmetic for a bound on your solution rather than your best current solution, and if you can show the interval maps inside itself, then you know that the iterative process must converge inside of the interval, hence that is where the true solution lies.

A very simple bound for an integral \int_a^b f(x)dx is \int max_{x \in [a,b]}f(x)  dx= max_{x \in [a,b]}f(x) \int dx = max_{x \in [a,b]}f(x) (b - a)

The integral is a very nice operator. The result of the integral is a positive linear sum of the values of a function. This means it plays nice with inequalities.

Rigorously Bounding ODE solutions with Sum of Squares optimization – Intervals

Intervals – Moore book. Computational functional analaysis. Tucker book. Coqintervals. fixed point theorem. Hardware acceleration? Interval valued functions. Interval extensions.

  • Banach fixed point – contraction mapping
  • Brouwer fixed point
  • Schauder
  • Knaster Tarski

Picard iteration vs? Allowing flex on boundary conditions via an interval?

Interval book had an interesting integral form for the 2-D

sympy has cool stuff

google scholar search z3, sympy brings up interesting things

https://moorepants.github.io/eme171/resources.html

The pydy guy Moore has a lot of good shit. resonance https://www.moorepants.info/blog/introducing-resonance.html

Lyapunov functions. Piecewise affine lyapunov funcions. Are lyapunov functions kind of like a PDE? Value functions are pdes. If the system is piecewise affine we can define a grid on the same piecewise affine thingo. Compositional convexity. Could we use compositional convexity + Relu style piecewise affinity to get complicated lyapunov functions. Lyapunov functions don’t have to be continiuous, they just have to be decreasing. The Lie derivative wrt the flow is always negative, i.e gradeint of function points roughly in direction of flow. trangulate around equilibrium if you want to avoid quadratic lyapunov. For guarded system, can relax lyapunov constrain outside of guard if you tighten inside guard. Ax>= 0 is guard. Its S-procedure.

Best piecewise approximation with point choice?

http://theory.stanford.edu/~arbrad/papers/lr.ps linear ranking functions

Connection to petri nets?

https://ths.rwth-aachen.de/wp-content/uploads/sites/4/hs_lecture_notes.pdf

https://www.cs.colorado.edu/~xich8622/papers/rtss12.pdf

KoAt, LoAT. AProve. Integer transition systems. Termination analysis. Loops?

https://lfcps.org/pub/Pegasus.pdf darboux polynomials. barreir certificates. prelle-singer method. first integrals.

method 1. arbitrary polynomial p(t). calculate p'(t). find coefficents that make p'(t) = 0 by linear algebra. Idea: near invaraints? min max|p'(t) |

Lie Algebra method

https://www.researchgate.net/publication/233653257_Solving_Differential_Equations_by_Symmetry_Groups sympy links this paper. Sympy has some lie algebra stuff in there

https://www-users.math.umn.edu/~olver/sm.html Peter Olver tutorial

http://www-users.math.umn.edu/~olver/talk.html olver talks

https://www-sop.inria.fr/members/Evelyne.Hubert/publications/PDF/Hubert_HDR.pdf

https://www.cs.cmu.edu/~aplatzer/logic/diffinv.html andre platzer. Zach says Darboux polynomials?

https://sylph.io/blog/math.html

Books: Birhoff and Rota, Guggenheimer, different Olver books, prwctical guide to invaraints https://www.amazon.com/Practical-Invariant-Monographs-Computational-Mathematics/dp/0521857015

Idea: Approximate invariants? At least this ought to make a good coordinate system to work in where the dynamics are slow. Like action-angle and adiabatic transformations. Could also perhaps bound the

Picard Iteration

I have a method that I’m not sure is ultimately sound. The idea is to start with

Error analysis most often uses an appeal to Taylor’s theorem and Taylor’s theorem is usually derived from them mean value theorem or intermediate value theorem. Maybe that’s fine. But the mean value theorem is some heavy stuff. There are computational doo dads that use these bounds + interval analysis to rigorously integrate ODEs. See https://github.com/JuliaIntervals/TaylorModels.jl

The beauty of sum of squares certificates is that they are very primitive proofs of positivity for a function on a domain of infinitely many values. If I give you a way to write an expression as a sum of square terms, it is then quite obvious that it has to be always positive. This is algebra rather than analysis.

y(t) = \lambda(t) \and \lambda(t) is SOS \Rightarrow \forall t. y(t) >= 0. Sum of squares is a kind of a quantifier elimination method. The reverse direction of the above implication is the subject of the positivstullensatz, a theorem of real algebraic geometry. At the very least, we can use the SOS constraint as a relaxation of the quantified constraint.

So, I think by using sum of squares, we can turn a differential equation into a differential inequation. If we force the highest derivative to be larger than the required differential equation, we will get an overestimate of the required function.

A function that is dominated by another in derivative, will be dominated in value also. You can integrate over inequalities (I think. You have to be careful about such things. ) \forall t. \frac{dx}{dt} >= \frac{dx}{dt} \Rightarrow x(t) – x(0) >= y(t) – y(0) $

The derivative of a polynomial can be thought of as a completely formal operation, with no necessarily implied calculus meaning. It seems we can play a funny kind of shell game to avoid the bulk of calculus.

As an example, let’s take \frac{dx}{dt}=y y(0) = 1 with the solution y = e^t. e is a transcendental

The S-procedure is trick by which you can relax a sum of squares inequality to only need to be enforced in a domain. If you build a polynomials function that describes the domain, it that it is positive inside the domain and negative outside the domain, you can add a positive multiple of that to your SOS inequalities. Inside the domain you care about, you’ve only made them harder to satisfy, not easier. But outside the domain you have made it easier because you can have negative slack.

For the domain t \in [0,1] the polynomial (1 - t)t works as our domain polynomial.

We parametrize our solution as an explicit polynomial x(t) = a_0 + a_1 t + a_2 t^2 + .... It is important to note that what follows is always linear in the a_i.

\frac{dx}{dt} - x >= 0 can be relaxed to \frac{dx}{dt} - x(t) + \lambda(t)(1-t)t >= 0 with \lambda(t) is SOS.

So with that we get a reasonable formulation of finding a polynomial upper bound solution of the differential equation

\min x(1)

\frac{dx}{dt} - x(t) + \lambda_1(t)(1-t)t =  \lambda_2(t)

\lambda_{1,2}(t) is SOS.

And here it is written out in python using my cvxpy-helpers which bridge the gap between sympy polynomials and cvxpy.

We can go backwards to figure out sufficient conditions for a bound. We want x_u(t_f) \gte x(t_f). It is sufficient that \forall t. x_u(t) \gte x(t). For this it is sufficient that \forall t. x_u'(t)  >= x'(t) \and x_u(t_i) >= x(t_i) . We follow this down in derivative until we get the lowest derivative in the differential equation. Then we can use the linear differential equation itself x^{(n)}(t) = \sum_i a_i(t) x^{(i)}(t). x_u^{(n)}(t) >= \sum max(a_i x^{(i)}_u, x^{(i)}_l).

a(t) * x(t) >= \max a(t) x_u(t), a(t) x_l(t). This accounts for the possibility of terms changing signs. Or you could separate the terms into regions of constant sign.

The minimization characterization of the bound is useful. For any class of functions that contains our degree-d polynomial, we can show that the minimum of the same optimization problem is less than or equal to our value.

Is the dual value useful? The lower bound on the least upper bound

Doesn’t seem like the method will work for nonlinear odes. Maybe it will if you relax the nonlinearity. Or you could use perhaps a MIDSP to make piecewise linear approximations of the nonlinearity?

It is interesting to investigtae linear programming models. It is simpler and more concrete to examine how well different step sizes approximate each other rather than worry about the differential case.

We can explicit compute a finite difference solution in the LP, which is a power that is difficult to achieve in general for differential equations.

We can instead remove the exact solution by a convservative bound.

While we can differentiate through an equality, we can’t differentiate through an inequality. Differentiation involves negation, which plays havoc with inequalities. We can however integrate through inequalities.

\frac{dx}{dt} >= f \and x(0) >= a \Rightarrow x(t) >= \int^t_0 f(x) + a$

As a generalization we can integrate \int p(x) over inequalities as long as p(x) \gte 0

In particular \forall t. \frac{dx}{dt} >= \frac{dx}{dt} \Rightarrow x(t) – x(0) >= y(t) – y(0) $

We can convert a differential equation into a differential inequation. It is not entirely clear to me that there is a canonical way to do this. But it works to take the biggest.

\frac{dx}{dt} = A(t)x + f(t)

Is there a tightest

We can integrate

Here let’s calculate e

https://tel.archives-ouvertes.fr/tel-00657843v2/document Thesis on ODE bounds in Isabelle

not so good. very small

Stupid Z3Py Tricks Strikes Back: Verifying a Keras Neural Network

Neural networks are all the rage these days. One mundane way of looking at neural networks is that they are a particular class of parametrized functions . What makes them useful is:

  1. They can be used at insane scale due to their simplicity and excellent available implementations and tooling
  2. There are intuitive ways to input abstract structure and symmetry expected of a problem, for example translation symmetry, or a hierarchy of small scale pattern recognition combining into large scale structures. How this all works is very peculiar.
  3. Inspirational analogies can be drawn from nature.

Neural networks made out of just relus (rectified linear units, relu(x) = max(0,x) ) and linear layers are particularly amenable to formal analysis. Regarding the weights as fixed (once a network has be trained), the complete neural network is a piecewise linear function. The regions where it is piecewise define are polyhedra (They are defined by the hyperplanes of the relu transitions as inequality constraints). Such functions are among those the most amenable to automated rigorous analysis.

A relu

Most machine learning tasks don’t have a mathematically precise specification. What is the mathematically precise definition of a picture of a horse? We could try to come up with something (this is sort of what good old fashioned AI tried to do), but in my opinion it would be rather suspect.

Tasks that do have a precise spec are questionable areas for machine learning techniques, because how can you know that the network meets the spec? Also, one would suspect such a problem would have structure that you might be better off with a more traditional algorithmic approach.

However, there a a couple areas where one does have reasonable formal questions one might want to ask about a neural network:

  • Robustness around training and validation data. Finding Adversarial examples or proving there are none.
  • Games like Go. Alpha Go is a marriage of more traditional algorithmic approaches and machine learning. There is a core of traditional game tree search to it.
  • Control Systems – There are many control systems which we do have a reasonable formal spec one could write, such as walking robots. These systems are so high dimensional that it is difficult to derive a good controller from the spec, and hence reinforcement learning may be a reasonable option. But we would like to confirm the controller is good and not dangerous
  • Neural networks as computational accelerators. There are problems which we know how to solve, but are too slow. Neural networks can be evaluated very quickly and easily thanks to modern frameworks. It may be useful to presolve a large number of examples offline using the slow algorithm and train a neural network to give good estimates. We may be able to replace the slow thing entirely if we can verify the neural network always is good enough.

We’re going to use a neural network to fit a chebyshev polynomial. Here we’re picking a Chebyshev polynomial as our base truth because Chebyshev polynomials have some pleasant waviness to them. Why not. I like ’em. Also polynomials are easily understood by z3 as a base spec.

This example of course is a complete toy. How often do you see 1-D input space neural networks? Not often I bet.
But it’s nice for a couple reasons:

  1. Because we can visualize it.
  2. It drives home the point about neural nets being a space of piecewise linear function approximators, and how similar training is to curve fitting.
  3. It’s simple enough that z3 can crush it. There is a big question if z3 or other methods can scale to realistic neural nets. Modern top of the line neural nets are insanely huge. As we’ve done it here, I highly doubt it. There are special purpose SMT solvers being built for this task. Also the slightly different technology of mixed integer programming can be used and seems very promising. So this is an area of research. See links at the bottom for more.

Generally speaking, the combination of the capabilities of sympy and z3 give us access to some very intriguing possibilities. I’m not going to investigate this in detail in this post, but I are showing how you can convert a sympy derived polynomial into a python expression using lambdify, which can then be in turn used on z3 variables.

Our actual chebyshev polynomial

Here we build a very small 3 layers network using Keras. We use a least squares error and an adam optimizer. Whatever. I actually had difficulty getting nice results out for higher order chebyshev polynomials. Neural networks are so fiddly.

Our neural fit of the polynomial

And here we extract the weights and reinterpret them into z3. We could also have used z3 floating point capabilities rather than reals if you’re concerned about numerical issues. It was convenient to have my layers be different sizes, so that size mismatch would throw a python error. That’s how I found out the weights are transposed by default. The code at the end extracts a found countermodel and evaluates it. If you want to feel fancier, you can also use the prove function rather than an explicit Solver() instance. Saying you proved the neural network matches the polynomial to a certain tolerance feels really good. If you look at the graphs, the edges at 1 and -1 actually have pretty bad absolute error, around 0.5.

Links

https://github.com/sisl/NeuralVerification.jl

https://arxiv.org/abs/1711.07356 – Evaluating Robustness of Neural Networks with Mixed Integer Programming

https://github.com/vtjeng/MIPVerify.jl

https://arxiv.org/abs/1702.01135 – reluplex, SMT specifically for neural networks

Categorical LQR Control with Linear Relations

Last time, I described and built an interesting expansion of describing systems via linear functions, that of using linear relations. We remove the requirement that mappings be functional (exactly one output vector for any given input vector), which extends our descriptive capability. This is useful for describing “open” pieces of a linear system like electric circuits. This blog post is about another application: linear dynamical systems and control.

Linear relations are an excellent tool for reachability/controllability. In a control system, it is important to know where it is possible to get the system to. With linear dynamics x_{t+1} = Ax_{t} + B u_{t}, an easy controllability question is “can my system reach into some subspace”. This isn’t quite the most natural question physically, but it is natural mathematically. My first impulse would be to ask “given the state starts in this little region over here, can it get to this little region over here”, but the subspace question is a bit easier to answer. It’s a little funky but it isn’t useless. It can detect if the control parameter only touches 1 of 2 independent dynamical systems for example. We can write the equations of motion as a linear relation and iterate composition on them. See Erbele for more.

There is a set of different things (and kind of more interesting things) that are under the purview of linear control theory, LQR Controllers and Kalman filters. The LQR controller and Kalman filter are roughly (exactly?) the same thing mathematically. At an abstract mathematical level, they both rely on the fact that optimization of a quadratic objective x^T Q x + r^T x + c with a linear constraints Ax=b is solvable in closed form via linear algebra. The cost of the LQR controller could be the squared distance to some goal position for example. When you optimize a function, you set the derivative to 0. This is a linear equation for quadratic objectives. It is a useful framework because it has such powerful computational teeth.

The Kalman filter does a similar thing, except for the problem of state estimation. There are measurements linear related to the state that you want to match with the prior knowledge of the linear dynamics of the system and expected errors of measurement and environment perturbations to the dynamics.

There are a couple different related species of these filters. We could consider discrete time or continuous time. We can also consider infinite horizon or finite horizon. I feel that discrete time and finite horizon are the most simple and fundamental so that is what we’ll stick with. The infinite horizon and continuous time are limiting cases.

We also can consider the dynamics to be fixed for all time steps, or varying with time. Varying with time is useful for the approximation of nonlinear systems, where there are different good nonlinear approximation (taylor series of the dynamics) depending on the current state.

There are a couple ways to solve a constrained quadratic optimization problem. In some sense, the most conceptually straightforward is just to solve for a span of the basis (the “V-Rep” of my previous post) and plug that in to the quadratic objective and optimize over your now unconstrained problem. However, it is commonplace and elegant in many respects to use the Lagrange multiplier method. You introduce a new parameter for every linear equality and add a term to the objective \lambda^T (A x - b). Hypothetically this term is 0 for the constrained solution so we haven’t changed the objective in a sense. It’s a funny slight of hand. If you optimize over x and \lambda, the equality constraint comes out as your gradient conditions on \lambda. Via this method, we convert a linearly constrained quadratic optimization problem into an unconstrained quadratic optimization problem with more variables.

Despite feeling very abstract, the value these Lagrangian variables take on has an interpretation. They are the “cost” or “price” of the equality they correspond to. If you moved the constraint by an amount Ax - b + \epsilon, you would change the optimal cost by an amount \lambda \epsilon. (Pretty sure I have that right. Units check out. See Boyd https://www.youtube.com/watch?v=FJVmflArCXc)

The Lagrange multipliers enforcing the linear dynamics are called the co-state variables. They represent the “price” that the dynamics cost, and are derivatives of the optimal value function V(s) (The best value that can be achieved from state s) that may be familiar to you from dynamic programming or reinforcement learning. See references below for more.

Let’s get down to some brass tacks. I’ve suppressed some terms for simplicity. You can also add offsets to the dynamics and costs.

A quadratic cost function with lagrange multipliers. Q is a cost matrix associated with the x state variables, R is a cost matrix for the u control variables.

C = \lambda_0 (x_0 - \tilde{x}_0) + \sum_t x_t^T Q x_t + u_t^T R u_t + \lambda_{t+1}^T (x_{t+1} - A x_t - B u_t  )

Equations of motion results from optimizing with respect to \lambda by design.

\nabla_{\lambda_{t+1}} C = x_{t+1} - A x_t - B u_t  = 0.

The initial conditions are enforced by the zeroth multiplier.

\nabla_{\lambda_0} C = x_i - x_{0} = 0

Differentiation with respect to the state x has the interpretation of backwards iteration of value derivative, somewhat related to what one finds in the Bellman equation.

\nabla_{x_t} C = Q x_{t} + A^T \lambda_{t+1} - \lambda_{t} = 0 \Longrightarrow \lambda_{t} =  A^T \lambda_{t+1} + Q x_{t}

The final condition on value derivative is the one that makes it the most clear that the Lagrange multiplier has the interpretation of the derivative of the value function, as it sets it to that.

\nabla_{x_N} C = Q x_N - \lambda_{N} = 0

Finally, differentiation with respect to the control picks the best action given knowledge of the value function at that time step.

\nabla_{u_t} C = R u_{t} - B^T \lambda_t  = 0

Ok. Let’s code it up using linear relations. Each of these conditions is a separate little conceptual box. We can build the optimal control step relation by combining these updates together

The string diagram for a single time step of control

The following is a bit sketchy. I am confident it makes sense, but I’m not confident that I have all of the signs and other details correct. I also still need to add the linear terms to the objective and offset terms to the dynamics. These details are all kind of actually important. Still, I think it’s nice to set down in preliminary blog form.

Initial conditions and final conditions. Final conditions fix the final value derivative to Qx. Initial conditions set x to some value. Should there be a leg for lambda on the initial conditions?

Bits and Bobbles

  • The code in context https://github.com/philzook58/ConvexCat/blob/master/src/LinRel.hs
  • Some of the juicier stuff is nonlinear control. Gotta walk before we can run though. I have some suspicions that a streaming library may be useful here, or a lens-related approach. Also ADMM.
  • Reminiscent of Keldysh contour. Probably a meaningless coincidence.
  • In some sense H-Rep is the analog of (a -> b -> Bool) and V-rep [(a,b)]
  • Note that the equations of motion are relational rather than functional for a control systems. The control parameters u describe undetermined variables under our control.
  • Loopback (trace) the value derivative for infinite horizon control.
  • Can solve the Laplace equation with a Poincare Steklov operator. That should be my next post.
  • There is something a touch unsatisfying even with the reduced goals of this post. Why did I have to write down the quadratic optimization problem and then hand translate it to linear relations? It’s kind of unacceptable. The quadratic optimization problem is really the most natural statement, but I haven’t figured out to make it compositional. The last chapter of Rockafellar?

References

https://arxiv.org/abs/1405.6881 Baez and Erbele – Categories in Control

http://www.scholarpedia.org/article/Optimal_control

http://www.argmin.net/2016/05/18/mates-of-costate/ – interesting analogy to backpropagation. Lens connection?

https://courses.cs.washington.edu/courses/cse528/09sp/optimality_chapter.pdf

Failing to Bound Kissing Numbers

https://en.wikipedia.org/wiki/Kissing_number

Cody brought up the other day the kissing number problem.Kissing numbers are the number of equal sized spheres you can pack around another one in d dimensions. It’s fairly self evident that the number is 2 for 1-d and 6 for 2d but 3d isn’t so obvious and in fact puzzled great mathematicians for a while. He was musing that it was interesting that he kissing numbers for some dimensions are not currently known, despite the fact that the first order theory of the real numbers is decidable https://en.wikipedia.org/wiki/Decidability_of_first-order_theories_of_the_real_numbers

I suggested on knee jerk that Sum of Squares might be useful here. I see inequalities and polynomials and then it is the only game in town that I know anything about.

Apparently that knee jerk was not completely wrong

https://arxiv.org/pdf/math/0608426.pdf

Somehow SOS/SDP was used for bounds here. I had an impulse that the problem feels SOS-y but I do not understand their derivation.

One way the problem can be formulated is by finding or proving there is no solution to the following set of equations constraining the centers x_i of the spheres. Set the central sphere at (0,0,0,…) . Make the radii 1. Then\forall i. |x_i|^2 = 2^2 and \forall i j.  |x_i - x_j|^2 \ge 2^2

I tried a couple different things and have basically failed. I hope maybe I’ll someday have a follow up post where I do better.

So I had 1 idea on how to approach this via a convex relaxation

Make a vector x = \begin{bmatrix} x_0 & y _0 & x_1 & y _1 & x_2 & y _2 & ... \end{bmatrix} Take the outer product of this vector x^T x = X Then we can write the above equations as linear equalities and inequalities on X. If we forget that we need X to be the outer product of x (the relaxation step), this becomes a semidefinite program. Fingers crossed, maybe the solution comes back as a rank 1 matrix. Other fingers crossed, maybe the solution comes back and says it’s infeasible. In either case, we have solved our original problem.

Didn’t work though. Sigh. It’s conceivable we might do better if we start packing higher powers into x?

Ok Round 2. Let’s just ask z3 and see what it does. I’d trust z3 with my baby’s soft spot.

It solves for 5 and below. Z3 grinds to a halt on N=6 and above. It ran for days doin nothing on my desktop.

Ok. A different tact. Try to use a positivstellensatz proof. If you have a bunch of polynomial inequalities and equalities if you sum polynomial multiples of these constraints, with the inequalities having sum of square multiples, in such a way to = -1, it shows that there is no real solution to them. We have the distance from origin as equality constraint and distance from each other as an inequality constraint. I intuitively think of the positivstellensatz as deriving an impossibility from false assumptions. You can’t add a bunch of 0 and positive numbers are get a negative number, hence there is no real solution.

I have a small set of helper functions for combining sympy and cvxpy for sum of squares optimization. I keep it here along with some other cute little constructs https://github.com/philzook58/cvxpy-helpers

and here is the attempted positivstellensatz.

It worked in 1-d, but did not work in 2d. At order 3 polynomials N=7, I maxed out my ram.

I also tried doing it in Julia, since sympy was killing me. Julia already has a SOS package

It was faster to encode, but it’s using the same solver (SCS), so basically the same thing.

I should probably be reducing the system with respect to equality constraints since they’re already in a Groebner basis. I know that can be really important for reducing the size of your problem

I dunno.

Blah blah blah blah A bunch of unedited trash

https://github.com/peterwittek/ncpol2sdpa Peter Wittek has probably died in an avalanche? That is very sad.

These notes

https://web.stanford.edu/class/ee364b/lectures/sos_slides.pdf

Positivstullensatz.

kissing number

Review of sum of squares

minimimum sample as LP. ridiculous problem
min t
st. f(x_i) – t >= 0

dual -> one dual variable per sample point
The only dual that will be non zero is that actually selecting the minimum.

Hm. Yeah, that’s a decent analogy.

How does the dual even have a chance of knowing about poly airhtmetic?
It must be during the SOS conversion prcoess. In building the SOS constraints,
we build a finite, limittted version of polynomial multiplication
x as a matrix. x is a shift matrix.
In prpducing the characterstic polynomial, x is a shift matrix, with the last line using the polynomial
known to be zero to
eigenvectors of this matrix are zeros of the poly.

SOS does not really on polynomials persay. It relies on closure of the suqaring operaiton

maybe set one sphere just at x=0 y = 2. That breaks some symmettry

set next sphere in plane something. random plane through origin?

order y components – breaks some of permutation symmettry.

no, why not order in a random direction. That seems better for symmettry breaking

Neural Networks with Weighty Lenses (DiOptics?)

I wrote a while back how you can make a pretty nice DSL for reverse mode differentiation based on the same type as Lens. I’d heard some interesting rumblings on the internet around these ideas and so was revisiting them.

Composition is defined identically for reverse mode just as it is for lens.

The forward computation shares info with the backwards differential propagation, which corresponds to a transposed Jacobian

After chewing on it a while, I realized this really isn’t that exotic. How it works is that you store the reverse mode computation graph, and all necessary saved data from the forward pass in the closure of the (dy -> dx). I also have a suspicion that if you defunctionalized this construction, you’d get the Wengert tape formulation of reverse mode ad.

Second, Lens is just a nice structure for bidirectional computation, with one forward pass and one backward pass which may or may not be getting/setting. There are other examples for using it like this.

It is also pretty similar to the standard “dual number” form type FAD x dx y dy = (x,dx)->(y,dy) for forward mode AD. We can bring the two closer by a CPS/Yoneda transformation and then some rearrangement.

and meet it in the middle with

I ended the previous post somewhat unsatisfied by how ungainly writing that neural network example was, and I called for Conal Elliot’s compiling to categories plugin as a possible solution. The trouble is piping the weights all over the place. This piping is very frustrating in point-free form, especially when you know it’d be so trivial pointful. While the inputs and outputs of layers of the network compose nicely (you no longer need to know about the internal computations), the weights do not. As we get more and more layers, we get more and more weights. The weights are in some sense not as compositional as the inputs and outputs of the layers, or compose in a different way that you need to maintain access to.

I thought of a very slight conceptual twist that may help.

The idea is we keep the weights out to the side in their own little type parameter slots. Then we define composition such that it composes input/outputs while tupling the weights. Basically we throw the repetitive complexity appearing in piping the weights around into the definition of composition itself.

These operations are easily seen as 2 dimensional diagrams.

Three layers composed, exposing the weights from all layers
The 2-D arrow things can be built out of the 1-d arrows of the original basic AD lens by bending the weights up and down. Ultimately they are describing the same thing

Here’s the core reverse lens ad combinators

And here are the two dimensional combinators. I tried to write them point-free in terms of the combinators above to demonstrate that there is no monkey business going on. We

I wonder if this is actually nice?

I asked around and it seems like this idea may be what davidad is talking about when he refers to dioptics

http://events.cs.bham.ac.uk/syco/strings3-syco5/slides/dalrymple.pdf

Perhaps this will initiate a convo.

Edit: He confirms that what I’m doing appears to be a dioptic. Also he gave a better link http://events.cs.bham.ac.uk/syco/strings3-syco5/papers/dalrymple.pdf

He is up to some interesting diagrams

https://twitter.com/davidad/status/1179760373030801408?s=20

Bits and Bobbles

  • Does this actually work or help make things any better?
  • Recurrent neural nets flip my intended role of weights and inputs.
  • Do conv-nets naturally require higher dimensional diagrams?
  • This weighty style seems like a good fit for my gauss seidel and iterative LQR solvers. A big problem I hit there was getting all the information to the outside, which is a similar issue to getting the weights around in a neural net.

Flappy Bird as a Mixed Integer Program

My birds.

Mixed Integer Programming is a methodology where you can specify convex (usually linear) optimization problems that include integer/boolean variables.

Flappy Bird is a game about a bird avoiding pipes.

We can use mixed integer programming to make a controller for Flappy Bird. Feel free to put this as a real-world application in your grant proposals, people.

While thinking about writing a MIP for controlling a lunar lander game, I realized how amenable to mixed integer modeling flappy bird is. Ben and I put together the demo on Saturday. You can find his sister blog post here.

The bird is mostly in free fall, on parabolic trajectories. This is a linear dynamic, so it can directly be expressed as a linear constraint. It can discretely flap to give itself an upward impulse. This is a boolean force variable at every time step. Avoiding the ground and sky is a simple linear constraint. The bird has no control over its x motion, so that can be rolled out as concrete values. Because of this, we can check what pipes are relevant at time points in the future and putting the bird in the gap is also a simple linear constraint.

There are several different objectives one might want to consider and weight. Perhaps you want to save the poor birds energy and minimize the sum of all flaps cvx.sum(flap). Or perhaps you want to really be sure it doesn’t hit any pipes by maximizing the minimum distance from any pipe. Or perhaps minimize the absolute value of the y velocity, which is a reasonable heuristic for staying in control. All are expressible as linear constraints. Perhaps you might want a weighted combo of these. All things to fiddle with.

There is a pygame flappy bird clone which made this all the much more slick. It is well written and easy to understand and modify. Actually figuring out the appropriate bounding boxes for pipe avoidance was finicky. Figuring out the right combo of bird size and pipe size is hard, combined with computer graphics and their goddamn upside down coordinate system.

We run our solver in a model predictive control configuration. Model predictive control is where you roll out a trajectory as an optimization problem and resolve it at every action step. This turns an open loop trajectory solve into a closed loop control, at the expense of needing to solve a perhaps very complicated problem in real time. This is not always feasible.

My favorite mip modeling tool is cvxpy. It gives you vectorized constraints and slicing, which I love. More tools should aspire to achieve numpy-like interfaces. I’ve got lots of other blog posts using this package which you can find in my big post list the side-bar 👀.

The github repo for the entire code is here: https://github.com/philzook58/FlapPyBird-MPC

And here’s the guts of the controller:

I think it is largely self explanatory but here are some notes. The dt = t//10 + 1 thing is about decreasing our time resolution the further out from the current time step. This increases the time horizon without the extra computation cost. Intuitively modeling accuracy further out in time should matter less. The last_solution stuff is for in case the optimization solver fails for whatever reason, in which case it’ll follow open-loop the last trajectory it got.

Bits and Bobbles

  • We changed the dynamics slightly from the python original to make it easier to model. We found this did not change the feel of the game. The old dynamics were piecewise affine though, so are also modelable using mixed integer programming. http://groups.csail.mit.edu/robotics-center/public_papers/Marcucci18.pdf . Generally check out the papers coming out of the Tedrake group. They are sweet. Total fanboy over here.
  • The controller as is is not perfect. It fails eventually, and it probably shouldn’t. A bug? Numerical problems? Bad modeling of the pipe collision? A run tends to get through about a hundred pipes before something gets goofy.
  • Since we had access to the source code, we could mimic the dynamics very well. How robust is flappy bird to noise and bad modeling? We could add wind, or inaccurate pipe data.
  • Unions of Convex Region. Giving the flappy bird some x position control would change the nature of the problem. We could easily cut up the allowable regions of the bird into rectangles, and represent the total space as a union of convex regions, which is also MIP representable.
  • Verification – The finite difference scheme used is crude. It is conceivable for the bird to clip a pipe. Since basically we know the closed form of the trajectories, we could verify that the parabolas do not intersect the regions. For funzies, maybe use sum of squares optimization?
  • Realtime MIP. Our solver isn’t quite realtime. Maybe half real time. One might pursue methods to make the mixed integer program faster. This might involve custom branching heuristics, or early stopping. If one can get the solver fast enough, you might run the solver in parallel and only query a new path plan every so often.
  • 3d flappy bird? Let the bird rotate? What about a platformer (Mario) or lunar lander? All are pretty interesting piecewise affine systems.
  • Is this the best way to do this? Yes and no. Other ways to do this might involve doing some machine learning, or hardcoding a controller that monitors the pipe locations and has some simple feedback. You can find some among the forks of FlapPyBird. I have no doubt that you could write these quickly, fiddle with them and get them to work better and faster than this MIP controller. However, for me there is a difference between could work and should work. You can come up with a thousand bizarre schemes that could work. RL algorithms fall in this camp. But I have every reason to believe the MIP controller should work, which makes it easier to troubleshoot.

The Classical Coulomb Gas as a Mixed Integer Quadratic Program

The coulomb gas is a model of electrostatics where you take the discreteness of charge into account. That is what makes it hard compared to the continuous charge problem. Previously, I’ve used mixed integer programming to find lowest energy states of the ising model. This is even more obvious application of mixed integer programming to a physics model.

We ordinarily consider electric charge to be a continuum, but it isn’t. It comes in chunks of the electron charge. Historically, people didn’t even know that for quite a while. It is usually a reasonable approximation for most purposes to consider electric charge to be continuous

If you consider a network of capacitors cooled to the the level that there is not enough thermal energy to borrow to get an electron to jump, the charges on the capacitors will be observably discretized. With a sufficiently slow cooling to this state, the charges should arrange themselves into the lowest energy state.

The coulomb gas model also is of interest due to its connections to the XY model, which I’ve taken a stab at with mixed integer programming before. The coulomb gas models the energy of vortices in that model. I think the connection between the models actually requires a statistical or quantum mechanical context though, whereas we’ve been looking at the classical energy minimization.

We can formulate the classical coulomb gas problem very straightforwardly as a mixed integer quadratic program. We can easily include an externally applied field and a charge conservation constraint if we so desire within the framework.

We write this down in python using the cvxpy library, which has a built in free MIQP solver, albeit not a very good one. Commercial solvers are probably quite a bit better.

A plot of charge in a constant external electric field.

The results seems reasonable. It makes sense for charge to go in the direction of the electric field. Going to the corners makes sense because then like charges are far apart. So this might be working. Who knows.

Interesting places to go with this:

Prof Vanderbei shows how you can embed an FFT to enable making statements about both the time and frequency domain while keeping the problem sparse. The low time/memory N log(N) complexity of the FFT is reflected in the spasity of the resulting linear program.

https://vanderbei.princeton.edu/tex/ffOpt/ffOptMPCrev4.pdf

Here’s a sketch about what this might look like. Curiously, looking at the actual number of nonzeros in the problem matrices, there are way too many. I am not sure what is going on. Something is not performing as i expect in the following code

The equivalent dense DFT:

It would be possible to use a frequency domain solution of the interparticle energy rather than the explicit coulomb law form. Hypothetically this might increase the sparsity of the problem.

It seems very possible to me in a similar manner to embed a fast multipole method or barnes-hut approximation within a MIQP. Introducing explicit charge summary variables for blocks would create a sparse version of the interaction matrix. So that’s fun.