Categorical Combinators for Convex Optimization and Model Predictive Control using Cvxpy

We’re gonna pump this well until someone MAKES me stop.

This particular example is something that I’ve been trying to figure out for a long time, and I am pleasantly surprised at how simple it all seems to be. The key difference with my previous abortive attempts is that I’m not attempting the heavy computational lifting myself.

We can take pointful DSLs and convert them into point-free category theory inspired interface. In this case a very excellent pointful dsl for convex optimization: cvxpy. Some similar and related posts converting dsls to categorical form

A convex optimization problem optimizes a convex objective function with constraints that define a convex set like polytopes or balls. They are polynomial time tractable and shockingly useful. We can make a category out of convex optimization problems. We consider some variables to be “input” and some to be “output”. This choice is somewhat arbitrary as is the case for many “relation” feeling things that aren’t really so rigidly oriented.

Convex programming problems do have a natural notion of composition. Check out the last chapter of Rockafeller, where he talks about the convex algebra of bifunctions. Instead of summing over the inner composition variable like in Vect \sum_j A_{ij}B_{jk}, or existentializing like in Rel \{ (a,c) |\exists b. (a,b)\in A, (b,c) \in B \}, we instead minimize over the inner composition variable min_y A(x,y) + B(y,z). These are similar operations in that they all produce bound variables.

The identity morphism is just the simple constraint that the input variables equal to output variables with an objective function of 0. This is an affine constraint, hence convex.

In general, if we ignore the objective part entirely by just setting it to zero, we’re actually working in a very computationally useful subcategory of Rel, ConvexRel, the category of relations which are convex sets. Composition corresponds to an existential operation, which is quickly solvable by convex optimization techniques. In operations research terms, these are feasibility problems rather than optimization problems. Many of the combinators do nothing to the objective.

The monoidal product just stacks variables side by side and adds the objectives and combines the constraints. They really are still independent problems. Writing things in this way opens up a possibility for parallelism. More on that some other day.

We can code this all up in python with little combinators that return the input, output, objective, constraintlist. We need to hide these in inner lambdas for appropriate fresh generation of variables.

Now for a somewhat more concrete example: Model Predictive control of an unstable (linearized) pendulum.

Model predictive control is where you solve an optimization problem of the finite time rollout of a control system online. In other words, you take measurement of the current state, update the constraint in an optimization problem, ask the solver to solve it, and then apply the force or controls that the solver says is the best.

This gives the advantage over the LQR controller in that you can set hard inequality bounds on total force available, or positions where you wish to allow the thing to go. You don’t want your system crashing into some wall or falling over some cliff for example. These really are useful constraints in practice. You can also include possibly time dependent aspects of the dynamics of your system, possibly helping you model nonlinear dynamics of your system.

I have more posts on model predictive control here http://www.philipzucker.com/model-predictive-control-of-cartpole-in-openai-gym-using-osqp/ http://www.philipzucker.com/flappy-bird-as-a-mixed-integer-program/

Here we model the unstable point of a pendulum and ask the controller to find forces to balance the pendulum.

We can interpret the controller in GraphCat in order to produce a diagram of the 10 step lookahead controller. This is an advantage of the categorical style a la compiling to categories

We can also actually run it in model predictive control configuration in simulation.

And some curves. How bout that.

Bits and Bobbles

LazySets https://github.com/JuliaReach/LazySets.jl

ADMM – It’s a “lens”. I’m pretty sure I know how to do it pointfree. Let’s do it next time.

The minimization can be bubbled out to the top is we are always minimizing. If we mix in maximization, then we can’t and we’re working on a more difficult problem. This is similar to what happens in Rel when you have relational division, which is a kind of universal quantification \forall . Mixed quantifier problems in general are very tough. These kinds of problems include games, synthesis, and robustness. More on this some other day.

Convex-Concave programming minimax https://web.stanford.edu/~boyd/papers/pdf/dccp_cdc.pdf https://web.stanford.edu/class/ee364b/lectures/cvxccv.pdf

The minimization operation can be related to the summation operation by the method of steepest descent in some cases. The method of steepest descent approximates a sum or integral by evaluating it at it’s most dominant position and expanding out from there, hence converts a linear algebra thing into an optimization problem. Examples include the connection between statistical mechanics and thermodynamics and classical mechanics and quantum mechanics.

Legendre Transformation ~ Laplace Transformation via steepest descent https://en.wikipedia.org/wiki/Convex_conjugate yada yada, all kinds of good stuff.

Intersection is easy. Join/union is harder. Does MIP help?

def meet(f,g):
   def res():
      x,y,o,c = f()
      x1,y1,o1,c1 = g()
      return x,y,o+o1, c+ c1 + [x ==  x1, y == y1]
   return res

Quantifier elimination

MIP via adjunction

Rough Ideas on Categorical Combinators for Model Checking Petri Nets using Cvxpy

Petri nets are a framework for modeling dynamical systems that is very intuitive to some people. The vanilla version is that there are discrete tokens at nodes on a graph representing resources of some kind and tokens can be combined according to the firing of transition rules into new tokens in some other location.

This is a natural generalization of chemical reaction kinetics, where tokens are particular kinds of atoms that need to come together. It also is a useful notion for computer systems, where tokens represent some computational resource.

To me, this becomes rather similar to a flow problem or circuit problem. Tokens feel a bit like charge transitions are a bit like current (although not necessarily conservative). In a circuit, one can have such a small current that the particulate nature of electric current in terms of electrons is important. This happens for shot noise or for coulomb blockade for example.

If the number of tokens is very large, it seems intuitively sensible to me that one can well approximate the behavior by relaxing to a continuum. Circuits have discrete electrons and yet are very well approximated by ohm’s laws and the like. Populations are made of individuals, and yet in the thousands their dynamics are well described by differential equations.

It seems to me that mixed integer programming is a natural fit for this problem. Mixed integer programming has had its implementations and theory heavily refined for over 70 years so now very general purpose and performant off the shelf solvers are available. The way mixed integer programs are solved is by considering their quickly solved continuous relaxation (allowing fractional tokens and fractional transitions more akin to continuous electrical circuit flow) and using this information to systematically inform a discrete search process. This seems to me a reasonable starting approximation. Another name for petri nets is Vector Addition Systems, which has more of the matrix-y flavor.

We can encode a bounded model checking for reachability of a petri net into a mixed integer program fairly easily. We use 2-index variables, the first of which will be used for time step. We again turn to the general purpose functional way of encoding pointful dsls into pointfree ones as I have done here and here. The key point is that you need to be careful where you generate fresh variables. This is basically copy catting my posts here. http://www.philipzucker.com/categorical-combinators-for-graphviz-in-python/ http://www.philipzucker.com/a-sketch-of-categorical-relation-algebra-combinators-in-z3py/

I’m like 70% sure what I did here makes sense, but I’m pretty sure the general idea makes sense with some fiddling.

The big piece is the weighted_block function. It let’s you build a combinator with an internal state and input and output flow variables. You give matrices with entries for every possible transition. Whether transitions occurred between t and t+1 is indicated by integer variables. There is also possible accumulation of tokens at nodes, which also requires integer variables. Perhaps we’d want to expose the token state of the nodes to the outside too?

Weighted block schematically looks something like this

We can also get out a graphical representation of the net by reinterpreting our program into GraphCat. This is part of the power of the categorical interface. http://www.philipzucker.com/categorical-combinators-for-graphviz-in-python/

When I talked to Zach about this, he seemed skeptical that MIP solvers wouldn’t eat die a horrible death if you threw a moderately large petri net into them. Hard to say without trying.

Thoughts

There is an interesting analogy to be found with quantum field theory in that if you lift up to considering distributions of tokens, it looks like an occupation number representation. See Baez. http://math.ucr.edu/home/baez/stoch_stable.pdf

If you relax the integer constraint and the positivity constraints, this really is a finite difference formulation for capacitor circuits. The internal states would then be the charge of the capacitor. Would the positivity constraint be useful for diodes?

I wonder how relevant the chunky nature of petri nets might be for considering superconducting circuits, which have quantization of quantities from quantum mechanical effects.

Cvxpy let’s you describe convex regions. We can use this to implement a the convex subcategory of Rel which you can ask reachability questions. Relational division won’t work probably. I wonder if there is something fun there with respect the the integerizing operation and galois connections.

Edit: I should have googled a bit first. https://www.sciencedirect.com/science/article/pii/S0377221705009240 mathemtical programming tecchniques for petri net reachability. So it has been tried, and it sounds like the results weren’t insanely bad.

Fiddling around with validated ODE integration, Sum of Squares, Taylor Models.

As I have gotten more into the concerns of formal methods, I’ve become unsure that ODEs actually exist. These are concerns that did not bother me much when I defined myself as being more in the physics game. How times change. Here’s a rough cut.

A difficulty with ODE error analysis is that it is very confusing how to get the error on something you are having difficulty approximating in the first place.

If I wanted to know the error of using a finite step size dt vs a size dt/10, great. Just compute both and compare. However, no amount of this seems to bootstrap you down to the continuum. And so I thought, you’re screwed in regards to using numerics in order to get true hard facts about the true solution. You have to go to paper and pencil considerations of equations and variables and epsilons and deltas and things. It is now clearer to me that this is not true. There is a field of verified/validated numerics.

A key piece of this seems to be interval arithmetic. https://en.wikipedia.org/wiki/Interval_arithmetic An interval can be concretely represented by its left and right point. If you use rational numbers, you can represent the interval precisely. Interval arithmetic over approximates operations on intervals in such a way as to keep things easily computable. One way it does this is by ignoring dependencies between different terms. Check out Moore et al’s book for more.

What switching over to intervals does is you think about sets as the things you’re operating on rather than points. For ODEs (and other things), this shift of perspective is to no longer consider individual functions, but instead sets of functions. And not arbitrary extremely complicated sets, only those which are concretely manipulable and storable on a computer like intervals. Taylor models are a particular choice of function sets. You are manipulating an interval tube around a finite polynomial. If during integration / multiplication you get higher powers, truncate the polynomials by dumping the excess into the interval term. This keeps the complexity under wraps and closes the loop of the descriptive system.

If we have an iterative, contractive process for getting better and better solutions of a problem (like a newton method or some iterative linear algebra method), we can get definite bounds on the solution if we can demonstrate that a set maps into itself under this operation. If this is the case and we know there is a unique solution, then it must be in this set.

It is wise if at all possible to convert an ODE into integral form. \dot{x}= f(x,t) is the same as x(t) = x_0 + \int f(x,t)dt.

For ODEs, the common example of such an operation is known as Picard iteration. In physical terms, this is something like the impulse approximation / born approximation. One assumes that the ODE evolves according to a known trajectory x_0(t) as a first approximation. Then one plugs in the trajectory to the equations of motion f(x_0,t) to determine the “force” it would feel and integrate up all this force. This creates a better approximation x_1(t) (probably) which you can plug back in to create an even better approximation.

If we instead do this iteration on an intervally function set / taylor model thing, and can show that the set maps into itself, we know the true solution lies in this interval. The term to search for is Taylor Models (also some links below).

I was tinkering with whether sum of squares optimization might tie in to this. I have not seen SOS used in this context, but it probably has or is worthless.

An aspect of sum of squares optimization that I thought was very cool is that it gives you a simple numerical certificate that confirms that at the infinitude of points for which you could evaluate a polynomial, it comes out positive. This is pretty cool. http://www.philipzucker.com/deriving-the-chebyshev-polynomials-using-sum-of-squares-optimization-with-sympy-and-cvxpy/

But that isn’t really what makes Sum of squares special. There are other methods by which to do this.

There are very related methods called DSOS and SDSOS https://arxiv.org/abs/1706.02586 which are approximations of the SOS method. They replace the SDP constraint at the core with a more restrictive constraint that can be expressed with LP and socp respectively. These methods lose some of the universality of the SOS method and became basis dependent on your choice of polynomials. DSOS in fact is based around the concept of a diagonally dominant matrix, which means that you should know roughly what basis your certificate should be in.

This made me realize there is an even more elementary version of DSOS that perhaps should have been obvious to me from the outset. Suppose we have a set of functions we already know are positive everywhere on a domain of interest. A useful example is the raised chebyshev polynomials. https://en.wikipedia.org/wiki/Chebyshev_polynomials The appropriate chebyshev polynomials oscillate between [-1,1] on the interval [-1,1], so if you add 1 to them they are positive over the whole interval [-1,1]. Then nonnegative linear sums of them are also positive. Bing bang boom. And that compiles down into a simple linear program (inequality constraints on the coefficients) with significantly less variables than DSOS. What we are doing is restricting ourselves to completely positive diagonal matrices again, which are of course positive semidefinite. It is less flexible, but it also has more obvious knobs to throw in domain specific knowledge. You can use a significantly over complete basis and finding this basis is where you can insert your prior knowledge.

It is not at all clear there is any benefit over interval based methods.

Here is a sketch I wrote for x'=x which has solution e^t. I used raised chebyshev polynomials to enforce positive polynomial constraints and tossed in a little taylor model / interval arithmetic to truncate off the highest terms.

I’m using my helper functions for translating between sympy and cvxpy expressions. https://github.com/philzook58/cvxpy-helpers Sympy is great for collecting up the coefficients on terms and polynomial multiplication integration differentiation etc. I do it by basically creating sympy matrix variables corresponding to cvxpy variables which I compile to cvxpy expressions using lambdify with an explicit variable dictionary.

Seems to work, but I’ve been burned before.

man, LP solvers are so much better than SDP solvers


Random junk and links: Should I be more ashamed of dumps like this? I don’t expect you to read this.

https://github.com/JuliaIntervals/TaylorModels.jl

https://github.com/JuliaIntervals

Functional analysis by and large analyzes functions by analogy with more familiar properties of finite dimensional vector spaces. In ordinary 2d space, it is convenient to work with rectangular regions or polytopic regions.

Suppose I had a damped oscillator converging to some unknown point. If we can show that every point in a set maps within the set, we can show that the function

One model of a program is that it is some kind of kooky complicated hyper nonlinear discrete time dynamical system. And vice versa, dynamical systems are continuous time programs. The techniques for analyzing either have analogs in the other domain. Invariants of programs are essential for determining correctness properties of loops. Invariants like energy and momentum are essential for determining what physical systems can and cannot do. Lyapunov functions demonstrate that control systems are converging to the set point. Terminating metrics are showing that loops and recursion must eventually end.

If instead you use interval arithmetic for a bound on your solution rather than your best current solution, and if you can show the interval maps inside itself, then you know that the iterative process must converge inside of the interval, hence that is where the true solution lies.

A very simple bound for an integral \int_a^b f(x)dx is \int max_{x \in [a,b]}f(x)  dx= max_{x \in [a,b]}f(x) \int dx = max_{x \in [a,b]}f(x) (b - a)

The integral is a very nice operator. The result of the integral is a positive linear sum of the values of a function. This means it plays nice with inequalities.

Rigorously Bounding ODE solutions with Sum of Squares optimization – Intervals

Intervals – Moore book. Computational functional analaysis. Tucker book. Coqintervals. fixed point theorem. Hardware acceleration? Interval valued functions. Interval extensions.

  • Banach fixed point – contraction mapping
  • Brouwer fixed point
  • Schauder
  • Knaster Tarski

Picard iteration vs? Allowing flex on boundary conditions via an interval?

Interval book had an interesting integral form for the 2-D

sympy has cool stuff

google scholar search z3, sympy brings up interesting things

https://moorepants.github.io/eme171/resources.html

The pydy guy Moore has a lot of good shit. resonance https://www.moorepants.info/blog/introducing-resonance.html

Lyapunov functions. Piecewise affine lyapunov funcions. Are lyapunov functions kind of like a PDE? Value functions are pdes. If the system is piecewise affine we can define a grid on the same piecewise affine thingo. Compositional convexity. Could we use compositional convexity + Relu style piecewise affinity to get complicated lyapunov functions. Lyapunov functions don’t have to be continiuous, they just have to be decreasing. The Lie derivative wrt the flow is always negative, i.e gradeint of function points roughly in direction of flow. trangulate around equilibrium if you want to avoid quadratic lyapunov. For guarded system, can relax lyapunov constrain outside of guard if you tighten inside guard. Ax>= 0 is guard. Its S-procedure.

Best piecewise approximation with point choice?

http://theory.stanford.edu/~arbrad/papers/lr.ps linear ranking functions

Connection to petri nets?

https://ths.rwth-aachen.de/wp-content/uploads/sites/4/hs_lecture_notes.pdf

https://www.cs.colorado.edu/~xich8622/papers/rtss12.pdf

KoAt, LoAT. AProve. Integer transition systems. Termination analysis. Loops?

https://lfcps.org/pub/Pegasus.pdf darboux polynomials. barreir certificates. prelle-singer method. first integrals.

method 1. arbitrary polynomial p(t). calculate p'(t). find coefficents that make p'(t) = 0 by linear algebra. Idea: near invaraints? min max|p'(t) |

Lie Algebra method

https://www.researchgate.net/publication/233653257_Solving_Differential_Equations_by_Symmetry_Groups sympy links this paper. Sympy has some lie algebra stuff in there

https://www-users.math.umn.edu/~olver/sm.html Peter Olver tutorial

http://www-users.math.umn.edu/~olver/talk.html olver talks

https://www-sop.inria.fr/members/Evelyne.Hubert/publications/PDF/Hubert_HDR.pdf

https://www.cs.cmu.edu/~aplatzer/logic/diffinv.html andre platzer. Zach says Darboux polynomials?

https://sylph.io/blog/math.html

Books: Birhoff and Rota, Guggenheimer, different Olver books, prwctical guide to invaraints https://www.amazon.com/Practical-Invariant-Monographs-Computational-Mathematics/dp/0521857015

Idea: Approximate invariants? At least this ought to make a good coordinate system to work in where the dynamics are slow. Like action-angle and adiabatic transformations. Could also perhaps bound the

Picard Iteration

I have a method that I’m not sure is ultimately sound. The idea is to start with

Error analysis most often uses an appeal to Taylor’s theorem and Taylor’s theorem is usually derived from them mean value theorem or intermediate value theorem. Maybe that’s fine. But the mean value theorem is some heavy stuff. There are computational doo dads that use these bounds + interval analysis to rigorously integrate ODEs. See https://github.com/JuliaIntervals/TaylorModels.jl

The beauty of sum of squares certificates is that they are very primitive proofs of positivity for a function on a domain of infinitely many values. If I give you a way to write an expression as a sum of square terms, it is then quite obvious that it has to be always positive. This is algebra rather than analysis.

y(t) = \lambda(t) \and \lambda(t) is SOS \Rightarrow \forall t. y(t) >= 0. Sum of squares is a kind of a quantifier elimination method. The reverse direction of the above implication is the subject of the positivstullensatz, a theorem of real algebraic geometry. At the very least, we can use the SOS constraint as a relaxation of the quantified constraint.

So, I think by using sum of squares, we can turn a differential equation into a differential inequation. If we force the highest derivative to be larger than the required differential equation, we will get an overestimate of the required function.

A function that is dominated by another in derivative, will be dominated in value also. You can integrate over inequalities (I think. You have to be careful about such things. ) \forall t. \frac{dx}{dt} >= \frac{dx}{dt} \Rightarrow x(t) – x(0) >= y(t) – y(0) $

The derivative of a polynomial can be thought of as a completely formal operation, with no necessarily implied calculus meaning. It seems we can play a funny kind of shell game to avoid the bulk of calculus.

As an example, let’s take \frac{dx}{dt}=y y(0) = 1 with the solution y = e^t. e is a transcendental

The S-procedure is trick by which you can relax a sum of squares inequality to only need to be enforced in a domain. If you build a polynomials function that describes the domain, it that it is positive inside the domain and negative outside the domain, you can add a positive multiple of that to your SOS inequalities. Inside the domain you care about, you’ve only made them harder to satisfy, not easier. But outside the domain you have made it easier because you can have negative slack.

For the domain t \in [0,1] the polynomial (1 - t)t works as our domain polynomial.

We parametrize our solution as an explicit polynomial x(t) = a_0 + a_1 t + a_2 t^2 + .... It is important to note that what follows is always linear in the a_i.

\frac{dx}{dt} - x >= 0 can be relaxed to \frac{dx}{dt} - x(t) + \lambda(t)(1-t)t >= 0 with \lambda(t) is SOS.

So with that we get a reasonable formulation of finding a polynomial upper bound solution of the differential equation

\min x(1)

\frac{dx}{dt} - x(t) + \lambda_1(t)(1-t)t =  \lambda_2(t)

\lambda_{1,2}(t) is SOS.

And here it is written out in python using my cvxpy-helpers which bridge the gap between sympy polynomials and cvxpy.

We can go backwards to figure out sufficient conditions for a bound. We want x_u(t_f) \gte x(t_f). It is sufficient that \forall t. x_u(t) \gte x(t). For this it is sufficient that \forall t. x_u'(t)  >= x'(t) \and x_u(t_i) >= x(t_i) . We follow this down in derivative until we get the lowest derivative in the differential equation. Then we can use the linear differential equation itself x^{(n)}(t) = \sum_i a_i(t) x^{(i)}(t). x_u^{(n)}(t) >= \sum max(a_i x^{(i)}_u, x^{(i)}_l).

a(t) * x(t) >= \max a(t) x_u(t), a(t) x_l(t). This accounts for the possibility of terms changing signs. Or you could separate the terms into regions of constant sign.

The minimization characterization of the bound is useful. For any class of functions that contains our degree-d polynomial, we can show that the minimum of the same optimization problem is less than or equal to our value.

Is the dual value useful? The lower bound on the least upper bound

Doesn’t seem like the method will work for nonlinear odes. Maybe it will if you relax the nonlinearity. Or you could use perhaps a MIDSP to make piecewise linear approximations of the nonlinearity?

It is interesting to investigtae linear programming models. It is simpler and more concrete to examine how well different step sizes approximate each other rather than worry about the differential case.

We can explicit compute a finite difference solution in the LP, which is a power that is difficult to achieve in general for differential equations.

We can instead remove the exact solution by a convservative bound.

While we can differentiate through an equality, we can’t differentiate through an inequality. Differentiation involves negation, which plays havoc with inequalities. We can however integrate through inequalities.

\frac{dx}{dt} >= f \and x(0) >= a \Rightarrow x(t) >= \int^t_0 f(x) + a$

As a generalization we can integrate \int p(x) over inequalities as long as p(x) \gte 0

In particular \forall t. \frac{dx}{dt} >= \frac{dx}{dt} \Rightarrow x(t) – x(0) >= y(t) – y(0) $

We can convert a differential equation into a differential inequation. It is not entirely clear to me that there is a canonical way to do this. But it works to take the biggest.

\frac{dx}{dt} = A(t)x + f(t)

Is there a tightest

We can integrate

Here let’s calculate e

https://tel.archives-ouvertes.fr/tel-00657843v2/document Thesis on ODE bounds in Isabelle

myfunc x y = 3

not so good. very small

Flappy Bird as a Mixed Integer Program

My birds.

Mixed Integer Programming is a methodology where you can specify convex (usually linear) optimization problems that include integer/boolean variables.

Flappy Bird is a game about a bird avoiding pipes.

We can use mixed integer programming to make a controller for Flappy Bird. Feel free to put this as a real-world application in your grant proposals, people.

While thinking about writing a MIP for controlling a lunar lander game, I realized how amenable to mixed integer modeling flappy bird is. Ben and I put together the demo on Saturday. You can find his sister blog post here.

The bird is mostly in free fall, on parabolic trajectories. This is a linear dynamic, so it can directly be expressed as a linear constraint. It can discretely flap to give itself an upward impulse. This is a boolean force variable at every time step. Avoiding the ground and sky is a simple linear constraint. The bird has no control over its x motion, so that can be rolled out as concrete values. Because of this, we can check what pipes are relevant at time points in the future and putting the bird in the gap is also a simple linear constraint.

There are several different objectives one might want to consider and weight. Perhaps you want to save the poor birds energy and minimize the sum of all flaps cvx.sum(flap). Or perhaps you want to really be sure it doesn’t hit any pipes by maximizing the minimum distance from any pipe. Or perhaps minimize the absolute value of the y velocity, which is a reasonable heuristic for staying in control. All are expressible as linear constraints. Perhaps you might want a weighted combo of these. All things to fiddle with.

There is a pygame flappy bird clone which made this all the much more slick. It is well written and easy to understand and modify. Actually figuring out the appropriate bounding boxes for pipe avoidance was finicky. Figuring out the right combo of bird size and pipe size is hard, combined with computer graphics and their goddamn upside down coordinate system.

We run our solver in a model predictive control configuration. Model predictive control is where you roll out a trajectory as an optimization problem and resolve it at every action step. This turns an open loop trajectory solve into a closed loop control, at the expense of needing to solve a perhaps very complicated problem in real time. This is not always feasible.

My favorite mip modeling tool is cvxpy. It gives you vectorized constraints and slicing, which I love. More tools should aspire to achieve numpy-like interfaces. I’ve got lots of other blog posts using this package which you can find in my big post list the side-bar 👀.

The github repo for the entire code is here: https://github.com/philzook58/FlapPyBird-MPC

And here’s the guts of the controller:

import cvxpy as cvx
import numpy as np
import matplotlib.pyplot as plt


N = 20 # time steps to look ahead
path = cvx.Variable((N, 2)) # y pos and vel
flap = cvx.Variable(N-1, boolean=True) # whether or not the bird should flap in each step
last_solution = [False, False, False]
last_path = [(0,0),(0,0)]

PIPEGAPSIZE  = 100
PIPEWIDTH = 52
BIRDWIDTH = 34
BIRDHEIGHT = 24
BIRDDIAMETER = np.sqrt(BIRDHEIGHT**2 + BIRDWIDTH**2)
SKY = 0
GROUND = (512*0.79)-1
PLAYERX = 57


def getPipeConstraints(x, y, lowerPipes):
    constraints = []
    for pipe in lowerPipes:
        dist_from_front = pipe['x'] - x - BIRDDIAMETER
        dist_from_back = pipe['x'] - x + PIPEWIDTH
        if (dist_from_front < 0) and (dist_from_back > 0):
            #print(pipe['y'] + BIRDDIAMETER,  pipe['y'] + PIPEGAPSIZE)
            constraints += [y <= (pipe['y'] - BIRDDIAMETER)] # y above lower pipe
            constraints += [y >= (pipe['y'] - PIPEGAPSIZE)] # y below upper pipe
    #if len(constraints) > 0:
        #print(constraints)
    return constraints

def solve(playery, playerVelY, lowerPipes):
    global last_path, last_solution

    print(last_path)
    pipeVelX = -4
    playerAccY    =   1   # players downward accleration
    playerFlapAcc =  -14   # players speed on flapping

    # unpack variables
    y = path[:,0]

    vy = path[:,1]

    c = [] #constraints
    c += [y <= GROUND, y >= SKY]
    c += [y[0] == playery, vy[0] == playerVelY]

    x = PLAYERX
    xs = [x]
    for t in range(N-1):
        dt = t//10 + 1
        #dt = 1
        x -= dt * pipeVelX
        xs += [x]
        c += [vy[t + 1] ==  vy[t] + playerAccY * dt + playerFlapAcc * flap[t] ]
        c += [y[t + 1] ==  y[t] + vy[t + 1]*dt ]
        c += getPipeConstraints(x, y[t+1], lowerPipes)


    #objective = cvx.Minimize(cvx.sum(flap)) # minimize total fuel use
    objective = cvx.Minimize(cvx.sum(flap) + 10* cvx.sum(cvx.abs(vy))) # minimize total fuel use

    prob = cvx.Problem(objective, c)
    try:
        prob.solve(verbose = False, solver="GUROBI")
        #print(np.round(flap.value).astype(bool))
        #plt.plot(y.value)
        #plt.show()
        last_path = list(zip(xs, y.value))
        last_solution = np.round(flap.value).astype(bool)
        return last_solution[0], last_path
    except:
        last_solution = last_solution[1:]
        last_path = [((x-4), y) for (x,y) in last_path[1:]]
        return last_solution[0], last_path


I think it is largely self explanatory but here are some notes. The dt = t//10 + 1 thing is about decreasing our time resolution the further out from the current time step. This increases the time horizon without the extra computation cost. Intuitively modeling accuracy further out in time should matter less. The last_solution stuff is for in case the optimization solver fails for whatever reason, in which case it’ll follow open-loop the last trajectory it got.

Bits and Bobbles

  • We changed the dynamics slightly from the python original to make it easier to model. We found this did not change the feel of the game. The old dynamics were piecewise affine though, so are also modelable using mixed integer programming. http://groups.csail.mit.edu/robotics-center/public_papers/Marcucci18.pdf . Generally check out the papers coming out of the Tedrake group. They are sweet. Total fanboy over here.
  • The controller as is is not perfect. It fails eventually, and it probably shouldn’t. A bug? Numerical problems? Bad modeling of the pipe collision? A run tends to get through about a hundred pipes before something gets goofy.
  • Since we had access to the source code, we could mimic the dynamics very well. How robust is flappy bird to noise and bad modeling? We could add wind, or inaccurate pipe data.
  • Unions of Convex Region. Giving the flappy bird some x position control would change the nature of the problem. We could easily cut up the allowable regions of the bird into rectangles, and represent the total space as a union of convex regions, which is also MIP representable.
  • Verification – The finite difference scheme used is crude. It is conceivable for the bird to clip a pipe. Since basically we know the closed form of the trajectories, we could verify that the parabolas do not intersect the regions. For funzies, maybe use sum of squares optimization?
  • Realtime MIP. Our solver isn’t quite realtime. Maybe half real time. One might pursue methods to make the mixed integer program faster. This might involve custom branching heuristics, or early stopping. If one can get the solver fast enough, you might run the solver in parallel and only query a new path plan every so often.
  • 3d flappy bird? Let the bird rotate? What about a platformer (Mario) or lunar lander? All are pretty interesting piecewise affine systems.
  • Is this the best way to do this? Yes and no. Other ways to do this might involve doing some machine learning, or hardcoding a controller that monitors the pipe locations and has some simple feedback. You can find some among the forks of FlapPyBird. I have no doubt that you could write these quickly, fiddle with them and get them to work better and faster than this MIP controller. However, for me there is a difference between could work and should work. You can come up with a thousand bizarre schemes that could work. RL algorithms fall in this camp. But I have every reason to believe the MIP controller should work, which makes it easier to troubleshoot.

Cvxpy and NetworkX Flow Problems

Networkx outputs scipy sparse incidence matrices

https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.linalg.graphmatrix.incidence_matrix.html#networkx.linalg.graphmatrix.incidence_matrix

https://docs.scipy.org/doc/scipy/reference/sparse.html

Networkx also has it’s own flow solvers, but cvxpy gives you some interesting flexibility, like turning the problem mixed integer, quadratic terms, and other goodies. Plus it is very easy to get going as you’ll see.

So here’s a basic example of putting these two together. Very straightforward and cool.


import networkx as nx
import cvxpy as cvx
import matplotlib.pyplot as plt
import numpy as np
from scipy.sparse import lil_matrix

#graph is an networkx graph from somewhere

#print(edgedict)
nEdges = len(graph.edges)
nNodes = len(graph.nodes)

posflow = cvx.Variable(nEdges)
negflow = cvx.Variable(nEdges)

# split flow into positive and negative parts so we can talk about absolute value.
# Perhaps I should let cvxpy do it for me
constraints = [ 0 <= posflow,  0 <= negflow ]

absflow = posflow + negflow
flow = posflow - negflow

L = nx.incidence_matrix(graph, oriented=True )

source = np.zeros(nNodes) #lil_matrix(n_nodes)
# just some random source placement.
source[7] = 1
source[25] = -1

# cvxpy needs sparse matrices wrapped.
Lcvx = cvx.Constant(L)
#sourcecvx = cvx.Constant(source)

# flow conservation
constraints.append(Lcvx*flow == source)

# can put other funky inequality constraints on things.

objective = cvx.Minimize(cvx.sum(absflow)) 

print("building problem")
prob = cvx.Problem(objective, constraints)
print("starting solve")
prob.solve(solver=cvx.OSQP, verbose = True) #or try cvx.CBC, cvx.CVXOPT, cvx.GLPK, others
np.set_printoptions(threshold=np.inf)
print(absflow.value)


Here was a cool visual from a multi commodity flow problem (nx.draw_networkx_edges)

Nice, huh.

Trajectory Optimization of a Pendulum with Mixed Integer Linear Programming

There is a reasonable piecewise linear approximation for the pendulum replacing the the sinusoidal potential with two quadratic potentials (one around the top and one around the bottom). This translates to a triangle wave torque.

Cvxpy curiously has support for Mixed Integer Programming.

Cbc is probably better than GLPK MI. However, GLPK is way easier to get installed. Just brew install glpk and pip install cvxopt.

Getting cbc working was a bit of a journey. I had to actually BUILD Cylp (god forbid) and fix some problems.

Special Ordered Set constraints are useful for piecewise linear approximations. The SOS2 constraints take a set of variables and make it so that only two consecutive ones can be nonzero at a time. Solvers often have built in support for them, which can be much faster than just blasting them with general constraints. I did it by adding a binary variable for every consecutive pair. Then these binary variables suppress the continuous ones. Setting the sum of the binary variables to 1 makes only one able to be nonzero.

def SOS2(n):
	z = cvx.Variable(n-1,boolean=True)
	x = cvx.Variable(n)
	constraints = []
	constraints += [0 <= x]
	constraints += [x[0] <= z[0]]
	constraints += [x[-1] <= z[-1]]
	constraints += [x[1:-1] <= z[:-1]+z[1:]]
	constraints += [cvx.sum(z) == 1]
	constraints += [cvx.sum(x) == 1]
	return x, z, constraints

 

One downside is that every evaluation of these non linear functions requires a new set of integer and binary variables, which is possibly quite expensive.

For some values of total time steps and step length, the solver can go off the rails and never return.

At the moment, the solve is not fast enough for real time control with CBC (~ 2s). I wonder how much some kind of warm start might or more fiddling with heuristics, or if I had access to the built in SOS2 constraints rather than hacking it in. Also commercial solvers are usually faster. Still it is interesting.

import cvxpy as cvx
import numpy as np
import matplotlib.pyplot as plt

def SOS2(n):
	z = cvx.Variable(n-1,boolean=True)

	x = cvx.Variable(n)
	constraints = []
	constraints += [0 <= x]
	constraints += [x[0] <= z[0]]
	constraints += [x[-1] <= z[-1]]
	constraints += [x[1:-1] <= z[:-1]+z[1:]]
	constraints += [cvx.sum(z) == 1]
	constraints += [cvx.sum(x) == 1]
	return x, z, constraints



N = 20
thetas = cvx.Variable(N)
omegas = cvx.Variable(N)
torques = cvx.Variable(N)
tau = 0.3
c = [thetas[0] == +0.5, omegas[1] == 0, torques <= tau, torques >= -tau]
dt = 0.5

D = 5
thetasN = np.linspace(-np.pi,np.pi, D, endpoint=True)
zs = []
mods = []
xs = []
print(thetasN)
reward = 0
for t in range(N-1):
	c += [thetas[t+1] == thetas[t] + omegas[t]*dt]
	
	mod = cvx.Variable(1,integer=True)
	mods.append(mod)
	#xs.append(x)
	x, z, c1 = SOS2(D)
	c += c1
	xs.append(x)
	c += [thetas[t+1] == thetasN@x + 2*np.pi*mod]
	sin = np.sin(thetasN)@x
	reward += -np.cos(thetasN)@x
	c += [omegas[t+1] == omegas[t] - sin*dt + torques[t]*dt ]

objective = cvx.Maximize(reward)
constraints = c

prob = cvx.Problem(objective, constraints)
res = prob.solve(solver=cvx.CBC, verbose=True)

print([mod.value for mod in mods ])
print([thetasN@x.value for x in xs ])
print([x.value for x in xs ])

plt.plot(thetas.value)
plt.plot(torques.value)
plt.show()



# Other things to do: BigM. 
# Disjunctive constraints.
# Implication constraints.

Blue is angle, orange is the applied torque. The torque is running up against the limits I placed on it.

More Reinforcement Learning with cvxpy

So I spent thanksgiving doing this and playing Zelda. Even though that sounds like a hell of a day, seems a little sad for thanksgiving :(. I should probably make more of an effort to go home next year.

I tried implementing a more traditional q-learning pipeline using cvxpy (rather than the inequality trick of the last time). Couldn’t get it to work as well. And it’s still kind of slow despite a lot of rearrangement to vectorize operations (through batching basically).

I guess I’m still entranced with the idea of avoiding neural networks. In a sense, that is the old boring way of doing things. The Deep RL is the new stuff. Using ordinary function approximators is way older I think. But I feel like it takes a problem out of the equation (dealing with training neural nets). Also I like modeling languages/libraries.

I kept finding show stopping bugs throughout the day (incorrectly written maxaction functions, typos, cross episode data points, etc.), so I wouldn’t be surprised if there is one still in here. It’s very surprising how one can convince oneself that it is kind of working when it is actually impossible it’s working. All these environments are so simple, that I suspect I could randomly sample controllers out of a sack for the time I’ve been fiddling with this stuff and find a good one.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('Pendulum-v0')
print(env.action_space)
 
print(env.observation_space)
print(env.observation_space.high)
 
print(env.observation_space.low) # [1,1,8]
print(env.action_space.high) # -2
print(env.action_space.low)


 
chebdeg = 2
alpha = cvx.Variable(3*chebdeg**3)
alpha.value = np.zeros(3*chebdeg**3)
gamma = 1. - 1./50
def basis(s,a):
    n = np.arange(4)
    f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1)
    f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1)
    f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1)
    f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1)
    return f1*f2*f3*f4
def vbasis(s,a):
    #n = np.arange(4)
    N = s.shape[0]

    #print(chebval(s[:,0]/1.,np.eye(chebdeg)).shape)
    f1 = chebval(s[:,0]/1.,np.eye(chebdeg)).T.reshape(N,-1,1,1,1)
    f2 = chebval(s[:,1]/1.,np.eye(chebdeg)).T.reshape(N,1,-1,1,1)
    f3 = chebval(s[:,2]/8.,np.eye(chebdeg)).T.reshape(N,1,1,-1,1)
    if type(a) == np.ndarray:
        f4 = (a.reshape(-1,1,1,1,1)/2)**(np.arange(3).reshape(1,1,1,1,-1))
    else:
        f4 = (a/2)**(np.arange(3).reshape(1,1,1,1,-1))
    return (f1*f2*f3*f4).reshape(N,-1)

def qmaxaction(alpha,s):
    #n = np.arange(4)
    N = s.shape[0]

    #print(chebval(s[:,0]/1.,np.eye(chebdeg)).shape)
    f1 = chebval(s[:,0]/1.,np.eye(chebdeg)).T.reshape(N,-1,1,1,1)
    f2 = chebval(s[:,1]/1.,np.eye(chebdeg)).T.reshape(N,1,-1,1,1)
    f3 = chebval(s[:,2]/8.,np.eye(chebdeg)).T.reshape(N,1,1,-1,1)
    z = f1*f2*f3
    a = (np.array([0,0,0.25]).reshape(1,1,1,1,-1)*z).reshape(N,-1)@alpha.value
    b = (np.array([0,0.5,0]).reshape(1,1,1,1,-1)*z).reshape(N,-1)@alpha.value

    acts = np.clip(-b/2/(a+1e-7), -2,2)
    q2 = vbasis(s,2)@alpha.value
    print(acts)
    qa = vbasis(s,acts)@alpha.value
    qn2 = vbasis(s,-2)@alpha.value
    return np.maximum(qa,np.maximum(qn2,q2))
 
def evalb(alpha, s,a):
    f = basis(s,a)
    return alpha*f.flatten()
 
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 2))
    f2 = np.sum(evalb(alpha.value, observation, 0))
    f3 = np.sum(evalb(alpha.value, observation, -2))
    coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2)
    if coeff[0] < 0:
        action = -coeff[1]/2/coeff[0]
        action = min(max(action,-2),2)
    elif f1 > f3:
        #print("huh")
        action = 2
    else:
        #print("how bout dat")
        action = -2
    return np.array([action])
 
constraints = []
objective = 0



 
for x in range(4):
 
    constraints = []
    epobs = []
    eprewards = []
    epactions = []
    objective = 0
    epsilon = 1.2/(x+1)
    print("epsilon: ", epsilon)
    epNum = 50
    for i_episode in range(epNum):
        observations = []
        rewards = []
        actions = []
        observation = env.reset()
        reward = -100
        for t in range(100):
 
            prev_obs = observation
            prev_reward = reward
            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = maxaction(alpha, observation)
            observation, reward, done, info = env.step(action)
            observations.append(observation)
            rewards.append(reward)
            actions.append(action)

            #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
            #objective += evalb(alpha, observation, env.action_space.sample())
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break
        epobs.append(observations)
        epactions.append(actions)
        eprewards.append(rewards)
    for qiter in range(20):
        print("Q iteration: ", qiter)
        objective = 0
        for (observations,actions, rewards) in zip(epobs,epactions,eprewards):
            obs = np.array(observations)
            act = np.array(actions)
            bprev = vbasis(obs[:-1],act[1:]) #   np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
            #bnext = np.array([basis(s,a+np.random.randn()*0.2).flatten() for (s,a) in zip(observations[1:],actions[1:])])
            #obs = np.array(observations)[:-1]
            #q2 = np.array([basis(s,2).flatten() for s in observations[:-1]])@alpha.value
            #q2 = vbasis(obs[1:],2)@alpha.value
            #q0 = vbasis(obs[1:],0)@alpha.value
            #qn2 = vbasis(obs[1:],-2)@alpha.value
            #q1 = vbasis(obs[1:],1)@alpha.value
            #qn1 = vbasis(obs[1:],-1)@alpha.value
            #qs = []
            #for a in np.linspace(-2,2,5):
            #    qs.append(vbasis(obs[1:],a+np.random.randn()*0.5)@alpha.value)
            #for a in np.linspace(-0.1,0.1,3):
            #    qs.append(vbasis(obs[1:],a)@alpha.value)

            #qmax = np.max(np.stack([q2,q0,qn2,q1,qn1],axis=1),axis=1)
            #qmax = np.max(np.stack(qs,axis=1),axis=1)
            qmax = qmaxaction(alpha, obs[1:])


            #q0 = np.array([basis(s,0).flatten() for s in observations[:-1]])@alpha.value
            #qn2 = np.array([basis(s,-2).flatten() for s in observations[:-1]])@alpha.value
           
            #qmax = np.maximum(np.maximum(q0,q2),qn2)#np.array([np.dot(basis(s, maxaction(alpha,s)).flatten(),alpha.value) for s in observations[1:]])
            rewards = np.array(rewards)[:-1]
            objective += cvx.sum(cvx.huber(bprev@alpha - (rewards + gamma*qmax))) #cvx.sum_squares(bprev@alpha - (rewards + gamma*qmax))

        #bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
        #objective = cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
        loss = 0.001*cvx.sum_squares(alpha-alpha.value) + objective 
        prob = cvx.Problem(cvx.Minimize(loss), constraints)
        # The optimal objective value is returned by `prob.solve()`.
        print("solving problem")
        result = prob.solve(verbose=True)
        print(result)
        # The optimal value for x is stored in `x.value`.
        print(alpha.value)
    #inspection loop
    for i_episode in range(4):
        observation = env.reset()
        #env.env.state[0] = np.random.randn()*sigma 
        for t in range(200):
            env.render()
            #print(observation)
            prev_obs = observation
            action = maxaction(alpha, observation)
            #print(action)
 
            observation, reward, done, info = env.step(action)
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break

 

I also did the easy cartpole environment using the inequality trick.  Seems to work pretty well.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('CartPole-v1')
print(env.action_space)
 
print(env.observation_space)
print(env.observation_space.high)
 
print(env.observation_space.low) # [1,1,8]
#print(env.action_space.high) # -2
#print(env.action_space.low)
 
 
chebdeg = 3
alpha = cvx.Variable(2*chebdeg**4)
alpha.value = np.zeros(2*chebdeg**4)
gamma = 1. - 1./25
def basis(s,a):
    n = np.arange(4)
    f1 = chebval(s[0]/4.,np.eye(chebdeg)).reshape(-1,1,1,1,1)
    f2 = chebval(s[1]/10.,np.eye(chebdeg)).reshape(1,-1,1,1,1)
    f3 = chebval(s[2]/0.4,np.eye(chebdeg)).reshape(1,1,-1,1,1)
    f4 = chebval(s[3]/10.,np.eye(chebdeg)).reshape(1,1,1,-1,1)
    f5 = ((a/2)**np.arange(2)).reshape(1,1,1,1,-1)
    return f1*f2*f3*f4*f5
 
def evalb(alpha, s,a):
    f = basis(s,a)
    return alpha*f.flatten()
 
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 0))
    f2 = np.sum(evalb(alpha.value, observation, 1))
    if f1 > f2:
        action = 0
    else:
        action = 1
    return action
 
constraints = []
objective = 0
 
 
for x in range(4):
    constraints = []
    objective = 0
 

    epsilon = 1.2/(x+1)
    print("epsilon: ", epsilon)
    for i_episode in range(200):
        observation = env.reset()
        observations = []
        rewards = []
        actions = []
        reward = -100
        for t in range(50):
 
            prev_obs = observation
            prev_reward = reward
            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = maxaction(alpha, observation)
            observation, reward, done, info = env.step(action)
            observations.append(observation)
            rewards.append(reward)
            actions.append(action)

            #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
            #objective += evalb(alpha, observation, env.action_space.sample())
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break
        #print(rewards)
        bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
        bnext = np.array([basis(s,env.action_space.sample()).flatten() for (s,a) in zip(observations[1:],actions[1:])])
        rewards = np.array(rewards)[:-1]
        constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

        bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
        objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
    loss = 1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective 
    prob = cvx.Problem(cvx.Minimize(loss), constraints)

    print("solving problem")
    result = prob.solve(verbose=True)
    print(result)
    print(alpha.value)
    #inspection loop
    for i_episode in range(4):
        observation = env.reset()
        #env.env.state[0] = np.random.randn()*sigma 
        for t in range(200):
            env.render()
            #print(observation)
            prev_obs = observation
            action = maxaction(alpha, observation)
            #print(action)
 
            observation, reward, done, info = env.step(action)
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break

 

 

I also have some Work in Progress on getting full swingup cartpole. Currently is not really working. Seems to kind of be pumping about right? The continuous force control easy cartpole does work though.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('CartPole-v1')
print(env.action_space)
 
print(env.observation_space)
print(env.observation_space.high)
 
print(env.observation_space.low) # [1,1,8]
#print(env.action_space.high) # -2
#print(env.action_space.low)
 
 
chebdeg = 2
varNum = 2*2*4*2*2*3
alpha = cvx.Variable(varNum)
alpha.value = np.zeros(varNum)
gamma = 1. - 1./150
def basis(s,a):
    n = np.arange(4)
    f1 = chebval(s[0]/4.,np.eye(2) ).reshape(-1,1,1,1,1,1)
    f2 = chebval(s[1]/10.,np.eye(2)).reshape(1,-1,1,1,1,1)
    f3 = chebval(s[2]/1,np.eye(4)  ).reshape(1,1,-1,1,1,1)
    f4 = chebval(s[3]/1,np.eye(2)  ).reshape(1,1,1,-1,1,1)
    f5 = chebval(s[4]/10.,np.eye(2)).reshape(1,1,1,1,-1,1)
    f6 = ((a/10)**np.arange(3)).reshape(1,1,1,1,1,-1)
    return f1*f2*f3*f4*f5*f6
 
def evalb(alpha, s,a):
    f = basis(s,a)
    return alpha*f.flatten()
'''
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 0))
    f2 = np.sum(evalb(alpha.value, observation, 1))
    if f1 > f2:
        action = 0
    else:
        action = 1
    return action
'''
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 10))
    f2 = np.sum(evalb(alpha.value, observation, 0))
    f3 = np.sum(evalb(alpha.value, observation, -10))
    coeff = np.polyfit([10,0,-10], [f1,f2,f3], deg=2)
    if coeff[0] < 0:
        action = -coeff[1]/2/coeff[0]
        action = min(max(action,-10),10)
    elif f1 > f3:
        action = 10
    else:
        action = -10
    return np.array([action])

constraints = []
objective = 0
 
 
for x in range(4):
    constraints = []
    objective = 0
 

    epsilon = 1.2/(x+1)
    print("epsilon: ", epsilon)
    for i_episode in range(100):
        observation = env.reset()
        observations = []
        rewards = []
        actions = []
        reward = -100
        env.env.state[2] = np.random.rand()*2*np.pi
        for t in range(100):
 
            prev_obs = observation
            prev_reward = reward
            if np.random.rand() < epsilon or len(observation) < 5:
                action = np.random.rand()*20-10
            else:
                action = maxaction(alpha, observation)
            a = action
            env.env.force_mag = abs(a) #min(100, abs(a))
            #print(a)
            if a < 0:
                daction = 0
            else:
                daction = 1
            observation, reward, done, info = env.step(daction)
            o = observation
            observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])

            observations.append(observation)
            bonus = 0
            if observation[2] > 0.9:
                bonus = 1

            rewards.append( (bonus + observation[2] + 1)/2)#- observation[0]**2) #reward
            #print(rewards[-1])
            actions.append(action)

            #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
            #objective += evalb(alpha, observation, env.action_space.sample())
            x = observation[0]
            if x < -4 or x > 4:
                break
            #if done:
            #    print("Episode finished after {} timesteps".format(t+1))
            #    break
        #print(rewards)
        bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
        bnext = np.array([basis(s,a+np.random.randn()*0.3).flatten() for (s,a) in zip(observations[1:],actions[1:])])
        rewards = np.array(rewards)[:-1]
        #print(bprev.shape)
        #print(bnext.shape)

        constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

        bcost = np.array([basis(s,a+np.random.randn()*0.3).flatten() for s in observations])
        objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
    loss = 1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective 
    prob = cvx.Problem(cvx.Minimize(loss), constraints)

    print("solving problem")
    result = prob.solve(verbose=True)
    print(result)
    print(alpha.value)
    #inspection loop
    for i_episode in range(4):
        observation = env.reset()
        env.env.state[2] = np.random.rand()*2*np.pi
        #env.env.state[0] = np.random.randn()*sigma 
        o = observation
        observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])
        for t in range(200):
            env.render()
            #print(observation)
            prev_obs = observation
            o = observation
            observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])
            action = maxaction(alpha, observation)
            a = action
            env.env.force_mag = abs(a) #min(100, abs(a))
            #print(a)
            if a < 0:
                daction = 0
            else:
                daction = 1
            print(action)
 
            observation, reward, done, info = env.step(daction)

            if x < -4 or x > 4:
                break

 

Now I feel that a thing that matters quite a bit is what is your choice of action for the next time step. Hypothetically you want a ton of samples here. I now think that using an action that is just slightly perturbed from the actual action works well because the actual action is tending to become roughly the optimal one. Subsequent time steps have roughly the same data in them.

One advantage of discrete action space is that you can really search it all.

Does that mean I should seriously investigate the sum of squares form? A semidefinite variable per data point sounds bad. I feel like I’d have to seriously limit the amount of data I’m using. Maybe I’ll be pleasantly surprised.

I haven’t even gotten to playing with different polynomials yet. The current implementation is exponentially sized in the number of variables. But in kind of a silly way. I think it would be better to use all terms of a bounded total degree.

 

Deriving the Chebyshev Polynomials using Sum of Squares optimization with Sympy and Cvxpy

Least squares fitting \sum (f(x_i)-y_i)^2 is very commonly used and well loved. Sum of squared fitting can be solved using just linear algebra. One of the most convincing use cases to me of linear programming is doing sum of absolute value fits \sum |f(x_i)-y_i|  and maximum deviation fits \max_i |f(x_i)-y_i|. These two quality of fits are basically just as tractable as least squares, which is pretty cool.

The trick to turning an absolute value into an LP is to look at the region above the graph of absolute value.

This region is defined by y \ge x and y \ge -x. So you introduce a new variable y. Then the LP \min y subject to those constraints will minimize the absolute value. For a sum of absolute values, introduce a variable y_i for each absolute value you have. Then minimize \sum_i y_i. If you want to do min max optimization, use the same y value for every absolute value function.

\min y

\forall i. -y \le x_i \le y

Let’s change topic a bit. Chebyshev polynomials are awesome. They are basically the polynomials you want to use in numerics.

Chebyshev polynomials are sines and cosines in disguise. They inherit tons of properties from them. One very important property is the equioscillation property. The Chebyshev polynomials are the polynomials that stay closest to zero while keeping the x^n coefficient nonzero (2^(n-2) by convention). They oscillate perfectly between -1 and 1 on the interval x \in [-1,1] just like sort of a stretched out sine. It turns out this equioscillation property defines the Chebyshev polynomials

We can approximate the Chebyshev polynomials via sampling many points between [-1,1]. Then we do min of the max absolute error optimization using linear programming. What we get out does approximate the Chebyshev polynomials.

import cvxpy as cvx
import numpy as np
import matplotlib.pyplot as plt

# try finding the 3 through 5 chebyshev polynomial
for N in range(3,6):
	a = cvx.Variable(N) #polynomial coefficients
	t = cvx.Variable() 
	n = np.arange(N) #exponents
	

	xs = np.linspace(-1,1,100)
	chebcoeff = np.zeros(N)
	chebcoeff[-1] = 1
	plt.plot(xs, np.polynomial.chebyshev.chebval(xs, chebcoeff), color='r')

	constraints = [a[-1]==2**(N-2)] # have to have highest power
	for k in range(100):
	   x = np.random.rand()*2-1 #pick random x in [-1,1]
	   c = cvx.sum(a * x**n) #evaluate polynomial

	   constraints.append(c <= t)
	   constraints.append(-t <= c)

	obj = cvx.Minimize(t) #minimize maximum aboslute value
	prob = cvx.Problem(obj,constraints)
	prob.solve()
	plt.plot(xs, np.polynomial.polynomial.polyval(xs, a.value), color='g')
	print(a.value)

plt.show()

 

Found Coefficients:
[-9.95353054e-01  1.33115281e-03  1.99999613e+00]
[-0.01601964 -2.83172979  0.05364805  4.00000197]
[ 0.86388003 -0.33517716 -7.4286604   0.6983382   8.00000211]

 

Red is the actual Chebyshev polynomials and green is the solved for polynomials. It does a decent job. More samples will do even better, and if we picked the Chebyshev points it would be perfect.

Can we do better? Yes we can. Let’s go on a little optimization journey.

Semidefinite programming allows you to optimize matrix variables with the constraint that they have all positive eigenvalues. In a way it lets you add an infinite number of linear constraints. Another way of stating the eigenvalue constraint is that

\forall q. q^T X q \ge 0

You could sample a finite number of random q vectors and take the conjunction of all these constraints. Once you had enough, this is probably a pretty good approximation of the Semidefinite constraint. But semidefinite programming let’s you have an infinite number of the constraints in the sense that \forall q is referencing an infinite number of possible q , which is pretty remarkable.

Finite Sampling the qs has similarity to the previously discussed sampling method for absolute value minimization.

Sum of Squares optimization allows you to pick optimal polynomials with the constraint that they can be written as a sum of squares polynomials. In this form, the polynomials are manifestly positive everywhere. Sum of Squares programming is a perspective to take on Semidefinite programming. They are equivalent in power. You solve SOS programs under the hood by transforming them into semidefinite ones.

You can write a polynomial as a vector of coefficients \tilde{a}.

\tilde{x} = \begin{bmatrix} 1 \\ x \\ x^2 \\ x^3 \\ \vdots \end{bmatrix}

\tilde{a} = \begin{bmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \vdots \end{bmatrix}

p(x)=\tilde{a}^T \tilde{x}

Instead we represent the polynomial with the matrix Q

p(x) = \tilde{x}^T Q \tilde{x}

If the matrix is positive semidefinite, then it can be diagonalized into the sum of squares form.

In all honesty, this all sounds quite esoteric, and it kind of is. I struggle to find problems to solve with this stuff. But here we are! We’ve got one! We’re gonna find the Chebyshev polynomials exactly by translating the previous method to SOS.

The formulation is a direct transcription of the above tricks.

\min t

-t \le p(x) \le t  by which I mean p(x) + t is SOS and t - p(x) is SOS.

There are a couple packages available for python already that already do SOS, .

ncpol2sdpa (https://ncpol2sdpa.readthedocs.io/en/stable/)

Irene (https://irene.readthedocs.io/en/latest/index.html)

SumofSquares.jl for Julia and SOSTools for Matlab. YalMip too I think. Instead of using those packages, I want to roll my own, like a doofus.

Sympy already has very useful polynomial manipulation functionality. What we’re going to do is form up the appropriate expressions by collecting powers of x, and then turn them into cvxpy expressions term by term. The transcription from sympy to cvxpy isn’t so bad, especially with a couple helper functions.

One annoying extra thing we have to do is known as the S-procedure. We don’t care about regions outside of x \in [-1,1]. We can specify this with a polynomial inequality (x+1)(x-1) \ge 0. If we multiply this polynomial by any manifestly positive polynomial (a SOS polynomial in particular will work), it will remain positive in the region we care about. We can then add this function into all of our SOS inequalities to make them easier to satisfy. This is very similar to a Lagrange multiplier procedure.

Now all of this seems reasonable. But it is not clear to me that we have the truly best polynomial in hand with this s-procedure business. But it seems to works out.

from sympy import *
import cvxpy as cvx
import matplotlib.pyplot as plt
import numpy as np

#build corresponding cvx variable for sympy variable
def cvxvar(expr, PSD=True):
    if expr.func == MatrixSymbol:
        i = int(expr.shape[0].evalf())
        j = int(expr.shape[1].evalf())
        return cvx.Variable((i,j), PSD=PSD)        
    elif expr.func == Symbol:
        return cvx.Variable()

def cvxify(expr, cvxdict): # replaces sympy variables with cvx variables
     return lambdify(tuple(cvxdict.keys()), expr)(*cvxdict.values()) 


xs = np.linspace(-1,1,100)

for N in range(3,6):
    #plot actual chebyshev
    chebcoeff = np.zeros(N)
    chebcoeff[-1] = 1
    plt.plot(xs, np.polynomial.chebyshev.chebval(xs, chebcoeff), color='r')

    cvxdict = {}
    # Going to need symbolic polynomials in x
    x = Symbol('x')
    xn = Matrix([x**n for n in range(N)]);

    def sospoly(varname):
        Q = MatrixSymbol(varname, N,N)
        p = (xn.T * Matrix(Q) * xn)[0]
        return p, Q

    t = Symbol('t')
    cvxdict[t] = cvxvar(t)

    #lagrange multiplier polynomial 1
    pl1, l1 = sospoly('l1')
    cvxdict[l1] = cvxvar(l1)

    #lagrange multiplier polynomial 2
    pl2, l2 = sospoly('l2')
    cvxdict[l2] = cvxvar(l2)

    #Equation SOS Polynomial 1
    peq1, eq1 = sospoly('eq1')
    cvxdict[eq1] = cvxvar(eq1)

    #Equation SOS Polynomial 2
    peq2, eq2 = sospoly('eq2')
    cvxdict[eq2] = cvxvar(eq2)

    a = MatrixSymbol("a", N,1)
    pa = Matrix(a).T*xn #sum([polcoeff[k] * x**k for k in range(n)]);
    pa = pa[0]
    cvxdict[a] = cvxvar(a, PSD=False)


    constraints = []


    # Rough Derivation for upper constraint
    # pol <= t
    # 0 <= t - pol + lam * (x+1)(x-1)  # relax constraint with lambda
    # eq1 = t - pol + lam
    # 0 = t - pol + lam - eq1
    z1 = t - pa + pl1 * (x+1)*(x-1) - peq1
    z1 = Poly(z1, x).all_coeffs()
    constraints += [cvxify(expr, cvxdict) == 0 for expr in z1]

    # Derivation for lower constraint
    # -t <= pol
    # 0 <= pol + t + lam * (x+1)(x-1) # relax constraint with lambda
    # eq2 = pol + t + lam     # eq2 is SOS
    # 0 = t - pol + lam - eq2     #Rearrange to equal zero.
    z2 = pa + t + pl2 * (x+1)*(x-1) - peq2
    z2 = Poly(z2, x).all_coeffs()
    constraints += [cvxify(expr, cvxdict) == 0 for expr in z2]

    constraints += [cvxdict[a][N-1,0] == 2**(N-2) ]
    obj = cvx.Minimize(cvxdict[t]) #minimize maximum absolute value
    prob = cvx.Problem(obj,constraints)
    prob.solve()

    print(cvxdict[a].value.flatten())
    plt.plot(xs, np.polynomial.polynomial.polyval(xs, cvxdict[a].value.flatten()), color='g')


plt.show()

Coefficients:
[-1.00000000e+00 -1.02219773e-15  2.00000001e+00]
[-1.23103133e-13 -2.99999967e+00  1.97810058e-13  4.00001268e+00]
[ 1.00000088e+00 -1.39748880e-15 -7.99999704e+00 -3.96420452e-15
  7.99999691e+00]

 

Ooooooh yeah. Those curves are so similar you can’t even see the difference. NICE. JUICY.

There are a couple interesting extension to this. We could find global under or over approximating polynomials. This might be nice for a verified compression of a big polynomial to smaller, simpler polynomials for example. We could also similar form the pointwise best approximation of any arbitrary polynomial f(x) rather than the constant 0 like we did above (replace -t \le p(x) \le t for -t \le p(x) - f(x) \le t in the above). Or perhaps we could use it to find a best polynomial fit for some differential equation according to a pointwise error.

I think we could also extend this method to minimizing the mean absolute value integral just like we did in the sampling case.

\min \int_0^1 t(x)dx

-t(x) \le p(x) \le t(x)

 

More references on Sum of Squares optimization:

http://www.mit.edu/~parrilo/