Deriving the Chebyshev Polynomials using Sum of Squares optimization with Sympy and Cvxpy

Least squares fitting $\sum (f(x_i)-y_i)^2$ is very commonly used and well loved. Sum of squared fitting can be solved using just linear algebra. One of the most convincing use cases to me of linear programming is doing sum of absolute value fits $\sum |f(x_i)-y_i|$  and maximum deviation fits $\max_i |f(x_i)-y_i|$. These two quality of fits are basically just as tractable as least squares, which is pretty cool.

The trick to turning an absolute value into an LP is to look at the region above the graph of absolute value.

This region is defined by $y \ge x$ and $y \ge -x$. So you introduce a new variable y. Then the LP $\min y$ subject to those constraints will minimize the absolute value. For a sum of absolute values, introduce a variable $y_i$ for each absolute value you have. Then minimize $\sum_i y_i$. If you want to do min max optimization, use the same y value for every absolute value function.

$\min y$

$\forall i. -y \le x_i \le y$

Let’s change topic a bit. Chebyshev polynomials are awesome. They are basically the polynomials you want to use in numerics.

Chebyshev polynomials are sines and cosines in disguise. They inherit tons of properties from them. One very important property is the equioscillation property. The Chebyshev polynomials are the polynomials that stay closest to zero while keeping the x^n coefficient nonzero (2^(n-2) by convention). They oscillate perfectly between -1 and 1 on the interval $x \in [-1,1]$ just like sort of a stretched out sine. It turns out this equioscillation property defines the Chebyshev polynomials

We can approximate the Chebyshev polynomials via sampling many points between [-1,1]. Then we do min of the max absolute error optimization using linear programming. What we get out does approximate the Chebyshev polynomials.

Red is the actual Chebyshev polynomials and green is the solved for polynomials. It does a decent job. More samples will do even better, and if we picked the Chebyshev points it would be perfect.

Can we do better? Yes we can. Let’s go on a little optimization journey.

Semidefinite programming allows you to optimize matrix variables with the constraint that they have all positive eigenvalues. In a way it lets you add an infinite number of linear constraints. Another way of stating the eigenvalue constraint is that

$\forall q. q^T X q \ge 0$

You could sample a finite number of random q vectors and take the conjunction of all these constraints. Once you had enough, this is probably a pretty good approximation of the Semidefinite constraint. But semidefinite programming let’s you have an infinite number of the constraints in the sense that $\forall q$ is referencing an infinite number of possible q , which is pretty remarkable.

Finite Sampling the qs has similarity to the previously discussed sampling method for absolute value minimization.

Sum of Squares optimization allows you to pick optimal polynomials with the constraint that they can be written as a sum of squares polynomials. In this form, the polynomials are manifestly positive everywhere. Sum of Squares programming is a perspective to take on Semidefinite programming. They are equivalent in power. You solve SOS programs under the hood by transforming them into semidefinite ones.

You can write a polynomial as a vector of coefficients $\tilde{a}$.

$\tilde{x} = \begin{bmatrix} 1 \\ x \\ x^2 \\ x^3 \\ \vdots \end{bmatrix}$

$\tilde{a} = \begin{bmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \vdots \end{bmatrix}$

$p(x)=\tilde{a}^T \tilde{x}$

Instead we represent the polynomial with the matrix $Q$

$p(x) = \tilde{x}^T Q \tilde{x}$

If the matrix is positive semidefinite, then it can be diagonalized into the sum of squares form.

In all honesty, this all sounds quite esoteric, and it kind of is. I struggle to find problems to solve with this stuff. But here we are! We’ve got one! We’re gonna find the Chebyshev polynomials exactly by translating the previous method to SOS.

The formulation is a direct transcription of the above tricks.

$\min t$

$-t \le p(x) \le t$  by which I mean $p(x) + t$ is SOS and $t - p(x)$ is SOS.

There are a couple packages available for python already that already do SOS, .

SumofSquares.jl for Julia and SOSTools for Matlab. YalMip too I think. Instead of using those packages, I want to roll my own, like a doofus.

Sympy already has very useful polynomial manipulation functionality. What we’re going to do is form up the appropriate expressions by collecting powers of x, and then turn them into cvxpy expressions term by term. The transcription from sympy to cvxpy isn’t so bad, especially with a couple helper functions.

One annoying extra thing we have to do is known as the S-procedure. We don’t care about regions outside of $x \in [-1,1]$. We can specify this with a polynomial inequality $(x+1)(x-1) \ge 0$. If we multiply this polynomial by any manifestly positive polynomial (a SOS polynomial in particular will work), it will remain positive in the region we care about. We can then add this function into all of our SOS inequalities to make them easier to satisfy. This is very similar to a Lagrange multiplier procedure.

Now all of this seems reasonable. But it is not clear to me that we have the truly best polynomial in hand with this s-procedure business. But it seems to works out.

Ooooooh yeah. Those curves are so similar you can’t even see the difference. NICE. JUICY.

There are a couple interesting extension to this. We could find global under or over approximating polynomials. This might be nice for a verified compression of a big polynomial to smaller, simpler polynomials for example. We could also similar form the pointwise best approximation of any arbitrary polynomial f(x) rather than the constant 0 like we did above (replace $-t \le p(x) \le t$ for $-t \le p(x) - f(x) \le t$ in the above). Or perhaps we could use it to find a best polynomial fit for some differential equation according to a pointwise error.

I think we could also extend this method to minimizing the mean absolute value integral just like we did in the sampling case.

$\min \int_0^1 t(x)dx$

$-t(x) \le p(x) \le t(x)$

More references on Sum of Squares optimization:

http://www.mit.edu/~parrilo/

Solving the Ising Model using a Mixed Integer Linear Program Solver (Gurobi)

I came across an interesting thing, that finding the minimizer of the Ising model is encodable as a mixed integer linear program.

The Ising model is a simple model of a magnet. A lattice of spins that can either be up or down. They want to align with an external magnetic field, but also with their neighbors (or anti align, depending on the sign of the interaction). At low temperatures they can spontaneously align into a permanent magnet. At high temperatures, they are randomized. It is a really great model that contains the essence of many physics topics.

Linear Programs minimize linear functions subject to linear equality and inequality constraints. It just so happens this is a very solvable problem (polynomial time).

MILP also allow you to add the constraint that variables take on integer values. This takes you into NP territory. Through fiendish tricks, you can encode very difficult problems. MILP solvers use LP solvers as subroutines, giving them clues where to search, letting them step early if the LP solver returns integer solutions, or for bounding branches of the search tree.

How this all works is very interesting (and very, very roughly explained), but barely matters practically since other people have made fiendishly impressive implementations of this that I can’t compete with. So far as I can tell, Gurobi is one of the best available implementations (Hans Mittelman has some VERY useful benchmarks here http://plato.asu.edu/bench.html), and they have a gimped trial license available (2000 variable limit. Bummer.). Shout out to CLP and CBC, the Coin-Or Open Source versions of this that still work pretty damn well.

Interesting Connection: Quantum Annealing (like the D-Wave machine) is largely based around mapping discrete optimization problems to an Ising model. We are traveling that road in the opposite direction.

So how do we encode the Ising model?

Each spin is a binary variable $s_i \in {0,1}$

We also introduce a variable for every edge. which we will constrain to actually be the product of the spins. $e_{ij} \in {0,1}$. This is the big trick.

We can compute the And/Multiplication (they coincide for 0/1 binary variables) of the spins using a couple linear constraints. I think this does work for the 4 cases of the two spins.

$e_{ij} \ge s_i +s_j - 1$

$e_{ij} \le s_j$

$e_{ij} \le s_i$

The xor is usually what we care about for the Ising model, we want aligned vs unaligned spins to have different energy. It will have value 1 if they are aligned and 0 if they are anti-aligned. This is a linear function of the spins and the And.

$s_i \oplus s_j = s_i + s_j - 2 e_{ij}$

Then the standard Hamiltonian is

$H=\sum B_i s_i + \sum J_{ij} (s_i + s_j - 2 e_{ij})$

Well, modulo some constant offset. You may prefer making spins $\pm 1$, but that leads to basically the same Hamiltonian.

The Gurobi python package actually let’s us directly ask for AND constraints, which means I don’t actually have to code much of this.

We are allowed to use spatially varying external field B and coupling parameter J. The Hamiltonian is indeed linear in the variables as promised.

After already figuring this out, I found this chapter where they basically do what I’ve done here (and more probably). There is nothing new under the sun. The spatially varying fields B and J are very natural in the field of spin glasses.

https://onlinelibrary.wiley.com/doi/10.1002/3527603794.ch4

For a while I thought this is all we could do, find the lowest energy solution, but there’s more! Gurobi is one of the few solvers that support iteration over the lowest optimal solutions, which means we can start to talk about a low energy expansion. https://www.gurobi.com/documentation/8.0/refman/poolsearchmode.html#parameter:PoolSearchMode

Here we’ve got the basic functionality. Getting 10,000 takes about a minute. This is somewhat discouraging when I can see that we haven’t even got to very interesting ones yet, just single spin and double spin excitations. But I’ve got some ideas on how to fix that. Next time baby-cakes.

(A hint: recursion with memoization leads to some brother of a cluster expansion.)

Here’s the ground antiferromagnetic state. Cute.

Extracting a Banded Hessian in PyTorch

So pytorch does have some capability towards higher derivatives, with the caveat that you have to dot the gradients to turn them back into scalars before continuing. What this means is that you can sample a single application of the  Hessian (the matrix of second derivatives) at a time.

One could sample out every column of the hessian for example. Performance-wise I don’t know how bad this might be.

For a banded hessian, which will occur in a trajectory optimization problem (the bandedness being a reflection of the finite difference scheme), you don’t need that many samples. This feels more palatable. You only need to sample the hessian roughly the bandwidth number of times, which may be quite small. Plus, then you can invert that banded hessian very quickly using special purpose banded matrix solvers, which are also quite fast. I’m hoping that once I plug this into the trajectory optimization, I can use a Newton method (or SQP?) which will perform better than straight gradient descent.

If you pulled just a single column using [1,0,0,0,0,0..] for example, that would be wasteful, since there are so many known zeros in the banded matrix. Instead something like [1,0,0,1,0,0,1,0,0..] will not have any zeros in the result. This gets us every 3rd row of the matrix. Then we can sample with shifted versions like [0,1,0,0,1,0,0,1,0,0..]. until we have all the rows somewhere. Then there is some index shuffling to put the thing into a sane ordering, especially so that we can use https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.solveh_banded.html which requires the banded matrix to be given in a particular form.

An alternative approach might be to use an fft with some phase twiddling. Also it feels like since the Hessian is hermitian we ought to be able to use about half the samples, since half are redundant, but I haven’t figured out a clean way to do this yet. I think that perhaps sampling with random vectors and then solving for the coefficients would work, but again how to organize the code for such a thing?

Here’s a snippet simulatng extracting the band matrix from matrix products.

and here is the full pytorch implementation including a linear banded solve.

Output: