Deriving the Chebyshev Polynomials using Sum of Squares optimization with Sympy and Cvxpy

Least squares fitting \sum (f(x_i)-y_i)^2 is very commonly used and well loved. Sum of squared fitting can be solved using just linear algebra. One of the most convincing use cases to me of linear programming is doing sum of absolute value fits \sum |f(x_i)-y_i|  and maximum deviation fits \max_i |f(x_i)-y_i|. These two quality of fits are basically just as tractable as least squares, which is pretty cool.

The trick to turning an absolute value into an LP is to look at the region above the graph of absolute value.

This region is defined by y \ge x and y \ge -x. So you introduce a new variable y. Then the LP \min y subject to those constraints will minimize the absolute value. For a sum of absolute values, introduce a variable y_i for each absolute value you have. Then minimize \sum_i y_i. If you want to do min max optimization, use the same y value for every absolute value function.

\min y

\forall i. -y \le x_i \le y

Let’s change topic a bit. Chebyshev polynomials are awesome. They are basically the polynomials you want to use in numerics.

Chebyshev polynomials are sines and cosines in disguise. They inherit tons of properties from them. One very important property is the equioscillation property. The Chebyshev polynomials are the polynomials that stay closest to zero while keeping the x^n coefficient nonzero (2^(n-2) by convention). They oscillate perfectly between -1 and 1 on the interval x \in [-1,1] just like sort of a stretched out sine. It turns out this equioscillation property defines the Chebyshev polynomials

We can approximate the Chebyshev polynomials via sampling many points between [-1,1]. Then we do min of the max absolute error optimization using linear programming. What we get out does approximate the Chebyshev polynomials.

import cvxpy as cvx
import numpy as np
import matplotlib.pyplot as plt

# try finding the 3 through 5 chebyshev polynomial
for N in range(3,6):
	a = cvx.Variable(N) #polynomial coefficients
	t = cvx.Variable() 
	n = np.arange(N) #exponents

	xs = np.linspace(-1,1,100)
	chebcoeff = np.zeros(N)
	chebcoeff[-1] = 1
	plt.plot(xs, np.polynomial.chebyshev.chebval(xs, chebcoeff), color='r')

	constraints = [a[-1]==2**(N-2)] # have to have highest power
	for k in range(100):
	   x = np.random.rand()*2-1 #pick random x in [-1,1]
	   c = cvx.sum(a * x**n) #evaluate polynomial

	   constraints.append(c <= t)
	   constraints.append(-t <= c)

	obj = cvx.Minimize(t) #minimize maximum aboslute value
	prob = cvx.Problem(obj,constraints)
	plt.plot(xs, np.polynomial.polynomial.polyval(xs, a.value), color='g')


Found Coefficients:
[-9.95353054e-01  1.33115281e-03  1.99999613e+00]
[-0.01601964 -2.83172979  0.05364805  4.00000197]
[ 0.86388003 -0.33517716 -7.4286604   0.6983382   8.00000211]


Red is the actual Chebyshev polynomials and green is the solved for polynomials. It does a decent job. More samples will do even better, and if we picked the Chebyshev points it would be perfect.

Can we do better? Yes we can. Let’s go on a little optimization journey.

Semidefinite programming allows you to optimize matrix variables with the constraint that they have all positive eigenvalues. In a way it lets you add an infinite number of linear constraints. Another way of stating the eigenvalue constraint is that

\forall q. q^T X q \ge 0

You could sample a finite number of random q vectors and take the conjunction of all these constraints. Once you had enough, this is probably a pretty good approximation of the Semidefinite constraint. But semidefinite programming let’s you have an infinite number of the constraints in the sense that \forall q is referencing an infinite number of possible q , which is pretty remarkable.

Finite Sampling the qs has similarity to the previously discussed sampling method for absolute value minimization.

Sum of Squares optimization allows you to pick optimal polynomials with the constraint that they can be written as a sum of squares polynomials. In this form, the polynomials are manifestly positive everywhere. Sum of Squares programming is a perspective to take on Semidefinite programming. They are equivalent in power. You solve SOS programs under the hood by transforming them into semidefinite ones.

You can write a polynomial as a vector of coefficients \tilde{a}.

\tilde{x} = \begin{bmatrix} 1 \\ x \\ x^2 \\ x^3 \\ \vdots \end{bmatrix}

\tilde{a} = \begin{bmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \vdots \end{bmatrix}

p(x)=\tilde{a}^T \tilde{x}

Instead we represent the polynomial with the matrix Q

p(x) = \tilde{x}^T Q \tilde{x}

If the matrix is positive semidefinite, then it can be diagonalized into the sum of squares form.

In all honesty, this all sounds quite esoteric, and it kind of is. I struggle to find problems to solve with this stuff. But here we are! We’ve got one! We’re gonna find the Chebyshev polynomials exactly by translating the previous method to SOS.

The formulation is a direct transcription of the above tricks.

\min t

-t \le p(x) \le t  by which I mean p(x) + t is SOS and t - p(x) is SOS.

There are a couple packages available for python already that already do SOS, .

ncpol2sdpa (

Irene (

SumofSquares.jl for Julia and SOSTools for Matlab. YalMip too I think. Instead of using those packages, I want to roll my own, like a doofus.

Sympy already has very useful polynomial manipulation functionality. What we’re going to do is form up the appropriate expressions by collecting powers of x, and then turn them into cvxpy expressions term by term. The transcription from sympy to cvxpy isn’t so bad, especially with a couple helper functions.

One annoying extra thing we have to do is known as the S-procedure. We don’t care about regions outside of x \in [-1,1]. We can specify this with a polynomial inequality (x+1)(x-1) \ge 0. If we multiply this polynomial by any manifestly positive polynomial (a SOS polynomial in particular will work), it will remain positive in the region we care about. We can then add this function into all of our SOS inequalities to make them easier to satisfy. This is very similar to a Lagrange multiplier procedure.

Now all of this seems reasonable. But it is not clear to me that we have the truly best polynomial in hand with this s-procedure business. But it seems to works out.

from sympy import *
import cvxpy as cvx
import matplotlib.pyplot as plt
import numpy as np

#build corresponding cvx variable for sympy variable
def cvxvar(expr, PSD=True):
    if expr.func == MatrixSymbol:
        i = int(expr.shape[0].evalf())
        j = int(expr.shape[1].evalf())
        return cvx.Variable((i,j), PSD=PSD)        
    elif expr.func == Symbol:
        return cvx.Variable()

def cvxify(expr, cvxdict): # replaces sympy variables with cvx variables
     return lambdify(tuple(cvxdict.keys()), expr)(*cvxdict.values()) 

xs = np.linspace(-1,1,100)

for N in range(3,6):
    #plot actual chebyshev
    chebcoeff = np.zeros(N)
    chebcoeff[-1] = 1
    plt.plot(xs, np.polynomial.chebyshev.chebval(xs, chebcoeff), color='r')

    cvxdict = {}
    # Going to need symbolic polynomials in x
    x = Symbol('x')
    xn = Matrix([x**n for n in range(N)]);

    def sospoly(varname):
        Q = MatrixSymbol(varname, N,N)
        p = (xn.T * Matrix(Q) * xn)[0]
        return p, Q

    t = Symbol('t')
    cvxdict[t] = cvxvar(t)

    #lagrange multiplier polynomial 1
    pl1, l1 = sospoly('l1')
    cvxdict[l1] = cvxvar(l1)

    #lagrange multiplier polynomial 2
    pl2, l2 = sospoly('l2')
    cvxdict[l2] = cvxvar(l2)

    #Equation SOS Polynomial 1
    peq1, eq1 = sospoly('eq1')
    cvxdict[eq1] = cvxvar(eq1)

    #Equation SOS Polynomial 2
    peq2, eq2 = sospoly('eq2')
    cvxdict[eq2] = cvxvar(eq2)

    a = MatrixSymbol("a", N,1)
    pa = Matrix(a).T*xn #sum([polcoeff[k] * x**k for k in range(n)]);
    pa = pa[0]
    cvxdict[a] = cvxvar(a, PSD=False)

    constraints = []

    # Rough Derivation for upper constraint
    # pol <= t
    # 0 <= t - pol + lam * (x+1)(x-1)  # relax constraint with lambda
    # eq1 = t - pol + lam
    # 0 = t - pol + lam - eq1
    z1 = t - pa + pl1 * (x+1)*(x-1) - peq1
    z1 = Poly(z1, x).all_coeffs()
    constraints += [cvxify(expr, cvxdict) == 0 for expr in z1]

    # Derivation for lower constraint
    # -t <= pol
    # 0 <= pol + t + lam * (x+1)(x-1) # relax constraint with lambda
    # eq2 = pol + t + lam     # eq2 is SOS
    # 0 = t - pol + lam - eq2     #Rearrange to equal zero.
    z2 = pa + t + pl2 * (x+1)*(x-1) - peq2
    z2 = Poly(z2, x).all_coeffs()
    constraints += [cvxify(expr, cvxdict) == 0 for expr in z2]

    constraints += [cvxdict[a][N-1,0] == 2**(N-2) ]
    obj = cvx.Minimize(cvxdict[t]) #minimize maximum absolute value
    prob = cvx.Problem(obj,constraints)

    plt.plot(xs, np.polynomial.polynomial.polyval(xs, cvxdict[a].value.flatten()), color='g')

[-1.00000000e+00 -1.02219773e-15  2.00000001e+00]
[-1.23103133e-13 -2.99999967e+00  1.97810058e-13  4.00001268e+00]
[ 1.00000088e+00 -1.39748880e-15 -7.99999704e+00 -3.96420452e-15


Ooooooh yeah. Those curves are so similar you can’t even see the difference. NICE. JUICY.

There are a couple interesting extension to this. We could find global under or over approximating polynomials. This might be nice for a verified compression of a big polynomial to smaller, simpler polynomials for example. We could also similar form the pointwise best approximation of any arbitrary polynomial f(x) rather than the constant 0 like we did above (replace -t \le p(x) \le t for -t \le p(x) - f(x) \le t in the above). Or perhaps we could use it to find a best polynomial fit for some differential equation according to a pointwise error.

I think we could also extend this method to minimizing the mean absolute value integral just like we did in the sampling case.

\min \int_0^1 t(x)dx

-t(x) \le p(x) \le t(x)


More references on Sum of Squares optimization:



Reverse Mode Differentiation is Kind of Like a Lens II

For those looking for more on automatic differentiation in Haskell:

Ed Kmett’s ad package

Conal Elliott is making the rounds with a new take on AD (GOOD STUFF).

Justin Le has been making excellent posts and has another library he’s working on.


And here we go:

Reverse mode automatic differentiation is kind of like a lens. Here is the type for a non-fancy lens

type Lens s t a b = s -> (a, b -> t)

When you compose two lenses, you compose the getters (s -> a) and you compose the partially applied setter (b -> t) in the reverse direction.

We can define a type for a reverse mode differentiable function

type AD x dx y dy = x -> (y, dy -> dx)

When you compose two differentiable functions you compose the functions and you flip compose the Jacobian transpose (dy -> dx). It is this flip composition which gives reverse mode it’s name. The dependence of the Jacobian on the base point x corresponds to the dependence of the setter on the original object

The implementation of composition for Lens and AD are identical.

Both of these things are described by the same box diagram (cribbed from the profunctor optics paper ).



This is a very simple way of implementing a reserve mode automatic differentiation using only non exotic features of a functional programming language. Since it is so bare bones and functional, is this a good way to achieve the vision gorgeous post by Christoper Olah?  I do not know.

Now, to be clear, these ARE NOT lenses. Please, I don’t want to cloud the water, do not call these lenses. They’re pseudolenses or something. A very important part of what makes a lens a lens is that it obeys the lens laws, in which the getter and setter behave as one would expect. Our “setter” is a functional representation of the Jacobian transpose and our getter is the function itself. These do not obey lens laws in general.

Chain Rule AND Jacobian

What is reverse mode differentiation? One’s thinking is muddled by defaulting to the Calc I perspective of one dimensional functions. Thinking is also muddled by  the general conception that the gradient is a vector. This is slightly sloppy talk and can lead to confusion. It definitely has confused me.

The right setting for intuition is R^n \rightarrow R^m functions

If one looks at a multidimensional to multidimensional function like this, you can form a matrix of partial derivatives known as the Jacobian. In the scalar to scalar case this is a 1\times 1 matrix, which we can think of as just a number. In the multi to scalar case this is a 1\times n matrix which we somewhat fuzzily can think of as a vector.

The chain rule is a beautiful thing. It is what makes differentiation so elegant and tractable.

For many-to-many functions, if you compose them you matrix multiply their Jacobians.

Just to throw in some category theory spice (who can resist), the chain rule is a functor between the category of differentiable functions and the category of vector spaces where composition is given by Jacobian multiplication. This is probably wholly unhelpful.

The cost of multiplication for an a \times b matrix A and an b \times c matrix B is O(abc) . If we have 3 matrices ABC, we can associate to the left or right. (AB)C vs A(BC) choosing which product to form first. These two associations have different cost, abc * acd for left association or abd * bcd for right association. We want to use the smallest dimension over and over. For functions that are ultimately many to scalar functions, that means we want to multiply starting at the right.

For a clearer explanation of the importance of the association, maybe this will help


Functional representations of matrices

A Matrix data type typically gives you full inspection of the elements. If you partially apply the matrix vector product function (!* :: Matrix -> Vector -> Vector) to a matrix m, you get a vector to vector function (!* m) :: Vector -> Vector. In the sense that a matrix is data representing a linear map, this type looks gorgeous. It is so evocative of purpose.

If all you want to do is multiply matrices or perform matrix vector products this is not a bad way to go. A function in Haskell is a thing that exposes only a single interface, the ability to be applied. Very often, the loss of Gaussian elimination or eigenvalue decompositions is quite painfully felt. For simple automatic differentiation, it isn’t so bad though.

You can inefficiently reconstitute a matrix from it’s functional form by applying it to a basis of vectors.

One weakness of the functional form is that the type does not constrain the function to actually act linearly on the vectors.

One big advantage of the functional form is that you can intermix different matrix types (sparse, low-rank, dense) with no friction, just so long as they all have some way of being applied to the same kind of vector. You can also use functions like (id :: a -> a) as the identity matrix, which are not built from any underlying matrix type at all.

To match the lens, we need to represent the Jacobian transpose as the function (dy -> dx) mapping differentials in the output space to differentials in the input space.

The Lens Trick

A lens is the combination of a getter (a function that grabs a piece out of a larger object) and a setter (a function that takes the object and a new piece and returns the object with that piece replaced).

The common form of lens used in Haskell doesn’t look like the above. It looks like this.

type Lens s t a b = forall f. Functor f => (a -> f b) -> (s -> f t)

This form has exactly the same content as the previous form (A non obvious fact. See the Profunctor Optics paper above. Magic neato polymorphism stuff), with the added functionality of being able to compose using the regular Haskell (.) operator.

I think a good case can be made to NOT use the lens trick (do as I say, not as I do). It obfuscates sharing and obfuscates your code to the compiler (I assume the compiler optimizations have less understanding of polymorphic functor types than it does of tuples and functions), meaning the compiler has less opportunity to help you out. But it is also pretty cool. So… I dunno. Edit:

/u/mstksg points out that compilers actually LOVE the van Laarhoven representation (the lens trick) because when f is finally specialized it is a newtype wrappers which have no runtime cost. Then the compiler can just chew the thing apart.

One thing that is extra scary about the fancy form is that it makes it less clear how much data is likely to be shared between the forward and backward pass. Another alternative to the lens that shows this is the following.

type AD x dx y dy = (x -> y, x -> dy -> dx)

This form is again the same in end result. However it cannot share computation and therefore isn’t the same performance wise. One nontrivial function that took me some head scratching is how to convert from the fancy lens directly to the regular lens without destroying sharing. I think this does it

unfancy :: Lens' a b -> (a -> (b, b -> a))
unfancy l = getCompose . l (\b -> Compose (b, id))


Some code

I have some small exploration of the concept in this git

Again, really check out Conal Elliott’s AD paper and enjoy the many, many apostrophes to follow.

Some basic definitions and transformations between fancy and non fancy lenses. Extracting the gradient is similar to the set function. Gradient assumes a many to one function and it applies it to 1.

import Data.Functor.Identity
import Data.Functor.Const
import Data.Functor.Compose

type Lens' a b = forall f. Functor f => (b -> f b) -> a -> f a

lens'' :: (a -> (b, b -> a)) -> Lens' a b
lens'' h g x = fmap j fb where
    (b, j) = h x
    fb = g b

over :: Lens' a b -> ((b -> b) -> a -> a)
over l f = runIdentity . l (Identity . f)

set :: Lens' a b -> a -> b -> a
set l = flip (\x -> (over l (const x)))

view :: Lens' a b -> a -> b
view l = getConst . l Const

unlens'' :: Lens' a b -> (a -> (b, b -> a))
unlens'' l = getCompose . l (\b -> Compose (b, id))

constlens :: Lens' (a,b) c -> b -> Lens' a c
constlens l b = lens'' $ \a -> let (c, df) = f (a,b) in
                             (c, fst . df) where 
                                           f = unlens'' l

grad :: Num b => Lens' a b -> a -> a
grad l = (flip (set l)) 1

Basic 1D functions and arrow/categorical combinators

-- add and dup are dual!

add' :: Num a => Lens' (a,a) a 
add' = lens'' $ \(x,y) -> (x + y, \ds -> (ds, ds))

dup' :: Num a => Lens' a (a,a)
dup' = lens'' $ \x -> ((x,x), \(dx,dy) -> dx + dy)

sub' :: Num a => Lens' (a,a) a 
sub' = lens'' $ \(x,y) -> (x - y, \ds -> (ds, -ds))

mul' :: Num a => Lens' (a,a) a 
mul' = lens'' $ \(x,y) -> (x * y, \dz -> (dz * y, x * dz))

recip' :: Fractional a => Lens' a a 
recip' = lens'' $ \x-> (recip x, \ds -> - ds / (x * x))

div' :: Fractional a => Lens' (a,a) a 
div' = lens'' $ (\(x,y) -> (x / y, \d -> (d/y,-x*d/(y * y))))

sin' :: Floating a => Lens' a a
sin' = lens'' $ \x -> (sin x, \dx -> dx * (cos x))

cos' :: Floating a => Lens' a a
cos' = lens'' $ \x -> (cos x, \dx -> -dx * (sin x))

pow' :: Num a => Integer -> Lens' a a
pow' n = lens'' $ \x -> (x ^ n, \dx -> (fromInteger n) * dx * x ^ (n-1)) 

--cmul :: Num a => a -> Lens' a a
--cmul c = lens (* c) (\x -> \dx -> c * dx)

exp' :: Floating a => Lens' a a
exp' = lens'' $ \x -> let ex = exp x in
                      (ex, \dx -> dx * ex)

fst' :: Num b => Lens' (a,b) a
fst' = lens'' (\(a,b) -> (a, \ds -> (ds, 0)))

snd' :: Num a => Lens' (a,b) b
snd' = lens'' (\(a,b) -> (b, \ds -> (0, ds)))

swap' :: Lens' (a,b) (b,a)
swap' = lens'' (\(a,b) -> ((b,a), \(db,da) -> (da, db)))

assoc' :: Lens' ((a,b),c) (a,(b,c))
assoc' = lens'' $ \((a,b),c) -> ((a,(b,c)), \(da,(db,dc)) -> ((da,db),dc))

par' :: Lens' a b -> Lens' c d -> Lens' (a,c) (b,d)
par' l1 l2 = lens'' f3 where
    f1 = unlens'' l1
    f2 = unlens'' l2
    f3 (a,c) = ((b,d), df1 *** df2) where
        (b,df1) = f1 a
        (d,df2) = f2 c

fan' :: Num a => Lens' a b -> Lens' a c -> Lens' a (b,c)
fan' l1 l2 = lens'' f3 where
    f1 = unlens'' l1
    f2 = unlens'' l2
    f3 a = ((b,c), \(db,dc) -> df1 db + df2 dc) where
        (b,df1) = f1 a
        (c,df2) = f2 a

first' :: Lens' a b -> Lens' (a, c) (b, c)
first' l = par' l id

second' :: Lens' a b -> Lens' (c, a) (c, b)
second' l = par' id l

relu' :: (Ord a, Num a) => Lens' a a
relu' = lens'' $ \x -> (frelu x, brelu x) where
        frelu x | x > 0 = x
                | otherwise = 0
        brelu x dy | x > 0 = dy
                   | otherwise = 0

Some List based stuff.

import Data.List (sort)
import Control.Applicative (ZipList (..))

-- replicate and sum are dual!

sum' :: Num a => Lens' [a] a
sum' = lens'' $ \xs -> (sum xs, \dy -> replicate (length xs) dy)

replicate' :: Num a => Int -> Lens' a [a]
replicate' n = lens'' $ \x -> (replicate n x, sum)

repeat' :: Num a => Lens' a [a]
repeat' = lens'' $ \x -> (repeat x, sum)

map' :: Lens' a b -> Lens' [a] [b]
map' l = lens'' $ \xs -> let (bs, fs) = unzip . map (unlens'' l) $ xs in 
                       (bs, getZipList . ((ZipList fs) <*>) . ZipList)

zip' :: Lens' ([a], [b]) [(a,b)]
zip' = lens'' $ \(as,bs) -> (zip as bs, unzip)

unzip' :: Lens' [(a,b)] ([a], [b])
unzip' = lens'' $ \xs -> (unzip xs, uncurry zip)

maximum' :: (Num a, Ord a) => Lens' [a] a
maximum' = lens'' $ \(x:xs) -> let (best, bestind, lenxs) = argmaxixum x 0 1 xs in
                               (best, \dy -> onehot bestind lenxs dy) where
    argmaxixum best bestind len [] = (best, bestind, len) 
    argmaxixum best bestind curind (x:xs) = if x > best then argmaxixum x curind (curind + 1) xs else argmaxixum best bestind (curind + 1) xs  
    onehot n m x | m == 0 = []
                 | n == m = x : (onehot n (m-1) x) 
                 | otherwise = 0 : (onehot n (m-1) x)

sort' :: Ord a => Lens' [a] [a]
sort' = lens'' $ \xs -> let (sxs, indices) = unzip . sort $ zip xs [0 ..] in
                        (sxs, desort indices) where
                          desort indices = snd . unzip . sort . zip indices

And some functionality from HMatrix

import Numeric.LinearAlgebra
import Numeric.LinearAlgebra.Devel (zipVectorWith)
import Numeric.ADLens.Lens
-- import Data.Vector as V

dot' :: (Container Vector t, Numeric t) => Lens' (Vector t, Vector t) t
dot' = lens'' $ \(v1,v2) -> (v1 <.> v2, \ds -> (scale ds v2, scale ds v1))

mdot' :: (Product t, Numeric t) => Lens' (Matrix t, Vector t) (Vector t)
mdot' = lens'' $ \(a,v) -> (a #> v, \dv -> (outer dv v, dv <# a))

add' :: Additive c => Lens' (c, c) c
add' = lens'' $ \(v1,v2) -> (add v1 v2, \dv -> (dv, dv))

-- I need konst I think?
sumElements' :: (Container Vector t, Numeric t) => Lens' (Vector t) t
sumElements' = lens'' $ \v -> (sumElements v, \ds -> scalar ds)

reshape' :: Container Vector t => Int -> Lens' (Vector t) (Matrix t)
reshape' n = lens'' $ \v -> (reshape n v,  \dm -> flatten dm)

-- conjugate transpose not trace
tr'' ::  (Transposable m mt, Transposable mt m) => Lens' m mt
tr'' = lens'' $ \x -> (tr x, \dt -> tr dt)

flatten' :: (Num t, Container Vector t) => Lens' (Matrix t) (Vector t)
flatten' = lens'' $ \m -> let s = fst $ size m in  
                          (flatten m,  \dm -> reshape s dm)

norm_2' :: (Container c R, Normed (c R), Linear R c) => Lens' (c R) R
norm_2' = lens'' $ \v -> let nv = norm_2 v in (nv, \dnv -> scale (2 * dnv / nv) v )

cmap' :: (Element b, Container Vector e) => (Lens' e b) -> Lens' (Vector e) (Vector b)
cmap' l = lens'' $ \c -> (cmap f c, \dc -> zipVectorWith f' c dc) where
        f = view l
        f' = set l
maxElement' :: Container c e => Lens' (c e) e
maxElement' = lens'' $ \v -> let i = maxIndex v in (v ! i, dv -> scalar 0)

det' :: Field t => Lens' (Matrix t) t
det' = lens'' $ \m -> let (minv, (lndet, phase)) = invlndet m in
                    let detm = phase * exp detm in
                    (detm, \ds -> (scale (ds * detm) minv))

diag' :: (Num a, Element a) => Lens' (Vector a) (Matrix a)
diag' = lens'' $ \v -> (diag v, takeDiag)

takeDiag' :: (Num a, Element a) => Lens' (Matrix a) (Vector a) 
takeDiag' = lens'' $ \m -> (takeDiag m, diag)

In practice, I don’t think this is a very ergonomic approach without something like Conal Elliott’s Compiling to Categories plugin. You have to program in a point-free arrow style (inspired very directly by Conal’s above AD paper) which is pretty nasty IMO. The neural network code here is inscrutable. It is only a three layer neural network.

import Numeric.ADLens.Lens
import Numeric.ADLens.Basic
import Numeric.ADLens.List
import Numeric.ADLens.HMatrix

import Numeric.LinearAlgebra

type L1 = Matrix Double
type L2 = Matrix Double
type L3 = Matrix Double

type Input = Vector Double
type Output = Vector Double
type Weights = (L1,(L2,(L3,())))

class TupleSum a where
	tupsum :: a -> a -> a
instance TupleSum () where
	tupsum _ _ = ()
instance (Num a, TupleSum b) => TupleSum (a,b) where
	tupsum (a,x) (b,y) = (a + b, tupsum x y)

-- A dense relu neural network example
swaplayer :: Lens' ((Matrix t, b), Vector t) (b, (Matrix t, Vector t))
swaplayer = first' swap' . assoc' 

mmultlayer :: Numeric t => Lens' (b, (Matrix t, Vector t)) (b, Vector t)
mmultlayer = second' mdot'

relulayer :: Lens' (b, Vector Double) (b, Vector Double)
relulayer = second' $ cmap' relu'

uselayer :: Lens' ((Matrix Double, b), Vector Double) (b, Vector Double)
uselayer = swaplayer . mmultlayer . relulayer

runNetwork :: Lens' (Weights, Input) ((), Output)
runNetwork =  uselayer . uselayer . uselayer

main :: IO ()
main = do
   putStrLn "Starting Tests"
   print $ grad (pow' 2) 1
   print $ grad (pow' 4) 1
   print $ grad (map' (pow' 2) . sum') $ [1 .. 5]
   print $ grad (map' (pow' 4) . sum') $ [1 .. 5]
   print $ map (\x -> 4 * x ^ 3 )  [1 .. 5]
   l1 <- randn 3 4
   l2 <- randn 2 3
   l3 <- randn 1 2
   let weights = (l1,(l2,(l3,())))
   print $ view runNetwork (weights, vector [1,2,3,4])
   putStrLn "The neural network gradients"
   print $ set runNetwork (weights, vector [1,2,3,4]) ((), vector [1])



Model Predictive Control of CartPole in OpenAI Gym using OSQP

A continuation of this post

I was having difficulty getting the model predictive control from a previous post working on an actual cartpole. There are a lot more unknown variables in that case and other issues (the thing has a tendency to destroy itself). I was kind of hoping it would just work. So I realized that I should get it working in simulation.

I did not copy the simulation code of the openai cartpole  which gives some small amount of credence that the MPC might generalize to a real system.

For the sake of honesty, I’m still at the point where my controller freaks out about 1/3 of the time. I do not really understand why.


Looks damn good here though huh.

A problem I had for a while was the Length of my pole was off by a factor of 2. It still kind of worked, especially if nearly balanced (although with a lot of oscillation, which in hindsight was a clue something wasn’t tuned in right).

There are some useful techniques for manipulating gym. You can set some parameters from the outside, like starting positions and thresholds. Also you can mangle your way into continuous force control rather than just left right commands (wouldn’t it be cool to use Integer programming for that? Silly, but cool).

There is still a bunch of trash in here from me playing around with parameters. I like to keep it real (and lazy).

import gym
from mpc import MPC
import numpy as np
env = gym.make('CartPole-v0')
env.env.theta_threshold_radians = np.pi * 2
env.env.x_threshold = 5 

start_theta = 0#-np.pi + 0.4# + 0.1#2 * np.pi #np.pi+0.4

mpc = MPC(0.5,0,start_theta,0) 
action = 0
for i_episode in range(1):
    observation = env.reset()
    env.env.state[2] = start_theta - np.pi
    for t in range(500):
        #action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        a = mpc.update(observation[0] + 0.5, observation[1], observation[2]+np.pi, observation[3])
        env.env.force_mag = abs(a) #min(100, abs(a))
        if a < 0:
        	action = 0
        	action = 1
        if done:
            #print("Episode finished after {} timesteps".format(t+1))

One problem was that originally I had the pole just want to go to pi. But if it swings the other direction or many swings, that is bad and it will freak out. So I changed it to try to go the the current nearest multiple of pi, which helps.

Fiddling with the size of the regulation does have a significant effect and the relative size of regulation for x, v, f, omega. I am doing a lot of that search dumbly. I should probably do some kind of automatic.

Loosening the constraints on v and x seems to help stability.

I think weighting the angle at the end of the episode slightly more helps. That’s why I used linspace for the weight on the angle.

I’ve had a lot of problem with the answer coming back as infeasible from OSQP. I feel like it probably shouldn’t be and that is the solver’s problem?

Two things help: sometimes the cart does go out of the allowable range. The optimization probably will try to go all the way to the boundaries since it is useful. And since there is some mismatch between the actual dynamics and my model, it will go outside. So I heavily reduce the constraints for the first couple time steps. It takes a couple. 4 seems to work ok. It should want to apply forces during those four steps to get it back in range anyhow.

Even then it still goes infeasible sometimes and I don’t know why. So in that case, I reduce the required accuracy to hopefully at least get something that makes sense. That is what the eps_rel stuff is about. Maybe it helps. Not super clear. I could try a more iterative increasing of the eps?!topic/osqp/BzEqWQR2dYY


import sparsegrad.forward as forward
import numpy as np
import osqp
import scipy.sparse as sparse

class MPC():
	def __init__(self, x0, v0, theta0, thetadot0):
		self.N = 50
		self.NVars  = 5
		T = 8.0
		self.dt = T/self.N
		dt = self.dt
		self.dtinv = 1./dt
		#Px = sparse.eye(N)
		#sparse.csc_matrix((N, N)) 
		# The three deifferent weigthing matrices for x, v, and external force
		reg = sparse.eye(self.N)*0.05
		z = sparse.bsr_matrix((self.N, self.N))
		# sparse.diags(np.arange(N)/N)
		pp = sparse.diags(np.linspace(1,7,self.N)) # sparse.eye(self.N)
		P = sparse.block_diag([reg, 10*reg ,pp, reg, 10*reg]) #1*reg,1*reg])
		self.P = P
		THETA = 2
		q = np.zeros((self.NVars, self.N))
		q[THETA,:] = np.pi
		q[0,:] = 0.5
		#q[N,0] = -2 * 0.5 * 10
		q = q.flatten()
		q= -P@q
		#u = np.arr

		self.x = np.random.randn(self.N,self.NVars).flatten()
		#x = np.zeros((N,NVars)).flatten()
		#v = np.zeros(N)
		#f = np.zeros(N)


		A, l, u = self.getAlu(self.x, x0, v0, theta0, thetadot0)
		self.m = osqp.OSQP()
		self.m.setup(P=P, q=q, A=A, l=l, u=u , time_limit=0.1, verbose=False)# , eps_rel=1e-2 #  **settings # warm_start=False, eps_prim_inf=1e-1
		self.results = self.m.solve()
		for i in range(100):
			self.update(x0, v0, theta0, thetadot0)

	def calcQ(self):
		THETA = 2
		q = np.zeros((self.NVars, self.N))
		thetas = np.reshape(self.x, (self.NVars, self.N))[THETA,:]
		thetas = thetas - thetas % (2*np.pi) + np.pi
		q[THETA,:] = thetas[0] #np.pi #thetas
		q[0,:] = 0.5
		#q[N,0] = -2 * 0.5 * 10
		q = q.flatten()
		q = -self.P@q
		return q

	def update(self, x0, v0,theta0, thetadot0):
		A, l, u = self.getAlu(self.x, x0, v0, theta0, thetadot0)
		q = self.calcQ()
		self.m.update(q=q,, l=l, u=u)
		self.results = self.m.solve()
		if self.results.x[0] is not None:
			self.x = np.copy(self.results.x)
			#self.x += np.random.randn(self.N*self.NVars)*0.1 # help get out of a rut?
			return 0
		return self.x[4*self.N+1]

	def constraint(self, var, x0, v0, th0, thd0):
		#x[0] -= 1
		g = 9.8
		L = 1.0
		gL = g * L
		m = 1.0 # doesn't matter
		I = L**2 / 3
		Iinv = 1.0/I
		dtinv = self.dtinv
		N = self.N

		x = var[:N]
		v = var[N:2*N]
		theta = var[2*N:3*N]
		thetadot = var[3*N:4*N]
		a = var[4*N:5*N]
		dynvars = (x,v,theta,thetadot)
		xavg, vavg, thetavg, thdotavg = map(lambda z: (z[0:-1]+z[1:])/2, dynvars)
		dx, dv, dthet, dthdot = map(lambda z: (z[1:]-z[0:-1])*dtinv, dynvars)
		vres = dv - a[1:]
		xres = dx - vavg
		torque = (-gL*np.sin(thetavg) + a[1:]*L*np.cos(thetavg))/2
		thetdotres = dthdot - torque*Iinv
		thetres = dthet - thdotavg

		return x[0:1]-x0, v[0:1]-v0, theta[0:1]-th0, thetadot[0:1]-thd0, xres,vres, thetdotres, thetres
		#return x[0:5] - 0.5

	def getAlu(self, x, x0, v0, th0, thd0):
		N = self.N
		gt = np.zeros((2,N))
		gt[0,:] = -0.1 # 0.15 # x is greaer than 0.15
		gt[1,:] = -3 #-1 #veclotu is gt -1m/s
		#gt[4,:] = -10
		control_n = max(3, int(0.1 / self.dt)) # I dunno. 4 seems to help

		gt[:,:control_n] = -100
		#gt[1,:2] = -100
		#gt[1,:2] = -15
		#gt[0,:3] = -10
		gt = gt.flatten()
		lt = np.zeros((2,N))
		lt[0,:] = 1 #0.75 # x less than 0.75
		lt[1,:] = 3 #1 # velocity less than 1m/s
		#lt[4,:] = 10
		lt[:,:	control_n] = 100
		#lt[1,:2] = 100
		#lt[0,:3] = 10
		#lt[1,:2] = 15
		lt = lt.flatten()

		z = sparse.bsr_matrix((N, N))
		ineqA = sparse.bmat([[sparse.eye(N),z,z,z,z],[z,sparse.eye(N),z,z,z]]) #.tocsc()
		cons = self.constraint(forward.seed_sparse_gradient(x), x0, v0, th0, thd0)
		A = sparse.vstack(map(lambda z: z.dvalue, cons)) #  y.dvalue.tocsc()
		totval = np.concatenate(tuple(map(lambda z: z.value, cons)))
		temp = A@x - totval

		A = sparse.vstack((A,ineqA)).tocsc()

		#print(tuple(map(lambda z: z.value, cons)))
		u = np.concatenate((temp, lt))
		l = np.concatenate((temp, gt))
		return A, l, u



Division of Polynomials in Haskell

I’ve been been on a kick learning about some cool math topics. In particular, I busted out my copy of Concrete Mathematics (awesome book) and was reading up on the number theory section. Bezout’s identity is that if you have ma+nb=d for some m,n and d divides a and b, then d is the greatest divisor of a and b. Bezout’s identity is a certificate theorem like the dual solution to a linear program. It doesn’t matter how you found m and n or d, Euclid’s algorithm or brute search, but once you’ve found that certificate, you know you have the biggest divisor. Pretty cool. Suppose you have some other common divisor d’. That means xd’=a and yd’=b. Subsitute this into the certificate. m(xd’)+n(yd’)=(mx+ny)d’ =d. This means d’ is a divisor of d. But the divisor relation implies the inequality relation (a divisor is always less than or equal to the thing it divides). Therefore d'<=d.

But another thing I’ve been checking out is algebraic geometry and Groebner bases. Groebner bases are a method for solving multivariate polynomial equations. The method is a relative of euclidean division and also in a sense Gaussian elimination. Groebner basis are a special recombination of a set of polynomial equations such that polynomial division works uniquely on them (in many variables polynomial division doesn’t work like you’d expect from the one variable case anymore). Somewhat surprisingly other good properties come along for the ride. More on this in posts to come

Some interesting stuff in the Haskell arena:

Gröbner bases in Haskell: Part I

I love Doug McIlroy’s powser. It is my favorite functional pearl.

One interesting aspect not much explored is that you can compose multiple layers of [] like [[[Double]]] to get multivariate power series. You can then also tuple up m of these series to make power series from R^n -> R^m. An interesting example of a category and a nonlinear relative of the matrix category going from V^n -> V^m. I’m having a hell of a time getting composition to work automatically though. I’m a little stumped.

I made a version of division that uses the Integral typeclass rather than the fractional typeclass with an eye towards these applications. It was kind of annoying but I think it is right now.

I thought that in order to make things more Groebner basis like, I should make polynomials that start with the highest power first. It also makes sense for a division algorithm, but now I’m not so sure. I should also have probably used Vector and not List.





bezout x 0 = (1, 0, x)
bezout x y = (n, m - n*z , d) where 
                                r = x `mod` y
                                z = x `div` y
                                (m, n, d) = bezout y r 
                                -- my + nr = d
    -- my + n(x - zy) = d
    -- (m-nz)y + nx = d

-- newtype Poly a = P [a]
instance Num a => Num [a] where
    (+) xs [] = xs
    (+) [] ys = ys
    (+) xss@(x : xs) yss@(y : ys) | nx > ny = x : (xs + yss)
                                  | nx < ny = y : (xss + ys) 
                                  | otherwise = x + y : (xs + ys) where 
                nx = length xs
                ny = length ys
    (*) _ [] = []
    (*) [] _ = []
    (*) (x:xs) ys = ((* x) <$> ys ++ (replicate nx 0)) + (xs * ys) where nx = length xs
        -- (pure x * ys) * xpow nx + (xs * ys) where nx = length xs
    negate = fmap negate
    abs = error "no don't do this"
    signum = error "or this"
    fromInteger 0 = []
    fromInteger x = (pure . fromInteger) x 

xpow n = 1 : replicate n 0 
class DivMod a where
    divmod :: a -> a -> (a, a)

instance Integral a => DivMod a where
    divmod = quotRem


instance Integral a => Integral [a] where
    quotRem [] _ = ([], [])
    quotRem xss@(x : xs) yss@(y : ys) | nx < ny = ([], xss)
                                      | otherwise = (d' + d'', r' + r'') where

                                       -- | nx > ny = (d' + d'',r' + r'')    
                                    --  | otherwise = ( pure d, r : xs - (pure d) * ys) where 
                                        -- (pure d , xss - (pure d) * yss)                                       
                                      -- ( pure d, r : xs - (pure d) * ys) 
                                      -- | x == 0 = quotRem xs yss
                nx = length xs
                ny = length ys
                (d , r) = x `quotRem` y 
                d' = d : (replicate (nx - ny) 0)
                r' = r : (replicate (nx - ny) 0)
                -- (d'' ,r'') =  quotRem (xss - d' * yss) yss
                (d'' , r'') = quotRem (xs - (pure d) * ys) yss 
    toInteger = error "don't do this" 

instance (Num a, Ord a) => Real [a] where
  toRational = error "screw you haskell" 

instance (Enum a) => Enum [a] where
  toEnum = error " crew you" -- pure . toEnum
  fromEnum = error "screw yo "  -- fromEnum . head
-- actually this is a different Ord instance from the main library

instance Ord a => Ord [a] where
  compare xs yy = case compare nx ny of
                       EQ -> compare (head xs) (head ys)
                       z -> z where
                            nx = length xs
                            ny = length ys

-- if we use powser with Vector, 
    -- we can just grab the tail for the remainer.
    --it makes sense to grab the end
-- we can just reuse the powser infracsture
-- with a tail. 
-- tail view in agda for that purpose also.

example :: [[Double]]
example = [1, 1]

var :: Num a => [a]
var = [1,0]
x :: Num a => [a]
x = var

y :: Num a => [[a]]
y = pure x

z :: Num a => [[[a]]]
z = pure y

x' (x, _, _) = x
y' (_, y, _) = y

f :: [[Double]]
f = 1 + 2 * x + 3 * x * y + 3 * y

--(a , (a ,( a ,())))
-- '[a , b, c, d]

f1 :: (Num a) => (a, a, a) -> (a, a)
f1 (x, y, z) = (x, y + z)

f2 :: (Num a) => (a, a) -> a
f2 (x, y) = x*x

g = f2 . f1

example2 = g (x, y, z)
-- only a single variable
-- we need something else 
use :: Num a => [a] -> a -> a
use [] z = 0 
use (x : xs) z = x * (z ^ (length xs)) + (use xs z)