## Cvxpy and NetworkX Flow Problems

Networkx outputs scipy sparse incidence matrices

https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.linalg.graphmatrix.incidence_matrix.html#networkx.linalg.graphmatrix.incidence_matrix

https://docs.scipy.org/doc/scipy/reference/sparse.html

Networkx also has it’s own flow solvers, but cvxpy gives you some interesting flexibility, like turning the problem mixed integer, quadratic terms, and other goodies. Plus it is very easy to get going as you’ll see.

So here’s a basic example of putting these two together. Very straightforward and cool.


import networkx as nx
import cvxpy as cvx
import matplotlib.pyplot as plt
import numpy as np
from scipy.sparse import lil_matrix

#graph is an networkx graph from somewhere

#print(edgedict)
nEdges = len(graph.edges)
nNodes = len(graph.nodes)

posflow = cvx.Variable(nEdges)
negflow = cvx.Variable(nEdges)

# split flow into positive and negative parts so we can talk about absolute value.
# Perhaps I should let cvxpy do it for me
constraints = [ 0 <= posflow,  0 <= negflow ]

absflow = posflow + negflow
flow = posflow - negflow

L = nx.incidence_matrix(graph, oriented=True )

source = np.zeros(nNodes) #lil_matrix(n_nodes)
# just some random source placement.
source[7] = 1
source[25] = -1

# cvxpy needs sparse matrices wrapped.
Lcvx = cvx.Constant(L)
#sourcecvx = cvx.Constant(source)

# flow conservation
constraints.append(Lcvx*flow == source)

# can put other funky inequality constraints on things.

objective = cvx.Minimize(cvx.sum(absflow))

print("building problem")
prob = cvx.Problem(objective, constraints)
print("starting solve")
prob.solve(solver=cvx.OSQP, verbose = True) #or try cvx.CBC, cvx.CVXOPT, cvx.GLPK, others
np.set_printoptions(threshold=np.inf)
print(absflow.value)



Here was a cool visual from a multi commodity flow problem (nx.draw_networkx_edges)

Nice, huh.

## A Simple Interior Point Linear Programming Solver in Python

This solver is probably not useful for anything. For almost all purposes, let me point you to cvxpy.

If you want an open source solver CBC/CLP and GLPK and OSQP are good.

If you want proprietary, you can get a variable number constrained trial license to Gurobi for free.

Having said that, here we go.

The simplex method gets more press, and certainly has it’s advantages, but the interior point method makes much more sense to me. What follows is the basic implementation described in Stephen Boyd’s course and book http://web.stanford.edu/~boyd/cvxbook/

In the basic interior point method, you can achieve your inequality constraints $\phi(x) \ge 0$ by using a logarithmic potential to punish getting close to them $-\gamma \ln (\phi(x))$ where $\gamma$ is a parameter we’ll talk about in a bit.  From my perspective, the logarithm is a somewhat arbitrary choice. I believe some properties of the logarithmic potential is necessary for some convergence guarantees.

The basic unconstrained newton step takes a locally quadratic approximation to the function you’re trying to optimize and finds the minimum of that. This basically comes down to taking a step that is the inverse hessian applied to the gradient.

$\min_{dx} f(x_0+dx) \approx f(x_0) + \nabla f(x_0)dx + \frac{1}{2} dx^T H dx$

$(H)_{ij} = \partial_{ij}f(x_0)$

$\nabla f(x_0) +H dx = 0 \rightarrow dx =- H^{-1}\nabla f$

We can maintain a linear constraint on the variable x during this newton step. Instead of setting the gradient to zero, we set it so that it is perpendicular to the constraint plane using the Lagrange multiplier procedure.

$\nabla f(x_0) +H dx = -A^T \lambda \rightarrow Hdx + A^T \lambda = - \nabla f$

$A(x_0 + dx) = b$

This is a block linear system

$\begin{bmatrix} H & A^T \\ A & 0 \\ \end{bmatrix} \begin{bmatrix} dx \\ \lambda \end{bmatrix} = \begin{bmatrix} -\nabla f \\ b - Ax_0 \end{bmatrix}$

Despite the logarithm potential, there is no guarantee that the newton step would not take us outside the allowed region. This is why we need a line search on top of the newton step. We scale the newton dx to $\alpha dx$. Because the function we’re optimizing is convex and the region we’re in is convex, there is some step length in that newton direction that will work. So if we keep decreasing the overall step size, we’ll eventually find something acceptable.

As part of the interior point method, once it has converged we decrease the parameter $\gamma$ applied to the logarithm potential. This will allow the inequality constraints to satisfied tighter and tighter with smaller gamma.

The standard form of an LP is

$\min c^T x$

$A x = b$

$x \ge 0$

This doesn’t feel like the form you’d want. One way you can construct this is by adding slack variables and splitting regular variables into a positive and negative piece

$x = x_+ - x_-$

$Ax \ge b \rightarrow Ax +s = b, s \ge 0$

The interior point formulation of this is

$\min c^T x- \gamma \sum_i \ln(x_i)$

$Ax = b$

The Hessian and gradient are quite simple here

$\nabla f = -\frac{\gamma}{x_i}$

$(H)_{ij} = \delta_{ij} \frac{\gamma}{x_i^2}$

The optimum conditions for this are

$\nabla (c^T x - \gamma \ln(x))= c - \gamma \frac{1}{x} = 0$

$Ax=b$

Now in the above, I’m not sure I got all the signs right, but I did implement it in python. The result seems to be correct and does work. I haven’t tested extensively, YMMV. It’s a useful starting point.

#simple lp

import numpy as np
import scipy.sparse as sparse
import scipy.sparse.linalg as linalg
# min cx
# x >= 0
# Ax = b

def newtonDec(df, dx):
return np.dot(df,dx)

# assumes that x + alpha*dx can be made positive
def linesearch(x, dx):
alpha = 1.
while not np.all( x + alpha*dx > 0):
alpha *= 0.1
return alpha

# min cx

def solve_lp2(A, b, c, gamma, xstart=None):
#x = np.ones(A.shape[1])
#lam = np.zeros(b.shape)
xsize = A.shape[1]
if xstart is not None:
x = xstart
else:
#xlam = np.ones(xsize + b.size)
x = np.ones(xsize) # xlam[:xsize]
#lam = xlam[xsize:]
while True :
print("Iterate")
H = sparse.bmat( [[ sparse.diags(gamma / x**2)   ,   A.T ],
[ A  ,                         0 ]]  )

dfdx = c - gamma / x #+  A.T@lam
dfdlam = A@x - b
df = np.concatenate((dfdx, dfdlam))#np.zeros(b.size))) # dfdlam))
#np.concatenate( , A@x - b)
dxlam = linalg.spsolve(H,df)
dx = - dxlam[:xsize]
lam = dxlam[xsize:]

alpha = linesearch(x,dx)
x += alpha * dx
#lam += dlam
if newtonDec(dfdx,dx) >= -1e-10:
print("stop")
break

return x, lam

def solve_lp(A,b,c, xstart=None):
gamma = 1.0
xsize = A.shape[1]
x = np.ones(xsize)
for i in range(8):
x, lam = solve_lp2(A, b, c, gamma, xstart=x)
gamma *= 0.1
return x, lam

N = 12
A = np.ones(N).reshape(1,-1)
b = np.ones(1)*2
c = np.zeros(N)
c[0] = -1

#print(solve_lp(A,b,c, 0.000001))
print(solve_lp(A,b,c))

def BB(A, b, c, best, xhint = None):
picked = np.zeros(xsize)
picked[pickvar] = 1
Anew = sparse.hstack((A, picked))
bnew = np.concatenate((b,choice))
x, lam =
np.dot(c,x)
if lp_solve(Anew, bnew, c) < best:
best, x = BB(Anew, bnew , c, best, xhint)
return best, x

'''
#min  cx + gamma * ln(x)
# s.t. Ax = b

# cx + gamma * ln(x) + lambda (Ax - b)

delx = c + gamma * 1/x + lambda A
dellam = Ax - b
# hess
dlx = A
dxl = A.T
dxx = - gamma (1/x**2)

H @ (x l) = (delx dell)
'''


Musings:

I wanted to build this because I’ve been getting really into mixed integer programming and have been wondering how much getting deep in the guts of the solver might help. Given my domain knowledge of the problems at hand, I have probably quite good heuristics. In addition, I’ve been curious about a paper that has pointed out an interesting relatively unexploited territory, combining machine learning with mixed integer programming https://arxiv.org/pdf/1811.06128

For these purposes, I want a really simple optimization solver.

But this is silly. I should use CLP or OSQP as a black box if I really want to worry about the mixed integer aspect.

MIOSQP is interesting.

It is interesting how the different domains of discrete optimization and search seem to have relatively similar sets of methods. Maybe I’m crazy. Maybe at the loose level I’m gonna talk almost anything is like almost anything else.

Clause learning and Cutting plane addition feel rather similar.

Relaxation to LP and unit propagation are somewhat similar. Or is unit propagation like elimination?

Mixed integer programs build their own heuristics.

Fourier Motzkin and resolution are similar methods. In Fourier motzkin, you eliminate variables in linear inequalities by using algebra to bring that variable by itself on one side of the inequality and then matching up all the <= to all the uses of >= of that variable. There are packages that compute these things. See CDD or Polyhedra.jl

Resolution takes boolean formula. You can eliminate a variable q from a CNF formula by taking all the negated instances $\not q$ and combining them with all positive instances.

## Trajectory Optimization of a Pendulum with Mixed Integer Linear Programming

There is a reasonable piecewise linear approximation for the pendulum replacing the the sinusoidal potential with two quadratic potentials (one around the top and one around the bottom). This translates to a triangle wave torque.

Cvxpy curiously has support for Mixed Integer Programming.

Cbc is probably better than GLPK MI. However, GLPK is way easier to get installed. Just brew install glpk and pip install cvxopt.

Getting cbc working was a bit of a journey. I had to actually BUILD Cylp (god forbid) and fix some problems.

Special Ordered Set constraints are useful for piecewise linear approximations. The SOS2 constraints take a set of variables and make it so that only two consecutive ones can be nonzero at a time. Solvers often have built in support for them, which can be much faster than just blasting them with general constraints. I did it by adding a binary variable for every consecutive pair. Then these binary variables suppress the continuous ones. Setting the sum of the binary variables to 1 makes only one able to be nonzero.

def SOS2(n):
z = cvx.Variable(n-1,boolean=True)
x = cvx.Variable(n)
constraints = []
constraints += [0 <= x]
constraints += [x[0] <= z[0]]
constraints += [x[-1] <= z[-1]]
constraints += [x[1:-1] <= z[:-1]+z[1:]]
constraints += [cvx.sum(z) == 1]
constraints += [cvx.sum(x) == 1]
return x, z, constraints

One downside is that every evaluation of these non linear functions requires a new set of integer and binary variables, which is possibly quite expensive.

For some values of total time steps and step length, the solver can go off the rails and never return.

At the moment, the solve is not fast enough for real time control with CBC (~ 2s). I wonder how much some kind of warm start might or more fiddling with heuristics, or if I had access to the built in SOS2 constraints rather than hacking it in. Also commercial solvers are usually faster. Still it is interesting.

import cvxpy as cvx
import numpy as np
import matplotlib.pyplot as plt

def SOS2(n):
z = cvx.Variable(n-1,boolean=True)

x = cvx.Variable(n)
constraints = []
constraints += [0 <= x]
constraints += [x[0] <= z[0]]
constraints += [x[-1] <= z[-1]]
constraints += [x[1:-1] <= z[:-1]+z[1:]]
constraints += [cvx.sum(z) == 1]
constraints += [cvx.sum(x) == 1]
return x, z, constraints

N = 20
thetas = cvx.Variable(N)
omegas = cvx.Variable(N)
torques = cvx.Variable(N)
tau = 0.3
c = [thetas[0] == +0.5, omegas[1] == 0, torques <= tau, torques >= -tau]
dt = 0.5

D = 5
thetasN = np.linspace(-np.pi,np.pi, D, endpoint=True)
zs = []
mods = []
xs = []
print(thetasN)
reward = 0
for t in range(N-1):
c += [thetas[t+1] == thetas[t] + omegas[t]*dt]

mod = cvx.Variable(1,integer=True)
mods.append(mod)
#xs.append(x)
x, z, c1 = SOS2(D)
c += c1
xs.append(x)
c += [thetas[t+1] == thetasN@x + 2*np.pi*mod]
sin = np.sin(thetasN)@x
reward += -np.cos(thetasN)@x
c += [omegas[t+1] == omegas[t] - sin*dt + torques[t]*dt ]

objective = cvx.Maximize(reward)
constraints = c

prob = cvx.Problem(objective, constraints)
res = prob.solve(solver=cvx.CBC, verbose=True)

print([mod.value for mod in mods ])
print([thetasN@x.value for x in xs ])
print([x.value for x in xs ])

plt.plot(thetas.value)
plt.plot(torques.value)
plt.show()

# Other things to do: BigM.
# Disjunctive constraints.
# Implication constraints.



Blue is angle, orange is the applied torque. The torque is running up against the limits I placed on it.

## Gettin’ that Robot some Tasty Apples: Solving a simple geometrical puzzle in Z3 python

At work there is a monthly puzzler.

“Design a robot that can pick up all 9 apples arranged on a 3 by 3 rectangular grid, and spaced 1m apart. The robot parts they have are faulty. The robot can only turn three times”

I think the intent of the puzzle is that the robot can drive in forward and reverse, but only actually turn 3 times. It’s not very hard to do by hand. I decided to take a crack at this one using Z3 for funzies. Z3 is an SMT solver. It is capable of solving a interesting wide variety of problems.

I interpret this as “find 4 lines that touch all points in the grid, such that each subsequent line intersects.”

It is fairly easy to directly translate this into a Z3 model.

from z3 import *
import matplotlib.pyplot as plt
import numpy as np
s = Solver()

linenum = 4

def linepoint(l,p): #line point product. Is zero if point is on line
return l[0]*p[0] + l[1]*p[1] + l[2]

lines = [[Real(s + str(i)) for s in ('a','b','c')] for i in range(linenum)]

applepoints = [(x,y) for x in range(5) for y in range(3)]
turnpoints = [[Real(s + str(i)) for s in ('x','y')] for i in range(linenum-1)]

for pt in applepoints:  #Every point needs to be touched by one line

for l in lines: #Exclude invalid lines (all zeros)
s.add(Or([a != 0 for a in l]))

for i in range(linenum-1): #Consecutive lines intersect (aren't parallel). There are other ways of doing this

def convert(x):
if is_int_value(x):
return x.as_long()
elif is_rational_value(x):
return x.numerator_as_long()/x.denominator_as_long()
elif is_algebraic_value(x):
return convert(x.approx(7))

print(s.check())

m = s.model()
#print("x = %s" % m[x])

#print "traversing model..."
for d in m.decls():
print("%s = %s" % (d.name(), m[d]))

#nplines = np.array([[m[a].numerator_as_long()/m[a].denominator_as_long() for a in l] for l in lines])
nplines = np.array([[convert(m[a]) for a in l] for l in lines])

#xs = np.random.randn(2,3)
#xs = np.

print(nplines)
for l in nplines:
pts = []
for j in [-1,1]:
# We can use Z3 to draw a nice set of lines too
opt = Optimize()
x, y = Reals('x y')
opt.add([0 <=  x, x <= 2, 0 <= y, y <= 2])
opt.add(l[0] * x + l[1]*y + l[2] == 0)
opt.minimize(j*0.453 * x + j*0.3234 * y) # random objective direction hopefully not perpendicular to a line
print(opt.check())
m = opt.model()
pts.append([convert(m[x]), convert(m[y])])
print(l)
pts = np.array(pts).T
plt.plot(pts[0,:], pts[1,:])
plt.show()

If we ask z3 to use only 3 lines, it returns unsat. Try to prove that by hand.

However, If the robot is on the projective plane, it is possible with 3 lines. It only needs to drive to infinity and turn twice. All lines intersect exactly once on the projective plane. How convenient.

The problem only seems somewhat difficult to computerize because of the seemingly infinite nature of geometry. If we only consider the lines that touch at least two points, all possible robot paths becomes extremely enumerable. Is there a proof that we only need these lines?

Another interesting approach might be to note that the points are described by the set of equations $x*(x-1)*(x-2)=0$ and $y*(y-1)*(y-2)=0$. I think we could then possibly use methods of nonlinear algebra (Groebner bases) to find the lines. Roughly an ideal containment question? Don’t have this one fully thought out yet. I think z3 might be doing something like this behind the scenes.

## More Reinforcement Learning with cvxpy

So I spent thanksgiving doing this and playing Zelda. Even though that sounds like a hell of a day, seems a little sad for thanksgiving :(. I should probably make more of an effort to go home next year.

I tried implementing a more traditional q-learning pipeline using cvxpy (rather than the inequality trick of the last time). Couldn’t get it to work as well. And it’s still kind of slow despite a lot of rearrangement to vectorize operations (through batching basically).

I guess I’m still entranced with the idea of avoiding neural networks. In a sense, that is the old boring way of doing things. The Deep RL is the new stuff. Using ordinary function approximators is way older I think. But I feel like it takes a problem out of the equation (dealing with training neural nets). Also I like modeling languages/libraries.

I kept finding show stopping bugs throughout the day (incorrectly written maxaction functions, typos, cross episode data points, etc.), so I wouldn’t be surprised if there is one still in here. It’s very surprising how one can convince oneself that it is kind of working when it is actually impossible it’s working. All these environments are so simple, that I suspect I could randomly sample controllers out of a sack for the time I’ve been fiddling with this stuff and find a good one.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('Pendulum-v0')
print(env.action_space)

print(env.observation_space)
print(env.observation_space.high)

print(env.observation_space.low) # [1,1,8]
print(env.action_space.high) # -2
print(env.action_space.low)

chebdeg = 2
alpha = cvx.Variable(3*chebdeg**3)
alpha.value = np.zeros(3*chebdeg**3)
gamma = 1. - 1./50
def basis(s,a):
n = np.arange(4)
f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1)
f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1)
f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1)
f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1)
return f1*f2*f3*f4
def vbasis(s,a):
#n = np.arange(4)
N = s.shape[0]

#print(chebval(s[:,0]/1.,np.eye(chebdeg)).shape)
f1 = chebval(s[:,0]/1.,np.eye(chebdeg)).T.reshape(N,-1,1,1,1)
f2 = chebval(s[:,1]/1.,np.eye(chebdeg)).T.reshape(N,1,-1,1,1)
f3 = chebval(s[:,2]/8.,np.eye(chebdeg)).T.reshape(N,1,1,-1,1)
if type(a) == np.ndarray:
f4 = (a.reshape(-1,1,1,1,1)/2)**(np.arange(3).reshape(1,1,1,1,-1))
else:
f4 = (a/2)**(np.arange(3).reshape(1,1,1,1,-1))
return (f1*f2*f3*f4).reshape(N,-1)

def qmaxaction(alpha,s):
#n = np.arange(4)
N = s.shape[0]

#print(chebval(s[:,0]/1.,np.eye(chebdeg)).shape)
f1 = chebval(s[:,0]/1.,np.eye(chebdeg)).T.reshape(N,-1,1,1,1)
f2 = chebval(s[:,1]/1.,np.eye(chebdeg)).T.reshape(N,1,-1,1,1)
f3 = chebval(s[:,2]/8.,np.eye(chebdeg)).T.reshape(N,1,1,-1,1)
z = f1*f2*f3
a = (np.array([0,0,0.25]).reshape(1,1,1,1,-1)*z).reshape(N,-1)@alpha.value
b = (np.array([0,0.5,0]).reshape(1,1,1,1,-1)*z).reshape(N,-1)@alpha.value

acts = np.clip(-b/2/(a+1e-7), -2,2)
q2 = vbasis(s,2)@alpha.value
print(acts)
qa = vbasis(s,acts)@alpha.value
qn2 = vbasis(s,-2)@alpha.value
return np.maximum(qa,np.maximum(qn2,q2))

def evalb(alpha, s,a):
f = basis(s,a)
return alpha*f.flatten()

def maxaction(alpha, obs):
f1 = np.sum(evalb(alpha.value, observation, 2))
f2 = np.sum(evalb(alpha.value, observation, 0))
f3 = np.sum(evalb(alpha.value, observation, -2))
coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2)
if coeff[0] < 0:
action = -coeff[1]/2/coeff[0]
action = min(max(action,-2),2)
elif f1 > f3:
#print("huh")
action = 2
else:
#print("how bout dat")
action = -2
return np.array([action])

constraints = []
objective = 0

for x in range(4):

constraints = []
epobs = []
eprewards = []
epactions = []
objective = 0
epsilon = 1.2/(x+1)
print("epsilon: ", epsilon)
epNum = 50
for i_episode in range(epNum):
observations = []
rewards = []
actions = []
observation = env.reset()
reward = -100
for t in range(100):

prev_obs = observation
prev_reward = reward
if np.random.rand() < epsilon:
action = env.action_space.sample()
else:
action = maxaction(alpha, observation)
observation, reward, done, info = env.step(action)
observations.append(observation)
rewards.append(reward)
actions.append(action)

#constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
#objective += evalb(alpha, observation, env.action_space.sample())
if done:
print("Episode finished after {} timesteps".format(t+1))
break
epobs.append(observations)
epactions.append(actions)
eprewards.append(rewards)
for qiter in range(20):
print("Q iteration: ", qiter)
objective = 0
for (observations,actions, rewards) in zip(epobs,epactions,eprewards):
obs = np.array(observations)
act = np.array(actions)
bprev = vbasis(obs[:-1],act[1:]) #   np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
#bnext = np.array([basis(s,a+np.random.randn()*0.2).flatten() for (s,a) in zip(observations[1:],actions[1:])])
#obs = np.array(observations)[:-1]
#q2 = np.array([basis(s,2).flatten() for s in observations[:-1]])@alpha.value
#q2 = vbasis(obs[1:],2)@alpha.value
#q0 = vbasis(obs[1:],0)@alpha.value
#qn2 = vbasis(obs[1:],-2)@alpha.value
#q1 = vbasis(obs[1:],1)@alpha.value
#qn1 = vbasis(obs[1:],-1)@alpha.value
#qs = []
#for a in np.linspace(-2,2,5):
#    qs.append(vbasis(obs[1:],a+np.random.randn()*0.5)@alpha.value)
#for a in np.linspace(-0.1,0.1,3):
#    qs.append(vbasis(obs[1:],a)@alpha.value)

#qmax = np.max(np.stack([q2,q0,qn2,q1,qn1],axis=1),axis=1)
#qmax = np.max(np.stack(qs,axis=1),axis=1)
qmax = qmaxaction(alpha, obs[1:])

#q0 = np.array([basis(s,0).flatten() for s in observations[:-1]])@alpha.value
#qn2 = np.array([basis(s,-2).flatten() for s in observations[:-1]])@alpha.value

#qmax = np.maximum(np.maximum(q0,q2),qn2)#np.array([np.dot(basis(s, maxaction(alpha,s)).flatten(),alpha.value) for s in observations[1:]])
rewards = np.array(rewards)[:-1]
objective += cvx.sum(cvx.huber(bprev@alpha - (rewards + gamma*qmax))) #cvx.sum_squares(bprev@alpha - (rewards + gamma*qmax))

#bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
#objective = cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
loss = 0.001*cvx.sum_squares(alpha-alpha.value) + objective
prob = cvx.Problem(cvx.Minimize(loss), constraints)
# The optimal objective value is returned by prob.solve().
print("solving problem")
result = prob.solve(verbose=True)
print(result)
# The optimal value for x is stored in x.value.
print(alpha.value)
#inspection loop
for i_episode in range(4):
observation = env.reset()
#env.env.state[0] = np.random.randn()*sigma
for t in range(200):
env.render()
#print(observation)
prev_obs = observation
action = maxaction(alpha, observation)
#print(action)

observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break


I also did the easy cartpole environment using the inequality trick.  Seems to work pretty well.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('CartPole-v1')
print(env.action_space)

print(env.observation_space)
print(env.observation_space.high)

print(env.observation_space.low) # [1,1,8]
#print(env.action_space.high) # -2
#print(env.action_space.low)

chebdeg = 3
alpha = cvx.Variable(2*chebdeg**4)
alpha.value = np.zeros(2*chebdeg**4)
gamma = 1. - 1./25
def basis(s,a):
n = np.arange(4)
f1 = chebval(s[0]/4.,np.eye(chebdeg)).reshape(-1,1,1,1,1)
f2 = chebval(s[1]/10.,np.eye(chebdeg)).reshape(1,-1,1,1,1)
f3 = chebval(s[2]/0.4,np.eye(chebdeg)).reshape(1,1,-1,1,1)
f4 = chebval(s[3]/10.,np.eye(chebdeg)).reshape(1,1,1,-1,1)
f5 = ((a/2)**np.arange(2)).reshape(1,1,1,1,-1)
return f1*f2*f3*f4*f5

def evalb(alpha, s,a):
f = basis(s,a)
return alpha*f.flatten()

def maxaction(alpha, obs):
f1 = np.sum(evalb(alpha.value, observation, 0))
f2 = np.sum(evalb(alpha.value, observation, 1))
if f1 > f2:
action = 0
else:
action = 1
return action

constraints = []
objective = 0

for x in range(4):
constraints = []
objective = 0

epsilon = 1.2/(x+1)
print("epsilon: ", epsilon)
for i_episode in range(200):
observation = env.reset()
observations = []
rewards = []
actions = []
reward = -100
for t in range(50):

prev_obs = observation
prev_reward = reward
if np.random.rand() < epsilon:
action = env.action_space.sample()
else:
action = maxaction(alpha, observation)
observation, reward, done, info = env.step(action)
observations.append(observation)
rewards.append(reward)
actions.append(action)

#constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
#objective += evalb(alpha, observation, env.action_space.sample())
if done:
print("Episode finished after {} timesteps".format(t+1))
break
#print(rewards)
bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
bnext = np.array([basis(s,env.action_space.sample()).flatten() for (s,a) in zip(observations[1:],actions[1:])])
rewards = np.array(rewards)[:-1]
constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
loss = 1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective
prob = cvx.Problem(cvx.Minimize(loss), constraints)

print("solving problem")
result = prob.solve(verbose=True)
print(result)
print(alpha.value)
#inspection loop
for i_episode in range(4):
observation = env.reset()
#env.env.state[0] = np.random.randn()*sigma
for t in range(200):
env.render()
#print(observation)
prev_obs = observation
action = maxaction(alpha, observation)
#print(action)

observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break


I also have some Work in Progress on getting full swingup cartpole. Currently is not really working. Seems to kind of be pumping about right? The continuous force control easy cartpole does work though.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('CartPole-v1')
print(env.action_space)

print(env.observation_space)
print(env.observation_space.high)

print(env.observation_space.low) # [1,1,8]
#print(env.action_space.high) # -2
#print(env.action_space.low)

chebdeg = 2
varNum = 2*2*4*2*2*3
alpha = cvx.Variable(varNum)
alpha.value = np.zeros(varNum)
gamma = 1. - 1./150
def basis(s,a):
n = np.arange(4)
f1 = chebval(s[0]/4.,np.eye(2) ).reshape(-1,1,1,1,1,1)
f2 = chebval(s[1]/10.,np.eye(2)).reshape(1,-1,1,1,1,1)
f3 = chebval(s[2]/1,np.eye(4)  ).reshape(1,1,-1,1,1,1)
f4 = chebval(s[3]/1,np.eye(2)  ).reshape(1,1,1,-1,1,1)
f5 = chebval(s[4]/10.,np.eye(2)).reshape(1,1,1,1,-1,1)
f6 = ((a/10)**np.arange(3)).reshape(1,1,1,1,1,-1)
return f1*f2*f3*f4*f5*f6

def evalb(alpha, s,a):
f = basis(s,a)
return alpha*f.flatten()
'''
def maxaction(alpha, obs):
f1 = np.sum(evalb(alpha.value, observation, 0))
f2 = np.sum(evalb(alpha.value, observation, 1))
if f1 > f2:
action = 0
else:
action = 1
return action
'''
def maxaction(alpha, obs):
f1 = np.sum(evalb(alpha.value, observation, 10))
f2 = np.sum(evalb(alpha.value, observation, 0))
f3 = np.sum(evalb(alpha.value, observation, -10))
coeff = np.polyfit([10,0,-10], [f1,f2,f3], deg=2)
if coeff[0] < 0:
action = -coeff[1]/2/coeff[0]
action = min(max(action,-10),10)
elif f1 > f3:
action = 10
else:
action = -10
return np.array([action])

constraints = []
objective = 0

for x in range(4):
constraints = []
objective = 0

epsilon = 1.2/(x+1)
print("epsilon: ", epsilon)
for i_episode in range(100):
observation = env.reset()
observations = []
rewards = []
actions = []
reward = -100
env.env.state[2] = np.random.rand()*2*np.pi
for t in range(100):

prev_obs = observation
prev_reward = reward
if np.random.rand() < epsilon or len(observation) < 5:
action = np.random.rand()*20-10
else:
action = maxaction(alpha, observation)
a = action
env.env.force_mag = abs(a) #min(100, abs(a))
#print(a)
if a < 0:
daction = 0
else:
daction = 1
observation, reward, done, info = env.step(daction)
o = observation
observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])

observations.append(observation)
bonus = 0
if observation[2] > 0.9:
bonus = 1

rewards.append( (bonus + observation[2] + 1)/2)#- observation[0]**2) #reward
#print(rewards[-1])
actions.append(action)

#constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
#objective += evalb(alpha, observation, env.action_space.sample())
x = observation[0]
if x < -4 or x > 4:
break
#if done:
#    print("Episode finished after {} timesteps".format(t+1))
#    break
#print(rewards)
bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
bnext = np.array([basis(s,a+np.random.randn()*0.3).flatten() for (s,a) in zip(observations[1:],actions[1:])])
rewards = np.array(rewards)[:-1]
#print(bprev.shape)
#print(bnext.shape)

constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

bcost = np.array([basis(s,a+np.random.randn()*0.3).flatten() for s in observations])
objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
loss = 1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective
prob = cvx.Problem(cvx.Minimize(loss), constraints)

print("solving problem")
result = prob.solve(verbose=True)
print(result)
print(alpha.value)
#inspection loop
for i_episode in range(4):
observation = env.reset()
env.env.state[2] = np.random.rand()*2*np.pi
#env.env.state[0] = np.random.randn()*sigma
o = observation
observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])
for t in range(200):
env.render()
#print(observation)
prev_obs = observation
o = observation
observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])
action = maxaction(alpha, observation)
a = action
env.env.force_mag = abs(a) #min(100, abs(a))
#print(a)
if a < 0:
daction = 0
else:
daction = 1
print(action)

observation, reward, done, info = env.step(daction)

if x < -4 or x > 4:
break


Now I feel that a thing that matters quite a bit is what is your choice of action for the next time step. Hypothetically you want a ton of samples here. I now think that using an action that is just slightly perturbed from the actual action works well because the actual action is tending to become roughly the optimal one. Subsequent time steps have roughly the same data in them.

One advantage of discrete action space is that you can really search it all.

Does that mean I should seriously investigate the sum of squares form? A semidefinite variable per data point sounds bad. I feel like I’d have to seriously limit the amount of data I’m using. Maybe I’ll be pleasantly surprised.

I haven’t even gotten to playing with different polynomials yet. The current implementation is exponentially sized in the number of variables. But in kind of a silly way. I think it would be better to use all terms of a bounded total degree.

## Q-Learning with Linear Programming (cvxpy, OpenAI Gym Pendulum)

http://web.mit.edu/~pucci/www/discountedLP.pdf

http://underactuated.mit.edu/underactuated.html?chapter=dp

There is a fun idea of using Linear Programming to do dynamic programming I originally saw in the underactuated robotics textbook.

In my experience reinforcement learning is finicky and depressing. It usually doesn’t work and is very hard to troubleshoot. Do you just need to run it for 10 minutes? 10 years? Is there a bug? God knows. I end up wriggling hyperparameters and praying a lot.

One part of this is the relative finickiness of neural network optimization compared to the technology of convex optimization. Convex optimization solvers are quite reliable and fast.

There is a way of phrasing Q learning as a linear programming problem

The linear programming approach relaxes the Bellman equations.

$Q(s_t,a_t)=r_t + \gamma \max_a Q(s_{t+1},a)$

to

$\forall a. Q(s_t,a_t) \ge r_t +\gamma Q(s_{t+1},a)$

We can approach this forall in a couple ways, one of which is just sampling actions somehow. To make the constraint tight in places you minimize a weighting of Q

$\min \sum w_i * Q(s_i,a_i)$

If Q is written as a linear combination of basis functions

$Q(s,a)=\sum \alpha_i f_i(s,a)$

The all of this put together is a linear program in the variables $\alpha_i$.

For ease, I used cvxpy. I don’t even store my state action pairs, which is quite lazy of me. Even here, compiling the linear program via cvxpy is kind of slow. This preprocessing step takes longer than the actual solve does. You could avoid cvxpy and directly interface a linear programming solver much faster, if that is your thing.

The whole process is still model free. I didn’t plug in pendulum dynamics anywhere. I run openAI gym and use the resulting state-action-state tuples to add inequalities to my cvxpy model. I weight where I want the inequalities to be tightest by using the actual states experienced.

Unfortunately, it still took a couple hours of hyper parameter tuning and fiddling to get the thing to work. So not a grand success on that point.

I made a lot of guesswork for what seemed reasonable

I parametrized the dependence of Q on a by a quadratic so that it is easy to maximize analytically. That is what the polyfit stuff is about. Maximum of $ax^2+bx+c$ is at $-b/2a$. I really should be checking the sign of the a coefficient. I am just assuming it is positive. Naughty boy.

m assuming that it

Chebyshev polynomials are probably good.

It seemed to help to use a slight perturbation of the actual action used on the right hand side of the Bellman inequality. My reasoning here is that the pendulum is actually a continuous system, so we should be using the differential Bellman equation really.

Should I allow for some kind of slack in the equations? Getting a bad reward or data point or one weird unusual state could ruin things for everyone. Inequalities are unforgiving.

Gamma seemed to matter a decent amount

The regularization of alpha seemed largely irrelevant.

Epsilon greediness seems to not matter much either.

Future ideas:

Might be good to replace the sampling of a with a Sum of Squares condition over the variable a.

Should I damp the update in some way? Add a cost the changing alpha from it’s previous value. A kind of damped update / using a prior.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('Pendulum-v0')
print(env.action_space)

print(env.observation_space)
print(env.observation_space.high)

print(env.observation_space.low) # [1,1,8]
print(env.action_space.high) # -2
print(env.action_space.low)

chebdeg = 4
alpha = cvx.Variable(3*chebdeg**3)
gamma = 1. - 1./100
def basis(s,a):
n = np.arange(4)
f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1)
f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1)
f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1)
f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1)
return f1*f2*f3*f4

def evalb(alpha, s,a):
f = basis(s,a)
return alpha*f.flatten()

def maxaction(alpha, obs):
f1 = np.sum(evalb(alpha.value, observation, 2))
f2 = np.sum(evalb(alpha.value, observation, 0))
f3 = np.sum(evalb(alpha.value, observation, -2))
coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2)
action = -coeff[1]/2/coeff[0]
action = min(max(action,-2),2)
return np.array([action])

constraints = []
objective = 0

for x in range(4):

constraints = []
objective = 0
epsilon = 1.2/(x+1)
print("epsilon: ", epsilon)
for i_episode in range(50):
observation = env.reset()
reward = -100
for t in range(100):

prev_obs = observation
prev_reward = reward
if np.random.rand() < epsilon:
action = env.action_space.sample()
else:
action = maxaction(alpha, observation)
observation, reward, done, info = env.step(action)
constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
objective += evalb(alpha, observation, env.action_space.sample())
if done:
print("Episode finished after {} timesteps".format(t+1))
break
loss = 0.1*cvx.sum(cvx.abs(alpha)) + objective
prob = cvx.Problem(cvx.Minimize(loss), constraints)
# The optimal objective value is returned by prob.solve().
print("solving problem")
result = prob.solve()
print(result)
# The optimal value for x is stored in x.value.
print(alpha.value)
#inspection loop
for i_episode in range(4):
observation = env.reset()
#env.env.state[0] = np.random.randn()*sigma
for t in range(200):
env.render()
#print(observation)
prev_obs = observation
action = maxaction(alpha, observation)
#print(action)

observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break


Edit:

A improved version. Fixed the bug in my maxaction function. I shouldn’t have been assuming that it was always concave down.

Also vectorized slightly. Fairly significantly improves the solve time. Not much time is spent in cvxpy, now the solve is dominated by about 3 legitimate seconds in OSQP.

You can flip stuff in and out of loops to try different versions. This method is off-policy, so I could keep data around forever. However, it mostly just slowed the solve time.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('Pendulum-v0')
print(env.action_space)

print(env.observation_space)
print(env.observation_space.high)

print(env.observation_space.low) # [1,1,8]
print(env.action_space.high) # -2
print(env.action_space.low)

chebdeg = 3
alpha = cvx.Variable(3*chebdeg**3)
alpha.value = np.zeros(3*chebdeg**3)
gamma = 1. - 1./100
def basis(s,a):
n = np.arange(4)
f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1)
f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1)
f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1)
f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1)
return f1*f2*f3*f4

def evalb(alpha, s,a):
f = basis(s,a)
return alpha*f.flatten()

def maxaction(alpha, obs):
f1 = np.sum(evalb(alpha.value, observation, 2))
f2 = np.sum(evalb(alpha.value, observation, 0))
f3 = np.sum(evalb(alpha.value, observation, -2))
coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2)
if coeff[0] < 0:
action = -coeff[1]/2/coeff[0]
action = min(max(action,-2),2)
elif f1 > f3:
action = 2
else:
action = -2
return np.array([action])

constraints = []
objective = 0

for x in range(4):
constraints = []
objective = 0

epsilon = 1.2/(x+1)
print("epsilon: ", epsilon)
for i_episode in range(50):
observation = env.reset()
observations = []
rewards = []
actions = []
reward = -100
for t in range(100):

prev_obs = observation
prev_reward = reward
if np.random.rand() < epsilon:
action = env.action_space.sample()
else:
action = maxaction(alpha, observation)
observation, reward, done, info = env.step(action)
observations.append(observation)
rewards.append(reward)
actions.append(action)

#constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
#objective += evalb(alpha, observation, env.action_space.sample())
if done:
print("Episode finished after {} timesteps".format(t+1))
break
bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
bnext = np.array([basis(s,a+np.random.randn()*0.2).flatten() for (s,a) in zip(observations[1:],actions[1:])])
rewards = np.array(rewards)[:-1]
constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
loss = 0.1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective
prob = cvx.Problem(cvx.Minimize(loss), constraints)

print("solving problem")
result = prob.solve(verbose=True)
print(result)
print(alpha.value)
#inspection loop
for i_episode in range(4):
observation = env.reset()
#env.env.state[0] = np.random.randn()*sigma
for t in range(200):
env.render()
#print(observation)
prev_obs = observation
action = maxaction(alpha, observation)
#print(action)

observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break


## Deriving the Chebyshev Polynomials using Sum of Squares optimization with Sympy and Cvxpy

Least squares fitting $\sum (f(x_i)-y_i)^2$ is very commonly used and well loved. Sum of squared fitting can be solved using just linear algebra. One of the most convincing use cases to me of linear programming is doing sum of absolute value fits $\sum |f(x_i)-y_i|$  and maximum deviation fits $\max_i |f(x_i)-y_i|$. These two quality of fits are basically just as tractable as least squares, which is pretty cool.

The trick to turning an absolute value into an LP is to look at the region above the graph of absolute value.

This region is defined by $y \ge x$ and $y \ge -x$. So you introduce a new variable y. Then the LP $\min y$ subject to those constraints will minimize the absolute value. For a sum of absolute values, introduce a variable $y_i$ for each absolute value you have. Then minimize $\sum_i y_i$. If you want to do min max optimization, use the same y value for every absolute value function.

$\min y$

$\forall i. -y \le x_i \le y$

Let’s change topic a bit. Chebyshev polynomials are awesome. They are basically the polynomials you want to use in numerics.

Chebyshev polynomials are sines and cosines in disguise. They inherit tons of properties from them. One very important property is the equioscillation property. The Chebyshev polynomials are the polynomials that stay closest to zero while keeping the x^n coefficient nonzero (2^(n-2) by convention). They oscillate perfectly between -1 and 1 on the interval $x \in [-1,1]$ just like sort of a stretched out sine. It turns out this equioscillation property defines the Chebyshev polynomials

We can approximate the Chebyshev polynomials via sampling many points between [-1,1]. Then we do min of the max absolute error optimization using linear programming. What we get out does approximate the Chebyshev polynomials.

import cvxpy as cvx
import numpy as np
import matplotlib.pyplot as plt

# try finding the 3 through 5 chebyshev polynomial
for N in range(3,6):
a = cvx.Variable(N) #polynomial coefficients
t = cvx.Variable()
n = np.arange(N) #exponents

xs = np.linspace(-1,1,100)
chebcoeff = np.zeros(N)
chebcoeff[-1] = 1
plt.plot(xs, np.polynomial.chebyshev.chebval(xs, chebcoeff), color='r')

constraints = [a[-1]==2**(N-2)] # have to have highest power
for k in range(100):
x = np.random.rand()*2-1 #pick random x in [-1,1]
c = cvx.sum(a * x**n) #evaluate polynomial

constraints.append(c <= t)
constraints.append(-t <= c)

obj = cvx.Minimize(t) #minimize maximum aboslute value
prob = cvx.Problem(obj,constraints)
prob.solve()
plt.plot(xs, np.polynomial.polynomial.polyval(xs, a.value), color='g')
print(a.value)

plt.show()



Found Coefficients:
[-9.95353054e-01  1.33115281e-03  1.99999613e+00]
[-0.01601964 -2.83172979  0.05364805  4.00000197]
[ 0.86388003 -0.33517716 -7.4286604   0.6983382   8.00000211]

Red is the actual Chebyshev polynomials and green is the solved for polynomials. It does a decent job. More samples will do even better, and if we picked the Chebyshev points it would be perfect.

Can we do better? Yes we can. Let’s go on a little optimization journey.

Semidefinite programming allows you to optimize matrix variables with the constraint that they have all positive eigenvalues. In a way it lets you add an infinite number of linear constraints. Another way of stating the eigenvalue constraint is that

$\forall q. q^T X q \ge 0$

You could sample a finite number of random q vectors and take the conjunction of all these constraints. Once you had enough, this is probably a pretty good approximation of the Semidefinite constraint. But semidefinite programming let’s you have an infinite number of the constraints in the sense that $\forall q$ is referencing an infinite number of possible q , which is pretty remarkable.

Finite Sampling the qs has similarity to the previously discussed sampling method for absolute value minimization.

Sum of Squares optimization allows you to pick optimal polynomials with the constraint that they can be written as a sum of squares polynomials. In this form, the polynomials are manifestly positive everywhere. Sum of Squares programming is a perspective to take on Semidefinite programming. They are equivalent in power. You solve SOS programs under the hood by transforming them into semidefinite ones.

You can write a polynomial as a vector of coefficients $\tilde{a}$.

$\tilde{x} = \begin{bmatrix} 1 \\ x \\ x^2 \\ x^3 \\ \vdots \end{bmatrix}$

$\tilde{a} = \begin{bmatrix} a_0 \\ a_1 \\ a_2 \\ a_3 \\ \vdots \end{bmatrix}$

$p(x)=\tilde{a}^T \tilde{x}$

Instead we represent the polynomial with the matrix $Q$

$p(x) = \tilde{x}^T Q \tilde{x}$

If the matrix is positive semidefinite, then it can be diagonalized into the sum of squares form.

In all honesty, this all sounds quite esoteric, and it kind of is. I struggle to find problems to solve with this stuff. But here we are! We’ve got one! We’re gonna find the Chebyshev polynomials exactly by translating the previous method to SOS.

The formulation is a direct transcription of the above tricks.

$\min t$

$-t \le p(x) \le t$  by which I mean $p(x) + t$ is SOS and $t - p(x)$ is SOS.

There are a couple packages available for python already that already do SOS, .

SumofSquares.jl for Julia and SOSTools for Matlab. YalMip too I think. Instead of using those packages, I want to roll my own, like a doofus.

Sympy already has very useful polynomial manipulation functionality. What we’re going to do is form up the appropriate expressions by collecting powers of x, and then turn them into cvxpy expressions term by term. The transcription from sympy to cvxpy isn’t so bad, especially with a couple helper functions.

One annoying extra thing we have to do is known as the S-procedure. We don’t care about regions outside of $x \in [-1,1]$. We can specify this with a polynomial inequality $(x+1)(x-1) \ge 0$. If we multiply this polynomial by any manifestly positive polynomial (a SOS polynomial in particular will work), it will remain positive in the region we care about. We can then add this function into all of our SOS inequalities to make them easier to satisfy. This is very similar to a Lagrange multiplier procedure.

Now all of this seems reasonable. But it is not clear to me that we have the truly best polynomial in hand with this s-procedure business. But it seems to works out.

from sympy import *
import cvxpy as cvx
import matplotlib.pyplot as plt
import numpy as np

#build corresponding cvx variable for sympy variable
def cvxvar(expr, PSD=True):
if expr.func == MatrixSymbol:
i = int(expr.shape[0].evalf())
j = int(expr.shape[1].evalf())
return cvx.Variable((i,j), PSD=PSD)
elif expr.func == Symbol:
return cvx.Variable()

def cvxify(expr, cvxdict): # replaces sympy variables with cvx variables
return lambdify(tuple(cvxdict.keys()), expr)(*cvxdict.values())

xs = np.linspace(-1,1,100)

for N in range(3,6):
#plot actual chebyshev
chebcoeff = np.zeros(N)
chebcoeff[-1] = 1
plt.plot(xs, np.polynomial.chebyshev.chebval(xs, chebcoeff), color='r')

cvxdict = {}
# Going to need symbolic polynomials in x
x = Symbol('x')
xn = Matrix([x**n for n in range(N)]);

def sospoly(varname):
Q = MatrixSymbol(varname, N,N)
p = (xn.T * Matrix(Q) * xn)[0]
return p, Q

t = Symbol('t')
cvxdict[t] = cvxvar(t)

#lagrange multiplier polynomial 1
pl1, l1 = sospoly('l1')
cvxdict[l1] = cvxvar(l1)

#lagrange multiplier polynomial 2
pl2, l2 = sospoly('l2')
cvxdict[l2] = cvxvar(l2)

#Equation SOS Polynomial 1
peq1, eq1 = sospoly('eq1')
cvxdict[eq1] = cvxvar(eq1)

#Equation SOS Polynomial 2
peq2, eq2 = sospoly('eq2')
cvxdict[eq2] = cvxvar(eq2)

a = MatrixSymbol("a", N,1)
pa = Matrix(a).T*xn #sum([polcoeff[k] * x**k for k in range(n)]);
pa = pa[0]
cvxdict[a] = cvxvar(a, PSD=False)

constraints = []

# Rough Derivation for upper constraint
# pol <= t
# 0 <= t - pol + lam * (x+1)(x-1)  # relax constraint with lambda
# eq1 = t - pol + lam
# 0 = t - pol + lam - eq1
z1 = t - pa + pl1 * (x+1)*(x-1) - peq1
z1 = Poly(z1, x).all_coeffs()
constraints += [cvxify(expr, cvxdict) == 0 for expr in z1]

# Derivation for lower constraint
# -t <= pol
# 0 <= pol + t + lam * (x+1)(x-1) # relax constraint with lambda
# eq2 = pol + t + lam     # eq2 is SOS
# 0 = t - pol + lam - eq2     #Rearrange to equal zero.
z2 = pa + t + pl2 * (x+1)*(x-1) - peq2
z2 = Poly(z2, x).all_coeffs()
constraints += [cvxify(expr, cvxdict) == 0 for expr in z2]

constraints += [cvxdict[a][N-1,0] == 2**(N-2) ]
obj = cvx.Minimize(cvxdict[t]) #minimize maximum absolute value
prob = cvx.Problem(obj,constraints)
prob.solve()

print(cvxdict[a].value.flatten())
plt.plot(xs, np.polynomial.polynomial.polyval(xs, cvxdict[a].value.flatten()), color='g')

plt.show()


Coefficients:
[-1.00000000e+00 -1.02219773e-15  2.00000001e+00]
[-1.23103133e-13 -2.99999967e+00  1.97810058e-13  4.00001268e+00]
[ 1.00000088e+00 -1.39748880e-15 -7.99999704e+00 -3.96420452e-15
7.99999691e+00]

Ooooooh yeah. Those curves are so similar you can’t even see the difference. NICE. JUICY.

There are a couple interesting extension to this. We could find global under or over approximating polynomials. This might be nice for a verified compression of a big polynomial to smaller, simpler polynomials for example. We could also similar form the pointwise best approximation of any arbitrary polynomial f(x) rather than the constant 0 like we did above (replace $-t \le p(x) \le t$ for $-t \le p(x) - f(x) \le t$ in the above). Or perhaps we could use it to find a best polynomial fit for some differential equation according to a pointwise error.

I think we could also extend this method to minimizing the mean absolute value integral just like we did in the sampling case.

$\min \int_0^1 t(x)dx$

$-t(x) \le p(x) \le t(x)$

More references on Sum of Squares optimization:

http://www.mit.edu/~parrilo/

## Model Predictive Control of CartPole in OpenAI Gym using OSQP

A continuation of this post http://www.philipzucker.com/osqp-sparsegrad-fast-model-predictive-control-python-inverted-pendulum/

I was having difficulty getting the model predictive control from a previous post working on an actual cartpole. There are a lot more unknown variables in that case and other issues (the thing has a tendency to destroy itself). I was kind of hoping it would just work. So I realized that I should get it working in simulation.

I did not copy the simulation code of the openai cartpole https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py  which gives some small amount of credence that the MPC might generalize to a real system.

For the sake of honesty, I’m still at the point where my controller freaks out about 1/3 of the time. I do not really understand why.

Looks damn good here though huh.

A problem I had for a while was the Length of my pole was off by a factor of 2. It still kind of worked, especially if nearly balanced (although with a lot of oscillation, which in hindsight was a clue something wasn’t tuned in right).

There are some useful techniques for manipulating gym. You can set some parameters from the outside, like starting positions and thresholds. Also you can mangle your way into continuous force control rather than just left right commands (wouldn’t it be cool to use Integer programming for that? Silly, but cool).

There is still a bunch of trash in here from me playing around with parameters. I like to keep it real (and lazy).

import gym
from mpc import MPC
import numpy as np
env = gym.make('CartPole-v0')
env.env.x_threshold = 5

start_theta = 0#-np.pi + 0.4# + 0.1#2 * np.pi #np.pi+0.4

mpc = MPC(0.5,0,start_theta,0)
action = 0
for i_episode in range(1):
observation = env.reset()
env.env.state[2] = start_theta - np.pi
for t in range(500):
env.render()
#print(observation)
#action = env.action_space.sample()
observation, reward, done, info = env.step(action)
a = mpc.update(observation[0] + 0.5, observation[1], observation[2]+np.pi, observation[3])
env.env.force_mag = abs(a) #min(100, abs(a))
#print(a)
if a < 0:
action = 0
else:
action = 1
if done:
pass
#print("Episode finished after {} timesteps".format(t+1))
#print(observation)
#print(dir(env))
#break
print(mpc.calcQ().reshape((5,50)))
print(observation)
print(dir(env))

One problem was that originally I had the pole just want to go to pi. But if it swings the other direction or many swings, that is bad and it will freak out. So I changed it to try to go the the current nearest multiple of pi, which helps.

Fiddling with the size of the regulation does have a significant effect and the relative size of regulation for x, v, f, omega. I am doing a lot of that search dumbly. I should probably do some kind of automatic.

Loosening the constraints on v and x seems to help stability.

I think weighting the angle at the end of the episode slightly more helps. That’s why I used linspace for the weight on the angle.

I’ve had a lot of problem with the answer coming back as infeasible from OSQP. I feel like it probably shouldn’t be and that is the solver’s problem?

Two things help: sometimes the cart does go out of the allowable range. The optimization probably will try to go all the way to the boundaries since it is useful. And since there is some mismatch between the actual dynamics and my model, it will go outside. So I heavily reduce the constraints for the first couple time steps. It takes a couple. 4 seems to work ok. It should want to apply forces during those four steps to get it back in range anyhow.

Even then it still goes infeasible sometimes and I don’t know why. So in that case, I reduce the required accuracy to hopefully at least get something that makes sense. That is what the eps_rel stuff is about. Maybe it helps. Not super clear. I could try a more iterative increasing of the eps?

import sparsegrad.forward as forward
import numpy as np
import osqp
import scipy.sparse as sparse

class MPC():
def __init__(self, x0, v0, theta0, thetadot0):
self.N = 50
self.NVars  = 5
T = 8.0
self.dt = T/self.N
dt = self.dt
self.dtinv = 1./dt
#Px = sparse.eye(N)
#sparse.csc_matrix((N, N))
# The three deifferent weigthing matrices for x, v, and external force
reg = sparse.eye(self.N)*0.05
z = sparse.bsr_matrix((self.N, self.N))
# sparse.diags(np.arange(N)/N)
pp = sparse.diags(np.linspace(1,7,self.N)) # sparse.eye(self.N)
P = sparse.block_diag([reg, 10*reg ,pp, reg, 10*reg]) #1*reg,1*reg])
#P[N,N]=10
self.P = P
THETA = 2
q = np.zeros((self.NVars, self.N))
q[THETA,:] = np.pi
q[0,:] = 0.5
#q[N,0] = -2 * 0.5 * 10
q = q.flatten()
q= -P@q
#u = np.arr

self.x = np.random.randn(self.N,self.NVars).flatten()
#x = np.zeros((N,NVars)).flatten()
#v = np.zeros(N)
#f = np.zeros(N)

A, l, u = self.getAlu(self.x, x0, v0, theta0, thetadot0)
self.m = osqp.OSQP()
self.m.setup(P=P, q=q, A=A, l=l, u=u , time_limit=0.1, verbose=False)# , eps_rel=1e-2 #  **settings # warm_start=False, eps_prim_inf=1e-1
self.results = self.m.solve()
print(self.results.x)
for i in range(100):

def calcQ(self):
THETA = 2
q = np.zeros((self.NVars, self.N))
thetas = np.reshape(self.x, (self.NVars, self.N))[THETA,:]
thetas = thetas - thetas % (2*np.pi) + np.pi
q[THETA,:] = thetas[0] #np.pi #thetas
q[0,:] = 0.5
#q[N,0] = -2 * 0.5 * 10
q = q.flatten()
q = -self.P@q
return q

A, l, u = self.getAlu(self.x, x0, v0, theta0, thetadot0)
#print(A.shape)
#print(len(A))
q = self.calcQ()
self.m.update(q=q, Ax=A.data, l=l, u=u)
self.results = self.m.solve()
if self.results.x[0] is not None:
self.x = np.copy(self.results.x)
self.m.update_settings(eps_rel=1e-3)
else:
#self.x += np.random.randn(self.N*self.NVars)*0.1 # help get out of a rut?
self.m.update_settings(eps_rel=1.1)
print("failed")
return 0
return self.x[4*self.N+1]

def constraint(self, var, x0, v0, th0, thd0):
#x[0] -= 1
#print(x[0])
g = 9.8
L = 1.0
gL = g * L
m = 1.0 # doesn't matter
I = L**2 / 3
Iinv = 1.0/I
dtinv = self.dtinv
N = self.N

x = var[:N]
v = var[N:2*N]
theta = var[2*N:3*N]
a = var[4*N:5*N]
xavg, vavg, thetavg, thdotavg = map(lambda z: (z[0:-1]+z[1:])/2, dynvars)
dx, dv, dthet, dthdot = map(lambda z: (z[1:]-z[0:-1])*dtinv, dynvars)
vres = dv - a[1:]
xres = dx - vavg
torque = (-gL*np.sin(thetavg) + a[1:]*L*np.cos(thetavg))/2
thetdotres = dthdot - torque*Iinv
thetres = dthet - thdotavg

return x[0:1]-x0, v[0:1]-v0, theta[0:1]-th0, thetadot[0:1]-thd0, xres,vres, thetdotres, thetres
#return x[0:5] - 0.5

def getAlu(self, x, x0, v0, th0, thd0):
N = self.N
gt = np.zeros((2,N))
gt[0,:] = -0.1 # 0.15 # x is greaer than 0.15
gt[1,:] = -3 #-1 #veclotu is gt -1m/s
#gt[4,:] = -10
control_n = max(3, int(0.1 / self.dt)) # I dunno. 4 seems to help
#print(control_n)

gt[:,:control_n] = -100
#gt[1,:2] = -100
#gt[1,:2] = -15
#gt[0,:3] = -10
gt = gt.flatten()
lt = np.zeros((2,N))
lt[0,:] = 1 #0.75 # x less than 0.75
lt[1,:] = 3 #1 # velocity less than 1m/s
#lt[4,:] = 10
lt[:,:	control_n] = 100
#lt[1,:2] = 100
#lt[0,:3] = 10
#lt[1,:2] = 15

lt = lt.flatten()

z = sparse.bsr_matrix((N, N))
ineqA = sparse.bmat([[sparse.eye(N),z,z,z,z],[z,sparse.eye(N),z,z,z]]) #.tocsc()
#print(ineqA.shape)
#print(ineqA)
cons = self.constraint(forward.seed_sparse_gradient(x), x0, v0, th0, thd0)
A = sparse.vstack(map(lambda z: z.dvalue, cons)) #  y.dvalue.tocsc()
#print(A.shape)
totval = np.concatenate(tuple(map(lambda z: z.value, cons)))
temp = A@x - totval

A = sparse.vstack((A,ineqA)).tocsc()

#print(tuple(map(lambda z: z.value, cons)))
#print(temp.shape)
#print(lt.shape)
#print(gt.shape)
u = np.concatenate((temp, lt))
l = np.concatenate((temp, gt))
return A, l, u



## OSQP and Sparsegrad: Fast Model Predictive Control in Python for an inverted pendulum

OSQP is a quadratic programming solver that seems to be designed for control problems. It has the ability to warm start, which should make it faster.

I heard about it in this Julia talk

They have some really cool technology over there in Julia land.

The problem is setup as a sequential quadratic program. I have a quadratic cost to try to bring the pendulum to an upright position.

The equations of motion are roughly

$\ddot{\theta}I=-mgL\sin(\theta)+mfL\cos(\theta)$

$\ddot{x}=f$

$I=\frac{1}{3}mL^2$

We don’t include the back reaction of the pole on the cart. It is basically irrelevant for our systems and overly complicated for no reason. The interesting bit is the nonlinear dynamics and influence of the cart acceleration.

I write down obeying the equations of motion as a linear constraint between timesteps. I use sparsegrad to get a sparse Jacobian of the equations of motion. The upper and lower (l and u) bounds are set equal to make an equality constraint.

Our setup has a finite track about a meter long. Our motors basically are velocity control and they can go about +-1m/s. Both of these can be encapsulated as linear constraints on the position and velocity variables. $l \le Ax \le u$

$\phi(x)=0$

$\phi(x) \approx \phi(x_0) + \partial \phi(x_0) (x - x_0)$

$A= \partial \phi(x_0)$

$l=u=\partial \phi(x_0) x_0 - \phi_0(x_0)$

Similarly for finding the quadratic cost function in terms of the setpoint $x_s$. $\frac{1}{2}x^T P x + q^Tx= \frac{1}{2}(x-x_s)P(x-x_s)$ Expanding this out we get

$q = - Px_s$

I run for multiple iterations to hopefully converge to a solution (it does). The initial linearization might be quite poor, but it gets better.

The OSQP program seems to be able to run in under 1ms. Nice! Initial tests of our system seem to be running at about 100Hz.

Could probably find a way to make those A matrices faster than constructing them entirely anew every time. We’ll see if we need that.

I very inelegantly smash all the variables into x. OSQP and scipy sparse don’t support multiIndex objects well, best as I can figure. Vectors should be single dimensioned and matrices 2 dimensioned.

Still to be seen if it works on hardware. I was already having infeasibility problems. Adjusting the regularization on P seemed to help.

import sparsegrad.forward as forward
import numpy as np
import osqp
import scipy.sparse as sparse
import matplotlib.pyplot as plt
import time

#def f(x):
#	return x**2

N = 1000
NVars  = 5
T = 5.0
dt = T/N
dtinv = 1./dt
#Px = sparse.eye(N)
#sparse.csc_matrix((N, N))
# The three deifferent weigthing matrices for x, v, and external force
reg = sparse.eye(N)*0.01
# sparse.diags(np.arange(N)/N)
P = sparse.block_diag([reg,reg ,sparse.eye(N), 1*reg,1*reg])
#P[N,N]=10
THETA = 2
q = np.zeros((NVars, N))
q[THETA,:] = np.pi
#q[N,0] = -2 * 0.5 * 10
q = q.flatten()
q= -P@q
#u = np.arr

x = np.random.randn(N,NVars).flatten()
#x = np.zeros((N,NVars)).flatten()
#v = np.zeros(N)
#f = np.zeros(N)

g = 9.8
L = 0.5
gL = g * L
m = 1.0 # doesn't matter
I = L**2 / 3
Iinv = 1.0/I
print(Iinv)

def constraint(var, x0, v0, th0, thd0):
#x[0] -= 1
#print(x[0])
x = var[:N]
v = var[N:2*N]
theta = var[2*N:3*N]
a = var[4*N:5*N]
xavg, vavg, thetavg, thdotavg = map(lambda z: (z[0:-1]+z[1:])/2, dynvars)
dx, dv, dthet, dthdot = map(lambda z: (z[1:]-z[0:-1])*dtinv, dynvars)
vres = dv - a[1:]
xres = dx - vavg
torque = -gL*np.sin(thetavg) + a[1:]*L*np.cos(thetavg)
thetdotres = dthdot - torque*Iinv
thetres = dthet - thdotavg

return x[0:1]-x0, v[0:1]-v0, theta[0:1]-th0, thetadot[0:1]-thd0, xres,vres, thetdotres, thetres
#return x[0:5] - 0.5

def getAlu(x, x0, v0, th0, thd0):
gt = np.zeros((2,N))
gt[0,:] = 0.1 # x is greaer than 0
gt[1,:] = -1 #veclotu is gt -1m/s
gt = gt.flatten()
lt = np.zeros((2,N))
lt[0,:] = 0.8
lt[1,:] = 1 # velocity less than 1m/s
lt = lt.flatten()

z = sparse.bsr_matrix((N, N))
ineqA = sparse.bmat([[sparse.eye(N),z,z,z,z],[z,sparse.eye(N),z,z,z]]) #.tocsc()
print(ineqA.shape)
#print(ineqA)
cons = constraint(forward.seed_sparse_gradient(x), x0, v0, th0, thd0)
A = sparse.vstack(map(lambda z: z.dvalue, cons)) #  y.dvalue.tocsc()
print(A.shape)
totval = np.concatenate(tuple(map(lambda z: z.value, cons)))
temp = A@x - totval

A = sparse.vstack((A,ineqA)).tocsc()

#print(tuple(map(lambda z: z.value, cons)))
print(temp.shape)
print(lt.shape)
print(gt.shape)
u = np.concatenate((temp, lt))
l = np.concatenate((temp, gt))
return A, l, u

#A = y.dvalue.tocsc()
#print(y.dvalue)
#sparse.hstack( , format="csc")
A, l, u = getAlu(x, 0.5, 0, 0, 0)
m = osqp.OSQP()
m.setup(P=P, q=q, A=A, l=l, u=u) #  **settings
results = m.solve()
print(results.x)

#plt.plot(results.x[3*N:4*N])
#plt.plot(results.x[4*N:5*N])
start = time.time()
iters = 100
for i in range(iters):
A, l, u = getAlu(results.x, 0.5, 0, 0, 0)
print(A.shape)
#print(len(A))
m.update(Ax=A.data, l=l, u=u)
results = m.solve()
end = time.time()
print("Solve in :", iters / (end-start) ,"Hz")

plt.plot(results.x[:N], label="x")
plt.plot(results.x[N:2*N], label="v")
plt.plot(results.x[2*N:3*N], label="theta")
plt.plot(results.x[4*N:5*N], label="force")
plt.legend()
plt.show()
#m.update(q=q_new, l=l_new, u=u_new)


## Attaching the Jordan Wigner String in Numpy

Just a fast (fast to write, not fast to run) little jordan wigner string code

import numpy as np
from numpy import kron, identity
import numpy.linalg as la

# sigma functions

sigma_x = np.array([[0, 1],[1, 0]])
sigma_y = np.array([[0, -1j],[1j, 0]])
sigma_z = np.array([[1, 0],[0, -1]])

# standard basis

spin_up = np.array([[1],[0]])
spin_down = np.array([[0],[1]])

sigma_plus = sigma_x + 1j * sigma_y
sigma_minus = sigma_x - 1j * sigma_y

# pauli spin

N = 3
def chainify(mat, pos):
if pos == 0:
newmat = mat
else:
newmat = identity(2)
for j in range(1,N):
if j == pos:
newmat = kron(newmat,mat)
else:
newmat = kron(newmat,identity(2))
return newmat

def sx(i):
return chainify(sigma_x,i)
def sy(i):
return chainify(sigma_y,i)
def sz(i):
return chainify(sigma_z,i)
def sp(i):
return chainify(sigma_plus,i)
def sm(i):
return chainify(sigma_minus,i)

#print sz(0)
#print sz(1)
#print sz(2)

#print np.dot(sp(0),sp(0))
# sp sm = 2 + 2 sz
#print np.dot(sp(0),sm(0))- 2*identity(2**N) - 2*sz(0)

I = identity(2**N)

fdag = lambda i: sp(i)/2
f = lambda i: sm(i)/2

def stringify(mat, pos):
if pos == 0:
newmat = mat
else:
newmat = sigma_z
for j in range(1,N):
if j == pos:
newmat = kron(newmat,mat)
elif j<pos:
newmat = kron(newmat,sigma_z)
else:
newmat = kron(newmat,identity(2))
return newmat

def cdag(i):
return np.mat(stringify(sigma_plus/2, i))

def c(i):
return np.mat(stringify(sigma_minus/2, i))

#print np.dot(cdag(1),c(1)) + np.dot(c(1),cdag(1)) # This is 1
#print np.dot(cdag(1),c(2)) + np.dot(c(2),cdag(1)) # This is 0

#It does appear to work.

print cdag(1)*c(1) + c(1)*cdag(1) # This is 1
print cdag(1)*c(2) + c(2)*cdag(1) # This anticommutator is 0.


What fun!