Gettin’ that Robot some Tasty Apples: Solving a simple geometrical puzzle in Z3 python

At work there is a monthly puzzler.

“Design a robot that can pick up all 9 apples arranged on a 3 by 3 rectangular grid, and spaced 1m apart. The robot parts they have are faulty. The robot can only turn three times”

I think the intent of the puzzle is that the robot can drive in forward and reverse, but only actually turn 3 times. It’s not very hard to do by hand. I decided to take a crack at this one using Z3 for funzies. Z3 is an SMT solver. It is capable of solving a interesting wide variety of problems.

I interpret this as “find 4 lines that touch all points in the grid, such that each subsequent line intersects.”

It is fairly easy to directly translate this into a Z3 model.

from z3 import *
import matplotlib.pyplot as plt
import numpy as np
s = Solver()

linenum = 4

def linepoint(l,p): #line point product. Is zero if point is on line
   return l[0]*p[0] + l[1]*p[1] + l[2]

lines = [[Real(s + str(i)) for s in ('a','b','c')] for i in range(linenum)]

applepoints = [(x,y) for x in range(5) for y in range(3)]
turnpoints = [[Real(s + str(i)) for s in ('x','y')] for i in range(linenum-1)]

for pt in applepoints:  #Every point needs to be touched by one line
	s.add(Or([linepoint(l,pt)==0 for l in lines]))

for l in lines: #Exclude invalid lines (all zeros)
		s.add(Or([a != 0 for a in l]))

for i in range(linenum-1): #Consecutive lines intersect (aren't parallel). There are other ways of doing this
	s.add(linepoint( lines[i], turnpoints[i]) == 0)
	s.add(linepoint( lines[i+1], turnpoints[i]) == 0)









def convert(x):
	if is_int_value(x):
		return x.as_long()
	elif is_rational_value(x):
		return x.numerator_as_long()/x.denominator_as_long()
	elif is_algebraic_value(x):
		return convert(x.approx(7))


print(s.check())

m = s.model()
#print("x = %s" % m[x])

#print "traversing model..."
for d in m.decls():
    print("%s = %s" % (d.name(), m[d]))

#nplines = np.array([[m[a].numerator_as_long()/m[a].denominator_as_long() for a in l] for l in lines])
nplines = np.array([[convert(m[a]) for a in l] for l in lines])

#xs = np.random.randn(2,3)
#xs = np.




print(nplines)
for l in nplines:
	pts = []
	for j in [-1,1]:
		# We can use Z3 to draw a nice set of lines too
		opt = Optimize()
		x, y = Reals('x y')
		opt.add([0 <=  x, x <= 2, 0 <= y, y <= 2])
		opt.add(l[0] * x + l[1]*y + l[2] == 0)
		opt.minimize(j*0.453 * x + j*0.3234 * y) # random objective direction hopefully not perpendicular to a line
		print(opt.check())
		m = opt.model()
		pts.append([convert(m[x]), convert(m[y])])
		print(l)
	pts = np.array(pts).T
	plt.plot(pts[0,:], pts[1,:])
plt.show()

A couple comments:

If we ask z3 to use only 3 lines, it returns unsat. Try to prove that by hand.

However, If the robot is on the projective plane, it is possible with 3 lines. It only needs to drive to infinity and turn twice. All lines intersect exactly once on the projective plane. How convenient.

The problem only seems somewhat difficult to computerize because of the seemingly infinite nature of geometry. If we only consider the lines that touch at least two points, all possible robot paths becomes extremely enumerable. Is there a proof that we only need these lines?

Another interesting approach might be to note that the points are described by the set of equations x*(x-1)*(x-2)=0 and y*(y-1)*(y-2)=0. I think we could then possibly use methods of nonlinear algebra (Groebner bases) to find the lines. Roughly an ideal containment question? Don’t have this one fully thought out yet. I think z3 might be doing something like this behind the scenes.

 

 

 

 

More Reinforcement Learning with cvxpy

So I spent thanksgiving doing this and playing Zelda. Even though that sounds like a hell of a day, seems a little sad for thanksgiving :(. I should probably make more of an effort to go home next year.

I tried implementing a more traditional q-learning pipeline using cvxpy (rather than the inequality trick of the last time). Couldn’t get it to work as well. And it’s still kind of slow despite a lot of rearrangement to vectorize operations (through batching basically).

I guess I’m still entranced with the idea of avoiding neural networks. In a sense, that is the old boring way of doing things. The Deep RL is the new stuff. Using ordinary function approximators is way older I think. But I feel like it takes a problem out of the equation (dealing with training neural nets). Also I like modeling languages/libraries.

I kept finding show stopping bugs throughout the day (incorrectly written maxaction functions, typos, cross episode data points, etc.), so I wouldn’t be surprised if there is one still in here. It’s very surprising how one can convince oneself that it is kind of working when it is actually impossible it’s working. All these environments are so simple, that I suspect I could randomly sample controllers out of a sack for the time I’ve been fiddling with this stuff and find a good one.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('Pendulum-v0')
print(env.action_space)
 
print(env.observation_space)
print(env.observation_space.high)
 
print(env.observation_space.low) # [1,1,8]
print(env.action_space.high) # -2
print(env.action_space.low)


 
chebdeg = 2
alpha = cvx.Variable(3*chebdeg**3)
alpha.value = np.zeros(3*chebdeg**3)
gamma = 1. - 1./50
def basis(s,a):
    n = np.arange(4)
    f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1)
    f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1)
    f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1)
    f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1)
    return f1*f2*f3*f4
def vbasis(s,a):
    #n = np.arange(4)
    N = s.shape[0]

    #print(chebval(s[:,0]/1.,np.eye(chebdeg)).shape)
    f1 = chebval(s[:,0]/1.,np.eye(chebdeg)).T.reshape(N,-1,1,1,1)
    f2 = chebval(s[:,1]/1.,np.eye(chebdeg)).T.reshape(N,1,-1,1,1)
    f3 = chebval(s[:,2]/8.,np.eye(chebdeg)).T.reshape(N,1,1,-1,1)
    if type(a) == np.ndarray:
        f4 = (a.reshape(-1,1,1,1,1)/2)**(np.arange(3).reshape(1,1,1,1,-1))
    else:
        f4 = (a/2)**(np.arange(3).reshape(1,1,1,1,-1))
    return (f1*f2*f3*f4).reshape(N,-1)

def qmaxaction(alpha,s):
    #n = np.arange(4)
    N = s.shape[0]

    #print(chebval(s[:,0]/1.,np.eye(chebdeg)).shape)
    f1 = chebval(s[:,0]/1.,np.eye(chebdeg)).T.reshape(N,-1,1,1,1)
    f2 = chebval(s[:,1]/1.,np.eye(chebdeg)).T.reshape(N,1,-1,1,1)
    f3 = chebval(s[:,2]/8.,np.eye(chebdeg)).T.reshape(N,1,1,-1,1)
    z = f1*f2*f3
    a = (np.array([0,0,0.25]).reshape(1,1,1,1,-1)*z).reshape(N,-1)@alpha.value
    b = (np.array([0,0.5,0]).reshape(1,1,1,1,-1)*z).reshape(N,-1)@alpha.value

    acts = np.clip(-b/2/(a+1e-7), -2,2)
    q2 = vbasis(s,2)@alpha.value
    print(acts)
    qa = vbasis(s,acts)@alpha.value
    qn2 = vbasis(s,-2)@alpha.value
    return np.maximum(qa,np.maximum(qn2,q2))
 
def evalb(alpha, s,a):
    f = basis(s,a)
    return alpha*f.flatten()
 
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 2))
    f2 = np.sum(evalb(alpha.value, observation, 0))
    f3 = np.sum(evalb(alpha.value, observation, -2))
    coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2)
    if coeff[0] < 0:
        action = -coeff[1]/2/coeff[0]
        action = min(max(action,-2),2)
    elif f1 > f3:
        #print("huh")
        action = 2
    else:
        #print("how bout dat")
        action = -2
    return np.array([action])
 
constraints = []
objective = 0



 
for x in range(4):
 
    constraints = []
    epobs = []
    eprewards = []
    epactions = []
    objective = 0
    epsilon = 1.2/(x+1)
    print("epsilon: ", epsilon)
    epNum = 50
    for i_episode in range(epNum):
        observations = []
        rewards = []
        actions = []
        observation = env.reset()
        reward = -100
        for t in range(100):
 
            prev_obs = observation
            prev_reward = reward
            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = maxaction(alpha, observation)
            observation, reward, done, info = env.step(action)
            observations.append(observation)
            rewards.append(reward)
            actions.append(action)

            #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
            #objective += evalb(alpha, observation, env.action_space.sample())
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break
        epobs.append(observations)
        epactions.append(actions)
        eprewards.append(rewards)
    for qiter in range(20):
        print("Q iteration: ", qiter)
        objective = 0
        for (observations,actions, rewards) in zip(epobs,epactions,eprewards):
            obs = np.array(observations)
            act = np.array(actions)
            bprev = vbasis(obs[:-1],act[1:]) #   np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
            #bnext = np.array([basis(s,a+np.random.randn()*0.2).flatten() for (s,a) in zip(observations[1:],actions[1:])])
            #obs = np.array(observations)[:-1]
            #q2 = np.array([basis(s,2).flatten() for s in observations[:-1]])@alpha.value
            #q2 = vbasis(obs[1:],2)@alpha.value
            #q0 = vbasis(obs[1:],0)@alpha.value
            #qn2 = vbasis(obs[1:],-2)@alpha.value
            #q1 = vbasis(obs[1:],1)@alpha.value
            #qn1 = vbasis(obs[1:],-1)@alpha.value
            #qs = []
            #for a in np.linspace(-2,2,5):
            #    qs.append(vbasis(obs[1:],a+np.random.randn()*0.5)@alpha.value)
            #for a in np.linspace(-0.1,0.1,3):
            #    qs.append(vbasis(obs[1:],a)@alpha.value)

            #qmax = np.max(np.stack([q2,q0,qn2,q1,qn1],axis=1),axis=1)
            #qmax = np.max(np.stack(qs,axis=1),axis=1)
            qmax = qmaxaction(alpha, obs[1:])


            #q0 = np.array([basis(s,0).flatten() for s in observations[:-1]])@alpha.value
            #qn2 = np.array([basis(s,-2).flatten() for s in observations[:-1]])@alpha.value
           
            #qmax = np.maximum(np.maximum(q0,q2),qn2)#np.array([np.dot(basis(s, maxaction(alpha,s)).flatten(),alpha.value) for s in observations[1:]])
            rewards = np.array(rewards)[:-1]
            objective += cvx.sum(cvx.huber(bprev@alpha - (rewards + gamma*qmax))) #cvx.sum_squares(bprev@alpha - (rewards + gamma*qmax))

        #bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
        #objective = cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
        loss = 0.001*cvx.sum_squares(alpha-alpha.value) + objective 
        prob = cvx.Problem(cvx.Minimize(loss), constraints)
        # The optimal objective value is returned by `prob.solve()`.
        print("solving problem")
        result = prob.solve(verbose=True)
        print(result)
        # The optimal value for x is stored in `x.value`.
        print(alpha.value)
    #inspection loop
    for i_episode in range(4):
        observation = env.reset()
        #env.env.state[0] = np.random.randn()*sigma 
        for t in range(200):
            env.render()
            #print(observation)
            prev_obs = observation
            action = maxaction(alpha, observation)
            #print(action)
 
            observation, reward, done, info = env.step(action)
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break

 

I also did the easy cartpole environment using the inequality trick.  Seems to work pretty well.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('CartPole-v1')
print(env.action_space)
 
print(env.observation_space)
print(env.observation_space.high)
 
print(env.observation_space.low) # [1,1,8]
#print(env.action_space.high) # -2
#print(env.action_space.low)
 
 
chebdeg = 3
alpha = cvx.Variable(2*chebdeg**4)
alpha.value = np.zeros(2*chebdeg**4)
gamma = 1. - 1./25
def basis(s,a):
    n = np.arange(4)
    f1 = chebval(s[0]/4.,np.eye(chebdeg)).reshape(-1,1,1,1,1)
    f2 = chebval(s[1]/10.,np.eye(chebdeg)).reshape(1,-1,1,1,1)
    f3 = chebval(s[2]/0.4,np.eye(chebdeg)).reshape(1,1,-1,1,1)
    f4 = chebval(s[3]/10.,np.eye(chebdeg)).reshape(1,1,1,-1,1)
    f5 = ((a/2)**np.arange(2)).reshape(1,1,1,1,-1)
    return f1*f2*f3*f4*f5
 
def evalb(alpha, s,a):
    f = basis(s,a)
    return alpha*f.flatten()
 
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 0))
    f2 = np.sum(evalb(alpha.value, observation, 1))
    if f1 > f2:
        action = 0
    else:
        action = 1
    return action
 
constraints = []
objective = 0
 
 
for x in range(4):
    constraints = []
    objective = 0
 

    epsilon = 1.2/(x+1)
    print("epsilon: ", epsilon)
    for i_episode in range(200):
        observation = env.reset()
        observations = []
        rewards = []
        actions = []
        reward = -100
        for t in range(50):
 
            prev_obs = observation
            prev_reward = reward
            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = maxaction(alpha, observation)
            observation, reward, done, info = env.step(action)
            observations.append(observation)
            rewards.append(reward)
            actions.append(action)

            #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
            #objective += evalb(alpha, observation, env.action_space.sample())
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break
        #print(rewards)
        bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
        bnext = np.array([basis(s,env.action_space.sample()).flatten() for (s,a) in zip(observations[1:],actions[1:])])
        rewards = np.array(rewards)[:-1]
        constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

        bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
        objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
    loss = 1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective 
    prob = cvx.Problem(cvx.Minimize(loss), constraints)

    print("solving problem")
    result = prob.solve(verbose=True)
    print(result)
    print(alpha.value)
    #inspection loop
    for i_episode in range(4):
        observation = env.reset()
        #env.env.state[0] = np.random.randn()*sigma 
        for t in range(200):
            env.render()
            #print(observation)
            prev_obs = observation
            action = maxaction(alpha, observation)
            #print(action)
 
            observation, reward, done, info = env.step(action)
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break

 

 

I also have some Work in Progress on getting full swingup cartpole. Currently is not really working. Seems to kind of be pumping about right? The continuous force control easy cartpole does work though.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('CartPole-v1')
print(env.action_space)
 
print(env.observation_space)
print(env.observation_space.high)
 
print(env.observation_space.low) # [1,1,8]
#print(env.action_space.high) # -2
#print(env.action_space.low)
 
 
chebdeg = 2
varNum = 2*2*4*2*2*3
alpha = cvx.Variable(varNum)
alpha.value = np.zeros(varNum)
gamma = 1. - 1./150
def basis(s,a):
    n = np.arange(4)
    f1 = chebval(s[0]/4.,np.eye(2) ).reshape(-1,1,1,1,1,1)
    f2 = chebval(s[1]/10.,np.eye(2)).reshape(1,-1,1,1,1,1)
    f3 = chebval(s[2]/1,np.eye(4)  ).reshape(1,1,-1,1,1,1)
    f4 = chebval(s[3]/1,np.eye(2)  ).reshape(1,1,1,-1,1,1)
    f5 = chebval(s[4]/10.,np.eye(2)).reshape(1,1,1,1,-1,1)
    f6 = ((a/10)**np.arange(3)).reshape(1,1,1,1,1,-1)
    return f1*f2*f3*f4*f5*f6
 
def evalb(alpha, s,a):
    f = basis(s,a)
    return alpha*f.flatten()
'''
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 0))
    f2 = np.sum(evalb(alpha.value, observation, 1))
    if f1 > f2:
        action = 0
    else:
        action = 1
    return action
'''
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 10))
    f2 = np.sum(evalb(alpha.value, observation, 0))
    f3 = np.sum(evalb(alpha.value, observation, -10))
    coeff = np.polyfit([10,0,-10], [f1,f2,f3], deg=2)
    if coeff[0] < 0:
        action = -coeff[1]/2/coeff[0]
        action = min(max(action,-10),10)
    elif f1 > f3:
        action = 10
    else:
        action = -10
    return np.array([action])

constraints = []
objective = 0
 
 
for x in range(4):
    constraints = []
    objective = 0
 

    epsilon = 1.2/(x+1)
    print("epsilon: ", epsilon)
    for i_episode in range(100):
        observation = env.reset()
        observations = []
        rewards = []
        actions = []
        reward = -100
        env.env.state[2] = np.random.rand()*2*np.pi
        for t in range(100):
 
            prev_obs = observation
            prev_reward = reward
            if np.random.rand() < epsilon or len(observation) < 5:
                action = np.random.rand()*20-10
            else:
                action = maxaction(alpha, observation)
            a = action
            env.env.force_mag = abs(a) #min(100, abs(a))
            #print(a)
            if a < 0:
                daction = 0
            else:
                daction = 1
            observation, reward, done, info = env.step(daction)
            o = observation
            observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])

            observations.append(observation)
            bonus = 0
            if observation[2] > 0.9:
                bonus = 1

            rewards.append( (bonus + observation[2] + 1)/2)#- observation[0]**2) #reward
            #print(rewards[-1])
            actions.append(action)

            #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
            #objective += evalb(alpha, observation, env.action_space.sample())
            x = observation[0]
            if x < -4 or x > 4:
                break
            #if done:
            #    print("Episode finished after {} timesteps".format(t+1))
            #    break
        #print(rewards)
        bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
        bnext = np.array([basis(s,a+np.random.randn()*0.3).flatten() for (s,a) in zip(observations[1:],actions[1:])])
        rewards = np.array(rewards)[:-1]
        #print(bprev.shape)
        #print(bnext.shape)

        constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

        bcost = np.array([basis(s,a+np.random.randn()*0.3).flatten() for s in observations])
        objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
    loss = 1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective 
    prob = cvx.Problem(cvx.Minimize(loss), constraints)

    print("solving problem")
    result = prob.solve(verbose=True)
    print(result)
    print(alpha.value)
    #inspection loop
    for i_episode in range(4):
        observation = env.reset()
        env.env.state[2] = np.random.rand()*2*np.pi
        #env.env.state[0] = np.random.randn()*sigma 
        o = observation
        observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])
        for t in range(200):
            env.render()
            #print(observation)
            prev_obs = observation
            o = observation
            observation = np.array([o[0],o[1],np.cos(o[2]),np.sin(o[2]),o[3]])
            action = maxaction(alpha, observation)
            a = action
            env.env.force_mag = abs(a) #min(100, abs(a))
            #print(a)
            if a < 0:
                daction = 0
            else:
                daction = 1
            print(action)
 
            observation, reward, done, info = env.step(daction)

            if x < -4 or x > 4:
                break

 

Now I feel that a thing that matters quite a bit is what is your choice of action for the next time step. Hypothetically you want a ton of samples here. I now think that using an action that is just slightly perturbed from the actual action works well because the actual action is tending to become roughly the optimal one. Subsequent time steps have roughly the same data in them.

One advantage of discrete action space is that you can really search it all.

Does that mean I should seriously investigate the sum of squares form? A semidefinite variable per data point sounds bad. I feel like I’d have to seriously limit the amount of data I’m using. Maybe I’ll be pleasantly surprised.

I haven’t even gotten to playing with different polynomials yet. The current implementation is exponentially sized in the number of variables. But in kind of a silly way. I think it would be better to use all terms of a bounded total degree.

 

Q-Learning with Linear Programming (cvxpy, OpenAI Gym Pendulum)

http://web.mit.edu/~pucci/www/discountedLP.pdf

http://underactuated.mit.edu/underactuated.html?chapter=dp

There is a fun idea of using Linear Programming to do dynamic programming I originally saw in the underactuated robotics textbook.

In my experience reinforcement learning is finicky and depressing. It usually doesn’t work and is very hard to troubleshoot. Do you just need to run it for 10 minutes? 10 years? Is there a bug? God knows. I end up wriggling hyperparameters and praying a lot.

One part of this is the relative finickiness of neural network optimization compared to the technology of convex optimization. Convex optimization solvers are quite reliable and fast.

There is a way of phrasing Q learning as a linear programming problem

The linear programming approach relaxes the Bellman equations.

Q(s_t,a_t)=r_t + \gamma \max_a Q(s_{t+1},a)

to

\forall a. Q(s_t,a_t) \ge r_t +\gamma Q(s_{t+1},a)

We can approach this forall in a couple ways, one of which is just sampling actions somehow. To make the constraint tight in places you minimize a weighting of Q

\min \sum w_i * Q(s_i,a_i)

If Q is written as a linear combination of basis functions

Q(s,a)=\sum \alpha_i f_i(s,a)

The all of this put together is a linear program in the variables \alpha_i.

 

For ease, I used cvxpy. I don’t even store my state action pairs, which is quite lazy of me. Even here, compiling the linear program via cvxpy is kind of slow. This preprocessing step takes longer than the actual solve does. You could avoid cvxpy and directly interface a linear programming solver much faster, if that is your thing.

The whole process is still model free. I didn’t plug in pendulum dynamics anywhere. I run openAI gym and use the resulting state-action-state tuples to add inequalities to my cvxpy model. I weight where I want the inequalities to be tightest by using the actual states experienced.

Unfortunately, it still took a couple hours of hyper parameter tuning and fiddling to get the thing to work. So not a grand success on that point.

I made a lot of guesswork for what seemed reasonable

I parametrized the dependence of Q on a by a quadratic so that it is easy to maximize analytically. That is what the polyfit stuff is about. Maximum of ax^2+bx+c is at -b/2a. I really should be checking the sign of the a coefficient. I am just assuming it is positive. Naughty boy.

m assuming that it

Chebyshev polynomials are probably good.

It seemed to help to use a slight perturbation of the actual action used on the right hand side of the Bellman inequality. My reasoning here is that the pendulum is actually a continuous system, so we should be using the differential Bellman equation really.

Should I allow for some kind of slack in the equations? Getting a bad reward or data point or one weird unusual state could ruin things for everyone. Inequalities are unforgiving.

Gamma seemed to matter a decent amount

The regularization of alpha seemed largely irrelevant.

Epsilon greediness seems to not matter much either.

 

 

Future ideas:

Might be good to replace the sampling of a with a Sum of Squares condition over the variable a.

Should I damp the update in some way? Add a cost the changing alpha from it’s previous value. A kind of damped update / using a prior.

 

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('Pendulum-v0')
print(env.action_space)

print(env.observation_space)
print(env.observation_space.high)

print(env.observation_space.low) # [1,1,8]
print(env.action_space.high) # -2
print(env.action_space.low)


chebdeg = 4
alpha = cvx.Variable(3*chebdeg**3)
gamma = 1. - 1./100
def basis(s,a):
	n = np.arange(4)
	f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1)
	f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1)
	f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1)
	f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1)
	return f1*f2*f3*f4

def evalb(alpha, s,a):
	f = basis(s,a)
	return alpha*f.flatten()

def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 2))
    f2 = np.sum(evalb(alpha.value, observation, 0))
    f3 = np.sum(evalb(alpha.value, observation, -2))
    coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2)
    action = -coeff[1]/2/coeff[0]
    action = min(max(action,-2),2)
    return np.array([action])

constraints = []
objective = 0




for x in range(4):

	constraints = []
	objective = 0
	epsilon = 1.2/(x+1)
	print("epsilon: ", epsilon)
	for i_episode in range(50):
	    observation = env.reset()
	    reward = -100
	    for t in range(100):

	        prev_obs = observation
	        prev_reward = reward
	        if np.random.rand() < epsilon:
	        	action = env.action_space.sample()
	        else:
	        	action = maxaction(alpha, observation)
	        observation, reward, done, info = env.step(action)
	        constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
	        objective += evalb(alpha, observation, env.action_space.sample())
	        if done:
	            print("Episode finished after {} timesteps".format(t+1))
	            break
	loss = 0.1*cvx.sum(cvx.abs(alpha)) + objective 
	prob = cvx.Problem(cvx.Minimize(loss), constraints)
	# The optimal objective value is returned by `prob.solve()`.
	print("solving problem")
	result = prob.solve()
	print(result)
	# The optimal value for x is stored in `x.value`.
	print(alpha.value)
	#inspection loop
	for i_episode in range(4):
	    observation = env.reset()
	    #env.env.state[0] = np.random.randn()*sigma 
	    for t in range(200):
	        env.render()
	        #print(observation)
	        prev_obs = observation
	        action = maxaction(alpha, observation)
	        #print(action)

	        observation, reward, done, info = env.step(action)
	        if done:
	            print("Episode finished after {} timesteps".format(t+1))
	            break

 

 


Edit:

A improved version. Fixed the bug in my maxaction function. I shouldn’t have been assuming that it was always concave down.

Also vectorized slightly. Fairly significantly improves the solve time. Not much time is spent in cvxpy, now the solve is dominated by about 3 legitimate seconds in OSQP.

You can flip stuff in and out of loops to try different versions. This method is off-policy, so I could keep data around forever. However, it mostly just slowed the solve time.

import gym
import numpy as np
import cvxpy as cvx
from numpy.polynomial.chebyshev import chebval
env = gym.make('Pendulum-v0')
print(env.action_space)
 
print(env.observation_space)
print(env.observation_space.high)
 
print(env.observation_space.low) # [1,1,8]
print(env.action_space.high) # -2
print(env.action_space.low)
 
 
chebdeg = 3
alpha = cvx.Variable(3*chebdeg**3)
alpha.value = np.zeros(3*chebdeg**3)
gamma = 1. - 1./100
def basis(s,a):
    n = np.arange(4)
    f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1)
    f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1)
    f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1)
    f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1)
    return f1*f2*f3*f4
 
def evalb(alpha, s,a):
    f = basis(s,a)
    return alpha*f.flatten()
 
def maxaction(alpha, obs):
    f1 = np.sum(evalb(alpha.value, observation, 2))
    f2 = np.sum(evalb(alpha.value, observation, 0))
    f3 = np.sum(evalb(alpha.value, observation, -2))
    coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2)
    if coeff[0] < 0:
        action = -coeff[1]/2/coeff[0]
        action = min(max(action,-2),2)
    elif f1 > f3:
        action = 2
    else:
        action = -2
    return np.array([action])
 
constraints = []
objective = 0
 
 
for x in range(4):
    constraints = []
    objective = 0
 

    epsilon = 1.2/(x+1)
    print("epsilon: ", epsilon)
    for i_episode in range(50):
        observation = env.reset()
        observations = []
        rewards = []
        actions = []
        reward = -100
        for t in range(100):
 
            prev_obs = observation
            prev_reward = reward
            if np.random.rand() < epsilon:
                action = env.action_space.sample()
            else:
                action = maxaction(alpha, observation)
            observation, reward, done, info = env.step(action)
            observations.append(observation)
            rewards.append(reward)
            actions.append(action)

            #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation)))
            #objective += evalb(alpha, observation, env.action_space.sample())
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break
        bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])])
        bnext = np.array([basis(s,a+np.random.randn()*0.2).flatten() for (s,a) in zip(observations[1:],actions[1:])])
        rewards = np.array(rewards)[:-1]
        constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha]

        bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations])
        objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha
    loss = 0.1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective 
    prob = cvx.Problem(cvx.Minimize(loss), constraints)

    print("solving problem")
    result = prob.solve(verbose=True)
    print(result)
    print(alpha.value)
    #inspection loop
    for i_episode in range(4):
        observation = env.reset()
        #env.env.state[0] = np.random.randn()*sigma 
        for t in range(200):
            env.render()
            #print(observation)
            prev_obs = observation
            action = maxaction(alpha, observation)
            #print(action)
 
            observation, reward, done, info = env.step(action)
            if done:
                print("Episode finished after {} timesteps".format(t+1))
                break

 

A Touch of Topological Quantum Computation in Haskell Pt. I

Quantum computing exploits the massive vector spaces nature uses to describe quantum phenomenon.

The evolution of a quantum system is described by the application of matrices on a vector describing the quantum state of the system. The vector has one entry for every possible state of the system, so the number of entries can get very, very large. Every time you add a new degree of freedom to a system, the size of the total state space gets multiplied by the size of the new DOF, so you have a vector exponentially sized in the  number  of degrees of freedom.

Now, a couple caveats. We could have described probabilistic dynamics similarly, with a probability associated with each state. The subtle difference is that quantum amplitudes are complex numbers whereas probabilities are positive real numbers. This allows for interference. Another caveat is that when you perform a measurement, you only get a single state, so you are hamstrung by the tiny amount of information you can actually extract out of this huge vector. Nevertheless, there are a handful of situations where, to everyone’s best guess, you get a genuine quantum advantage over classical or probabilistic computation.

Topological quantum computing is based around the braiding of particles called anyons. These particles have a peculiar vector space associated with them and the braiding applies a matrix to this space. In fact, the space associated with the particles can basically only be manipulated by braiding and other states require more energy or very large scale perturbations to access. Computing using anyons has a robustness compared to a traditional quantum computing systems. It can be made extremely unlikely that unwanted states are accessed or unwanted gates applied. The physical nature of the topological quantum system has an intrinsic error correcting power. This situation is schematically similar in some ways to classical error correction on a magnetic hard disk. Suppose some cosmic ray comes down and flips a spin in your hard disk. The physics of magnets makes the spin tend to realign with it’s neighbors, so the physics supplies an intrinsic classical error correction in this case.

The typical descriptions of how the vector spaces associated with anyons work I have found rather confusing. What we’re going to do is implement these vector spaces in the functional programming language Haskell for concreteness and play around with them a bit.

Anyons

In many systems, the splitting and joining of particles obey rules. Charge has to be conserved. In chemistry, the total number of each individual atom on each side of a reaction must be the same. Or in particle accelerators, lepton number and other junk has to be conserved.

Anyonic particles have their own system of combination rules. Particle A can combine with B to make C or D. Particle B combined with C always make A. That kind of stuff. These rules are called fusion rules and there are many choices, although they are not arbitrary. They can be described by a table N_{ab}^{c} that holds counts of the ways to combine a and b into c. This table has to be consistent with some algebraic conditions, the hexagon and pentagon equations, which we’ll get to later.

We need to describe particle production trees following these rules in order to describe the anyon vector space.

Fibonacci anyons are one of the simplest anyon systems, and yet sufficiently complicated to support universal quantum computation. There are only two particles types in the Fibonacci system, the I particle and the \tau  particle. The I particle is an identity particle (kind of like an electrically neutral particle). It combines with \tau to make a \tau. However, two \tau particle can combine in two different ways, to make another \tau particle or to make an I particle.

So we make a datatype for the tree structure that has one constructor for each possible particle split and one constructor (TLeaf, ILeaf) for each final particle type. We can use GADTs (Generalize Algebraic Data Types) to make only good particle production history trees constructible. The type has two type parameters carried along with it, the particle at the root of the tree and the leaf-labelled tree structure, represented with nested tuples.

data Tau
data Id
data FibTree root leaves where
   TTT :: FibTree Tau l -> FibTree Tau r -> FibTree Tau (l,r)
   ITT :: FibTree Tau l -> FibTree Tau r -> FibTree Id (l,r) 
   TIT :: FibTree Id l -> FibTree Tau r -> FibTree Tau (l,r)
   TTI :: FibTree Tau l -> FibTree Id r -> FibTree Tau (l,r)
   III :: FibTree Id l -> FibTree Id r -> FibTree Id (l,r)
   TLeaf :: FibTree Tau Tau
   ILeaf :: FibTree Id Id
Free Vector Spaces

We need to describe quantum superpositions of these anyon trees. We’ll consider the particles at the leaves of the tree to be the set of particles that you have at the current moment in a time. This is a classical quantity. You will not have a superposition of these leaf particles. However, there are some quantum remnants of the history of how these particles were made. The exact history can never be determined, kind of like how the exact history of a particle going through a double slit cannot be determined. However, there is still a quantum interference effect left over. When you bring particles together to combine them, depending on the quantum connections, you can have different possible resulting particles left over with different probabilities. Recombining anyons and seeing what results is a measurement of the system .

Vectors can be described in different basis sets. The bases for anyon trees are labelled by both a tree structure and what particles are at the root and leaves. Different tree associations are the analog of using some basis x vs some other rotated basis x’. The way we’ve built the type level tags in the FibTree reflects this perspective.

The labelling of inner edges of the tree with particles varies depending on which basis vector we’re talking about. A different inner particle is the analog of \hat{x} vs \hat{y}.

To work with these bases we need to break out of the mindset that a vector put on a computer is the same as an array. While for big iron purposes this is close to true, there are more flexible options. The array style forces you to use integers to index your space, but what if your basis does not very naturally map to integers?

A free vector space over some objects is the linear combination of those objects. This doesn’t have the make any sense. We can form the formal sum (3.7💋+2.3i👩‍🎨) over the emoji basis for example. Until we attach more meaning to it, all it really means is a mapping between emojis and numerical coefficients. We’re also implying by the word vector that we can add two of the combinations coefficient wise and multiply scalars onto them.

We are going to import free vectors as described by the legendary Dan Piponi as described here: http://blog.sigfpe.com/2007/03/monads-vector-spaces-and-quantum.html

What he does is implement the Free vector space pretty directly. We represent a Vector space using a list of tuples [(a,b)]. The a are basis objects and b are the coefficients attached to them.

data W b a = W { runW :: [(a,b)] } deriving (Eq,Show,Ord)
instance Semigroup (W b a) where
  (W x) <> (W y) = W (x <> y)
instance Monoid (W b a) where
  mempty = W mempty 
mapW f (W l) = W $ map (\(a,b) -> (a,f b)) l
instance Functor (W b) where
    fmap f (W a) = W $ map (\(a,p) -> (f a,p)) a

instance Num b => Applicative (W b) where
  pure x = W [(x,1)]
  (W fs) <*> (W xs) = W [(f x, a * b) | (f, a) <- fs, (x, b) <- xs] 


instance Num b => Monad (W b) where
   return x = W [(x,1)]
   l >>= f = W $ concatMap (\(W d,p) -> map (\(x,q)->(x,p*q)) d) (runW $ fmap f l)

a .* b = mapW (a*) b

instance (Eq a,Show a,Num b) => Num (W b a) where
     W a + W b = W $ (a ++ b)
     a - b = a + (-1) .* b
     _ * _ = error "Num is annoying"
     abs _ = error "Num is annoying"
     signum _ = error "Num is annoying"
     fromInteger a = if a==0 then W [] else error "fromInteger can only take zero argument"

collect :: (Ord a,Num b) => W b a -> W b a
collect = W . Map.toList . Map.fromListWith (+) . runW

trimZero = W . filter (\(k,v) -> not $ nearZero v) . runW
simplify = trimZero . collect
-- filter (not . nearZero . snd)

type P a = W Double a

type Q a = W (Complex Double) a

 

The Vector monad factors out the linear piece of a computation. Because of this factoring, the type constrains the mapping to be linear, in a similar way that monads in other contexts might guarantee no leaking of impure computations. This is pretty handy. The function you give to bind correspond to a selector columns of the matrix.

We need some way to zoom into a subtrees and then apply operations. We define the operations lmap and rmap.

lmap :: (forall a. FibTree a b -> Q (FibTree a c)) -> (FibTree e (b,d) -> Q (FibTree e (c,d)))
lmap f (ITT l r) = fmap (\l' -> ITT l' r) (f l)
lmap f (TTI l r) = fmap (\l' -> TTI l' r) (f l)
lmap f (TIT l r) = fmap (\l' -> TIT l' r) (f l)
lmap f (TTT l r) = fmap (\l' -> TTT l' r) (f l)
lmap f (III l r) = fmap (\l' -> III l' r) (f l)


rmap :: (forall a. FibTree a b -> Q (FibTree a c)) -> (FibTree e (d,b) -> Q (FibTree e (d,c)))
rmap f (ITT l r) = fmap (\r' -> ITT l r') (f r)
rmap f (TTI l r) = fmap (\r' -> TTI l r') (f r)
rmap f (TIT l r) = fmap (\r' -> TIT l r') (f r)
rmap f (TTT l r) = fmap (\r' -> TTT l r') (f r)
rmap f (III l r) = fmap (\r' -> III l r') (f r)

You reference a node by the path it takes to get there from the root. For example,  `(rmap . lmap . rmap) f` applies f at the node that is at the right-left-right position down from the root.

Braiding

For Fibonacci anyons, the only two non trivial braidings happen when you braid two \tau particles. 

braid :: FibTree a (l,r) -> Q (FibTree a (r,l))
braid (ITT l r) = W [(ITT r l,  cis $ 4 * pi / 5)] 
braid (TTT l r) = W [(TTT r l,  (cis $ - 3 * pi / 5))]
braid (TTI l r) = pure $ TIT r l
braid (TIT l r) = pure $ TTI r l
braid (III l r) = pure $ III r l

-- The inverse of braid
braid' :: FibTree a (l,r) -> Q (FibTree a (r,l))
braid' = star . braid

We only have defined how to braid two particles that were split directly from the same particle. How do we describe the braiding for the other cases? Well we need to give the linear transformation for how to change basis into other tree structures. Then we have defined braiding for particles without the same immediate parent also.

F-Moves

We can transform to a new basis. where the histories differs by association. We can braid two particles by associating the tree until they are together. An association move does not change any of the outgoing leaf positions. It can, however, change a particle in an interior position. We can apply an F-move anywhere inside the tree, not only at the final leaves.

fmove :: FibTree a (c,(d,e)) -> Q (FibTree a ((c,d),e))
fmove (ITT  a  (TIT b c)) = pure $ ITT ( TTI  a b) c
fmove (ITT  a  (TTT b c)) = pure $ ITT ( TTT  a b) c
fmove (ITT  a  (TTI b c)) = pure $ III ( ITT  a b) c

fmove (TIT  a  (TTT b c)) = pure $ TTT ( TIT  a b) c
fmove (TIT  a  (TTI b c)) = pure $ TTI ( TIT  a b) c
fmove (TIT  a  (TIT b c)) = pure $ TIT ( III  a b) c

fmove (TTI  a  (III b c)) = pure $ TTI ( TTI  a b) c
fmove (TTI  a  (ITT b c)) = W [(TIT ( ITT  a b) c, tau)         , (TTT ( TTT  a b) c, sqrt tau)]

fmove (TTT  a  (TTT b c)) = W [(TIT ( ITT  a b) c, sqrt tau)  ,   (TTT ( TTT  a b) c, - tau   )]
fmove (TTT  a  (TTI b c)) = pure $ TTI ( TTT  a b) c
fmove (TTT  a  (TIT b c)) = pure $ TTT ( TTI  a b) c 

fmove (III  a  (ITT b c)) = pure $ ITT ( TIT  a b) c
fmove (III  a  (III b c)) = pure $ III ( III  a b) c



fmove' :: FibTree a ((c,d),e) -> Q (FibTree a (c,(d,e)))
fmove' (ITT ( TTI  a b) c) = pure $ (ITT  a  (TIT b c))
fmove' (ITT ( TTT  a b) c) = pure $  (ITT  a  (TTT b c))
fmove' (ITT ( TIT  a b) c) = pure $  (III  a  (ITT b c))



fmove' (TTI ( TTT  a b) c) = pure $ (TTT  a  (TTI b c))
fmove' (TTI ( TTI  a b) c) = pure $ (TTI  a  (III b c))
fmove' (TTI ( TIT  a b) c) = pure $ TIT  a  (TTI b c)

fmove' (TTT ( TTI  a b) c ) = pure $ TTT  a  (TIT b c)
fmove' (TTT ( TIT  a b) c ) = pure $ TIT  a  (TTT b c)
fmove' (TTT ( TTT  a b) c) = W [(TTI  a  (ITT b c), sqrt tau)  , (TTT  a  (TTT b c),   - tau  )]


fmove' (TIT ( ITT  a b) c) = W [(TTI  a  (ITT b c), tau)         , (TTT  a  (TTT b c) , sqrt tau)]
fmove' (TIT ( III  a b) c ) = pure $ TIT  a  (TIT b c)

fmove' (III ( III  a b) c ) = pure $ III  a  (III b c)
fmove' (III ( ITT  a b) c

 

Fusion / Dot product

Two particles that split can only fuse back into themselves. So the definition is pretty trivial. This is like \hat{e}_i \cdot \hat{e}_j = \delta_{ij}.

dot :: FibTree a (b, c) -> FibTree a' (b, c) -> Q (FibTree a' a)
dot x@(TTI _ _) y@(TTI _ _) | x == y = pure TLeaf
                             | otherwise = mempty
dot x@(TIT _ _) y@(TIT _ _) | x == y = pure TLeaf
                             | otherwise = mempty
dot x@(TTT _ _) y@(TTT _ _) | x == y = pure TLeaf
                            | otherwise = mempty
dot x@(III _ _) y@(III _ _) | x == y = pure ILeaf
                            | otherwise = mempty
dot x@(ITT _ _) y@(ITT _ _) | x == y = pure ILeaf
                            | otherwise = mempty
dot _ _ = mempty

Hexagon and Pentagon equations

The F and R matrices and the fusion rules need to obey consistency conditions called the hexagon and pentagon equations. Certain simple rearrangements have alternate ways of being achieved. The alternative paths need to agree.

pentagon1 ::  FibTree a (e,(d,(c,b))) -> Q (FibTree a (((e,d),c),b))
pentagon1 v =  do 
                 v1 <- fmove v
                 fmove v1

pentagon2 :: FibTree a (b,(c,(d,e))) -> Q (FibTree a (((b,c),d),e))
pentagon2 v = do
                v1 :: FibTree a (b,((c,d),e)) <- rmap fmove v
                v2 :: FibTree a ((b,(c,d)),e) <- fmove v1
                lmap fmove v2

ex1 = TTT TLeaf (TTT TLeaf (TTT TLeaf TLeaf))
-- returns empty
pentagon =  simplify $ ((pentagon1 ex1) - (pentagon2 ex1))

hexagon1 :: FibTree a (b,(c,d)) -> Q (FibTree a ((d,b),c))
hexagon1 v = do
             v1 :: FibTree a ((b,c),d) <- fmove v
             v2 :: FibTree a (d,(b,c)) <- braid v1
             fmove v2  

hexagon2 :: FibTree a (b,(c,d)) -> Q (FibTree a ((d,b),c))
hexagon2 v = do
             v1 :: FibTree a (b,(d,c)) <- rmap braid v
             v2 :: FibTree a ((b,d),c) <- fmove v1
             lmap braid v2


ex2 = (TTT TLeaf (TTT TLeaf TLeaf))
--returns empty
hexagon =  simplify $ ((hexagon1 ex2) - (hexagon2 ex2))

Next Time:

With this, we have the rudiments of what we need to describe manipulation of anyon spaces. However, applying F-moves manually is rather laborious. Next time we’ll look into automating this using arcane type-level programming. You can take a peek at my trash WIP repo here

 

RefErences:
A big ole review on topological quantum computation: https://arxiv.org/abs/0707.1889
Ady Stern on The fractional quantum hall effect and anyons: https://www.sciencedirect.com/science/article/pii/S0003491607001674

 

Another good anyon tutorial: https://arxiv.org/abs/0902.3275

Mathematica program that I still don’t get, but is very interesting: http://www.cs.ox.ac.uk/people/jamie.vicary/twovect/guide.pdf

Kitaev huge Paper: https://arxiv.org/abs/cond-mat/0506438

Bonderson thesis: https://thesis.library.caltech.edu/2447/2/thesis.pdf

Bernevig review: https://arxiv.org/abs/1506.05805

More food for thought:

The Rosetta Stone: http://math.ucr.edu/home/baez/rosetta.pdf

http://blog.sigfpe.com/2008/08/hopf-algebra-group-monad.html

http://haskellformaths.blogspot.com/2012/03/what-is-hopf-algebra.html