## Variational Method of the Quantum Simple Harmonic Oscillator using PyTorch

A fun method (and useful!) for solving the ground state of the Schrodinger equation is to minimize the energy integral $dx \psi^\dagger H \psi$ while keeping the total probability 1. Pytorch is a big ole optimization library, so let’s give it a go.

I’ve tried two versions, using a stock neural network with relus and making it a bit easier by giving a gaussian with variable width and shift.

We can mimic the probability constraint by dividing by to total normalization $\int dx \psi^\dagger \psi$. A Lagrange multiplier or penalty method may allows us to access higher wavefunctions.

SGD seems to do a better job getting a rounder gaussian, while Adam is less finicky but makes a harsh triangular wavefunction.

The ground state solution of $-\frac{d^2\psi}{dx^2} + x^2\psi=E\psi$ is $e^{-x^2/2}$, with an energy of 1/2 (unless I botched up my head math). We may not get it, because we’re not sampling a very good total domain. Whatever, for further investigation.

Very intriguing is that pytorch has a determinant in it, I believe. That opens up the possibility of doing a Hartree-Fock style variational solution.

Here is my garbage

import torch
import matplotlib.pyplot as plt
import numpy as np
import torch.optim
from scipy import linalg
import time

import torch.nn as nn
import torch.nn.functional as F

class Psi(nn.Module):
def __init__(self):
super(Psi, self).__init__()

# an affine operation: y = Wx + b
self.lin1 = nn.Linear(1, 10) #Takes x to the 10 different hats
self.lin2 = nn.Linear(10, 1) #add up the hats.
#self.lin1 = nn.Linear(1, 1) #Takes x to the 10 different hats
#self.lin2 = nn.Linear(2, 1) #add up the hats.
def forward(self, x):
# Max pooling over a (2, 2) window
shifts = self.lin1(x)
hats =  F.relu(shifts) #hat(shifts)hat(shifts)
y = self.lin2(hats)
#y = torch.exp(- shifts ** 2 / 4)
return y

#a = torch.ones(10, requires_grad=True)
#z = torch.linspace(0, 1, steps=10)

# batch variable for monte carlo integration
x = torch.randn(10000,1, requires_grad=True) # a hundred random points between 0 and 1

psi = Psi()
y = psi(x)
import torch.optim as optim

# create your optimizer

optimizer =  optim.SGD(psi.parameters(), lr=0.0001, momentum=0.9, nesterov=True)
#y2 = torch.sin(np.pi*x)
#print(y)
#x2 = x.clone()
plt.scatter(x.detach().numpy(),y.detach().numpy(), label="original")
scalar = torch.ones(1,1)
for i in range(4000):
#y.backward(torch.ones(100,1), create_graph=True)
x = torch.randn(1000,1, requires_grad=True) # a hundred random points between 0 and 1

y = psi(x)

y.backward(torch.ones(1000,1), create_graph=True)
#print(dir(x)) #x.grad ** 2 +
E = torch.sum(x.grad ** 2 + x**2 * y**2)#+ 10*(psi(scalar*0)**2 + psi(scalar)**2)
N = torch.sum(y ** 2)
L = E/N
print(L)
L.backward()
optimizer.step()
for param in psi.parameters():
print(param)
plt.scatter(x.detach().numpy(),y.detach().numpy(), label="new")
plt.legend()
plt.show()

# may want to use the current wavefunction for gibbs style sampling
# we need to differentiate with respect to x for the kinetic energy

Edit: Hmm I didn’t compensate for the fact I was using randn sampling. That was a goof. I started using unifrom sampling, which doesn’t need compensation

## Pytorch Trajectory Optimization Part 4: Cleaner code, 50Hz

Cleaned up the code more and refactored some things.

Added backtracking. It will backtrack on the dx until the function is actually decreasing.

Prototyped the online part with shifts. Seems to work well with a fixed penalty parameter rho~100. Runs at ~50Hz with pretty good performance at 4 optimization steps per time step. Faster or slower depending on the number of newton steps per time step we allow ourselves.  Still to see if the thing will control an actual cartpole.

The majority of time is spent just backwards calculating the hessian still (~50%).

I’ve tried a couple different schemes (direct projection of the delLy terms or using y = torch.eye). None particularly seem to help.

The line search is also fairly significant (~20% of the time) but it really helps with both stability and actually decreasing the number of hessian steps, so it is an overall win. Surprisingly during the line search, projecting out the batch to 0 doesn’t matter much. How could this possibly make sense?

What I should do is pack this into a class that accepts new state observations and initializes with the warm start. Not clear if I should force the 4 newton steps on you or let you call them yourself. I think if you use too few it is pretty unstable (1 doesn’t seem to work well. 2 might be ok and gets you up to 80Hz maybe.)

The various metaparameters should be put into the __init__. The stopping cutoff  1e-7, Starting rho (~0.1), rho increase (x10) , backtrack alpha decrease factor (0.5 right now), the online rho (100). Hopefully none of these matter two much. I have noticed going too small with cutoff leading to endless loops.

Could swapping the ordering of time step vs variable number maybe help?

For inequality constraints like the track length and forces, exponential barriers seems like a more stable option compared to log barriers. Log barriers at least require me to check if they are going NaN.

I attempted the pure Lagrangian version where lambda is just another variable. It wasn’t working that great.

import torch
import matplotlib.pyplot as plt
import numpy as np
import torch.optim
from scipy import linalg
import time

N = 100
T = 10.0
dt = T/N
NVars = 4
NControls = 1
# Enum values
X = 0
V = 1
THETA = 2

#The bandwidth number for solve_banded

bandn = (NVars+NControls)*3//2
# We will use this many batches so we can get the entire hessian in one pass
batch = bandn * 2 + 1

def getNewState():
#we 're going to also pack f into here
#The forces have to come first for a good variable ordering the the hessian
x = torch.zeros(batch,N,NVars+NControls, requires_grad=True)
l = torch.zeros(1, N-1,NVars, requires_grad=False)
return x, l

#Compute the residual with respect to the dynamics
def dynamical_res(x):
f = x[:,1:,:NControls]
x = x[:,:,NControls:]

delx = (x[:,1:,:] - x[:, :-1,:]) / dt

xbar = (x[:,1:,:] + x[:, :-1,:]) / 2
#dxdt = torch.zeros(x.shape[0], N-1,NVars)
dxdt = torch.zeros_like(xbar)
dxdt[:,:,X] = xbar[:,:,V]
dxdt[:,:,V] = f[:,:,0]
dxdt[:,:,THETADOT] = -torch.sin(xbar[:,:,THETA]) + f[:,:,0]*torch.cos(xbar[:,:,THETA])

xres = delx - dxdt
return xres

def calc_loss(x, l, rho):
xres = dynamical_res(x)
# Some regularization. This encodes sort of that all variables -100 < x< 100
cost = 0.1*torch.sum(x**2)
# The forces have to come first for a good variable ordering the the hessian
f = x[:,1:,:NControls]
x = x[:,:,NControls:]

lagrange_mult = torch.sum(l * xres)
penalty = rho*torch.sum(xres**2)

#Absolute Value craps it's pants unfortunately.
#I tried to weight it so it doesn't feel bad about needing to swing up
cost +=  1.0*torch.sum((x[:,:,THETA]-np.pi)**2 * torch.arange(N) / N )
cost += 0.5*torch.sum(f**2)
xlim = 0.4
#Some options to try for inequality constraints. YMMV.
#cost += rho*torch.sum(-torch.log(xbar[:,:,X] + xlim) - torch.log(xlim - xbar[:,:,X]))
#The softer inequality constraint seems to work better.
# the log loses it's mind pretty easily
# tried adding ln rho in there to make it harsher as time goes on?
#cost += torch.sum(torch.exp((-xbar[:,:,X] - xlim)*(5+np.log(rho+0.1))) + torch.exp((xbar[:,:,X]- xlim)*(5+np.log(rho+0.1))))
#Next one doesn't work?
#cost += torch.sum(torch.exp((-xbar[:,:,X] - xlim)) + torch.exp((xbar[:,:,X]- xlim)))**(np.log(rho/10+3))
total_cost =  cost + lagrange_mult + penalty

def getGradHessBand(loss, B, x):
# get gradient. create_graph allows higher order derivatives
delL = delL0[:,1:,:].view(B,-1,B) #remove x0
#y is used to sample the appropriate rows
#y = torch.zeros(B,N-1,NVars+NControls, requires_grad=False).view(B,-1)
# There is probably a way to do it this way.
# Would this be a speed up?
y = torch.eye(B).view(B,1,B)
#print(y.shape)
#print(delL.shape)
#delL = delL.view(B,-1)
#y = torch.zeros(B,N-1,NVars+NControls, requires_grad=False).view(B,-1)

#for i in range(B):
#	y[i,i::B]=1
#delL = delL.view(B,-1)
#temp = 0
#for i in range(B):
#	temp += torch.sum(delL[i,:,i]) #Direct projection is not faster

delLy = torch.sum(delL * y)
delL = delL.view(B,-1)

delLy.backward()
#temp.backward()
#reshuffle columns to actuall be correct
for i in range(B):
nphess[:,i::B] = np.roll(nphess[:,i::B], -i+B//2, axis=0)
#returns gradient and hessian flattened
return delL.detach().numpy()[0,:].reshape(-1), nphess

def line_search(x, dx, total_cost, newton_dec):
#x1 = torch.unsqueeze(x[0],0)
xnew = torch.tensor(x) #Make a copy
alpha = 1
prev_cost = torch.tensor(total_cost) #Current total cost
done = False
# do a backtracking line search
while not done:
try:
xnew[:,1:,:] = x[:,1:,:] - alpha * dx
#print(xnew.shape)
total_cost = calc_loss(xnew, l, rho)
if alpha < 1e-8:
print("Alpha small: Uh oh")
done = True
if total_cost < prev_cost: # - alpha * 0.5 * batch * newton_dec:
done = True
else:
print("Backtrack")
alpha = alpha * 0.5
except ValueError: #Sometimes you get NaNs if you have logs in cost func
print("Out of bounds")
alpha = alpha * 0.1
x[:,1:,:] -= alpha * dx #Commit the change
return x

def opt_iteration(x, l, rho):
total_cost = calc_loss(x, l, rho)

#Try to solve the linear system. Sometimes, it fails
# in which case just defualt to gradient descent
# you're probably fucked though
try:
dx = linalg.solve_banded((bandn,bandn), hess, gradL, overwrite_ab=True)
except ValueError:
print("ValueError: Hess Solve Failed.")
except LinAlgError:
print("LinAlgError: Hess Solve Failed.")
x.grad.data.zero_() # Forgetting this causes awful bugs. I think this has to be here
newton_dec = np.dot(dx,gradL) # quadratic estimate of cost improvement
dx = torch.tensor(dx.reshape(1,N-1,NVars+NControls)) # return to original shape
x = line_search(x, dx, total_cost, newton_dec)

# If newton decrement is a small percentage of cost, quit
done = newton_dec < 1e-7*total_cost.detach().numpy()
return x, done

#Initial Solve
x, l = getNewState()
rho = 0.0
count = 0
for j in range(6):
while True:
count += 1
print("Count: ", count)
x, done = opt_iteration(x,l,rho)
if done:
break
xres = dynamical_res(x[0].unsqueeze(0))
print(xres.shape)
print(l.shape)
l += 2 * rho * xres
print("upping rho")
rho = rho * 10 + 0.1

#Online Solve
start = time.time()
NT = 10
for t in range(NT): # time steps
print("Time step")
x[:,0:-1,:] = x[:,1:,:] # shift forward one step
l[:,0:-1,:] = l[:,1:,:]
#x[:,0,:] = x[:,1,:] + torch.randn(1,NVars+NControls)*0.05 #Just move first position
rho = 100
for i in range(1): # how many penalty pumping moves
for m in range(4): # newton steps
print("Iter Step")
x, done = opt_iteration(x,l,rho)
xres = dynamical_res(x[0].unsqueeze(0))
l += 2 * rho * xres
rho = rho * 10
end = time.time()
print(NT/(end-start), "Hz" )

plt.plot(xres[0,:,0].detach().numpy(), label='Xres')
plt.plot(xres[0,:,1].detach().numpy(), label='Vres')
plt.plot(xres[0,:,2].detach().numpy(), label='THeres')

plt.legend(loc='upper right')
plt.figure()
#plt.subplot(132)
plt.plot(x[0,:,1].detach().numpy(), label='X')
plt.plot(x[0,:,2].detach().numpy(), label='V')
plt.plot(x[0,:,3].detach().numpy(), label='Theta')
plt.plot(x[0,:,0].detach().numpy(), label='F')
#plt.plot(cost[0,:].detach().numpy(), label='F')
plt.legend(loc='upper right')
#plt.figure()
#plt.subplot(133)
#plt.plot(costs)
print("hess count: ", count)

plt.show()

## Pytorch Trajectory Optimization 3: Plugging in the Hessian

I plugged in the hessian extraction code for using newton steps.

When I profiled it using the oh so convenient https://github.com/rkern/line_profiler I found almost all of my time was spent in the delLy.backwards step. For each hessian I needed to run this B (the band width) times and each time cost ~0.6ms. For the entire run to converge took about 70 iterations and 1000 runs of this backwards step, which came out to 0.6 seconds. It is insane, but actually even calculating the band of the hessian costs considerably more time than inverting it.

So to speed this up, I did a bizarre thing. I replicated the entire system B times. Then I can get the entire hessian band in a single call to backwards. remarkably, although B ~ 15 this only slowed backwards down by 3x. This is huge savings actually, while obviously inefficient. The entire program has gone down from 1.1s to 0.38s, roughly a 3x improvement. All in all, this puts us at 70/0.38 ~ 185 Hz for a newton step. Is that good enough? I could trim some more fat. The Fast MPC paper http://web.stanford.edu/~boyd/papers/fast_mpc.html says we need about ~5 iterations to tune up a solution, this would mean running at 40Hz. I think that might be ok.

Since the hessian is hermitian it is possible to use roughly half the calls (~B/2), but then actually extracting the hessian is not so simple. I haven’t figured out a way to comfortably do such a thing yet. I think I could figure out the first column and then subtract (roughly some kind of gaussian elimination procedure).

It has helped stability to regularize everything with a surprising amount of weight in the cost. I guess since I anticipate all values being in the range of -10,10, maybe this makes sense.

Now I need to try not using this augmented Lagrangian method and just switching to a straight newton step.

Edit: Ooh. Adding a simple backtracking line search really improves stability.

import torch
import matplotlib.pyplot as plt
import numpy as np
import torch.optim
from scipy import linalg

N = 100
T = 10.0
dt = T/N
NVars = 4
NControls = 1
batch = (NVars+NControls)*3
def getNewState():
#we 're going to also pack f into here
x = torch.zeros(batch,N,NVars+NControls, requires_grad=True)

#f = torch.zeros(batch, N-1, requires_grad=True)

l = torch.zeros(batch, N-1,NVars, requires_grad=False)
return x, l

def calc_loss(x, l , rho, prox=0): # l,
#depack f, it has one less time point
cost = 0.1*torch.sum(x**2)
#cost += prox * torch.sum((x - x.detach())**2)
f = x[:,1:,:NControls]
#leftoverf = x[:,0,:NControls]
x = x[:,:,NControls:]

delx = (x[:,1:,:] - x[:, :-1,:]) / dt

xbar = (x[:,1:,:] + x[:, :-1,:]) / 2
dxdt = torch.zeros(batch, N-1,NVars)
THETA = 2
X = 0
V = 1
dxdt[:,:,X] = xbar[:,:,V]
#print(dxdt.shape)
#print(f.shape)
dxdt[:,:,V] = f[:,:,0]
dxdt[:,:,THETADOT] = -torch.sin(xbar[:,:,THETA]) + f[:,:,0]*torch.cos(xbar[:,:,THETA])

xres = delx - dxdt

lagrange_mult = torch.sum(l * xres)

#cost = torch.sum((x+1)**2+(x+1)**2, dim=0).sum(0).sum(0)
#cost += torch.sum((f+1)**2, dim=0).sum(0).sum(0)
#cost += 1
penalty = rho*torch.sum( xres**2)

#cost += 1.0*torch.sum((abs(x[:,:,THETA]-np.pi)), dim=1)
#cost = 1.0*torch.sum((x[:,:,:]-np.pi)**2 )
cost += 1.0*torch.sum((x[:,:,THETA]-np.pi)**2 * torch.arange(N) / N )
cost += 0.5*torch.sum(f**2)
#cost = 1.0*torch.sum((x[:,:,:]-np.pi)**2 )
#cost = cost1 + 1.0
#cost += 0.01*torch.sum(x**2, dim=1).sum(0).sum(0)
#xlim = 3
#cost += 0.1*torch.sum(-torch.log(xbar[:,:,X] + xlim) - torch.log(xlim - xbar[:,:,X]))
#cost += 0.1*torch.sum(-torch.log(xbar[:,:,V] + 1) - torch.log(1 - xbar[:,:,V]),dim=1)
#cost += (leftoverf**2).sum()

#total_cost = cost + lagrange_mult + penalty

return cost, penalty, lagrange_mult, xres

def getFullHess(): #for experimentation
pass

def getGradHessBand(loss, B, x):
#B = bandn
delL = delL0[:,1:,:].view(B,-1) #remove x0
print("del ", delL[:,:10])
#hess = torch.zeros(B,N-1,NVars+NControls, requires_grad=False).view(B,B,-1)
y = torch.zeros(B,N-1,NVars+NControls, requires_grad=False).view(B,-1)

#y = torch.eye(B).view(B,1,B)
#print(y.shape)
for i in range(B):
#y = torch.zeros(N-1,NVars+NControls, requires_grad=False).view(-1)
y[i,i::B]=1
#print(y[:,:2*B])
print(y.shape)
print(delL.shape)
delLy = torch.sum(delL * y)
#print(delLy)

delLy.backward() #(i != B-1)
#print(hess.shape)
#hess[i,:] = x.grad[:,1:,:].view(-1) #also remove x0
#print(hess[i,:])

#print(hess)
nphess = x.grad[:,1:,:].view(B,-1).detach().numpy()# .view(-1)# hess.detach().numpy()
#print(nphess[:,:4])
#print(nphess)
for i in range(B):
nphess[:,i::B] = np.roll(nphess[:,i::B], -i+B//2, axis=0)
print(nphess[:,:4])
#hessband = removeX0(nphess[:B//2+1,:])
return delL.detach().numpy()[0,:], nphess #hessband

x, l = getNewState()
rho = 0.1
prox = 0.0
for j in range(10):
while True:
try:
cost, penalty, lagrange_mult, xres = calc_loss(x, l, rho, prox)
#print(total_cost)
print("hey now")
#print(cost)
total_cost = cost + lagrange_mult + penalty
#total_cost = cost
#print(hess)
#print(hess.shape)

#easiest thing might be to put lagrange mutlipleirs into x.
#Alternatively, use second order step in penalty method.
bandn = (NVars+NControls)*3//2
print(hess.shape)
dx = linalg.solve_banded((bandn,bandn), hess, gradL) #
#print(hess)
#print(hess[bandn:,:])
#dx = linalg.solveh_banded(hess[:bandn+1,:], gradL, overwrite_ab=True)
#df0 = dx[:NControls].reshape(-1,NControls)
dx = dx.reshape(1,N-1,NVars+NControls)

x[:,1:,:] -= torch.tensor(dx)
print(x[:,:5,:])
#print(x[:,0,NVars:].shape)
#print(df0.shape)
costval = cost.detach().numpy()
#break
if newton_dec < 1e-10*costval:
break
except np.linalg.LinAlgError:
print("LINALGERROR")
prox += 0.1
#break
#print(x)
l += 2 * rho * xres
rho = rho * 2 #+ 0.1
#print(x)
#plt.subplot(131)
plt.plot(xres[0,:,0].detach().numpy(), label='Xres')
plt.plot(xres[0,:,1].detach().numpy(), label='Vres')
plt.plot(xres[0,:,2].detach().numpy(), label='THeres')

plt.legend(loc='upper right')
plt.figure()
#plt.subplot(132)
plt.plot(x[0,:,1].detach().numpy(), label='X')
plt.plot(x[0,:,2].detach().numpy(), label='V')
plt.plot(x[0,:,3].detach().numpy(), label='Theta')
plt.plot(x[0,:,0].detach().numpy(), label='F')
#plt.plot(cost[0,:].detach().numpy(), label='F')
plt.legend(loc='upper right')
#plt.figure()
#plt.subplot(133)
#plt.plot(costs)

plt.show()

## Extracting a Banded Hessian in PyTorch

So pytorch does have some capability towards higher derivatives, with the caveat that you have to dot the gradients to turn them back into scalars before continuing. What this means is that you can sample a single application of the  Hessian (the matrix of second derivatives) at a time.

One could sample out every column of the hessian for example. Performance-wise I don’t know how bad this might be.

For a banded hessian, which will occur in a trajectory optimization problem (the bandedness being a reflection of the finite difference scheme), you don’t need that many samples. This feels more palatable. You only need to sample the hessian roughly the bandwidth number of times, which may be quite small. Plus, then you can invert that banded hessian very quickly using special purpose banded matrix solvers, which are also quite fast. I’m hoping that once I plug this into the trajectory optimization, I can use a Newton method (or SQP?) which will perform better than straight gradient descent.

If you pulled just a single column using [1,0,0,0,0,0..] for example, that would be wasteful, since there are so many known zeros in the banded matrix. Instead something like [1,0,0,1,0,0,1,0,0..] will not have any zeros in the result. This gets us every 3rd row of the matrix. Then we can sample with shifted versions like [0,1,0,0,1,0,0,1,0,0..]. until we have all the rows somewhere. Then there is some index shuffling to put the thing into a sane ordering, especially so that we can use https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.solveh_banded.html which requires the banded matrix to be given in a particular form.

An alternative approach might be to use an fft with some phase twiddling. Also it feels like since the Hessian is hermitian we ought to be able to use about half the samples, since half are redundant, but I haven’t figured out a clean way to do this yet. I think that perhaps sampling with random vectors and then solving for the coefficients would work, but again how to organize the code for such a thing?

Here’s a snippet simulatng extracting the band matrix from matrix products.

import numpy as np

N = 6
B = 5
h = np.diag(np.random.randn(N))
h = h + h.T # symmetrize our matrix
print(h)
band = y = np.zeros((B, N))
for i in range(B):
y = np.zeros(N)
y[i::B]=1
band[i,:] = h @ y
print(band)
for i in range(B):
band[:,i::B] = np.roll(band[:,i::B], -i, axis=0) #B//2

print(band)
print(band[:B//2+1,:])

and here is the full pytorch implementation including a linear banded solve.

import torch
import matplotlib.pyplot as plt
import numpy as np
import torch.optim
from scipy import linalg
import matplotlib.pyplot as plt

N = 12

x = torch.zeros(N, requires_grad=True)

L = torch.sum((x[1:] - x[ :-1])**2)/2 + x[0]**2/2 + x[-1]**2/2 #torch.sum((x**2))

#L.backward()
B = 3

print(delL)

hess = torch.zeros(3,N, requires_grad=False)
for i in range(3):
y = torch.zeros(N, requires_grad=False)
y[i::3]=1
delLy = delL @ y

delLy.backward(retain_graph=True)
print(hess)
nphess = hess.detach().numpy()
print(nphess)
for i in range(B):
nphess[:,i::B] = np.roll(nphess[:,i::B], -i, axis=0)

print(nphess)
print(nphess[:B//2+1,:])
hessband = nphess[:B//2+1,:]
b = np.zeros(N)
b[4]=1
x = linalg.solveh_banded(hessband, b, lower=True)
print(x)

plt.plot(x)
plt.show()

Output:

tensor([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
None
tensor([ 2., -1., -1.,  2., -1., -1.,  2., -1., -1.,  2., -1.,  0.])
tensor([-1.,  2., -1., -1.,  2., -1., -1.,  2., -1., -1.,  2., -1.])
tensor([ 0., -1.,  2., -1., -1.,  2., -1., -1.,  2., -1., -1.,  2.])
tensor([[ 2., -1., -1.,  2., -1., -1.,  2., -1., -1.,  2., -1.,  0.],
[-1.,  2., -1., -1.,  2., -1., -1.,  2., -1., -1.,  2., -1.],
[ 0., -1.,  2., -1., -1.,  2., -1., -1.,  2., -1., -1.,  2.]])
[[ 2. -1. -1.  2. -1. -1.  2. -1. -1.  2. -1.  0.]
[-1.  2. -1. -1.  2. -1. -1.  2. -1. -1.  2. -1.]
[ 0. -1.  2. -1. -1.  2. -1. -1.  2. -1. -1.  2.]]
[[ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.  0.]
[ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]]
[[ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.  0.]]
[0.61538462 1.23076923 1.84615385 2.46153846 3.07692308 2.69230769
2.30769231 1.92307692 1.53846154 1.15384615 0.76923077 0.38461538]

## PyTorch Trajectory Optimization Part 2: Work in Progress

I actually have been plotting the trajectories, which is insane that I wasn’t already doing in part 1. There was clearly funky behavior.

Alternating the Lagrange multiplier steps and the state variable steps seems to have helped with convergence. Adding a cost to the dynamical residuals seems to have helped clean them up also.

I should attempt some kind of analysis rather than making shit up. Assuming quadratic costs (and dynamics), the problem is tractable. The training procedure is basically a dynamical system.

Changed the code a bit to use more variables. Actually trying the cart pole problem now. The results seem plausible. A noisy but balanced dynamical residual around zero. And the force appears to flip it’s direction as the pole crosses the horizontal.

Polyak’s step length

The idea is that if you know the optimal value you’re trying to achieve, that gives you a scale of gradient to work with. Not as good as a hessian maybe, but it’s somethin’. If you use a gradient step of $x + (f-f*)\frac{\nabla f}{|\nabla f|^2}$ it at least has the same units as x and not f/x. In some simple models of f, this might be exactly the step size you’d need. If you know you’re far away from optimal, you should be taking some big step sizes.

The Polyak step length has not been useful so far. Interesting idea though.

import torch
import matplotlib.pyplot as plt
import numpy as np
batch = 1
N = 50
T = 7.0
dt = T/N
NVars = 4

x = torch.zeros(batch,N,NVars, requires_grad=True)

#print(x)

#v = torch.zeros(batch, N, requires_grad=True)

f = torch.zeros(batch, N-1, requires_grad=True)

l = torch.zeros(batch, N-1,NVars, requires_grad=True)
#	x[0,0,2] = np.pi

'''
class Vars():
def __init__(self, N=10):
self.data = torch.zeros(batch, N, 2)
self.data1 = torch.zeros(batch, N-1, 3)
self.lx = self.data1[:,:,0]
self.lv = self.data1[:,:,1]
self.f = self.data1[:,:,2]
self.x = self.data[:,:,0]
self.v = self.data[:,:,1]
'''
def calc_loss(x,f, l):
delx = (x[:,1:,:] - x[:, :-1,:]) / dt

xbar = (x[:,1:,:] + x[:, :-1,:]) / 2
dxdt = torch.zeros(batch, N-1,NVars)
THETA = 2
X = 0
V = 1
dxdt[:,:,X] = xbar[:,:,V]
dxdt[:,:,V] = f
dxdt[:,:,THETADOT] = -torch.sin(xbar[:,:,THETA]) + torch.cos(f)

xres = delx - dxdt

#dyn_err = torch.sum(torch.abs(xres) + torch.abs(vres), dim=1) #torch.sum(xres**2 + vres**2, dim=1) # + Abs of same thing?

lagrange_mult = torch.sum(l * xres, dim=1).sum(1)

cost = 0.1*torch.sum((x[:,:,X]-1)**2, dim=1)  + .1*torch.sum( f**2, dim=1) + torch.sum( torch.abs(xres), dim=1).sum(1)*dt +  torch.sum((x[:,:,THETA]-np.pi)**2, dim=1)*0.01

#cost = torch.sum((x-1)**2, dim=1)

total_cost =   cost + lagrange_mult  #100 * dyn_err + reward

import torch.optim as optim
'''
# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update
'''

#Could interleave an ODE solve step - stinks that I have to write dyanmics twice
#Or interleave a sepearate dynamic solving
# Or could use an adaptive line search. Backtracking
# Goal is to get dyn_err quite small

'''
learning_rate = 0.001
for i in range(40):
total_cost=calc_loss(x,v,f)
total_cost.backward()
while dyn_loss > 0.01:
dyn_loss.backward()
learning_rate = dyn_loss / (torch.norm(x.grad[:,1:]) + (torch.norm(v.grad[:,1:])
x[:,1:] -= learning_rate * x.grad[:,1:] # Do not change Starting conditions
v[:,1:] -= learning_rate * v.grad[:,1:]
reward.backward()
f -= learning_rate * f.grad
'''
learning_rate = 0.005
costs= []
for i in range(4000):
total_cost, lagrange, cost, xres = calc_loss(x,f, l)
costs.append(total_cost[0])
print(total_cost)
#print(x)

total_cost.backward()
f -= learning_rate * f.grad
#l += learning_rate * l.grad

x[:,1:,:] -= learning_rate * x.grad[:,1:,:] # Do not change Starting conditions

total_cost, lagrange, cost, xres = calc_loss(x,f, l)
costs.append(total_cost[0])
print(total_cost)
#print(x)

total_cost.backward()
#f -= learning_rate * f.grad
l += learning_rate * l.grad

#x[:,1:,:] -= learning_rate * x.grad[:,1:,:] # Do not change Starting conditions

print(x)
#print(v)
print(f)

plt.plot(xres[0,:,0].detach().numpy(), label='Xres')
plt.plot(xres[0,:,1].detach().numpy(), label='Vres')
plt.plot(xres[0,:,2].detach().numpy(), label='THeres')
plt.plot(x[0,:,0].detach().numpy(), label='X')
plt.plot(x[0,:,1].detach().numpy(), label='V')
plt.plot(x[0,:,2].detach().numpy(), label='Theta')
plt.plot(f[0,:].detach().numpy(), label='F')

#plt.plot(costs)
#plt.plot(l[0,:,0].detach().numpy(), label='Lx')

plt.legend(loc='upper left')
plt.show()

Problems:

1. The step size is ad hoc.
2. Lagrange multiplier technique does not seem to work
3. Takes a lot of steps
4. diverges
5. seems to not be getting an actual solution
6. Takes a lot of iterations

On the table: Better integration scheme. Hermite collocation?

Be more careful with scaling, what are the units?

mutiplier smoothing. Temporal derivative of lagrange multiplier in cost?

alternate more complete solving steps

huber on theta position cost. Square too harsh? Punishes swing away too much?

more bullshit optimization strats as alternatives to grad descent

weight sooner more than later. Care more about earlier times since want to do model predictive control

Just solve eq of motion don’t worry about control as simpler problem

Pole up balancing

logarithm squeezing method – nope

The lambda * x model of lagrange mulitplier. Leads to oscillation

Damping term?

This learning rate is more like a discretization time step than a decay parameter. Well the product of both actually.

Heat equation model. Kind of relaxing everything into place

______________________________

Switched to using pytorch optimizers. Adam seems to work the best. Maybe 5x as fast convergence as my gradient descent. Adagrad and Adadelta aren’t quite as good. Should still try momentum. Have to reset the initial conditions after every iteration. A better way? Maybe pass x0 in to calc_loss separately?

Switched over to using the method of multipliers http://www.cs.cmu.edu/~pradeepr/convexopt/Lecture_Slides/Augmented-lagrangian.pdf

The idea is to increase the quadratic constraint cost slowly over time, while adjusting a Lagrange mutiplier term to compensate also. Seems to be working better. The scheduling of the increase is still fairly ad hoc.

import torch
import matplotlib.pyplot as plt
import numpy as np
import torch.optim
batch = 1
N = 50
T = 10.0
dt = T/N
NVars = 4

def getNewState():

x = torch.zeros(batch,N,NVars, requires_grad=True)

f = torch.zeros(batch, N-1, requires_grad=True)

l = torch.zeros(batch, N-1,NVars, requires_grad=False)
return x,f,l

def calc_loss(x,f, l, rho):
delx = (x[:,1:,:] - x[:, :-1,:]) / dt

xbar = (x[:,1:,:] + x[:, :-1,:]) / 2
dxdt = torch.zeros(batch, N-1,NVars)
THETA = 2
X = 0
V = 1
dxdt[:,:,X] = xbar[:,:,V]
dxdt[:,:,V] = f
dxdt[:,:,THETADOT] = -torch.sin(xbar[:,:,THETA]) + f*torch.cos(xbar[:,:,THETA])

xres = delx - dxdt

#dyn_err = torch.sum(torch.abs(xres) + torch.abs(vres), dim=1) #torch.sum(xres**2 + vres**2, dim=1) # + Abs of same thing?

lagrange_mult = torch.sum(l * xres, dim=1).sum(1)

#cost = 0
cost =  1.0*torch.sum(torch.abs(x[:,:,THETA]-np.pi), dim=1) # 0.1*torch.sum((x[:,:,X]-1)**2, dim=1)  +
#cost += 2.0 * torch.sum((x[:,30:-1,THETA] - np.pi)**2,dim=1)
#cost += 7.0*torch.sum( torch.abs(xres)+ xres**2 , dim=1).sum(1)
penalty = rho*torch.sum( xres**2 , dim=1).sum(1)
#  + 1*torch.sum( torch.abs(xres)+ xres**2 , dim=1).sum(1)
# 5.0*torch.sum( torch.abs(xres)+ xres**2 , dim=1).sum(1) +
cost += 0.01*torch.sum( f**2, dim=1)
#cost += torch.sum(-torch.log(f + 1) - torch.log(1 - f),dim=1)
cost += 0.1*torch.sum(-torch.log(xbar[:,:,X] + 1) - torch.log(1 - xbar[:,:,X]),dim=1)
cost += 0.1*torch.sum(-torch.log(xbar[:,:,V] + 1) - torch.log(1 - xbar[:,:,V]),dim=1)

#cost += torch.sum(-torch.log(xres + .5) - torch.log(.5 - xres),dim=1).sum(1)

# torch.sum( torch.abs(xres), dim=1).sum(1)*dt +
#cost = torch.sum((x-1)**2, dim=1)

total_cost =   cost + lagrange_mult + penalty  #100 * dyn_err + reward

import torch.optim as optim

learning_rate = 0.001

x, f, l = getNewState()
optimizers = [torch.optim.SGD([x,f], lr= learning_rate),
optimizer = optimizers[1]
#optimizer = torch.optim.SGD([x,f], lr= learning_rate)
costs= []
path = []
mults = []
rho = 0.1
prev_cost = 0
for j in range(15):
prev_cost = None
for i in range(1,10000):

total_cost, lagrange, cost, xres = calc_loss(x,f, l, rho)

costs.append(total_cost[0])
if i % 5 == 0:
#pass
print(total_cost)

total_cost.backward()

optimizer.step()

x[0,0,2] = 0#np.pi+0.3 # Initial Conditions
x[0,0,0] = 0
x[0,0,1] = 0
x[0,0,3] = 0
#if (x.grad.norm()).detach().numpy()/N < 0.1: #Put Convergence condition here

if i > 2000:
break
if prev_cost:
if ((total_cost - prev_cost).abs()/total_cost).detach().numpy() < 0.000001:
pass #break

prev_cost = total_cost

total_cost, lagrange, cost, xres = calc_loss(x,f, l, rho)
costs.append(total_cost[0])

l += 2 * rho * xres

rho = rho + 0.5
print(rho)

plt.subplot(131)
plt.plot(xres[0,:,0].detach().numpy(), label='Xres')
plt.plot(xres[0,:,1].detach().numpy(), label='Vres')
plt.plot(xres[0,:,2].detach().numpy(), label='THeres')

plt.legend(loc='upper right')
#plt.figure()
plt.subplot(132)
plt.plot(x[0,:,0].detach().numpy(), label='X')
plt.plot(x[0,:,1].detach().numpy(), label='V')
plt.plot(x[0,:,2].detach().numpy(), label='Theta')