Pytorch Trajectory Optimization Part 4: Cleaner code, 50Hz

Cleaned up the code more and refactored some things.

Added backtracking. It will backtrack on the dx until the function is actually decreasing.

Prototyped the online part with shifts. Seems to work well with a fixed penalty parameter rho~100. Runs at ~50Hz with pretty good performance at 4 optimization steps per time step. Faster or slower depending on the number of newton steps per time step we allow ourselves.  Still to see if the thing will control an actual cartpole.

The majority of time is spent just backwards calculating the hessian still (~50%).

I’ve tried a couple different schemes (direct projection of the delLy terms or using y = torch.eye). None particularly seem to help.

The line search is also fairly significant (~20% of the time) but it really helps with both stability and actually decreasing the number of hessian steps, so it is an overall win. Surprisingly during the line search, projecting out the batch to 0 doesn’t matter much. How could this possibly make sense?

What I should do is pack this into a class that accepts new state observations and initializes with the warm start. Not clear if I should force the 4 newton steps on you or let you call them yourself. I think if you use too few it is pretty unstable (1 doesn’t seem to work well. 2 might be ok and gets you up to 80Hz maybe.)

The various metaparameters should be put into the __init__. The stopping cutoff  1e-7, Starting rho (~0.1), rho increase (x10) , backtrack alpha decrease factor (0.5 right now), the online rho (100). Hopefully none of these matter two much. I have noticed going too small with cutoff leading to endless loops.

Could swapping the ordering of time step vs variable number maybe help?

For inequality constraints like the track length and forces, exponential barriers seems like a more stable option compared to log barriers. Log barriers at least require me to check if they are going NaN.

I attempted the pure Lagrangian version where lambda is just another variable. It wasn’t working that great.



Pytorch Trajectory Optimization 3: Plugging in the Hessian

I plugged in the hessian extraction code for using newton steps.

When I profiled it using the oh so convenient I found almost all of my time was spent in the delLy.backwards step. For each hessian I needed to run this B (the band width) times and each time cost ~0.6ms. For the entire run to converge took about 70 iterations and 1000 runs of this backwards step, which came out to 0.6 seconds. It is insane, but actually even calculating the band of the hessian costs considerably more time than inverting it.

So to speed this up, I did a bizarre thing. I replicated the entire system B times. Then I can get the entire hessian band in a single call to backwards. remarkably, although B ~ 15 this only slowed backwards down by 3x. This is huge savings actually, while obviously inefficient. The entire program has gone down from 1.1s to 0.38s, roughly a 3x improvement. All in all, this puts us at 70/0.38 ~ 185 Hz for a newton step. Is that good enough? I could trim some more fat. The Fast MPC paper says we need about ~5 iterations to tune up a solution, this would mean running at 40Hz. I think that might be ok.

Since the hessian is hermitian it is possible to use roughly half the calls (~B/2), but then actually extracting the hessian is not so simple. I haven’t figured out a way to comfortably do such a thing yet. I think I could figure out the first column and then subtract (roughly some kind of gaussian elimination procedure).

It has helped stability to regularize everything with a surprising amount of weight in the cost. I guess since I anticipate all values being in the range of -10,10, maybe this makes sense.

Now I need to try not using this augmented Lagrangian method and just switching to a straight newton step.

Edit: Ooh. Adding a simple backtracking line search really improves stability.



Cartpole Camera System – OpenCV + PS EYE + IR

We tried using colored tape before. It was okay after manual tuning, but kind of sucked. Commerical motion tracking systems use IR cameras and retroreflectors.

We bought some retroreflective tape and put it on the pole.

We removed our PS EYE IR filter. The PS EYE is really cheap (~7$) and has a high framerate mode (100+ fps). People have been using it for a while for computer vision projects.

We followed the instructions, but did not add the floppy disk and sanded down the base of the lens to bring the image back into focus.

We bought an IR LED ring light which fit over the camera with the plastic cover removed and rubber banded it in place.

If you snip the photoresistor it is always on, since the photoresistor is high resistance in the dark. We used a spare 12V power supply that we soldered a connector on for.

We had also bought an IR pass filter on amazon, but it does not appear to help.

Useful utilties: qv4l2, v4l2-ctl and v4l2-utils. You can change lots of stuff.

qv4l2 -d 1 is very useful for experiementation

Useful options to  v4l2-ctl : -d selects camera, -p sets framerate -l gives a list of changeable options. You have to turn off the automatic stuff before it becomes changeable. Counterintuitively auto-exposure seems to have 1 as off.

There has been a recent update to opencv to let the v4l2 buffer size be changed. We’re hoping this will really help with our latency issues

A useful blog. We use v4l2-ctl for controlling the exposure programmatically

Oooh. The contour method + rotated rectangle is working really well for matching the retroreflective tape.

You need to reduce the video size to 320×240 if you want to go to the highest framerate of 187fps


In regards to the frame delay problem from before, it’s not clear that we’re really seeing it? We are attempting both the screen timestamp technique and also comparing to our rotary encoder. In the screen timestamp technique, it is not so clear that what we measure there is latency, and if it is, it includes the latency of the monitor itself, which is irrelevant.

img_5311 img_2511


Extracting a Banded Hessian in PyTorch

So pytorch does have some capability towards higher derivatives, with the caveat that you have to dot the gradients to turn them back into scalars before continuing. What this means is that you can sample a single application of the  Hessian (the matrix of second derivatives) at a time.

One could sample out every column of the hessian for example. Performance-wise I don’t know how bad this might be.

For a banded hessian, which will occur in a trajectory optimization problem (the bandedness being a reflection of the finite difference scheme), you don’t need that many samples. This feels more palatable. You only need to sample the hessian roughly the bandwidth number of times, which may be quite small. Plus, then you can invert that banded hessian very quickly using special purpose banded matrix solvers, which are also quite fast. I’m hoping that once I plug this into the trajectory optimization, I can use a Newton method (or SQP?) which will perform better than straight gradient descent.

If you pulled just a single column using [1,0,0,0,0,0..] for example, that would be wasteful, since there are so many known zeros in the banded matrix. Instead something like [1,0,0,1,0,0,1,0,0..] will not have any zeros in the result. This gets us every 3rd row of the matrix. Then we can sample with shifted versions like [0,1,0,0,1,0,0,1,0,0..]. until we have all the rows somewhere. Then there is some index shuffling to put the thing into a sane ordering, especially so that we can use which requires the banded matrix to be given in a particular form.

An alternative approach might be to use an fft with some phase twiddling. Also it feels like since the Hessian is hermitian we ought to be able to use about half the samples, since half are redundant, but I haven’t figured out a clean way to do this yet. I think that perhaps sampling with random vectors and then solving for the coefficients would work, but again how to organize the code for such a thing?


Here’s a snippet simulatng extracting the band matrix from matrix products.


and here is the full pytorch implementation including a linear banded solve.




PyTorch Trajectory Optimization Part 2: Work in Progress

I actually have been plotting the trajectories, which is insane that I wasn’t already doing in part 1. There was clearly funky behavior.

Alternating the Lagrange multiplier steps and the state variable steps seems to have helped with convergence. Adding a cost to the dynamical residuals seems to have helped clean them up also.

I should attempt some kind of analysis rather than making shit up. Assuming quadratic costs (and dynamics), the problem is tractable. The training procedure is basically a dynamical system.

Changed the code a bit to use more variables. Actually trying the cart pole problem now. The results seem plausible. A noisy but balanced dynamical residual around zero. And the force appears to flip it’s direction as the pole crosses the horizontal.

Polyak’s step length

The idea is that if you know the optimal value you’re trying to achieve, that gives you a scale of gradient to work with. Not as good as a hessian maybe, but it’s somethin’. If you use a gradient step of x + (f-f*)\frac{\nabla f}{|\nabla f|^2} it at least has the same units as x and not f/x. In some simple models of f, this might be exactly the step size you’d need. If you know you’re far away from optimal, you should be taking some big step sizes.

The Polyak step length has not been useful so far. Interesting idea though.



  1. The step size is ad hoc.
  2. Lagrange multiplier technique does not seem to work
  3. Takes a lot of steps
  4. diverges
  5. seems to not be getting an actual solution
  6. Takes a lot of iterations

On the table: Better integration scheme. Hermite collocation?

Be more careful with scaling, what are the units?

mutiplier smoothing. Temporal derivative of lagrange multiplier in cost?

alternate more complete solving steps

huber on theta position cost. Square too harsh? Punishes swing away too much?

more bullshit optimization strats as alternatives to grad descent

weight sooner more than later. Care more about earlier times since want to do model predictive control

Just solve eq of motion don’t worry about control as simpler problem

Pole up balancing

logarithm squeezing method – nope

The lambda * x model of lagrange mulitplier. Leads to oscillation

Damping term?

This learning rate is more like a discretization time step than a decay parameter. Well the product of both actually.

Heat equation model. Kind of relaxing everything into place



Made some big adjustments

Switched to using pytorch optimizers. Adam seems to work the best. Maybe 5x as fast convergence as my gradient descent. Adagrad and Adadelta aren’t quite as good. Should still try momentum. Have to reset the initial conditions after every iteration. A better way? Maybe pass x0 in to calc_loss separately?

Switched over to using the method of multipliers

The idea is to increase the quadratic constraint cost slowly over time, while adjusting a Lagrange mutiplier term to compensate also. Seems to be working better. The scheduling of the increase is still fairly ad hoc.



The left is residuals of obeying the equations of motion, the middle is the force and trajectories themselves and the right is cost vs iteration time. Not entirely clear that a residual of 0.04 is sufficient. Integrated over time this could be an overly optimistic error of 0.2 ish I’d guess. That is on the edge of making me uncomfortable. Increase rho more? Also that force schedule seems funky and overly complex. Still, improvement from before. Feels like we’re cookin’ with gas


Garbage Can Compiling to Categories with Inspectable Lambdas

There are a couple kinds of functions that we can turn into totally inspectable data.

Linear functions can be reconstituted into a matrix if you give a basis of vectors.

Functions from enumerable types can be turned into a lookup table

Sufficiently polymorphic functions are another example though. forall a. a-> a is commonly known to only be id. The same goes for fst = forall a b. (a,b)->a and snd and swap and all the nesting of . These functions have exactly one inhabiting value (excluding internal churning and the possibility of going into an infinite loop).

So the type directly tells us the implementation

forall a. (a,a)->a is similar. It can only be fst or snd. Types that reuse a type parameter in the input can only be permutations.

I’ve been trying to find a way to take a written lambda and convert it to data automatically and have been having trouble.

An opaque type that we have hidden the contructors to is the same (T,T)->T can only be fst or snd specialized to T since we can’t possibly destruct on T.

We can figure out which one by giving a labeled example to that function and then inspecting a single output.  This gives the permutation and duplication that was done.

Similarly for T -> Either T T

Once we have this, we can (Hopefully) reinterpret this lambda in terms of a monoidal category.



What about TH? Also the new quantified constraints extensions might be helpful?



Ok. A Different approach. This works much better to what I had in mind. you can write aribatrary (\(x,y,) -> (y,x)) tuple like lambdas and it will convert them to a category. I really had to hack around to get the thing to compile. Like that Pick typeclass, what the heck? Why can I get defaults values in type families but not in typeclasses?

It is all decidedly not typesafe. You can get totally nonsensical things to compile to something. However if you stick to lambdas, you’ll be ok. Maybe.

No on further review this does not work. I got tricked that the type seemed ok at a certain point.  A couple problems arise upon actual application. Since the idea is to drive the form based on the type variables upon actual application to something that has types of the same form it gets all screwed up. Also tons of instances are overlapping, although I think this is fixable.

Maybe what I need is existential types that can’t ever unify together accidentally.

A couple thought on typelevel programming principles:

  1. Typeclasses are hard to get default cases. You want to use type families if that is what you want
  2. Typeclasses need unique stuff to appear on the right hand side. Only 1 pattern should match. You might need to add extra parameters to match to which you can force on the left hand side of the instance
  3. ~ type equality is real useful


An alternative to using lambda is to use an explicit Proxy. The type variables are basically just as good for syntactic purposes (a touch more noisy).



Pytorch Trajectory Optimization

Trajectory optimization is cool. The idea is to take a dynamical problem as a big ole optimization problem, finding the best actions to take to achieve your goals or maximize a reward.

There are a couple of flavors of trajectory optimization (shooting methods, collocation methods)

PyTorch gives a pretty low overhead extension to Numpy that also gives autodifferentiation. It is mainly intended as a neural network library, for which it has a number of facilities.

Gradient Descent is not the preferred method for these problems (According to Boyd’s Convex optimization course). Gradient Descent has shit convergence compared to newton iteration, but is very flexible and easy to implement.

In addition, using a normal ODE solver from Scipy would be much more stable, but it would require cleverness to have the same code work for both scipy and the torch parts. So screw it.

One nicety of this approach is that we don’t even have to have our derivatives solved for. They could be all tied up in a

I thought that maybe I could just weight the dynamics cost enough to have it require the dynamics be satisfied, but that did not seem to work. Maybe with more fiddling? On further review my code had massive bugs in it. I’m not sure that the dynamics cost version wouldn’t work, but the Lagrange multiplier method seems to work well and makes sense too.

In this formulation, we can also train some kind of parametrized controller function f_w(x) by sampling some random initial starting conditions (or even dynamical parameters like mass and length etc, or noise forces). This is quite nice.

Additional bits that may be nice: Backtracking line search, logarithmic potential for inequalities, I wonder if a symplectic style interleaving of position and momentum might be nice even for this global case. Should definitely just tie up all the vars into a single x. Can we use a lagrangian or hamiltonian and then have pytorch differentiate that? It may in fact be nice to use some combinator to be able to hand the same function to ODEInt for a couple reasons (getting good initilizations  of the path for example).

For a simple system, I’m using \dot{x}=v , \dot{v}=f , where you get to control f at every time point and x is starting at 0 and wants to get to 1. I’m using a simple scheme of finite difference in time for the time derivative. x and v are defined at t and f, lx, lv are defined at the half time steps t + \frac{1}{2}. You need at least two time steps to get a derivative. I’m adding a square cost to the force, otherwise it would just get a huge force. lx and lv are Lagrange multipliers enforcing the equations of motion at each time step

Here was an initial pass (just here for for historical reasons, look at the updated one below. This one does not work as is)


goofed up a couple things (inlcuding my xres making no sense. You need to explicility zero gradients. Pretty annoying). Lagrange multiplier method makes total sense.

Could we use a Hamiltonian and use autograd to derive equations of motion? Seems plausible and convenient.

Can I make a custom pytorch layer for sparse Hessians? The data oriented viewpoint would have you pump the gradient and hessian backward. Or could you automatically build an H-matrix structure for the hessian of convnets?

Put a neural controller in there. Each batch could have randomized parameters, noise, and initial conditions.

Is rebuilding total_cost every time bad?



CartPole Maths

Our approach to the Cartpole is not black box. A Cartpole is a pretty simple system all things considered.

The first thing to do is derive the equations of motion. Originally I was using the Lagrangian for the system and deriving the equations of motion that way, which includes the back reaction of the pole back on the acceleration of the cart for example.

But the motor complicates things. I have a tough time in general modeling motors. What physical thing does a command to our motor driver correspond to? Velocity? Power? Torque? I have guesses. The easiest thing to do is just build and measure. It turns out for us that the commands are basically velocity control.

So in our case the back reaction is basically irrelevant. We have direct control over the cart velocity. In that case, one can use some easy enough hand waving to get the equations of motion.

Let’s set the angle \theta = 0 at pole down with positive angle going counter clockwise. We can assume that gravity acts at the center of mass of the pole, which is at the midpoint. This gives a torque \tau = -mg \frac{L}{2} \sin(\theta). One way to see the negative sign is that a slight positive angle should give a negative torque returning it back to the down position. The moment of inertia of the pole is mL^2/3. You can look this up as a pole around one of it’s ends or derive it from I =\int dm r^2. The \frac{1}{3} comes from the integration of the r^2. Putting these together we get mL^2/3 \ddot{\theta} = -mg  \frac{L}{2} \sin(\theta) .

Now we need to actually put in the cart stuff. The cart puts the pole in an accelerating frame, where basically you have a new component of gravity that points horizontally. This adds a torque -ma  \frac{L}{2} \cos(\theta) . As far as all of the signs go, honestly we just fiddled with them until it worked.

Now that we have all that in hand, we can talk about the Linear Quadratic Regulator (LQR) control.

The model that LQR uses is that the equations of motion are linear and the cost function that you want to minimize are quadratic in the controls u and state x. This is plausibly tractable because Quadratic and linear stuff is usually ok. I’m serious.

These letter choices for the various bits are pretty standard.

cost = \int x^TQx + u^TRu dt


If you just look at the wikipedia page, you can already just plug and chug to the solution.

u = -Kx

K= R^{-1} B^T P

A^TP + PA - PBR^{-1}B^TP + Q = 0 

Jesus that equation looks like shit.

which is QUITE CONVENIENTLY solved by the following scipy function



Now given this shit, we need to approximate our nonlinear equations of motion with linear equations in order to use the LQR framework.

mL^2/3 \ddot{\theta} = -m \frac{L}{2}( g\sin(\theta) + a \cos(\theta))

Near the top where we are linearizing

\delta\theta = \theta - \pi




mL^2/3 \ddot{\delta\theta} \approx m \frac{L}{2}(g\delta\theta+a)

We can move some of those constants to the other side to get \ddot{\delta\theta} by itself.

\ddot{\delta\theta} \approx \frac{3}{2L}( g\delta\theta + a)

Another thing you have to do is massage these second order equations into first order form. You do this by introducing a new state variable \omega

\dot{\delta\theta} = \omega

\dot{\omega} \approx \frac{3}{2L}( g\delta\theta + a)

In matrix form this is

\begin{bmatrix}\dot{\delta\theta} \\ \dot{\omega}  \end{bmatrix} =    \begin{bmatrix} 0 & 1 \\    \frac{3}{2L} g & 0    \end{bmatrix}    \begin{bmatrix}\delta\theta \\ \omega \end{bmatrix}    +    \begin{bmatrix} 0 \\ \frac{3}{2L} \end{bmatrix} \begin{bmatrix} a \end{bmatrix}  

In addition to this, it is nice to add the cart dynamics, even though they are pretty trivial. This is because we can then add some weighting terms to discourage the cart from trying to go off the track or go faster than the motors support. There are ways to make it so that the cartpole never tries to go out of bounds, but they are a bit more complicated. I’ve got some blog posts about them.

\begin{bmatrix}\dot{\delta\theta} \\ \dot{\omega} \\ \dot{x} \\ \dot{v}  \end{bmatrix} =    \begin{bmatrix} 0 & 1 & 0 & 0 \\    \frac{3}{2L} g & 0 & 0 & 0 \\    0 & 0 & 0 & 1\\    0 & 0 & 0 & 0 \\    \end{bmatrix}    \begin{bmatrix}\delta\theta \\ \omega \\ x \\ v \end{bmatrix}    +    \begin{bmatrix} 0 \\ \frac{3}{2L} \\ 0 \\ 1 \end{bmatrix} \begin{bmatrix} a \end{bmatrix}  

So we can read off our needed matrices A and B from here.

A =    \begin{bmatrix} 0 & 1 & 0 & 0 \\    \frac{3}{2L} g & 0 & 0 & 0 \\    0 & 0 & 0 & 1\\    0 & 0 & 0 & 0 \\    \end{bmatrix}

B = \begin{bmatrix} 0 \\ \frac{3}{2L} \\ 0 \\ 1 \end{bmatrix}

Now in regards to the weighting matrices Q and R, this is a bit tougher to say what we want. We sort of want all the state variables to be small but the relative important isn’t a priori clear to me. So we picked diagonal matrices and tried out some different values. One thing to note though is that the states variables could possibly have very different scales, since their units are different. The characteristic time of the system is T=\sqrt{\frac{L}{g}}. The characteristic length is the size of our track 1000mm and the rough angle scale is \pi/8 -ish.

Now that we have our matrices we can plug it all into scipy and use it!

One thing to be careful about is that the pole can have swung around multiple times leading to the angle being some multiple of 2\pi. Our hack around this is to just take the \sin of the angle.





Analytic Center in Python using Scipy and Numpy

The analytic center for a set of inequalities \phi(x)<0 is the minimizing position of \sum -\ln (-\phi(x)) . In particular it is often used with linear inequalities. It gives a reasonable and easily computable for convex constraint function center of the region. The hessian at that point can give you a reasonable ellipse that approximates the region too (both interior and exterior approximation).

I wrote a program for linear inequalities. It is not particularly robust. First I get a feasible point using the LP solver in scipy. Then I give the appropriate gradients and Hessians to a newton conjugate gradient solver in scipy. It does return a reasonable center, but I had to fiddle with some epsilons to avoid logarithms exploding and to avoid the hessian being so big it overwhelms the gradient. Possibly a couple burn in steps of gradient descent might help, or getting a feasible point that isn’t optimal since the optimal points being on the boundary is a huge problem. If the newton solver comes back with only 1 or 2 iterations, it probably failed.


Noise and The Fluctuation Dissipation Theorem

I was looking at some slides the other day and they quoted noise power in units of \frac{W}{\sqrt{Hz}}. Being the ignoramus I am, I was wondering why it was scaled

First off, when a Watt is quoted in an electrical measurement, usually you’re measuring Voltage with an instrument with a known input impedance Z. That’s how you convert your fluctutating voltage measurement to Watts.

Second, the sqrt frequency thing? Nowadays, your measurement apparatus is probably a digital sampler and it performs an FFT giving you a spectrum. The width of your FFT is the sampling frequency roughly. Does that make sense that when you increase the width of your taken spectrum the height of the noise signal changes too? It does, but only because implicitly, most sampling circuits take an average of the signal over the same period as the sampling time. These two times are not necessarily intrinsically linked. One could have a system that takes a very fast snapshot and but can only save data or send it over a link at a much slower speed. The noise power is this snapshot time, not the data saving time. The data saving time would be the bandwidth in the FFT.

These two are engineered to be the same to avoid distortion of the frequency signal via aliasing.

But there is an even simpler way to see this. Suppose you have two measurements V1 and V2 that are the averages of time T with variance \sigma. Then the average of these two, V3, is over a time 2T. However, by the standard kind of manipulations (for Gaussian variables the squared variance of a sum = the sum of the squared variances, \sigma^2_{\sum x_i}=\sum \sigma_{x_i} ), the variance of the new signal is \sigma/\sqrt{2} which means it scales with the time window. Hence multiplying you actual measured variances by the square root of your time window gives you a time window invariant quantity.


While I was thinking about that in the car I realized that the fluctuation dissipation theorem is a mean field theory kind of thing. The fluctuation dissipation theorem feels weird and spooky, but I guess it is ultimately simple (or not).

Mean field theory tries to summarize all the complicated interactions with neighbors with a simple summary. For interacting spins, it tries to summarize as an effective B field from the surrounding spins. Then you have a 1-particle model which you can solve and try to find a self-consistent value of B. Here is a sketch in equations.

H= \sum S\cdot S - B_{ext}\cdot S \rightarrow \sum - B_{eff}\cdot  S

Z=\sum_s e^{-\beta H}

M = <S> = \partial_{\beta B} \ln(Z)

B = \alpha M

You can do something similar to find an effective permeability due to your surrounding neighbors. \partial_B M = \chi

The fluctuating force due to your neighbors is like B, a constant forcing term.

The damping is like the permeability. One may want to consider a system that starts with an intrinsic damping, that is one difference between the magnetic case and the fluctuation case, in that free space has a natural permeability but not a natural damping (I suppose there is always some damping, due to radiation and what not, but we have a tendency to totally neglect such things). One could imagine ball bearings being shaken in a cup of molasses or something. You might want to fluctuation due to being hit by other ball bearings, but consider the damping from the molasses to be the dominating damping term (the the thermal fluctuations from the molasses to be ignorable).

Another difference is that I think you really are going to need to work explicitly with time. Just the thermal average isn’t going to cut it I think (at least not conceptually. There might be some dirty tricks you can play, but a typical Hamiltonian can’t have damping terms. As I write this I am doubting it’s truth).

\ddot{x} = -\nu \dot{x}+ f

calculate some averages … Then use the self-consistency

B = \alpha M \rightarrow f = f(\hat{x})

The dissipation will be related to your correlation with your neighbors. When you are moving faster, they have to tend to move in such a way to make you slow down on average.

To Be Continued