## More Reinforcement Learning with cvxpy

So I spent thanksgiving doing this and playing Zelda. Even though that sounds like a hell of a day, seems a little sad for thanksgiving :(. I should probably make more of an effort to go home next year.

I tried implementing a more traditional q-learning pipeline using cvxpy (rather than the inequality trick of the last time). Couldn’t get it to work as well. And it’s still kind of slow despite a lot of rearrangement to vectorize operations (through batching basically).

I guess I’m still entranced with the idea of avoiding neural networks. In a sense, that is the old boring way of doing things. The Deep RL is the new stuff. Using ordinary function approximators is way older I think. But I feel like it takes a problem out of the equation (dealing with training neural nets). Also I like modeling languages/libraries.

I kept finding show stopping bugs throughout the day (incorrectly written maxaction functions, typos, cross episode data points, etc.), so I wouldn’t be surprised if there is one still in here. It’s very surprising how one can convince oneself that it is kind of working when it is actually impossible it’s working. All these environments are so simple, that I suspect I could randomly sample controllers out of a sack for the time I’ve been fiddling with this stuff and find a good one.

I also did the easy cartpole environment using the inequality trick.  Seems to work pretty well.

I also have some Work in Progress on getting full swingup cartpole. Currently is not really working. Seems to kind of be pumping about right? The continuous force control easy cartpole does work though.

Now I feel that a thing that matters quite a bit is what is your choice of action for the next time step. Hypothetically you want a ton of samples here. I now think that using an action that is just slightly perturbed from the actual action works well because the actual action is tending to become roughly the optimal one. Subsequent time steps have roughly the same data in them.

One advantage of discrete action space is that you can really search it all.

Does that mean I should seriously investigate the sum of squares form? A semidefinite variable per data point sounds bad. I feel like I’d have to seriously limit the amount of data I’m using. Maybe I’ll be pleasantly surprised.

I haven’t even gotten to playing with different polynomials yet. The current implementation is exponentially sized in the number of variables. But in kind of a silly way. I think it would be better to use all terms of a bounded total degree.

## Q-Learning with Linear Programming (cvxpy, OpenAI Gym Pendulum)

http://web.mit.edu/~pucci/www/discountedLP.pdf

http://underactuated.mit.edu/underactuated.html?chapter=dp

There is a fun idea of using Linear Programming to do dynamic programming I originally saw in the underactuated robotics textbook.

In my experience reinforcement learning is finicky and depressing. It usually doesn’t work and is very hard to troubleshoot. Do you just need to run it for 10 minutes? 10 years? Is there a bug? God knows. I end up wriggling hyperparameters and praying a lot.

One part of this is the relative finickiness of neural network optimization compared to the technology of convex optimization. Convex optimization solvers are quite reliable and fast.

There is a way of phrasing Q learning as a linear programming problem

The linear programming approach relaxes the Bellman equations.

$Q(s_t,a_t)=r_t + \gamma \max_a Q(s_{t+1},a)$

to

$\forall a. Q(s_t,a_t) \ge r_t +\gamma Q(s_{t+1},a)$

We can approach this forall in a couple ways, one of which is just sampling actions somehow. To make the constraint tight in places you minimize a weighting of Q

$\min \sum w_i * Q(s_i,a_i)$

If Q is written as a linear combination of basis functions

$Q(s,a)=\sum \alpha_i f_i(s,a)$

The all of this put together is a linear program in the variables $\alpha_i$.

For ease, I used cvxpy. I don’t even store my state action pairs, which is quite lazy of me. Even here, compiling the linear program via cvxpy is kind of slow. This preprocessing step takes longer than the actual solve does. You could avoid cvxpy and directly interface a linear programming solver much faster, if that is your thing.

The whole process is still model free. I didn’t plug in pendulum dynamics anywhere. I run openAI gym and use the resulting state-action-state tuples to add inequalities to my cvxpy model. I weight where I want the inequalities to be tightest by using the actual states experienced.

Unfortunately, it still took a couple hours of hyper parameter tuning and fiddling to get the thing to work. So not a grand success on that point.

I made a lot of guesswork for what seemed reasonable

I parametrized the dependence of Q on a by a quadratic so that it is easy to maximize analytically. That is what the polyfit stuff is about. Maximum of $ax^2+bx+c$ is at $-b/2a$. I really should be checking the sign of the a coefficient. I am just assuming it is positive. Naughty boy.

m assuming that it

Chebyshev polynomials are probably good.

It seemed to help to use a slight perturbation of the actual action used on the right hand side of the Bellman inequality. My reasoning here is that the pendulum is actually a continuous system, so we should be using the differential Bellman equation really.

Should I allow for some kind of slack in the equations? Getting a bad reward or data point or one weird unusual state could ruin things for everyone. Inequalities are unforgiving.

Gamma seemed to matter a decent amount

The regularization of alpha seemed largely irrelevant.

Epsilon greediness seems to not matter much either.

Future ideas:

Might be good to replace the sampling of a with a Sum of Squares condition over the variable a.

Should I damp the update in some way? Add a cost the changing alpha from it’s previous value. A kind of damped update / using a prior.

Edit:

A improved version. Fixed the bug in my maxaction function. I shouldn’t have been assuming that it was always concave down.

Also vectorized slightly. Fairly significantly improves the solve time. Not much time is spent in cvxpy, now the solve is dominated by about 3 legitimate seconds in OSQP.

You can flip stuff in and out of loops to try different versions. This method is off-policy, so I could keep data around forever. However, it mostly just slowed the solve time.