http://web.mit.edu/~pucci/www/discountedLP.pdf

http://underactuated.mit.edu/underactuated.html?chapter=dp

There is a fun idea of using Linear Programming to do dynamic programming I originally saw in the underactuated robotics textbook.

In my experience reinforcement learning is finicky and depressing. It usually doesn’t work and is very hard to troubleshoot. Do you just need to run it for 10 minutes? 10 years? Is there a bug? God knows. I end up wriggling hyperparameters and praying a lot.

One part of this is the relative finickiness of neural network optimization compared to the technology of convex optimization. Convex optimization solvers are quite reliable and fast.

There is a way of phrasing Q learning as a linear programming problem

The linear programming approach relaxes the Bellman equations.

to

We can approach this forall in a couple ways, one of which is just sampling actions somehow. To make the constraint tight in places you minimize a weighting of Q

If Q is written as a linear combination of basis functions

The all of this put together is a linear program in the variables .

For ease, I used cvxpy. I don’t even store my state action pairs, which is quite lazy of me. Even here, compiling the linear program via cvxpy is kind of slow. This preprocessing step takes longer than the actual solve does. You could avoid cvxpy and directly interface a linear programming solver much faster, if that is your thing.

The whole process is still model free. I didn’t plug in pendulum dynamics anywhere. I run openAI gym and use the resulting state-action-state tuples to add inequalities to my cvxpy model. I weight where I want the inequalities to be tightest by using the actual states experienced.

Unfortunately, it still took a couple hours of hyper parameter tuning and fiddling to get the thing to work. So not a grand success on that point.

I made a lot of guesswork for what seemed reasonable

I parametrized the dependence of Q on `a`

by a quadratic so that it is easy to maximize analytically. That is what the polyfit stuff is about. Maximum of is at . I really should be checking the sign of the a coefficient. I am just assuming it is positive. Naughty boy.

m assuming that it

Chebyshev polynomials are probably good.

It seemed to help to use a slight perturbation of the actual action used on the right hand side of the Bellman inequality. My reasoning here is that the pendulum is actually a continuous system, so we should be using the differential Bellman equation really.

Should I allow for some kind of slack in the equations? Getting a bad reward or data point or one weird unusual state could ruin things for everyone. Inequalities are unforgiving.

Gamma seemed to matter a decent amount

The regularization of alpha seemed largely irrelevant.

Epsilon greediness seems to not matter much either.

Future ideas:

Might be good to replace the sampling of a with a Sum of Squares condition over the variable a.

Should I damp the update in some way? Add a cost the changing alpha from it’s previous value. A kind of damped update / using a prior.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
import gym import numpy as np import cvxpy as cvx from numpy.polynomial.chebyshev import chebval env = gym.make('Pendulum-v0') print(env.action_space) print(env.observation_space) print(env.observation_space.high) print(env.observation_space.low) # [1,1,8] print(env.action_space.high) # -2 print(env.action_space.low) chebdeg = 4 alpha = cvx.Variable(3*chebdeg**3) gamma = 1. - 1./100 def basis(s,a): n = np.arange(4) f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1) f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1) f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1) f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1) return f1*f2*f3*f4 def evalb(alpha, s,a): f = basis(s,a) return alpha*f.flatten() def maxaction(alpha, obs): f1 = np.sum(evalb(alpha.value, observation, 2)) f2 = np.sum(evalb(alpha.value, observation, 0)) f3 = np.sum(evalb(alpha.value, observation, -2)) coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2) action = -coeff[1]/2/coeff[0] action = min(max(action,-2),2) return np.array([action]) constraints = [] objective = 0 for x in range(4): constraints = [] objective = 0 epsilon = 1.2/(x+1) print("epsilon: ", epsilon) for i_episode in range(50): observation = env.reset() reward = -100 for t in range(100): prev_obs = observation prev_reward = reward if np.random.rand() < epsilon: action = env.action_space.sample() else: action = maxaction(alpha, observation) observation, reward, done, info = env.step(action) constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation))) objective += evalb(alpha, observation, env.action_space.sample()) if done: print("Episode finished after {} timesteps".format(t+1)) break loss = 0.1*cvx.sum(cvx.abs(alpha)) + objective prob = cvx.Problem(cvx.Minimize(loss), constraints) # The optimal objective value is returned by `prob.solve()`. print("solving problem") result = prob.solve() print(result) # The optimal value for x is stored in `x.value`. print(alpha.value) #inspection loop for i_episode in range(4): observation = env.reset() #env.env.state[0] = np.random.randn()*sigma for t in range(200): env.render() #print(observation) prev_obs = observation action = maxaction(alpha, observation) #print(action) observation, reward, done, info = env.step(action) if done: print("Episode finished after {} timesteps".format(t+1)) break |

Edit:

A improved version. Fixed the bug in my maxaction function. I shouldn’t have been assuming that it was always concave down.

Also vectorized slightly. Fairly significantly improves the solve time. Not much time is spent in cvxpy, now the solve is dominated by about 3 legitimate seconds in OSQP.

You can flip stuff in and out of loops to try different versions. This method is off-policy, so I could keep data around forever. However, it mostly just slowed the solve time.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
import gym import numpy as np import cvxpy as cvx from numpy.polynomial.chebyshev import chebval env = gym.make('Pendulum-v0') print(env.action_space) print(env.observation_space) print(env.observation_space.high) print(env.observation_space.low) # [1,1,8] print(env.action_space.high) # -2 print(env.action_space.low) chebdeg = 3 alpha = cvx.Variable(3*chebdeg**3) alpha.value = np.zeros(3*chebdeg**3) gamma = 1. - 1./100 def basis(s,a): n = np.arange(4) f1 = chebval(s[0]/1.,np.eye(chebdeg)).reshape(-1,1,1,1) f2 = chebval(s[1]/1.,np.eye(chebdeg)).reshape(1,-1,1,1) f3 = chebval(s[2]/8.,np.eye(chebdeg)).reshape(1,1,-1,1) f4 = ((a/2)**np.arange(3)).reshape(1,1,1,-1) return f1*f2*f3*f4 def evalb(alpha, s,a): f = basis(s,a) return alpha*f.flatten() def maxaction(alpha, obs): f1 = np.sum(evalb(alpha.value, observation, 2)) f2 = np.sum(evalb(alpha.value, observation, 0)) f3 = np.sum(evalb(alpha.value, observation, -2)) coeff = np.polyfit([2,0,-2], [f1,f2,f3], deg=2) if coeff[0] < 0: action = -coeff[1]/2/coeff[0] action = min(max(action,-2),2) elif f1 > f3: action = 2 else: action = -2 return np.array([action]) constraints = [] objective = 0 for x in range(4): constraints = [] objective = 0 epsilon = 1.2/(x+1) print("epsilon: ", epsilon) for i_episode in range(50): observation = env.reset() observations = [] rewards = [] actions = [] reward = -100 for t in range(100): prev_obs = observation prev_reward = reward if np.random.rand() < epsilon: action = env.action_space.sample() else: action = maxaction(alpha, observation) observation, reward, done, info = env.step(action) observations.append(observation) rewards.append(reward) actions.append(action) #constraints.append(evalb(alpha, prev_obs, action) >= prev_reward + gamma*evalb(alpha, observation, action + np.random.randn()*0.2))#maxaction(alpha, observation))) #objective += evalb(alpha, observation, env.action_space.sample()) if done: print("Episode finished after {} timesteps".format(t+1)) break bprev = np.array([basis(s,a).flatten() for (s,a) in zip(observations[:-1],actions[1:])]) bnext = np.array([basis(s,a+np.random.randn()*0.2).flatten() for (s,a) in zip(observations[1:],actions[1:])]) rewards = np.array(rewards)[:-1] constraints += [(bprev@alpha) >= rewards + gamma*(bnext)@alpha] bcost = np.array([basis(s,env.action_space.sample()).flatten() for s in observations]) objective += cvx.sum(bcost@alpha) # np.sum(bnext,axis=0) * alpha loss = 0.1*cvx.sum(cvx.abs(alpha-alpha.value)) + objective prob = cvx.Problem(cvx.Minimize(loss), constraints) print("solving problem") result = prob.solve(verbose=True) print(result) print(alpha.value) #inspection loop for i_episode in range(4): observation = env.reset() #env.env.state[0] = np.random.randn()*sigma for t in range(200): env.render() #print(observation) prev_obs = observation action = maxaction(alpha, observation) #print(action) observation, reward, done, info = env.step(action) if done: print("Episode finished after {} timesteps".format(t+1)) break |

Hi Ben,

I am a Chinese student in Southeast University.

Is that possible to use reinforcement learning to solve

linear programming?

Regards,

Lei

Hi Lei,

If you mean use reinforcement learning to solve linear programs, I’m not sure what that would mean exactly. I suppose so. People are using machine learning methods to accelerate mixed integer programming, but linear programming algorithms are so good anyway I’d be surprised if there is much to be gained throwing machine learning techniques into the mix. Faster heuristic pivot selection maybe?

If you mean use linear programming to solve reinforcement learning problems that’s what I do above. There is literature to be found and it is interesting, but I don’t necessarily recommend this approach. It is a non standard approach to Q-learning.