Checkin out the OpenAI Baselines

These are my notes on trying to edit the opeai baselines codebase to balance a cartpole from the down position. They are pretty scattered.

First I just run the built in examples to get a feel and try out deepq networks.

The PPO algorithm at the bottom is the reccommended one still I think. I got the pole kind of upright and would balance for a couple seconds maybe. Work in progress. The ppo example has some funky batching going on that you need to reshape your observations for.

 

 

https://github.com/openai/baselines

Some of these are ready to roll on anything you throw at them

They use python3.

pip3 install tensorflow-gpu

pip3 install cloudpickle

running a deepq cartpole example

checkout source

https://github.com/openai/baselines/blob/master/baselines/deepq/experiments/train_cartpole.py

what is all this shit.

Has quickly suppressed  exploration

has a clear upward trend. But the increase dies off at episode 300 or so. Learning rate decay?

took around 300 episodes to reach reward 200

Pretty good looking

The pong trainer is now called run_atari.py

sudo apt install python3-opencv

Hmm. Defaults to playing breakout actually. Highly exploratory at first. Still basically totally random after a couple mins

The buffer size is actually smaller than I’d expected. 10000?

simple.py is where the learn function is defined.

Hmmm explloration fraction is fraction of time for random play to be turned off over.

Is it getting better? Maybe. Still at 90% random. reward ~0.2

After about 2 hours 13,000 episodes still at 32% exploration. reward ~ 2

How much is from reduction of randomness, vs improvement of q-network? I suppose I could make a graph of reward vs exploration % using current network

 

trying ppo2 now (this is apparently openai goto method at the moment for first attempt)

https://blog.openai.com/openai-baselines-ppo/

I don’t know what to attribute this to necessarily but it is very quickly outperforming the q learning

like in 30seconds

clipfrac is probably how often the clipping bound gets hit

approxkl is approximate kullback-leibler divergence

looking inside the networks,

pi is the move probability distribution layer

vf is the value function

The MlpPolicy class will be useful for lower dimensional tasks. That is the multilayer perceptron, using  a bunch of fully connected layers

on a non simulated system, lower processor count to 1. The -np option on mpirun

Or just follow the mujoco example which is using only 1 cpu

Ah. This whole time it was saving in the /tmp folder in a time stamped folder

 

 

without threshold stopping, the thing went absolutely insane. You need to box in the pole

 

trying to make the square of height, so  that it more emphasizes a total height achieved? Maybe only give reward for new record height? + any height nearly all the way up

beefing up the force magntidue. I think it is a little too wimpy to get it over vertical

Maybe lower episode end so it spends more time trying to get higher than trying to just stay in bounds

Wait, should I not add those vector wrapper guys?

 

Hey! Sort of kind of success… It gets it up there sometimes, but then it can’t really keep it up there… That’s odd.

Wow. This is not a good integrator. Under no force the pendulum is obviously gaining energy. Should that matter much? I switched around the ordering to get a leapfrog. Much better.

custom cartpole instance to work from down position and editted to work with ppo2. To more closely match mujoco example, i switched from discrete actions to a continuous choice. I was trying to shape the reward to keep it in bounds. We’re getting there maybe. Picking the starting position happens in reset. Most everything else is a straight copy from the gym env

 

A script to view resulting network