Re: [mlpack] Hints for A3C/PPO

2018-02-20 Thread Shangtong Zhang
p is the policy, p[a] is the probability for action a. It is multiplied by a 1 to remind you that the gradient will be similar to cross-entropy (if you are familiar with the gradient of cross-entropy through a softmax operator). In my test case, the gradient is w.r.t. the state. (It’s better to

Re: [mlpack] Hints for A3C/PPO

2018-02-20 Thread Shangtong Zhang
> So that was stupid of me, forward() in policy.hpp is just computing the > softmaxes for the input (first param) and storing it in output (second param) > -> does that mean policy has to be the last layer of my neural net? See the comment here https://github.com/mlpack/mlpack/pull/934/files#dif

Re: [mlpack] Hints for A3C/PPO

2018-02-19 Thread Shangtong Zhang
Yes. First try the vanilla implementation, if it doesn’t work augment it with experience replay (ER). However I would suggest not to merge your vanilla implementation with ER, because it’s wrong theoretically as I mentioned before. I would also suggest not to merge your vanilla implementation wi

Re: [mlpack] Hints for A3C/PPO

2018-02-19 Thread Shangtong Zhang
For TRPO you need to read the original paper.. I don’t have better idea. Starting from a vanilla policy gradient is good, however the main concern is that from my experience, you need either experience replay or multi-workers to make a non-linear function approximator work (they can give you unco

[mlpack] Hints for A3C/PPO

2018-02-19 Thread Shangtong Zhang
Hi Chirag, I think it would be better to also cc the mail list. I assume you are trying to implement A3C or something likes this. Actually this has almost been done. See my PR https://github.com/mlpack/mlpack/pull/934 This is my work last summer. To