> So that was stupid of me, forward() in policy.hpp is just computing the
> softmaxes for the input (first param) and storing it in output (second param)
> -> does that mean policy has to be the last layer of my neural net?
See the comment here
Yes. First try the vanilla implementation, if it doesn’t work augment it with
experience replay (ER).
However I would suggest not to merge your vanilla implementation with ER,
because it’s wrong theoretically as I mentioned before. I would also suggest
not to merge your vanilla implementation
For TRPO you need to read the original paper.. I don’t have better idea.
Starting from a vanilla policy gradient is good, however the main concern is
that from my experience, you need either experience replay or multi-workers to
make a non-linear function approximator work (they can give you