Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Kensuke Matsuzaki
I used stochastic sampling at internal nodes, because of this. > During the forward simulation phase of SEARCH, the action at each node x is > selected by sampling a ∼ π¯(·|x). > As a result, the full imaginary trajectory is generated consistently > according to policy π¯. > In this section, we

Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Daniel
@Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are meant to be used during training, no? I think I understand ACT and LEARN but I am not sure about SEARCH for which they say this: > During search, we propose to stochastically sample actions according to π¯ instead of the

Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread David Wu
I imagine that at low visits at least, "ACT" behaves similarly to Leela Zero's "LCB" move selection, which also has the effect of sometimes selecting a move that is not the max-visits move, if its value estimate has recently been found to be sufficiently larger to balance the fact that it is lower

Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Kensuke Matsuzaki
Hi, I couldn't improve leela zero's strength by implementing SEARCH and ACT. https://github.com/zakki/leela-zero/commits/regularized_policy 2020年7月17日(金) 2:47 Rémi Coulom : > > This looks very interesting. > > From a quick glance, it seems the improvement is mainly when the number of > playouts