I used stochastic sampling at internal nodes, because of this.
> During the forward simulation phase of SEARCH, the action at each node x is
> selected by sampling a ∼ π¯(·|x).
> As a result, the full imaginary trajectory is generated consistently
> according to policy π¯.
> In this section, we
@Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are
meant to be used during training, no?
I think I understand ACT and LEARN but I am not sure about SEARCH for which
they say this:
> During search, we propose to stochastically sample actions according to
π¯ instead of the
I imagine that at low visits at least, "ACT" behaves similarly to Leela
Zero's "LCB" move selection, which also has the effect of sometimes
selecting a move that is not the max-visits move, if its value estimate has
recently been found to be sufficiently larger to balance the fact that it
is lower
Hi,
I couldn't improve leela zero's strength by implementing SEARCH and ACT.
https://github.com/zakki/leela-zero/commits/regularized_policy
2020年7月17日(金) 2:47 Rémi Coulom :
>
> This looks very interesting.
>
> From a quick glance, it seems the improvement is mainly when the number of
> playouts