Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Kensuke Matsuzaki
I used stochastic sampling at internal nodes, because of this.
> During the forward simulation phase of SEARCH, the action at each node x is 
> selected by sampling a ∼ π¯(·|x).
> As a result, the full imaginary trajectory is generated consistently 
> according to policy π¯.

> In this section, we establish our main claim namely that AlphaZero’s action 
> selection criteria can be interpreted as approximating the solution to a 
> regularized policy-optimization objective.

I think they say UCT and PUCT is approximation of direct π¯ sampling,
but I haven't understood section 3 well.

2020年7月20日(月) 2:51 Daniel :
>
> @Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are 
> meant to be used during training, no?
> I think I understand ACT and LEARN but I am not sure about SEARCH for which 
> they say this:
>
> > During search, we propose to stochastically sample actions according to π¯ 
> > instead of the deterministic action selection rule of Eq. 1.
>
> This sounds much like the random selection done at the root with temperature, 
> but this time applied at internal nodes.
> Does it mean the pUCT formula is not used? Why does the selection have to be 
> stochastic now?
> On selection, you compute π_bar every time from (q, π_theta, n_visits) so I 
> suppose π_bar has everything it needs to balance exploration and exploitation.
>
>
> On Sun, Jul 19, 2020 at 8:10 AM David Wu  wrote:
>>
>> I imagine that at low visits at least, "ACT" behaves similarly to Leela 
>> Zero's "LCB" move selection, which also has the effect of sometimes 
>> selecting a move that is not the max-visits move, if its value estimate has 
>> recently been found to be sufficiently larger to balance the fact that it is 
>> lower prior and lower visits (at least, typically, this is why the move 
>> wouldn't have been the max visits move in the first place). It also scales 
>> in an interesting way with empirical observed playout-by-playout variance of 
>> moves, but I think by far the important part is that it can use sufficiently 
>> confident high value to override max-visits.
>>
>> The gain from "LCB" in match play I recall is on the very very rough order 
>> of 100 Elo, although it could be less or more depending on match conditions 
>> and what neural net is used and other things. So for LZ at least, "ACT"-like 
>> behavior at low visits is not new.
>>
>>
>> On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki  
>> wrote:
>>>
>>> Hi,
>>>
>>> I couldn't improve leela zero's strength by implementing SEARCH and ACT.
>>> https://github.com/zakki/leela-zero/commits/regularized_policy
>>>
>>> 2020年7月17日(金) 2:47 Rémi Coulom :
>>> >
>>> > This looks very interesting.
>>> >
>>> > From a quick glance, it seems the improvement is mainly when the number 
>>> > of playouts is small. Also they don't test on the game of Go. Has anybody 
>>> > tried it?
>>> >
>>> > I will take a deeper look later.
>>> >
>>> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek  wrote:
>>> >>
>>> >> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>>> >>
>>> >>
>>> >> --
>>> >> Honesty is a very expensive gift. So, don't expect it from cheap people 
>>> >> - Warren Buffett
>>> >> http://tayek.com/
>>> >>
>>> >> ___
>>> >> Computer-go mailing list
>>> >> Computer-go@computer-go.org
>>> >> http://computer-go.org/mailman/listinfo/computer-go
>>> >
>>> > ___
>>> > Computer-go mailing list
>>> > Computer-go@computer-go.org
>>> > http://computer-go.org/mailman/listinfo/computer-go
>>>
>>>
>>>
>>> --
>>> Kensuke Matsuzaki
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go



-- 
Kensuke Matsuzaki
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Daniel
@Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are
meant to be used during training, no?
I think I understand ACT and LEARN but I am not sure about SEARCH for which
they say this:

> During search, we propose to stochastically sample actions according to
π¯ instead of the deterministic action selection rule of Eq. 1.

This sounds much like the random selection done at the root with
temperature, but this time applied at internal nodes.
Does it mean the pUCT formula is not used? Why does the selection have to
be stochastic now?
On selection, you compute π_bar every time from (q, π_theta, n_visits) so I
suppose π_bar has everything it needs to balance exploration and
exploitation.


On Sun, Jul 19, 2020 at 8:10 AM David Wu  wrote:

> I imagine that at low visits at least, "ACT" behaves similarly to Leela
> Zero's "LCB" move selection, which also has the effect of sometimes
> selecting a move that is not the max-visits move, if its value estimate has
> recently been found to be sufficiently larger to balance the fact that it
> is lower prior and lower visits (at least, typically, this is why the move
> wouldn't have been the max visits move in the first place). It also scales
> in an interesting way with empirical observed playout-by-playout variance
> of moves, but I think by far the important part is that it can use
> sufficiently confident high value to override max-visits.
>
> The gain from "LCB" in match play I recall is on the very very rough order
> of 100 Elo, although it could be less or more depending on match conditions
> and what neural net is used and other things. So for LZ at least,
> "ACT"-like behavior at low visits is not new.
>
>
> On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki 
> wrote:
>
>> Hi,
>>
>> I couldn't improve leela zero's strength by implementing SEARCH and ACT.
>> https://github.com/zakki/leela-zero/commits/regularized_policy
>>
>> 2020年7月17日(金) 2:47 Rémi Coulom :
>> >
>> > This looks very interesting.
>> >
>> > From a quick glance, it seems the improvement is mainly when the number
>> of playouts is small. Also they don't test on the game of Go. Has anybody
>> tried it?
>> >
>> > I will take a deeper look later.
>> >
>> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek  wrote:
>> >>
>> >>
>> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>> >>
>> >>
>> >> --
>> >> Honesty is a very expensive gift. So, don't expect it from cheap
>> people - Warren Buffett
>> >> http://tayek.com/
>> >>
>> >> ___
>> >> Computer-go mailing list
>> >> Computer-go@computer-go.org
>> >> http://computer-go.org/mailman/listinfo/computer-go
>> >
>> > ___
>> > Computer-go mailing list
>> > Computer-go@computer-go.org
>> > http://computer-go.org/mailman/listinfo/computer-go
>>
>>
>>
>> --
>> Kensuke Matsuzaki
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread David Wu
I imagine that at low visits at least, "ACT" behaves similarly to Leela
Zero's "LCB" move selection, which also has the effect of sometimes
selecting a move that is not the max-visits move, if its value estimate has
recently been found to be sufficiently larger to balance the fact that it
is lower prior and lower visits (at least, typically, this is why the move
wouldn't have been the max visits move in the first place). It also scales
in an interesting way with empirical observed playout-by-playout variance
of moves, but I think by far the important part is that it can use
sufficiently confident high value to override max-visits.

The gain from "LCB" in match play I recall is on the very very rough order
of 100 Elo, although it could be less or more depending on match conditions
and what neural net is used and other things. So for LZ at least,
"ACT"-like behavior at low visits is not new.


On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki 
wrote:

> Hi,
>
> I couldn't improve leela zero's strength by implementing SEARCH and ACT.
> https://github.com/zakki/leela-zero/commits/regularized_policy
>
> 2020年7月17日(金) 2:47 Rémi Coulom :
> >
> > This looks very interesting.
> >
> > From a quick glance, it seems the improvement is mainly when the number
> of playouts is small. Also they don't test on the game of Go. Has anybody
> tried it?
> >
> > I will take a deeper look later.
> >
> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek  wrote:
> >>
> >>
> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
> >>
> >>
> >> --
> >> Honesty is a very expensive gift. So, don't expect it from cheap people
> - Warren Buffett
> >> http://tayek.com/
> >>
> >> ___
> >> Computer-go mailing list
> >> Computer-go@computer-go.org
> >> http://computer-go.org/mailman/listinfo/computer-go
> >
> > ___
> > Computer-go mailing list
> > Computer-go@computer-go.org
> > http://computer-go.org/mailman/listinfo/computer-go
>
>
>
> --
> Kensuke Matsuzaki
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Kensuke Matsuzaki
Hi,

I couldn't improve leela zero's strength by implementing SEARCH and ACT.
https://github.com/zakki/leela-zero/commits/regularized_policy

2020年7月17日(金) 2:47 Rémi Coulom :
>
> This looks very interesting.
>
> From a quick glance, it seems the improvement is mainly when the number of 
> playouts is small. Also they don't test on the game of Go. Has anybody tried 
> it?
>
> I will take a deeper look later.
>
> On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek  wrote:
>>
>> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>>
>>
>> --
>> Honesty is a very expensive gift. So, don't expect it from cheap people - 
>> Warren Buffett
>> http://tayek.com/
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go



-- 
Kensuke Matsuzaki
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-16 Thread Rémi Coulom
This looks very interesting.

>From a quick glance, it seems the improvement is mainly when the number of
playouts is small. Also they don't test on the game of Go. Has anybody
tried it?

I will take a deeper look later.

On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek  wrote:

>
> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>
>
> --
> Honesty is a very expensive gift. So, don't expect it from cheap people -
> Warren Buffett
> http://tayek.com/
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


[Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-16 Thread Ray Tayek

https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/


--
Honesty is a very expensive gift. So, don't expect it from cheap people - 
Warren Buffett
http://tayek.com/

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go