Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Kensuke Matsuzaki
I used stochastic sampling at internal nodes, because of this.
> During the forward simulation phase of SEARCH, the action at each node x is 
> selected by sampling a ∼ π¯(·|x).
> As a result, the full imaginary trajectory is generated consistently 
> according to policy π¯.

> In this section, we establish our main claim namely that AlphaZero’s action 
> selection criteria can be interpreted as approximating the solution to a 
> regularized policy-optimization objective.

I think they say UCT and PUCT is approximation of direct π¯ sampling,
but I haven't understood section 3 well.

2020年7月20日(月) 2:51 Daniel :
>
> @Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are 
> meant to be used during training, no?
> I think I understand ACT and LEARN but I am not sure about SEARCH for which 
> they say this:
>
> > During search, we propose to stochastically sample actions according to π¯ 
> > instead of the deterministic action selection rule of Eq. 1.
>
> This sounds much like the random selection done at the root with temperature, 
> but this time applied at internal nodes.
> Does it mean the pUCT formula is not used? Why does the selection have to be 
> stochastic now?
> On selection, you compute π_bar every time from (q, π_theta, n_visits) so I 
> suppose π_bar has everything it needs to balance exploration and exploitation.
>
>
> On Sun, Jul 19, 2020 at 8:10 AM David Wu  wrote:
>>
>> I imagine that at low visits at least, "ACT" behaves similarly to Leela 
>> Zero's "LCB" move selection, which also has the effect of sometimes 
>> selecting a move that is not the max-visits move, if its value estimate has 
>> recently been found to be sufficiently larger to balance the fact that it is 
>> lower prior and lower visits (at least, typically, this is why the move 
>> wouldn't have been the max visits move in the first place). It also scales 
>> in an interesting way with empirical observed playout-by-playout variance of 
>> moves, but I think by far the important part is that it can use sufficiently 
>> confident high value to override max-visits.
>>
>> The gain from "LCB" in match play I recall is on the very very rough order 
>> of 100 Elo, although it could be less or more depending on match conditions 
>> and what neural net is used and other things. So for LZ at least, "ACT"-like 
>> behavior at low visits is not new.
>>
>>
>> On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki  
>> wrote:
>>>
>>> Hi,
>>>
>>> I couldn't improve leela zero's strength by implementing SEARCH and ACT.
>>> https://github.com/zakki/leela-zero/commits/regularized_policy
>>>
>>> 2020年7月17日(金) 2:47 Rémi Coulom :
>>> >
>>> > This looks very interesting.
>>> >
>>> > From a quick glance, it seems the improvement is mainly when the number 
>>> > of playouts is small. Also they don't test on the game of Go. Has anybody 
>>> > tried it?
>>> >
>>> > I will take a deeper look later.
>>> >
>>> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek  wrote:
>>> >>
>>> >> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>>> >>
>>> >>
>>> >> --
>>> >> Honesty is a very expensive gift. So, don't expect it from cheap people 
>>> >> - Warren Buffett
>>> >> http://tayek.com/
>>> >>
>>> >> ___
>>> >> Computer-go mailing list
>>> >> Computer-go@computer-go.org
>>> >> http://computer-go.org/mailman/listinfo/computer-go
>>> >
>>> > ___
>>> > Computer-go mailing list
>>> > Computer-go@computer-go.org
>>> > http://computer-go.org/mailman/listinfo/computer-go
>>>
>>>
>>>
>>> --
>>> Kensuke Matsuzaki
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go



-- 
Kensuke Matsuzaki
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

2020-07-19 Thread Kensuke Matsuzaki
Hi,

I couldn't improve leela zero's strength by implementing SEARCH and ACT.
https://github.com/zakki/leela-zero/commits/regularized_policy

2020年7月17日(金) 2:47 Rémi Coulom :
>
> This looks very interesting.
>
> From a quick glance, it seems the improvement is mainly when the number of 
> playouts is small. Also they don't test on the game of Go. Has anybody tried 
> it?
>
> I will take a deeper look later.
>
> On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek  wrote:
>>
>> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>>
>>
>> --
>> Honesty is a very expensive gift. So, don't expect it from cheap people - 
>> Warren Buffett
>> http://tayek.com/
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go



-- 
Kensuke Matsuzaki
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


Re: [Computer-go] Crazy Stone is playing on CGOS 9x9

2020-05-08 Thread Kensuke Matsuzaki
Rn says move 21 and move 27 was not good, but I can't understand.
rn.6.3.945 is running on EC2 g4dn.12xlarge, and it's network is 256
channel * 20 resnet blocks.

> And congratulations to rn for beating kata in a very beautiful game:
> http://www.yss-aya.com/cgos/viewer.cgi?9x9/SGF/2020/05/08/998312.sgf
>
> I am not strong enough to appreciate all the subtleties, but the complexity 
> looks amazing.
--
Kensuke Matsuzaki
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


Re: [Computer-go] AI Ryusei 2018 result

2018-12-19 Thread Kensuke Matsuzaki
Hi,

> using rollouts to compensate for Leela's network being trained with the
"wrong" komi for this competition:

Yes, and it seems that rollouts isn't useful when trained komi is "correct".

> Our program Natsukaze also used Leela Zero recent 70 selfplay games
to train DNN.

What would happen if Natsukaze used filtered raw training data instead of
filtered self play games?

https://github.com/gcp/leela-zero/issues/167

2018年12月19日(水) 7:35 Hiroshi Yamashita :

> Hi,
>
> Our program Natsukaze also used Leela Zero recent 70 selfplay games
>   to train DNN.
> Ladder escape moves(4% of total games) are removed, and chasing not
>   ladder(0.3%) also removed. But its DNN policy was weak, around CGOS 2100.
>
> Maybe it is because current LZ selfplay use t=1 not first 30 moves but all
> moves.
> I did not know this. I think this makes selfplay weaker +1000 Elo.
>
> Switch to t=1 for all self-play moves, i.e., randomcnt=999
> https://github.com/gcp/leela-zero-server/pull/81
>
> Thanks,
> Hiroshi Yamashita
>
>
> On 2018/12/19 2:01, Gian-Carlo Pascutto wrote:
> > On 17/12/18 01:53, Hiroshi Yamashita wrote:
> >> Hi,
> >>
> >> AI Ryusei 2018 was held on 15,16th December in Nihon-kiin, Japan.
> >> 14 programs played preliminary swiss 7 round, and top 6 programs
> >>   played round-robin final. Then, Golaxy won.
> >>
> >> Result
> >> https://www.igoshogi.net/ai_ryusei/01/en/result.html
> >
> > It appears the 2nd place finisher after Golaxy was a hybrid of Rn and
> > Leela Zero, using rollouts to compensate for Leela's network being
> > trained with the "wrong" komi for this competition:
> >
> > https://github.com/zakki/Ray/issues/171#issuecomment-447637052
> > https://img.igoshogi.net/ai_ryusei/01/data/11.pdf
> >
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go

-- 
MATSUZAKI Kensuke
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Kensuke Matsuzaki
Hi,

How do you get the V(s) for those datasets? You play out the endgame
> with the Monte Carlo playouts?
>

Yes, I use result of 100 playout from the endgame.
Sometimes the result stored in sgf differs from result of playouts.

zakki
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training the value network (a possibly more efficient approach)

2017-01-11 Thread Kensuke Matsuzaki
Hi,

I couldn't get positive experiment results on Ray.

Rn's network structure of V and W are similar and share parameters,
but only final convolutional layer are different.
I trained Rn's network to minimize MSE of V(s) + W(s).
It uses only KGS and GoGoD data sets, no self play with RL policy.
When trained only W(s), the network overfits, but to train V(s) + W(s) same
time
prevents overfitting.
But I have no idea about how to use V(s) or v(s) in MCTS.

Rn.3.0-4c plays with W(s): winning rate.
http://www.yss-aya.com/19x19/cgos/cross/Rn.3.0-4c.html
3394 elo

Rn.3.1-4c plays with V(s): sum of ownership. bit weaker
# MCTS part is tuned for W(s) now, so something may be wrong.
http://www.yss-aya.com/cgos/19x19/cross/Rn.3.1-4c.html
3218 elo

zakki

2017年1月11日(水) 19:49 Bo Peng :

> Hi Remi,
>
> Thanks for sharing your experience.
>
> As I am writing this, it seems there could be a third method: the perfect
> value function shall have the minimax property in the obvious way. So we
> can train our value function to satisfy the minimax property as well. In
> fact, we can train it such that a shallow-level MCTS gives as close a
> result as a deeper-level MCTS. This can be regarded as some kind of
> bootstrapping.
>
> Wonder if you have tried this. Seems might be a natural idea...
>
> Bo
>
> On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom"
> 
> wrote:
>
> >Hi,
> >
> >Thanks for sharing your idea.
> >
> >In my experience it is rarely efficient to train value functions from
> >very short term data (ie, next move). TD(lambda), or training from the
> >final outcome of the game is often better, because it uses a longer
> >horizon. But of course, it is difficult to tell without experiments
> >whether your idea would work or not. The advantage of your ideas is that
> >you can collect a lot of training data more easily.
> >
> >Rémi
> >
> >- Mail original -
> >De: "Bo Peng" 
> >À: computer-go@computer-go.org
> >Envoyé: Mardi 10 Janvier 2017 23:25:19
> >Objet: [Computer-go] Training the value network (a possibly more
> >efficient approach)
> >
> >
> >Hi everyone. It occurs to me there might be a more efficient method to
> >train the value network directly (without using the policy network).
> >
> >
> >You are welcome to check my method:
> >http://withablink.com/GoValueFunction.pdf
> >
> >
> >Let me know if there is any silly mistakes :)
> >
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go