@Gian-Carlo, Indeed, multi-labelled value net/ head sounds a good way to inject more signal into the network, accorging to that paper, thus to inject more reinforcemenf learning signal for learning from scratch. I was wondering if it could also be beneficial for bootstrapping the policy net/ head as well. Since score of randomly played games is likely to have high variance, I suppose that moves in most positions close to a final position will have very similar action values, as estimated by the MCTS search, hence weak reinforcement signal (all/most of the moves leading to the same outcome, either win or loss). Using a multi-labelled value head could produce multi-labelled action values which might have more spread than with a fixed 6.5 komi and would allow better ranking of moves (by playing on the komi / averaging over the various possible komi ?) Otherwise said, learning to increase the final score might be a good starting point for the policy net / head. At least in bootstrapping phase. Combined with prioritized sampling biased towards low reverse move count from endgame, as I mentionned in an earlier post and as you propose for first round of learning win/loss of final position. Patrick
-------- Message d'origine -------- De : computer-go-requ...@computer-go.org Date : 26/10/2017 16:17 (GMT+01:00) À : computer-go@computer-go.org Objet : Computer-go Digest, Vol 93, Issue 34 ------------------------------ Message: 2 Date: Thu, 26 Oct 2017 15:17:43 +0200 From: Gian-Carlo Pascutto <g...@sjeng.org> To: computer-go@computer-go.org Subject: Re: [Computer-go] AlphaGo Zero Message-ID: <8c872e71-4864-0a19-d3df-9fe1c48d2...@sjeng.org> Content-Type: text/plain; charset=utf-8 On 25-10-17 16:00, Petr Baudis wrote: > That makes sense. I still hope that with a much more aggressive > training schedule we could train a reasonable Go player, perhaps at > the expense of worse scaling at very high elos... (At least I feel > optimistic after discovering a stupid bug in my code.) By the way, a trivial observation: the initial network is random, so there's no point in using it for playing the first batch of games. It won't do anything useful until it has run a learning pass on a bunch of "win/loss" scored games and it can at least tell who is the likely winner in the final position (even if it mostly won't be able to make territory at first). This suggests that bootstrapping probably wants 500k starting games with just random moves. FWIW, it does not seem easy to get the value part of the network to converge in the dual-res architecture, even when taking the appropriate steps (1% weighting on error, strong regularizer). -- GCP ------------------------------ Message: 3 Date: Thu, 26 Oct 2017 15:55:23 +0200 From: Roel van Engelen <gosuba...@gmail.com> To: computer-go@computer-go.org Subject: Re: [Computer-go] Source code (Was: Reducing network size? (Was: AlphaGo Zero)) Message-ID: <CA+RUuO+hqs_rKnoVEQCDzV8oO=e4b9_s46juoqagnd8jg+4...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" @Gian-Carlo Pascutto Since training uses a ridiculous amount of computing power i wonder if it would be useful to make certain changes for future research, like training the value head with multiple komi values <https://arxiv.org/pdf/1705.10701.pdf>
_______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go