As I understand it, RL boosts the performance of the policy network, not
because the winner's moves are particularly better than the loser's moves,
but because it shores up specifically the weaknesses of the SL-trained
network. In other words, a network trained purely with SL will have certain
faults (i.e., perhaps it fails to tenuki as frequently as it should because
locality is one of the strongest predictors of next move, or it will play a
ladder out because at each step of the ladder, that is naively what is the
best move) By using RL specifically with games played by the SL-only policy
network, you shore up the weaknesses of that instance of the policy network.

Brian

On Mon, Dec 12, 2016 at 7:03 AM <computer-go-requ...@computer-go.org> wrote:

> Send Computer-go mailing list submissions to
>         computer-go@computer-go.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://computer-go.org/mailman/listinfo/computer-go
> or, via email, send a message with subject or body 'help' to
>         computer-go-requ...@computer-go.org
>
> You can reach the person managing the list at
>         computer-go-ow...@computer-go.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Computer-go digest..."
>
>
> Today's Topics:
>
>    1. Re: Some experiences with CNN trained on moves by the winning
>       player (Detlef Schmicker)
>    2. Re: Some experiences with CNN trained on moves by the winning
>       player (Erik van der Werf)
>    3. Re: Some experiences with CNN trained on moves by the     winning
>       player (Rémi Coulom)
>    4. Re: Some experiences with CNN trained on moves by the winning
>       player (Álvaro Begué)
>    5. Re: Some experiences with CNN trained on moves by the     winning
>       player (Brian Sheppard)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 11 Dec 2016 20:44:32 +0100
> From: Detlef Schmicker <d...@physik.de>
> To: computer-go@computer-go.org
> Subject: Re: [Computer-go] Some experiences with CNN trained on moves
>         by the winning player
> Message-ID: <a36f72ee-550a-0ee7-76ae-37dba67dd...@physik.de>
> Content-Type: text/plain; charset=utf-8
>
> Hi Erik,
>
> as far as I understood it, it was 250ELO in policy network alone ...
>
>
> section
>
> 2    Reinforcement Learning of Policy Networks
>
> We evaluated the performance of the RL policy network in game play,
> sampling each move (...) from its output probability distribution over
> actions.   When played head-to-head,
> the RL policy network won more than 80% of games against the SL policy
> network.
>
> > W.r.t. AG's reinforcement learning results, as far as I know,
> > reinforcement learning was only indirectly helpful. The RL policy net
> > performed worse then the SL policy net in the over-all system. Only by
> > training the value net to predict expected outcomes from the
> > (over-fitted?) RL policy net they got some improvement (or so they
> > claim). In essence this just means that RL may have been effective in
> > creating a better training set for SL. Don't get me wrong, I love RL,
> > but the reason why the RL part was hyped so much is in my opinion more
> > related to marketing, politics and personal ego.
>
>
> Detlef
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 11 Dec 2016 22:39:17 +0100
> From: Erik van der Werf <erikvanderw...@gmail.com>
> To: computer-go <computer-go@computer-go.org>
> Subject: Re: [Computer-go] Some experiences with CNN trained on moves
>         by the winning player
> Message-ID:
>         <CAKkgGrOB+bXXF9ArEPR6ep1qVOdSp3LF=S_7-HU=
> qb3egow...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> On Sun, Dec 11, 2016 at 8:44 PM, Detlef Schmicker <d...@physik.de> wrote:
> > Hi Erik,
> >
> > as far as I understood it, it was 250ELO in policy network alone ...
>
> Two problems: (1) it is a self-play result, (2) the policy was tested
> as a stand-alone player.
>
> A policy trained to win games will beat a policy trained to predict
> moves, so what? That's just confirming the expected result.
>
> BTW if you read a bit further it says that the SL policy performed
> better in AG. This is consistent with earlier reported work. E.g., as
> a student David used RL to train lots of strong stand-alone policies,
> but they never worked well when combined with MCTS. As far as I can
> tell, this one was no different, except that they were able to find
> some indirect use for it in the form of generating training data for
> the value network.
>
> Erik
>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 11 Dec 2016 22:50:58 +0100 (CET)
> From: Rémi Coulom <remi.cou...@free.fr>
> To: computer-go@computer-go.org
> Subject: Re: [Computer-go] Some experiences with CNN trained on moves
>         by      the     winning player
> Message-ID:
>         <
> 1424980490.176967635.1481493058932.javamail.r...@spooler6-g27.priv.proxad.net
> >
>
> Content-Type: text/plain; charset=utf-8
>
> It makes the policy stronger because it makes it more deterministic. The
> greedy policy is way stronger than the probability distribution.
>
> Rémi
>
> ----- Mail original -----
> De: "Detlef Schmicker" <d...@physik.de>
> À: computer-go@computer-go.org
> Envoyé: Dimanche 11 Décembre 2016 11:38:08
> Objet: [Computer-go] Some experiences with CNN trained on moves by the
> winning player
>
> I want to share some experience training my policy cnn:
>
> As I wondered, why reinforcement learning was so helpful. I trained
> from the Godod database with only using the moves by the winner of
> each game.
>
> Interestingly the prediction rate of this moves was slightly higher
> (without training, just taking the previously trained network) than
> taking into account the moves by both players (53% against 52%)
>
> Training on winning player moves did not help a lot, I got a
> statistical significant improvement of about 20-30ELO.
>
> So I still don't understand, why reinforcement should do around
> 100-200ELO :)
>
> Detlef
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
>
> ------------------------------
>
> Message: 4
> Date: Sun, 11 Dec 2016 16:52:31 -0500
> From: Álvaro Begué <alvaro.be...@gmail.com>
> To: computer-go <computer-go@computer-go.org>
> Subject: Re: [Computer-go] Some experiences with CNN trained on moves
>         by the winning player
> Message-ID:
>         <CAF8dVMUgN-KX66LmFE=
> uj8sf3-uyv8if5hgyjvtrlkqyf2g...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Sun, Dec 11, 2016 at 4:50 PM, Rémi Coulom <remi.cou...@free.fr> wrote:
>
> > It makes the policy stronger because it makes it more deterministic. The
> > greedy policy is way stronger than the probability distribution.
> >
>
> I suspected this is what it was mainly about. Did you run any experiments
> to see if that explains the whole effect?
>
>
>
> >
> > Rémi
> >
> > ----- Mail original -----
> > De: "Detlef Schmicker" <d...@physik.de>
> > À: computer-go@computer-go.org
> > Envoyé: Dimanche 11 Décembre 2016 11:38:08
> > Objet: [Computer-go] Some experiences with CNN trained on moves by the
> > winning player
> >
> > I want to share some experience training my policy cnn:
> >
> > As I wondered, why reinforcement learning was so helpful. I trained
> > from the Godod database with only using the moves by the winner of
> > each game.
> >
> > Interestingly the prediction rate of this moves was slightly higher
> > (without training, just taking the previously trained network) than
> > taking into account the moves by both players (53% against 52%)
> >
> > Training on winning player moves did not help a lot, I got a
> > statistical significant improvement of about 20-30ELO.
> >
> > So I still don't understand, why reinforcement should do around
> > 100-200ELO :)
> >
> > Detlef
> > _______________________________________________
> > Computer-go mailing list
> > Computer-go@computer-go.org
> > http://computer-go.org/mailman/listinfo/computer-go
> > _______________________________________________
> > Computer-go mailing list
> > Computer-go@computer-go.org
> > http://computer-go.org/mailman/listinfo/computer-go
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://computer-go.org/pipermail/computer-go/attachments/20161211/12d8428b/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 5
> Date: Sun, 11 Dec 2016 17:00:52 -0500
> From: "Brian Sheppard" <sheppar...@aol.com>
> To: <computer-go@computer-go.org>
> Subject: Re: [Computer-go] Some experiences with CNN trained on moves
>         by the  winning player
> Message-ID: <023801d253fa$0a33c4e0$1e9b4ea0$@aol.com>
> Content-Type: text/plain;       charset="utf-8"
>
> IMO, training using only the moves of winners is obviously the practical
> choice.
>
> Worst case: you "waste" half of your data. But that is actually not a
> downside provided that you have lots of data, and as your program
> strengthens you will avoid potential data-quality problems.
>
> Asymptotically, you have to train using only self-play games (and only the
> moves of winners). Otherwise you cannot break through the limitations
> inherent in the quality of the training games.
>
> -----Original Message-----
> From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf
> Of Erik van der Werf
> Sent: Sunday, December 11, 2016 6:51 AM
> To: computer-go <computer-go@computer-go.org>
> Subject: Re: [Computer-go] Some experiences with CNN trained on moves by
> the winning player
>
> Detlef, I think your result makes sense. For games between near-equally
> strong players the winning player's moves will not be much better than the
> loosing player's moves. The game is typically decided by subtle mistakes.
> Even if nearly all my moves are perfect, just one blunder can throw the
> game. Of course it depends on how you implement the details, but in
> principle reinforcement learning should be able to deal with such cases
> (i.e., prevent propagating irrelevant information all the way back to the
> starting position).
>
> W.r.t. AG's reinforcement learning results, as far as I know,
> reinforcement learning was only indirectly helpful. The RL policy net
> performed worse then the SL policy net in the over-all system. Only by
> training the value net to predict expected outcomes from the
> (over-fitted?) RL policy net they got some improvement (or so they claim).
> In essence this just means that RL may have been effective in creating a
> better training set for SL. Don't get me wrong, I love RL, but the reason
> why the RL part was hyped so much is in my opinion more related to
> marketing, politics and personal ego.
>
> Erik
>
>
> On Sun, Dec 11, 2016 at 11:38 AM, Detlef Schmicker <d...@physik.de> wrote:
> > I want to share some experience training my policy cnn:
> >
> > As I wondered, why reinforcement learning was so helpful. I trained
> > from the Godod database with only using the moves by the winner of
> > each game.
> >
> > Interestingly the prediction rate of this moves was slightly higher
> > (without training, just taking the previously trained network) than
> > taking into account the moves by both players (53% against 52%)
> >
> > Training on winning player moves did not help a lot, I got a
> > statistical significant improvement of about 20-30ELO.
> >
> > So I still don't understand, why reinforcement should do around
> > 100-200ELO :)
> >
> > Detlef
> > _______________________________________________
> > Computer-go mailing list
> > Computer-go@computer-go.org
> > http://computer-go.org/mailman/listinfo/computer-go
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
> ------------------------------
>
> End of Computer-go Digest, Vol 83, Issue 7
> ******************************************
>
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to