Re: [Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-13 Thread Rémi Coulom
I probably ran matches, but I did not write the result down.

I remember connecting the stochastic policy to KGS. It had a very unnatural 
style, playing blunders from time to time, mixed with strong moves.

If you have one good move with probability 0.3, and 70 bad moves with 
probability 0.01, it will play a blunder with probability 0.7.

I wonder if the policy trained by policy gradient becomes stronger than the 
greedy policy. Is it reported in the AlphaGo paper?

- Mail original -
De: "Álvaro Begué" 
À: "computer-go" 
Envoyé: Dimanche 11 Décembre 2016 22:52:31
Objet: Re: [Computer-go] Some experiences with CNN trained on moves by the 
winning player







On Sun, Dec 11, 2016 at 4:50 PM, Rémi Coulom < remi.cou...@free.fr > wrote: 


It makes the policy stronger because it makes it more deterministic. The greedy 
policy is way stronger than the probability distribution. 



I suspected this is what it was mainly about. Did you run any experiments to 
see if that explains the whole effect? 





Rémi 

- Mail original - 
De: "Detlef Schmicker" < d...@physik.de > 
À: computer-go@computer-go.org 
Envoyé: Dimanche 11 Décembre 2016 11:38:08 
Objet: [Computer-go] Some experiences with CNN trained on moves by the winning 
player 

I want to share some experience training my policy cnn: 

As I wondered, why reinforcement learning was so helpful. I trained 
from the Godod database with only using the moves by the winner of 
each game. 

Interestingly the prediction rate of this moves was slightly higher 
(without training, just taking the previously trained network) than 
taking into account the moves by both players (53% against 52%) 

Training on winning player moves did not help a lot, I got a 
statistical significant improvement of about 20-30ELO. 

So I still don't understand, why reinforcement should do around 
100-200ELO :) 

Detlef 
___ 
Computer-go mailing list 
Computer-go@computer-go.org 
http://computer-go.org/mailman/listinfo/computer-go 
___ 
Computer-go mailing list 
Computer-go@computer-go.org 
http://computer-go.org/mailman/listinfo/computer-go 

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-11 Thread Brian Sheppard
IMO, training using only the moves of winners is obviously the practical choice.

Worst case: you "waste" half of your data. But that is actually not a downside 
provided that you have lots of data, and as your program strengthens you will 
avoid potential data-quality problems.

Asymptotically, you have to train using only self-play games (and only the 
moves of winners). Otherwise you cannot break through the limitations inherent 
in the quality of the training games.

-Original Message-
From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Erik van der Werf
Sent: Sunday, December 11, 2016 6:51 AM
To: computer-go 
Subject: Re: [Computer-go] Some experiences with CNN trained on moves by the 
winning player

Detlef, I think your result makes sense. For games between near-equally strong 
players the winning player's moves will not be much better than the loosing 
player's moves. The game is typically decided by subtle mistakes. Even if 
nearly all my moves are perfect, just one blunder can throw the game. Of course 
it depends on how you implement the details, but in principle reinforcement 
learning should be able to deal with such cases (i.e., prevent propagating 
irrelevant information all the way back to the starting position).

W.r.t. AG's reinforcement learning results, as far as I know, reinforcement 
learning was only indirectly helpful. The RL policy net performed worse then 
the SL policy net in the over-all system. Only by training the value net to 
predict expected outcomes from the
(over-fitted?) RL policy net they got some improvement (or so they claim). In 
essence this just means that RL may have been effective in creating a better 
training set for SL. Don't get me wrong, I love RL, but the reason why the RL 
part was hyped so much is in my opinion more related to marketing, politics and 
personal ego.

Erik


On Sun, Dec 11, 2016 at 11:38 AM, Detlef Schmicker  wrote:
> I want to share some experience training my policy cnn:
>
> As I wondered, why reinforcement learning was so helpful. I trained 
> from the Godod database with only using the moves by the winner of 
> each game.
>
> Interestingly the prediction rate of this moves was slightly higher 
> (without training, just taking the previously trained network) than 
> taking into account the moves by both players (53% against 52%)
>
> Training on winning player moves did not help a lot, I got a 
> statistical significant improvement of about 20-30ELO.
>
> So I still don't understand, why reinforcement should do around 
> 100-200ELO :)
>
> Detlef
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-11 Thread Álvaro Begué
On Sun, Dec 11, 2016 at 4:50 PM, Rémi Coulom  wrote:

> It makes the policy stronger because it makes it more deterministic. The
> greedy policy is way stronger than the probability distribution.
>

I suspected this is what it was mainly about. Did you run any experiments
to see if that explains the whole effect?



>
> Rémi
>
> - Mail original -
> De: "Detlef Schmicker" 
> À: computer-go@computer-go.org
> Envoyé: Dimanche 11 Décembre 2016 11:38:08
> Objet: [Computer-go] Some experiences with CNN trained on moves by the
> winning player
>
> I want to share some experience training my policy cnn:
>
> As I wondered, why reinforcement learning was so helpful. I trained
> from the Godod database with only using the moves by the winner of
> each game.
>
> Interestingly the prediction rate of this moves was slightly higher
> (without training, just taking the previously trained network) than
> taking into account the moves by both players (53% against 52%)
>
> Training on winning player moves did not help a lot, I got a
> statistical significant improvement of about 20-30ELO.
>
> So I still don't understand, why reinforcement should do around
> 100-200ELO :)
>
> Detlef
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-11 Thread Rémi Coulom
It makes the policy stronger because it makes it more deterministic. The greedy 
policy is way stronger than the probability distribution.

Rémi

- Mail original -
De: "Detlef Schmicker" 
À: computer-go@computer-go.org
Envoyé: Dimanche 11 Décembre 2016 11:38:08
Objet: [Computer-go] Some experiences with CNN trained on moves by the  winning 
player

I want to share some experience training my policy cnn:

As I wondered, why reinforcement learning was so helpful. I trained
from the Godod database with only using the moves by the winner of
each game.

Interestingly the prediction rate of this moves was slightly higher
(without training, just taking the previously trained network) than
taking into account the moves by both players (53% against 52%)

Training on winning player moves did not help a lot, I got a
statistical significant improvement of about 20-30ELO.

So I still don't understand, why reinforcement should do around
100-200ELO :)

Detlef
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-11 Thread Erik van der Werf
On Sun, Dec 11, 2016 at 8:44 PM, Detlef Schmicker  wrote:
> Hi Erik,
>
> as far as I understood it, it was 250ELO in policy network alone ...

Two problems: (1) it is a self-play result, (2) the policy was tested
as a stand-alone player.

A policy trained to win games will beat a policy trained to predict
moves, so what? That's just confirming the expected result.

BTW if you read a bit further it says that the SL policy performed
better in AG. This is consistent with earlier reported work. E.g., as
a student David used RL to train lots of strong stand-alone policies,
but they never worked well when combined with MCTS. As far as I can
tell, this one was no different, except that they were able to find
some indirect use for it in the form of generating training data for
the value network.

Erik
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-11 Thread Detlef Schmicker
Hi Erik,

as far as I understood it, it was 250ELO in policy network alone ...


section

2Reinforcement Learning of Policy Networks

We evaluated the performance of the RL policy network in game play,
sampling each move (...) from its output probability distribution over
actions.   When played head-to-head,
the RL policy network won more than 80% of games against the SL policy
network.

> W.r.t. AG's reinforcement learning results, as far as I know,
> reinforcement learning was only indirectly helpful. The RL policy net
> performed worse then the SL policy net in the over-all system. Only by
> training the value net to predict expected outcomes from the
> (over-fitted?) RL policy net they got some improvement (or so they
> claim). In essence this just means that RL may have been effective in
> creating a better training set for SL. Don't get me wrong, I love RL,
> but the reason why the RL part was hyped so much is in my opinion more
> related to marketing, politics and personal ego.


Detlef
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-11 Thread Erik van der Werf
Detlef, I think your result makes sense. For games between
near-equally strong players the winning player's moves will not be
much better than the loosing player's moves. The game is typically
decided by subtle mistakes. Even if nearly all my moves are perfect,
just one blunder can throw the game. Of course it depends on how you
implement the details, but in principle reinforcement learning should
be able to deal with such cases (i.e., prevent propagating irrelevant
information all the way back to the starting position).

W.r.t. AG's reinforcement learning results, as far as I know,
reinforcement learning was only indirectly helpful. The RL policy net
performed worse then the SL policy net in the over-all system. Only by
training the value net to predict expected outcomes from the
(over-fitted?) RL policy net they got some improvement (or so they
claim). In essence this just means that RL may have been effective in
creating a better training set for SL. Don't get me wrong, I love RL,
but the reason why the RL part was hyped so much is in my opinion more
related to marketing, politics and personal ego.

Erik


On Sun, Dec 11, 2016 at 11:38 AM, Detlef Schmicker  wrote:
> I want to share some experience training my policy cnn:
>
> As I wondered, why reinforcement learning was so helpful. I trained
> from the Godod database with only using the moves by the winner of
> each game.
>
> Interestingly the prediction rate of this moves was slightly higher
> (without training, just taking the previously trained network) than
> taking into account the moves by both players (53% against 52%)
>
> Training on winning player moves did not help a lot, I got a
> statistical significant improvement of about 20-30ELO.
>
> So I still don't understand, why reinforcement should do around
> 100-200ELO :)
>
> Detlef
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] Some experiences with CNN trained on moves by the winning player

2016-12-11 Thread Detlef Schmicker
I want to share some experience training my policy cnn:

As I wondered, why reinforcement learning was so helpful. I trained
from the Godod database with only using the moves by the winner of
each game.

Interestingly the prediction rate of this moves was slightly higher
(without training, just taking the previously trained network) than
taking into account the moves by both players (53% against 52%)

Training on winning player moves did not help a lot, I got a
statistical significant improvement of about 20-30ELO.

So I still don't understand, why reinforcement should do around
100-200ELO :)

Detlef
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go