Re: [Computer-go] Some experiences with CNN trained on moves by the winning player
I probably ran matches, but I did not write the result down. I remember connecting the stochastic policy to KGS. It had a very unnatural style, playing blunders from time to time, mixed with strong moves. If you have one good move with probability 0.3, and 70 bad moves with probability 0.01, it will play a blunder with probability 0.7. I wonder if the policy trained by policy gradient becomes stronger than the greedy policy. Is it reported in the AlphaGo paper? - Mail original - De: "Álvaro Begué"À: "computer-go" Envoyé: Dimanche 11 Décembre 2016 22:52:31 Objet: Re: [Computer-go] Some experiences with CNN trained on moves by the winning player On Sun, Dec 11, 2016 at 4:50 PM, Rémi Coulom < remi.cou...@free.fr > wrote: It makes the policy stronger because it makes it more deterministic. The greedy policy is way stronger than the probability distribution. I suspected this is what it was mainly about. Did you run any experiments to see if that explains the whole effect? Rémi - Mail original - De: "Detlef Schmicker" < d...@physik.de > À: computer-go@computer-go.org Envoyé: Dimanche 11 Décembre 2016 11:38:08 Objet: [Computer-go] Some experiences with CNN trained on moves by the winning player I want to share some experience training my policy cnn: As I wondered, why reinforcement learning was so helpful. I trained from the Godod database with only using the moves by the winner of each game. Interestingly the prediction rate of this moves was slightly higher (without training, just taking the previously trained network) than taking into account the moves by both players (53% against 52%) Training on winning player moves did not help a lot, I got a statistical significant improvement of about 20-30ELO. So I still don't understand, why reinforcement should do around 100-200ELO :) Detlef ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Some experiences with CNN trained on moves by the winning player
IMO, training using only the moves of winners is obviously the practical choice. Worst case: you "waste" half of your data. But that is actually not a downside provided that you have lots of data, and as your program strengthens you will avoid potential data-quality problems. Asymptotically, you have to train using only self-play games (and only the moves of winners). Otherwise you cannot break through the limitations inherent in the quality of the training games. -Original Message- From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of Erik van der Werf Sent: Sunday, December 11, 2016 6:51 AM To: computer-goSubject: Re: [Computer-go] Some experiences with CNN trained on moves by the winning player Detlef, I think your result makes sense. For games between near-equally strong players the winning player's moves will not be much better than the loosing player's moves. The game is typically decided by subtle mistakes. Even if nearly all my moves are perfect, just one blunder can throw the game. Of course it depends on how you implement the details, but in principle reinforcement learning should be able to deal with such cases (i.e., prevent propagating irrelevant information all the way back to the starting position). W.r.t. AG's reinforcement learning results, as far as I know, reinforcement learning was only indirectly helpful. The RL policy net performed worse then the SL policy net in the over-all system. Only by training the value net to predict expected outcomes from the (over-fitted?) RL policy net they got some improvement (or so they claim). In essence this just means that RL may have been effective in creating a better training set for SL. Don't get me wrong, I love RL, but the reason why the RL part was hyped so much is in my opinion more related to marketing, politics and personal ego. Erik On Sun, Dec 11, 2016 at 11:38 AM, Detlef Schmicker wrote: > I want to share some experience training my policy cnn: > > As I wondered, why reinforcement learning was so helpful. I trained > from the Godod database with only using the moves by the winner of > each game. > > Interestingly the prediction rate of this moves was slightly higher > (without training, just taking the previously trained network) than > taking into account the moves by both players (53% against 52%) > > Training on winning player moves did not help a lot, I got a > statistical significant improvement of about 20-30ELO. > > So I still don't understand, why reinforcement should do around > 100-200ELO :) > > Detlef > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Some experiences with CNN trained on moves by the winning player
On Sun, Dec 11, 2016 at 4:50 PM, Rémi Coulomwrote: > It makes the policy stronger because it makes it more deterministic. The > greedy policy is way stronger than the probability distribution. > I suspected this is what it was mainly about. Did you run any experiments to see if that explains the whole effect? > > Rémi > > - Mail original - > De: "Detlef Schmicker" > À: computer-go@computer-go.org > Envoyé: Dimanche 11 Décembre 2016 11:38:08 > Objet: [Computer-go] Some experiences with CNN trained on moves by the > winning player > > I want to share some experience training my policy cnn: > > As I wondered, why reinforcement learning was so helpful. I trained > from the Godod database with only using the moves by the winner of > each game. > > Interestingly the prediction rate of this moves was slightly higher > (without training, just taking the previously trained network) than > taking into account the moves by both players (53% against 52%) > > Training on winning player moves did not help a lot, I got a > statistical significant improvement of about 20-30ELO. > > So I still don't understand, why reinforcement should do around > 100-200ELO :) > > Detlef > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Some experiences with CNN trained on moves by the winning player
It makes the policy stronger because it makes it more deterministic. The greedy policy is way stronger than the probability distribution. Rémi - Mail original - De: "Detlef Schmicker"À: computer-go@computer-go.org Envoyé: Dimanche 11 Décembre 2016 11:38:08 Objet: [Computer-go] Some experiences with CNN trained on moves by the winning player I want to share some experience training my policy cnn: As I wondered, why reinforcement learning was so helpful. I trained from the Godod database with only using the moves by the winner of each game. Interestingly the prediction rate of this moves was slightly higher (without training, just taking the previously trained network) than taking into account the moves by both players (53% against 52%) Training on winning player moves did not help a lot, I got a statistical significant improvement of about 20-30ELO. So I still don't understand, why reinforcement should do around 100-200ELO :) Detlef ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Some experiences with CNN trained on moves by the winning player
On Sun, Dec 11, 2016 at 8:44 PM, Detlef Schmickerwrote: > Hi Erik, > > as far as I understood it, it was 250ELO in policy network alone ... Two problems: (1) it is a self-play result, (2) the policy was tested as a stand-alone player. A policy trained to win games will beat a policy trained to predict moves, so what? That's just confirming the expected result. BTW if you read a bit further it says that the SL policy performed better in AG. This is consistent with earlier reported work. E.g., as a student David used RL to train lots of strong stand-alone policies, but they never worked well when combined with MCTS. As far as I can tell, this one was no different, except that they were able to find some indirect use for it in the form of generating training data for the value network. Erik ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Some experiences with CNN trained on moves by the winning player
Hi Erik, as far as I understood it, it was 250ELO in policy network alone ... section 2Reinforcement Learning of Policy Networks We evaluated the performance of the RL policy network in game play, sampling each move (...) from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. > W.r.t. AG's reinforcement learning results, as far as I know, > reinforcement learning was only indirectly helpful. The RL policy net > performed worse then the SL policy net in the over-all system. Only by > training the value net to predict expected outcomes from the > (over-fitted?) RL policy net they got some improvement (or so they > claim). In essence this just means that RL may have been effective in > creating a better training set for SL. Don't get me wrong, I love RL, > but the reason why the RL part was hyped so much is in my opinion more > related to marketing, politics and personal ego. Detlef ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Some experiences with CNN trained on moves by the winning player
Detlef, I think your result makes sense. For games between near-equally strong players the winning player's moves will not be much better than the loosing player's moves. The game is typically decided by subtle mistakes. Even if nearly all my moves are perfect, just one blunder can throw the game. Of course it depends on how you implement the details, but in principle reinforcement learning should be able to deal with such cases (i.e., prevent propagating irrelevant information all the way back to the starting position). W.r.t. AG's reinforcement learning results, as far as I know, reinforcement learning was only indirectly helpful. The RL policy net performed worse then the SL policy net in the over-all system. Only by training the value net to predict expected outcomes from the (over-fitted?) RL policy net they got some improvement (or so they claim). In essence this just means that RL may have been effective in creating a better training set for SL. Don't get me wrong, I love RL, but the reason why the RL part was hyped so much is in my opinion more related to marketing, politics and personal ego. Erik On Sun, Dec 11, 2016 at 11:38 AM, Detlef Schmickerwrote: > I want to share some experience training my policy cnn: > > As I wondered, why reinforcement learning was so helpful. I trained > from the Godod database with only using the moves by the winner of > each game. > > Interestingly the prediction rate of this moves was slightly higher > (without training, just taking the previously trained network) than > taking into account the moves by both players (53% against 52%) > > Training on winning player moves did not help a lot, I got a > statistical significant improvement of about 20-30ELO. > > So I still don't understand, why reinforcement should do around > 100-200ELO :) > > Detlef > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
[Computer-go] Some experiences with CNN trained on moves by the winning player
I want to share some experience training my policy cnn: As I wondered, why reinforcement learning was so helpful. I trained from the Godod database with only using the moves by the winner of each game. Interestingly the prediction rate of this moves was slightly higher (without training, just taking the previously trained network) than taking into account the moves by both players (53% against 52%) Training on winning player moves did not help a lot, I got a statistical significant improvement of about 20-30ELO. So I still don't understand, why reinforcement should do around 100-200ELO :) Detlef ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go