### Re: [computer-go] Monte-Carlo Simulation Balancing

> > > Just to clarify: I was not saying that Mogo's policy consisted > *solely* of looking for patterns around the last move. Merely that > it does not look for patterns around *every* point, which other > playout policies (e.g., CrazyStone, if I understand Remi's papers > correctly) appear to do. The RL paper seems to require that > playout design. Fine! > > BTW, FillBoard seems to help Pebbles, too. A few percent better > on 9x9 games. No testing on larger boards. YMMV, and like everything > about computer go: all predictions are guaranteed to be wrong, > or your money back. > For us the improvement is essential in 19x19 - I'll find that for the generality of "fillboard" if it helps also for you :-) the loss of diversity due to patterns is really clear in some situations, so the problem solved by fillboard is understandable, so I believe it should work also for you - but, as you say, all predictions in computer-go are almost guaranteed to be wrong :-) Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

> . Pebbles has a "Mogo" playout design, where you check > for patterns only around the last move (or two). > In MoGo, it's not only around the last move (at least with some probability and when there are empty spaces in the board); this is the "fill board" modification. (this provides a big improvement in 19x19 with big numbers of simulations, see http://www.lri.fr/~rimmel/publi/EK_explo.pdf , Fig 3 page 8 - not only quantitative improvement, but also, according to players, a qualitative improvement in the way mogo plays) Best regards, Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

A web search turned up a 2 page and an 8 page version. I read the short one. I agree that it's promising work that requires some follow- up research. Now that you've read it so many times, what excites you about it? Can you envision a way to scale it to larger patterns and boards on modern hardware? Sent from my iPhone On Aug 12, 2009, at 11:14 PM, Michael Williams > wrote: After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte- Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave --- - ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

In future papers they should avoid using a strong authority like Fuego for the training and instead force it to learn from a naive uniform random playout policy (with 100x or 1000x more playouts) and then build on that with an iterative approach (as was suggested in the paper). I also had another thought. Since they are training the policy to maximize the balance and not the winrate, wouldn't you be able to extract more information from each trial by using the score instead of the game result? The normal pitfalls to doing so do not apply here. Isaac Deutsch wrote: I admit I had trouble understanding the details of the paper. What I think is the biggest problem for applying this to bigger (up to 19x19) games is that you somehow need access to the "true" value of a move, i.e. it's a win or a loss. On the 5x5 board they used, this might be approximated pretty well, but there's no chance on 19x19 to do so. Am 13.08.2009 um 05:14 schrieb Michael Williams: After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

I admit I had trouble understanding the details of the paper. What I think is the biggest problem for applying this to bigger (up to 19x19) games is that you somehow need access to the "true" value of a move, i.e. it's a win or a loss. On the 5x5 board they used, this might be approximated pretty well, but there's no chance on 19x19 to do so. Am 13.08.2009 um 05:14 schrieb Michael Williams: After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte-Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Hi, We used alpha=0.1. There may well be a better setting of alpha, but this appeared to work nicely in our experiments. -Dave On 3-May-09, at 2:01 AM, elife wrote: Hi Dave, In your experiments what's the constant value alpha you set? Thanks. 2009/5/1 David Silver : Yes, in our experiments they were just constant numbers M=N=100. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Hi Dave, In your experiments what's the constant value alpha you set? Thanks. 2009/5/1 David Silver : > Yes, in our experiments they were just constant numbers M=N=100. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Hi Yamato, If M and N are the same, is there any reason to run M simulations and N simulations separately? What happens if you combine them and calculate V and g in the single loop? I think it gives the wrong answer to do it in a single loop. Note that the simulation outcomes z are used to compute both V and g, so that they are quite strongly correlated. In general, if random variables X and Y are correlated then E[X]E[Y] != E[XY]. A simple example of this problem would be sampling E[X]E[X] for some random variable X. Let's say X is actually just noise with mean zero. If you just take one sample x1, then x1*x1 is always +ve, and you would estimate E[X]E[X]=E[X^2]>0. But if you take two independent samples x1 and x2, then x1*x2 would be +ve half the time and -ve half the time, and you would correctly estimate E[X]E[X]=0. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: >Yes, in our experiments they were just constant numbers M=N=100. If M and N are the same, is there any reason to run M simulations and N simulations separately? What happens if you combine them and calculate V and g in the single loop? >Okay, let's continue the example above. Let's say that in position s, >using the current theta, moves a, b and c will be selected with >probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was >actually selected. Now consider pattern 1, this is matched after a. >But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So >psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- >(0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a) >= [0.2,0.3,-0.3,-0.2]. > >In other words, psi tells us whether each pattern was actually matched >more or less than we could have expected. I understood what psi was. I am not sure how it works, but anyway I can see your algorithm now. Thanks. -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

IMO other people's equations/code/ideas/papers always seem smarter than your own. The stuff you understand and do yourself just seems like common sense, and the stuff you don't always has a mystical air of complexity, at least until you understand it too :-) On 30-Apr-09, at 1:59 PM, Michael Williams wrote: I wish I was smart :( ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Hi Yamato, Thanks for the detailed explanation. M, N and alpha are constant numbers, right? What did you set them to? You're welcome! Yes, in our experiments they were just constant numbers M=N=100. The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. OK, I guess it is the formula 5 in the paper. Yes, exactly. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). I still don't understand this. Is it the formula 6? Could you please give me an example like the above? Yes that's right, this is equation 6. Okay, let's continue the example above. Let's say that in position s, using the current theta, moves a, b and c will be selected with probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was actually selected. Now consider pattern 1, this is matched after a. But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- (0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a) = [0.2,0.3,-0.3,-0.2]. In other words, psi tells us whether each pattern was actually matched more or less than we could have expected. Hope this helps. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

I wish I was smart :( David Silver wrote: Hi Remi, I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Okay, now I understand your point :-) It's a good question - and I think you're right. In REINFORCE any baseline can be subtracted from the reward, without affecting the expected gradient, but possibly reducing its variance. The baseline leading to the best estimate is indeed the average reward. So it should be the case that {-1,+1} would estimate the gradient g more efficiently than {0,1}, assuming that we see similar numbers of black wins as white wins across the training set. So to answer your question, we can safely modify the algorithm to use (z-b) instead of z, where b is the average reward. This would then make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of step-size). I don't think this would have affected the results we presented (because all of the learning algorithms converged anyway, at least approximately, during training) but it could be an important modification for larger boards. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Hi Remi, I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Okay, now I understand your point :-) It's a good question - and I think you're right. In REINFORCE any baseline can be subtracted from the reward, without affecting the expected gradient, but possibly reducing its variance. The baseline leading to the best estimate is indeed the average reward. So it should be the case that {-1,+1} would estimate the gradient g more efficiently than {0,1}, assuming that we see similar numbers of black wins as white wins across the training set. So to answer your question, we can safely modify the algorithm to use (z-b) instead of z, where b is the average reward. This would then make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of step-size). I don't think this would have affected the results we presented (because all of the learning algorithms converged anyway, at least approximately, during training) but it could be an important modification for larger boards. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: Sorry, I should have made it clear that this assumes that we are treating black wins as z=1 and white wins as z=0. In this special case, the gradient is the average of games in which black won. But yes, more generally you need to include games won by both sides. The algorithms in the paper still cover this case - I was just trying to simplify their description to make it easy to understand the ideas. I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. The gradient already compensates for the playout policy (equation 9), so in fact it would bias the gradient to sample uniformly at random! Yes, you are right. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Hi Remi, This is strange: you do not take lost playouts into consideration. I believe there is a problem with your estimation of the gradient. Suppose for instance that you count z = +1 for a win, and z = -1 for a loss. Then you would take lost playouts into consideration. This makes me a little suspicious. Sorry, I should have made it clear that this assumes that we are treating black wins as z=1 and white wins as z=0. In this special case, the gradient is the average of games in which black won. But yes, more generally you need to include games won by both sides. The algorithms in the paper still cover this case - I was just trying to simplify their description to make it easy to understand the ideas. The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. The gradient already compensates for the playout policy (equation 9), so in fact it would bias the gradient to sample uniformly at random! In equation 9, the gradient is taken with respect to the playout policy parameters. Using the product rule (third line), the gradient is equal to the playout policy probabilities multiplied by the sum likelihood ratios multiplied by the simulation outcomes z. This gradient can be computed by sampling playouts instead of multiplying by the playout policy probabilities. This is also why games with outcomes of z=0 can be ignored - they don't affect this gradient computation. More precisely: you should estimate the value of N playouts as Sum p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i / Sum p_i. This would be better. Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ? I think a similar point applies here. We care about the expected value of the playout policy, which can be sampled directly from playouts, instead of multiplying by the policy probabilities. You would only need importance sampling if you were using a different playout policy to the one which you are evaluating. But I guess I'm not seeing any good reason to do this? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Rémi Coulom wrote: The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. Rémi More precisely: you should estimate the value of N playouts as Sum p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i / Sum p_i. This would be better. Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ? Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: 2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g) This is strange: you do not take lost playouts into consideration. I believe there is a problem with your estimation of the gradient. Suppose for instance that you count z = +1 for a win, and z = -1 for a loss. Then you would take lost playouts into consideration. This makes me a little suspicious. The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: >A: Estimate value V* of every position in a training set, using deep >rollouts. > >B: Repeat, for each position in the training set > 1. Run M simulations, estimate value of position (call this V) > 2. Run another N simulations, average the value of psi(s,a) over all >positions and moves in games that black won (call this g) > 3. Adjust parameters by alpha * (V* - V) * g Thanks for the detailed explanation. M, N and alpha are constant numbers, right? What did you set them to? >The feature vector is the set of patterns you use, with value 1 if a >pattern is matched and 0 otherwise. The simulation policy selects >actions in proportion to the exponentiated, weighted sum of all >matching patterns. For example let's say move a matches patterns 1 and >2, move b matches patterns 1 and 3, and move c matches patterns 2 and >4. Then move a would be selected with probability e^(theta1 + >theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + >theta4)). The theta values are the weights on the patterns which we >would like to learn. They are the log of the Elo ratings in Remi >Coulom's approach. OK, I guess it is the formula 5 in the paper. >The only tricky part is computing the vector psi(s,a). Each component >of psi(s,a) corresponds to a particular pattern, and is the difference >between the observed feature (i.e. whether the pattern actually >occurred after move a in position s) and the expected feature (the >average value of the pattern, weighted by the probability of selecting >each action). I still don't understand this. Is it the formula 6? Could you please give me an example like the above? -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### [computer-go] Re: [computer go] Monte-Carlo Simulation Balancing

Hi Remi, What komi did you use for 5x5 and 6x6 ? I used 7.5 komi for both board sizes. I find it strange that you get only 70 Elo points from supervised learning over uniform random. Don't you have any feature for atari extension ? This one alone should improve strength immensely (extend string in atari because of the previous move). Actually no. The features are very simple, and know how to capture but not how to defend ataris. I'm sure that a better set of features could improve by more than 70 Elo, but I expect we would still see a benefit to balancing the weights correctly. For example, the Fuego policy defends ataris and follows several other common-sense rules, but the results in 5x5 and 6x6 show that it is not well balanced on small boards. Let us extrapolate: I got more than 200 Elo points of improvements from my patterns in 9x9 over uniform random (I never really measured that, it may be even more than 200). I guess you got more than 200 Elo on 9x9, in Mogo (Gelly et al. 2006) the improvement from uniform random was at least 400 Elo and I think your simulation policy is probably at least as strong. By the way I was sad to hear you're not working on Crazystone any more. Is it because you are you busy with other projects? So maybe I could get 600 more Elo points with your method. And even more on 19x19. I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. As to whether we can extrapolate, I hope so :-) I share the same feeling that improving the simulation policy will be more impactful on bigger boards with longer simulations. On the other hand I've been surprised by many things in MC Go, so nothing is certain until we try it! -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: Hi Michael, But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position. But isn't the actual value of the position either a win or a loss? On such small boards, can't you assume that Fuego is able to correctly determin who is winning and round it's evaluation to the nearest win/loss? i.e. if it evaluates the position to 0.674, that gets rounded to 1. If such an assumption about Fuego's ability to read the position on a small board is valid, then it should improve the results of your balanced simulation strategy, right? Or am I missing something? It's true that 5x5 Go is solved, so in principle we could have used the true minimax values. However we chose to use an approach that can scale to larger boards, which means that we should treat the expert evaluations as approximate. And in fact Fuego was not always accurate on 6x6 boards, as we used only 10k simulations in our training set. Also, I think it really helps to have "soft" rather than "hard" expert evaluations. We want a simulation policy that helps differentiate e.g. a 90% winning position from an 85% winning position. Rounding all the expert evaluations to 0 or 1 would lose much of this advantage. -Dave By this argument (your last paragraph), you need to do some magical number of simulations for the training data. Not enough simulations and you have too much noise. And infinite simulations gives you hard 0 or 1 results. But I can't argue with your results. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Hi Yamato, Could you give us the source code which you used? Your algorithm is too complicated, so it would be very helpful if possible. Actually I think the source code would be much harder to understand! It is written inside RLGO, and makes use of a substantial existing framework that would take a lot of effort to understand. (On a separate note I am considering making RLGO open source at some point, but I'd prefer to spend some effort cleaning it up before making it public). But I think maybe Algorithm 1 is much easier than you think: A: Estimate value V* of every position in a training set, using deep rollouts. B: Repeat, for each position in the training set 1. Run M simulations, estimate value of position (call this V) 2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g) 3. Adjust parameters by alpha * (V* - V) * g The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). It's also very important to be careful about signs and the colour to play - it's easy to make a mistake and follow the gradient in the wrong direction. Is that any clearer? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

But I'm only trying to make a point, not pin the analogy down perfectly. Naturally the stronger the player, the more likely his moves will conform to the level of the top players. The basic principle is that the longer the contest, the more opportunities a strong player has to demonstrate his superiority over an inferior player and/or recover from mistakes or surprises.In my golf analogy, my friend hits a hole in one. Let's say it was on the first hole of 9 against Tiger Woods. Tiger likely would have hit it in 2 strokes, and my beginner friend would be ahead by 1 stroke!But that 1 stroke lead would almost certainly disappear by the time the next hole was completed, as my friend was typically taking 7 or 8 shots on that par 3 course! Another way to look at this, is that a random legal move generator has a chance to play a perfect game, although that chance is almost infinitesimal. But even that small chance can be an order of magnitude larger if you can shorted the contest even by 1 or 2 moves. - Don On Tue, Apr 28, 2009 at 8:09 PM, steve uurtamo wrote: > also, i'm not sure that a lot of most amateurs' moves are very > good. the spectrum of bad moves is wide, it's just that it takes > someone many stones stronger to severely punish small differences > between good and nearly-good moves. among players of relatively > similar strength, these differences will go unnoticed and unpunished. > > s. > > 2009/4/28 Don Dailey : > > A simplistic model that helps explain this is golf. On a single hole, > even > > a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger > > occasionally shoots a 1 over par on some hole and even weak amateurs > > occasionally par or even birdie a hole.It's not going to happen a > lot, > > but it's not ridiculous either. Years ago I taught a player how to > golf, > > and on his third time out with me, he hit a hole in one on a short par > > 3. If Tiger Woods had been playing with us, he would have lost that > hole > > to this beginner. > > > > But in a 9 hole match, the odds go down enormously - for all practical > > purposes there is no chance. > > > > I kind of think of GO like that, even though it's a pretty simplistic > > model. Each move is like a hole of golf, it can be a good "shot" or a > bad > > one. With GO, however, probably a LOT of your moves are just as good > as > > the moves of a good player. But it's the ones that fall short, that > kill > > you. > > > > Go on a big board is like 18 holes of golf compared to just 1 or 2 holes > of > > golf. The better player is far more likely to win the 18 hole match > than > > the 1 hole match. > > > > - Don > > > > > > > > > > > > On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois > wrote: > >>> > >>> I noticed that, in general, changes in the playout policy have a much > >>> bigger impact on larger boards than on smaller boards. > >>> > >>> Rémi > >> > >> I think rating differences are emplified on larger boards. This is easy > to > >> see if you think about it this way : > >> > >> Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game > that > >> I would call 4-Go where instead of playing one game, you play > simultenously > >> 4 games and determine the winner by calculating the sum of the scores of > the > >> four games. Certainly rating differences would be bigger with 4-go than > with > >> go (given the same two players). This explains why rating differences > are > >> bigger on 19x19 than 9x9. > >> > >> Ivan > >> > >> ___ > >> computer-go mailing list > >> computer-go@computer-go.org > >> http://www.computer-go.org/mailman/listinfo/computer-go/ > > > > > > ___ > > computer-go mailing list > > computer-go@computer-go.org > > http://www.computer-go.org/mailman/listinfo/computer-go/ > > > ___ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ > ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: >> because the previous approaches were not optimized for such a small >> boards. > >I'm not sure what you mean here? The supervised learning and >reinforcement learning approaches that we compared against are both >trained on the small boards, i.e. the pattern weights are specifically >optimised for that size of board. Ah, sorry. I missed it. >I agree that the handcrafted policy from Fuego was not optimised for >small boards, which is why it performed poorly. But perhaps this is >also interesting, i.e. it suggests that a handcrafted policy for 9x9 >Go may be significantly suboptimal when used in 19x19 Go. So >automatically learning a simulation policy may ultimately prove to be >very beneficial. It is surprising Fuego was far worse than uniform random on 5x5 and 6x6. These results show the particularity of the very small boards. I suppose 5x5 is very different from 9x9 but 19x19 is not so from 9x9. >> I'm looking forward to your results on larger boards. > >Me too :-) >Coming soon, will let you know how it goes. >But I hope that others will try these ideas too, it's always much >better to compare multiple implementations of the same algorithm. Could you give us the source code which you used? Your algorithm is too complicated, so it would be very helpful if possible. -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

A fair number of amateur moves are exactly what a pro would use; this is no great surprise, since the moves are actually copied from professional games. The weaker player can blindly emulate, but does not know why a pro plays A instead of B or C; does not know how to punish slack moves. I've been taking some classes from an 8d pro. Sadly, I'm at the bottom of the class. The top players in these classes are 8d amateurs; the pro can give them 3 stones and win. When going over their games, many of their moves are quite respectable, but the pro is able to find mistakes which can be exploited with devastating effect. You can't afford to be "a little bit off" against a pro, but weaker players find it hard to exploit slack moves. When explaining his moves, the pro often uses very deep reading. We've all seen simple cases, where move A is advisable if a ladder works, otherwise B is recommended. The pro knows six moves in advance that he'll need to read that ladder, before the ladder is actually on the board. These are the simplest expositions I've seen; things get more complex from there. Terry McIntyre "Government is an association of men who do violence to the rest of us." - Leo Tolstoy From: steve uurtamo To: computer-go Sent: Tuesday, April 28, 2009 5:09:20 PM Subject: Re: [computer-go] Monte-Carlo Simulation Balancing also, i'm not sure that a lot of most amateurs' moves are very good. the spectrum of bad moves is wide, it's just that it takes someone many stones stronger to severely punish small differences between good and nearly-good moves. among players of relatively similar strength, these differences will go unnoticed and unpunished. s. 2009/4/28 Don Dailey : > A simplistic model that helps explain this is golf. On a single hole, even > a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger > occasionally shoots a 1 over par on some hole and even weak amateurs > occasionally par or even birdie a hole.It's not going to happen a lot, > but it's not ridiculous either. Years ago I taught a player how to golf, > and on his third time out with me, he hit a hole in one on a short par > 3. If Tiger Woods had been playing with us, he would have lost that hole > to this beginner. > > But in a 9 hole match, the odds go down enormously - for all practical > purposes there is no chance. > > I kind of think of GO like that, even though it's a pretty simplistic > model. Each move is like a hole of golf, it can be a good "shot" or a bad > one. With GO, however, probably a LOT of your moves are just as good as > the moves of a good player. But it's the ones that fall short, that kill > you. > > Go on a big board is like 18 holes of golf compared to just 1 or 2 holes of > golf. The better player is far more likely to win the 18 hole match than > the 1 hole match. > > - Don > > > > > > On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois wrote: >>> >>> I noticed that, in general, changes in the playout policy have a much >>> bigger impact on larger boards than on smaller boards. >>> >>> Rémi >> >> I think rating differences are emplified on larger boards. This is easy to >> see if you think about it this way : >> >> Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that >> I would call 4-Go where instead of playing one game, you play simultenously >> 4 games and determine the winner by calculating the sum of the scores of the >> four games. Certainly rating differences would be bigger with 4-go than with >> go (given the same two players). This explains why rating differences are >> bigger on 19x19 than 9x9. >> >> Ivan >> >> ___ >> computer-go mailing list >> computer-go@computer-go.org >> http://www.computer-go.org/mailman/listinfo/computer-go/ > > > ___ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ > ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

also, i'm not sure that a lot of most amateurs' moves are very good. the spectrum of bad moves is wide, it's just that it takes someone many stones stronger to severely punish small differences between good and nearly-good moves. among players of relatively similar strength, these differences will go unnoticed and unpunished. s. 2009/4/28 Don Dailey : > A simplistic model that helps explain this is golf. On a single hole, even > a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger > occasionally shoots a 1 over par on some hole and even weak amateurs > occasionally par or even birdie a hole. It's not going to happen a lot, > but it's not ridiculous either. Years ago I taught a player how to golf, > and on his third time out with me, he hit a hole in one on a short par > 3. If Tiger Woods had been playing with us, he would have lost that hole > to this beginner. > > But in a 9 hole match, the odds go down enormously - for all practical > purposes there is no chance. > > I kind of think of GO like that, even though it's a pretty simplistic > model. Each move is like a hole of golf, it can be a good "shot" or a bad > one. With GO, however, probably a LOT of your moves are just as good as > the moves of a good player. But it's the ones that fall short, that kill > you. > > Go on a big board is like 18 holes of golf compared to just 1 or 2 holes of > golf. The better player is far more likely to win the 18 hole match than > the 1 hole match. > > - Don > > > > > > On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois wrote: >>> >>> I noticed that, in general, changes in the playout policy have a much >>> bigger impact on larger boards than on smaller boards. >>> >>> Rémi >> >> I think rating differences are emplified on larger boards. This is easy to >> see if you think about it this way : >> >> Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that >> I would call 4-Go where instead of playing one game, you play simultenously >> 4 games and determine the winner by calculating the sum of the scores of the >> four games. Certainly rating differences would be bigger with 4-go than with >> go (given the same two players). This explains why rating differences are >> bigger on 19x19 than 9x9. >> >> Ivan >> >> ___ >> computer-go mailing list >> computer-go@computer-go.org >> http://www.computer-go.org/mailman/listinfo/computer-go/ > > > ___ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ > ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

A simplistic model that helps explain this is golf. On a single hole, even a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger occasionally shoots a 1 over par on some hole and even weak amateurs occasionally par or even birdie a hole.It's not going to happen a lot, but it's not ridiculous either. Years ago I taught a player how to golf, and on his third time out with me, he hit a hole in one on a short par 3. If Tiger Woods had been playing with us, he would have lost that hole to this beginner. But in a 9 hole match, the odds go down enormously - for all practical purposes there is no chance. I kind of think of GO like that, even though it's a pretty simplistic model. Each move is like a hole of golf, it can be a good "shot" or a bad one. With GO, however, probably a LOT of your moves are just as good as the moves of a good player. But it's the ones that fall short, that kill you. Go on a big board is like 18 holes of golf compared to just 1 or 2 holes of golf. The better player is far more likely to win the 18 hole match than the 1 hole match. - Don On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois wrote: > I noticed that, in general, changes in the playout policy have a much >> bigger impact on larger boards than on smaller boards. >> >> Rémi >> > > I think rating differences are emplified on larger boards. This is easy to > see if you think about it this way : > > Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I > would call 4-Go where instead of playing one game, you play simultenously 4 > games and determine the winner by calculating the sum of the scores of the > four games. Certainly rating differences would be bigger with 4-go than with > go (given the same two players). This explains why rating differences are > bigger on 19x19 than 9x9. > > Ivan > > ___ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ > ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi I think rating differences are emplified on larger boards. This is easy to see if you think about it this way : Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I would call 4-Go where instead of playing one game, you play simultenously 4 games and determine the winner by calculating the sum of the scores of the four games. Certainly rating differences would be bigger with 4-go than with go (given the same two players). This explains why rating differences are bigger on 19x19 than 9x9. Ivan ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: I don't think the 200+ Elo improvement is so impressive I agree that it would be much more impressive to report positive results on larger boards. But perhaps it is already interesting that tuning the balance of the simulation policy can make a big difference on small boards? Also, larger boards mean longer simulations, and so the importance of simulation balancing should become even more exaggerated. What komi did you use for 5x5 and 6x6 ? I find it strange that you get only 70 Elo points from supervised learning over uniform random. Don't you have any feature for atari extension ? This one alone should improve strength immensely (extend string in atari because of the previous move). Let us extrapolate: I got more than 200 Elo points of improvements from my patterns in 9x9 over uniform random (I never really measured that, it may be even more than 200). So maybe I could get 600 more Elo points with your method. And even more on 19x19. I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: >Please find attached my ICML paper with Gerry Tesauro on automatically >learning a simulation policy for Monte-Carlo Go. Our preliminary >results show a 200+ Elo improvement over previous approaches, although >our experiments were restricted to simple Monte-Carlo search with no >tree on small boards. I like you idea, but why do you use only 5x5 and 6x6 Go? I don't think the 200+ Elo improvement is so impressive because the previous approaches were not optimized for such a small boards. I'm looking forward to your results on larger boards. -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

My favorite part: "One natural idea is to use the learned simulation policy in Monte-Carlo search, and generate new deep search values, in an iterative cycle." But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position. But isn't the actual value of the position either a win or a loss? On such small boards, can't you assume that Fuego is able to correctly determin who is winning and round it's evaluation to the nearest win/loss? i.e. if it evaluates the position to 0.674, that gets rounded to 1. If such an assumption about Fuego's ability to read the position on a small board is valid, then it should improve the results of your balanced simulation strategy, right? Or am I missing something? David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte-Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Very interesting, thanks. If I understand correctly, your method makes your program 250 Elo points stronger than my pattern-learning algorithm on 5x5 and 6x6, by just learning better weights. That is impressive. I had stopped working on my program for a very long time, but maybe I will go back to work and try your algorithm. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/

### Re: [computer-go] Monte-Carlo Simulation Balancing

Finally! I guess you can add this technique to your list, Lukasz. David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte-Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/