Re: [computer-go] Monte-Carlo Simulation Balancing

2009-08-13 Thread Olivier Teytaud
>
>
> Just to clarify: I was not saying that Mogo's policy consisted
> *solely* of looking for patterns around the last move. Merely that
> it does not look for patterns around *every* point, which other
> playout policies (e.g., CrazyStone, if I understand Remi's papers
> correctly) appear to do. The RL paper seems to require that
> playout design.


Fine!


>
> BTW, FillBoard seems to help Pebbles, too. A few percent better
> on 9x9 games. No testing on larger boards. YMMV, and like everything
> about computer go: all predictions are guaranteed to be wrong,
> or your money back.
>

For us the improvement is essential in 19x19 - I'll find that for the
generality
of "fillboard" if it helps also for you :-) the loss of diversity due to
patterns
is really clear in some situations, so the problem solved by fillboard is
understandable,
so I believe it should work also for you - but, as you say, all predictions
in computer-go
are almost guaranteed to be wrong :-)
Olivier
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-08-13 Thread Olivier Teytaud
> . Pebbles has a "Mogo" playout design, where you check
> for patterns only around the last move (or two).
>

In MoGo, it's not only around the last move (at least with some probability
and when there are empty spaces in the board); this is the "fill board"
modification.

(this provides a big improvement in 19x19 with big numbers of simulations,
see http://www.lri.fr/~rimmel/publi/EK_explo.pdf , Fig 3 page 8 - not only
quantitative improvement, but also, according to players, a qualitative
improvement in the way mogo plays)

Best regards,
Olivier
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-08-13 Thread Jason House
A web search turned up a 2 page and an 8 page version. I read the  
short one. I agree that it's promising work that requires some follow- 
up research.


Now that you've read it so many times, what excites you about it? Can  
you envision a way to scale it to larger patterns and boards on modern  
hardware?


Sent from my iPhone

On Aug 12, 2009, at 11:14 PM, Michael Williams > wrote:


After about the 5th reading, I'm concluding that this is an  
excellent paper.  Is anyone (besides the authors) doing research  
based on this?  There is a lot to do.



David Silver wrote:

Hi everyone,
Please find attached my ICML paper with Gerry Tesauro on  
automatically learning a simulation policy for Monte-Carlo Go. Our  
preliminary results show a 200+ Elo improvement over previous  
approaches, although our experiments were restricted to simple  
Monte-Carlo search with no tree on small boards.

Abstract
In this paper we introduce the first algorithms for efficiently  
learning a simulation policy for Monte-Carlo search. Our main idea  
is to optimise the balance of a simulation policy, so that an  
accurate spread of simulation outcomes is maintained, rather than  
optimising the direct strength of the simulation policy. We develop  
two algorithms for balancing a simulation policy by gradient  
descent. The first algorithm optimises the balance of complete  
simulations, using a policy gradient algorithm; whereas the second  
algorithm optimises the balance over every two steps of simulation.  
We compare our algorithms to reinforcement learning and supervised  
learning algorithms for maximising the strength of the simulation  
policy. We test each algorithm in the domain of 5x5 and 6x6  
Computer Go, using a softmax policy that is parameterised by  
weights for a hundred simple patterns. When used in a simple Monte- 
Carlo search, the policies learnt by simulation balancing achieved  
significantly better performance, with half the mean squared error  
of a uniform random policy, and equal overall performance to a  
sophisticated Go engine.

-Dave
--- 
-

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-08-13 Thread Michael Williams
In future papers they should avoid using a strong authority like Fuego for the training and instead force it to learn from a naive uniform random playout policy 
(with 100x or 1000x more playouts) and then build on that with an iterative approach (as was suggested in the paper).


I also had another thought.  Since they are training the policy to maximize the balance and not the winrate, wouldn't you be able to extract more information 
from each trial by using the score instead of the game result?  The normal pitfalls to doing so do not apply here.




Isaac Deutsch wrote:
I admit I had trouble understanding the details of the paper. What I 
think is the biggest problem for applying this to bigger (up to 19x19) 
games is that you somehow need access to the "true" value of a move, 
i.e. it's a win or a loss. On the 5x5 board they used, this might be 
approximated pretty well, but there's no chance on 19x19 to do so.



Am 13.08.2009 um 05:14 schrieb Michael Williams:

After about the 5th reading, I'm concluding that this is an excellent 
paper.  Is anyone (besides the authors) doing research based on this?  
There is a lot to do.




___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-08-13 Thread Isaac Deutsch
I admit I had trouble understanding the details of the paper. What I  
think is the biggest problem for applying this to bigger (up to 19x19)  
games is that you somehow need access to the "true" value of a move,  
i.e. it's a win or a loss. On the 5x5 board they used, this might be  
approximated pretty well, but there's no chance on 19x19 to do so.



Am 13.08.2009 um 05:14 schrieb Michael Williams:

After about the 5th reading, I'm concluding that this is an  
excellent paper.  Is anyone (besides the authors) doing research  
based on this?  There is a lot to do.




___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-08-12 Thread Michael Williams

After about the 5th reading, I'm concluding that this is an excellent paper.  
Is anyone (besides the authors) doing research based on this?  There is a lot 
to do.


David Silver wrote:

Hi everyone,

Please find attached my ICML paper with Gerry Tesauro on automatically 
learning a simulation policy for Monte-Carlo Go. Our preliminary results 
show a 200+ Elo improvement over previous approaches, although our 
experiments were restricted to simple Monte-Carlo search with no tree on 
small boards.





Abstract

In this paper we introduce the first algorithms for efficiently learning 
a simulation policy for Monte-Carlo search. Our main idea is to optimise 
the balance of a simulation policy, so that an accurate spread of 
simulation outcomes is maintained, rather than optimising the direct 
strength of the simulation policy. We develop two algorithms for 
balancing a simulation policy by gradient descent. The first algorithm 
optimises the balance of complete simulations, using a policy gradient 
algorithm; whereas the second algorithm optimises the balance over every 
two steps of simulation. We compare our algorithms to reinforcement 
learning and supervised learning algorithms for maximising the strength 
of the simulation policy. We test each algorithm in the domain of 5x5 
and 6x6 Computer Go, using a softmax policy that is parameterised by 
weights for a hundred simple patterns. When used in a simple Monte-Carlo 
search, the policies learnt by simulation balancing achieved 
significantly better performance, with half the mean squared error of a 
uniform random policy, and equal overall performance to a sophisticated 
Go engine.


-Dave




___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-05-04 Thread David Silver

Hi,

We used alpha=0.1. There may well be a better setting of alpha, but  
this appeared to work nicely in our experiments.

-Dave

On 3-May-09, at 2:01 AM, elife wrote:


Hi Dave,

 In your experiments what's the constant value alpha you set?

Thanks.

2009/5/1 David Silver :


Yes, in our experiments they were just constant numbers M=N=100.

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-05-03 Thread elife
Hi Dave,

  In your experiments what's the constant value alpha you set?

Thanks.

2009/5/1 David Silver :

> Yes, in our experiments they were just constant numbers M=N=100.
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-05-01 Thread David Silver

Hi Yamato,


If M and N are the same, is there any reason to run M simulations and
N simulations separately?  What happens if you combine them and  
calculate

V and g in the single loop?


I think it gives the wrong answer to do it in a single loop. Note that  
the simulation outcomes z are used to compute both V and g, so that  
they are quite strongly correlated. In general, if random variables X  
and Y are correlated then E[X]E[Y] != E[XY].


A simple example of this problem would be sampling E[X]E[X] for some  
random variable X. Let's say X is actually just noise with mean zero.  
If you just take one sample x1, then x1*x1 is always +ve, and you  
would estimate E[X]E[X]=E[X^2]>0. But if you take two independent  
samples x1 and x2, then x1*x2 would be +ve half the time and -ve half  
the time, and you would correctly estimate E[X]E[X]=0.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Yamato
David Silver wrote:
>Yes, in our experiments they were just constant numbers M=N=100.

If M and N are the same, is there any reason to run M simulations and
N simulations separately?  What happens if you combine them and calculate
V and g in the single loop?

>Okay, let's continue the example above. Let's say that in position s,  
>using the current theta, moves a, b and c will be selected with  
>probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was  
>actually selected. Now consider pattern 1, this is matched after a.  
>But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So  
>psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- 
>(0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a)  
>= [0.2,0.3,-0.3,-0.2].
>
>In other words, psi tells us whether each pattern was actually matched  
>more or less than we could have expected.

I understood what psi was. I am not sure how it works, but anyway I can
see your algorithm now. Thanks.

--
Yamato
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver
IMO other people's equations/code/ideas/papers always seem smarter  
than your own. The stuff you understand and do yourself just seems  
like common sense, and the stuff you don't always has a mystical air  
of complexity, at least until you understand it too :-)


On 30-Apr-09, at 1:59 PM, Michael Williams wrote:


I wish I was smart   :(


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Yamato,


Thanks for the detailed explanation.
M, N and alpha are constant numbers, right?  What did you set them to?


You're welcome!
Yes, in our experiments they were just constant numbers M=N=100.


The feature vector is the set of patterns you use, with value 1 if a
pattern is matched and 0 otherwise. The simulation policy selects
actions in proportion to the exponentiated, weighted sum of all
matching patterns. For example let's say move a matches patterns 1  
and

2, move b matches patterns 1 and 3, and move c matches patterns 2 and
4. Then move a would be selected with probability e^(theta1 +
theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +
theta4)). The theta values are the weights on the patterns which we
would like to learn. They are the log of the Elo ratings in Remi
Coulom's approach.


OK, I guess it is the formula 5 in the paper.


Yes, exactly.


The only tricky part is computing the vector psi(s,a). Each component
of psi(s,a) corresponds to a particular pattern, and is the  
difference

between the observed feature (i.e. whether the pattern actually
occurred after move a in position s) and the expected feature (the
average value of the pattern, weighted by the probability of  
selecting

each action).


I still don't understand this. Is it the formula 6?
Could you please give me an example like the above?


Yes that's right, this is equation 6.

Okay, let's continue the example above. Let's say that in position s,  
using the current theta, moves a, b and c will be selected with  
probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was  
actually selected. Now consider pattern 1, this is matched after a.  
But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So  
psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- 
(0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a)  
= [0.2,0.3,-0.3,-0.2].


In other words, psi tells us whether each pattern was actually matched  
more or less than we could have expected.


Hope this helps.
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Michael Williams

I wish I was smart   :(


David Silver wrote:

Hi Remi,

I understood this. What I find strange is that using -1/1 should be 
equivalent to using 0/1, but your algorithm behaves differently: it 
ignores lost games with 0/1, and uses them with -1/1.


Imagine you add a big constant to z. One million, say. This does not 
change the problem. You get either 100 or 101 as outcome of a 
playout. But then, your estimate of the gradient becomes complete noise.


So maybe using -1/1 is better than 0/1 ? Since your algorithm depends 
so much on the definition of the reward, there must be an optimal way 
to set the reward. Or there must a better way to define an algorithm 
that would not depend on an offset in the reward.


There is still something wrong that I don't understand. There may be a 
way to quantify the amount of noise in the unbiased gradient estimate, 
and it would depend on the average reward. Probably setting the 
average reward to zero is what would minimize noise in the gradient 
estimate. This is just an intuitive guess.


Okay, now I understand your point :-) It's a good question - and I think 
you're right. In REINFORCE any baseline can be subtracted from the 
reward, without affecting the expected gradient, but possibly reducing 
its variance. The baseline leading to the best estimate is indeed the 
average reward.  So it should be the case that  {-1,+1} would estimate 
the gradient g more efficiently than {0,1}, assuming that we see similar 
numbers of black wins as white wins across the training set.


So to answer your question, we can safely modify the algorithm to use 
(z-b) instead of z, where b is the average reward. This would then make 
the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of 
step-size). I don't think this would have affected the results we 
presented (because all of the learning algorithms converged anyway, at 
least approximately, during training) but it could be an important 
modification for larger boards.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Remi,

I understood this. What I find strange is that using -1/1 should be  
equivalent to using 0/1, but your algorithm behaves differently: it  
ignores lost games with 0/1, and uses them with -1/1.


Imagine you add a big constant to z. One million, say. This does not  
change the problem. You get either 100 or 101 as outcome of  
a playout. But then, your estimate of the gradient becomes complete  
noise.


So maybe using -1/1 is better than 0/1 ? Since your algorithm  
depends so much on the definition of the reward, there must be an  
optimal way to set the reward. Or there must a better way to define  
an algorithm that would not depend on an offset in the reward.


There is still something wrong that I don't understand. There may be  
a way to quantify the amount of noise in the unbiased gradient  
estimate, and it would depend on the average reward. Probably  
setting the average reward to zero is what would minimize noise in  
the gradient estimate. This is just an intuitive guess.


Okay, now I understand your point :-) It's a good question - and I  
think you're right. In REINFORCE any baseline can be subtracted from  
the reward, without affecting the expected gradient, but possibly  
reducing its variance. The baseline leading to the best estimate is  
indeed the average reward.  So it should be the case that  {-1,+1}  
would estimate the gradient g more efficiently than {0,1}, assuming  
that we see similar numbers of black wins as white wins across the  
training set.


So to answer your question, we can safely modify the algorithm to use  
(z-b) instead of z, where b is the average reward. This would then  
make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling  
of step-size). I don't think this would have affected the results we  
presented (because all of the learning algorithms converged anyway, at  
least approximately, during training) but it could be an important  
modification for larger boards.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Rémi Coulom

David Silver wrote:


Sorry, I should have made it clear that this assumes that we are 
treating black wins as z=1 and white wins as z=0.
In this special case, the gradient is the average of games in which 
black won.
But yes, more generally you need to include games won by both sides. 
The algorithms in the paper still cover this case - I was just trying 
to simplify their description to make it easy to understand the ideas.


I understood this. What I find strange is that using -1/1 should be 
equivalent to using 0/1, but your algorithm behaves differently: it 
ignores lost games with 0/1, and uses them with -1/1.


Imagine you add a big constant to z. One million, say. This does not 
change the problem. You get either 100 or 101 as outcome of a 
playout. But then, your estimate of the gradient becomes complete noise.


So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so 
much on the definition of the reward, there must be an optimal way to 
set the reward. Or there must a better way to define an algorithm that 
would not depend on an offset in the reward.




The gradient already compensates for the playout policy (equation 9), 
so in fact it would bias the gradient to sample uniformly at random!

Yes, you are right.

There is still something wrong that I don't understand. There may be a 
way to quantify the amount of noise in the unbiased gradient estimate, 
and it would depend on the average reward. Probably setting the average 
reward to zero is what would minimize noise in the gradient estimate. 
This is just an intuitive guess.


Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Remi,


This is strange: you do not take lost playouts into consideration.

I believe there is a problem with your estimation of the gradient.  
Suppose for instance that you count z = +1 for a win, and z = -1 for  
a loss. Then you would take lost playouts into consideration. This  
makes me a little suspicious.


Sorry, I should have made it clear that this assumes that we are  
treating black wins as z=1 and white wins as z=0.
In this special case, the gradient is the average of games in which  
black won.
But yes, more generally you need to include games won by both sides.  
The algorithms in the paper still cover this case - I was just trying  
to simplify their description to make it easy to understand the ideas.


The fundamental problem here may be that your estimate of the  
gradient is biased by the playout policy. You should probably sample  
X(s) uniformly at random to have an unbiased estimator. Maybe this  
can be fixed with importance sampling, and then you may get a  
formula that is symmetrical regarding wins and losses. I don't have  
time to do it now, but it may be worth taking a look.


The gradient already compensates for the playout policy (equation 9),  
so in fact it would bias the gradient to sample uniformly at random!


In equation 9, the gradient is taken with respect to the playout  
policy parameters. Using the product rule (third line), the gradient  
is equal to the playout policy probabilities multiplied by the sum  
likelihood ratios multiplied by the simulation outcomes z. This  
gradient can be computed by sampling playouts  instead of multiplying  
by the playout policy probabilities. This is also why games with  
outcomes of z=0 can be ignored - they don't affect this gradient  
computation.


More precisely: you should estimate the value of N playouts as Sum  
p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum  
p_i z_i / Sum p_i. This would be better.


Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ?


I think a similar point applies here. We care about the expected value  
of the playout policy, which can be sampled directly from playouts,  
instead of multiplying by the policy probabilities. You would only  
need importance sampling if you were using a different playout policy  
to the one which you are evaluating. But I guess I'm not seeing any  
good reason to do this?


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Rémi Coulom

Rémi Coulom wrote:
The fundamental problem here may be that your estimate of the gradient 
is biased by the playout policy. You should probably sample X(s) 
uniformly at random to have an unbiased estimator. Maybe this can be 
fixed with importance sampling, and then you may get a formula that is 
symmetrical regarding wins and losses. I don't have time to do it now, 
but it may be worth taking a look.


Rémi 


More precisely: you should estimate the value of N playouts as Sum p_i 
z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i 
/ Sum p_i. This would be better.


Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ?

Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Rémi Coulom

David Silver wrote:


2. Run another N simulations, average the value of psi(s,a) over 
all positions and moves in games that black won (call this g) 


This is strange: you do not take lost playouts into consideration.

I believe there is a problem with your estimation of the gradient. 
Suppose for instance that you count z = +1 for a win, and z = -1 for a 
loss. Then you would take lost playouts into consideration. This makes 
me a little suspicious.


The fundamental problem here may be that your estimate of the gradient 
is biased by the playout policy. You should probably sample X(s) 
uniformly at random to have an unbiased estimator. Maybe this can be 
fixed with importance sampling, and then you may get a formula that is 
symmetrical regarding wins and losses. I don't have time to do it now, 
but it may be worth taking a look.


Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-29 Thread Yamato
David Silver wrote:
>A: Estimate value V* of every position in a training set, using deep  
>rollouts.
>
>B: Repeat, for each position in the training set
>   1. Run M simulations, estimate value of position (call this V)
>   2. Run another N simulations, average the value of psi(s,a) over all  
>positions and moves in games that black won (call this g)
>   3. Adjust parameters by alpha * (V* - V) * g

Thanks for the detailed explanation.
M, N and alpha are constant numbers, right?  What did you set them to?

>The feature vector is the set of patterns you use, with value 1 if a  
>pattern is matched and 0 otherwise. The simulation policy selects  
>actions in proportion to the exponentiated, weighted sum of all  
>matching patterns. For example let's say move a matches patterns 1 and  
>2, move b matches patterns 1 and 3, and move c matches patterns 2 and  
>4. Then move a would be selected with probability e^(theta1 +  
>theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +  
>theta4)). The theta values are the weights on the patterns which we  
>would like to learn. They are the log of the Elo ratings in Remi  
>Coulom's approach.

OK, I guess it is the formula 5 in the paper.

>The only tricky part is computing the vector psi(s,a). Each component  
>of psi(s,a) corresponds to a particular pattern, and is the difference  
>between the observed feature (i.e. whether the pattern actually  
>occurred after move a in position s) and the expected feature (the  
>average value of the pattern, weighted by the probability of selecting  
>each action).

I still don't understand this. Is it the formula 6?
Could you please give me an example like the above?

--
Yamato
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Re: [computer go] Monte-Carlo Simulation Balancing

2009-04-29 Thread David Silver

Hi Remi,


What komi did you use for 5x5 and 6x6 ?

I used 7.5 komi for both board sizes.


I find it strange that you get only 70 Elo points from supervised
learning over uniform random. Don't you have any feature for atari
extension ? This one alone should improve strength immensely (extend
string in atari because of the previous move).
Actually no. The features are very simple, and know how to capture but  
not how to defend ataris. I'm sure that a better set of features could  
improve by more than 70 Elo, but I expect we would still see a benefit  
to balancing the weights correctly. For example, the Fuego policy  
defends ataris and follows several other common-sense rules, but the  
results in 5x5 and 6x6 show that it is not well balanced on small  
boards.


Let us extrapolate: I got more than 200 Elo points of improvements  
from
my patterns in 9x9 over uniform random (I never really measured  
that, it

may be even more than 200).
I guess you got more than 200 Elo on 9x9, in Mogo (Gelly et al. 2006)  
the improvement from uniform random was at least 400 Elo and I think  
your simulation policy is probably at least as strong.


By the way I was sad to hear you're not working on Crazystone any  
more. Is it because you are you busy with other projects?



So maybe I could get 600 more Elo points
with your method. And even more on 19x19.
I noticed that, in general, changes in the playout policy have a much
bigger impact on larger boards than on smaller boards.


As to whether we can extrapolate, I hope so :-)
I share the same feeling that improving the simulation policy will be  
more impactful on bigger boards with longer simulations.
On the other hand I've been surprised by many things in MC Go, so  
nothing is certain until we try it!


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-29 Thread Michael Williams

David Silver wrote:

Hi Michael,

But one thing confuses me: You are using the value from Fuego's 10k 
simulations as an approximation of the actual value of the position.  
But isn't the actual
value of the position either a win or a loss?  On such small boards, 
can't you assume that Fuego is able to correctly determin who is 
winning and round it's
evaluation to the nearest win/loss?  i.e. if it evaluates the position 
to 0.674, that gets rounded to 1.  If such an assumption about Fuego's 
ability to read
the position on a small board is valid, then it should improve the 
results of your balanced simulation strategy, right?  Or am I missing 
something?


It's true that 5x5 Go is solved, so in principle we could have used the 
true minimax values. However we chose to use an approach that can scale 
to larger boards, which means that we should treat the expert 
evaluations as approximate. And in fact Fuego was not always accurate on 
6x6 boards, as we used only 10k simulations in our training set.


Also, I think it really helps to have "soft" rather than "hard" expert 
evaluations. We want a simulation policy that helps differentiate e.g. a 
90% winning position from an 85% winning position. Rounding all the 
expert evaluations to 0 or 1 would lose much of this advantage.


-Dave


By this argument (your last paragraph), you need to do some magical
number of simulations for the training data.  Not enough simulations
and you have too much noise.  And infinite simulations gives you hard
0 or 1 results.  But I can't argue with your results.

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-29 Thread David Silver

Hi Yamato,


Could you give us the source code which you used?  Your algorithm is
too complicated, so it would be very helpful if possible.


Actually I think the source code would be much harder to understand!  
It is written inside RLGO, and makes use of a substantial existing  
framework that would take a lot of effort to understand. (On a  
separate note I am considering making RLGO open source at some point,  
but I'd prefer to spend some effort cleaning it up before making it  
public).


But I think maybe Algorithm 1 is much easier than you think:

A: Estimate value V* of every position in a training set, using deep  
rollouts.


B: Repeat, for each position in the training set
1. Run M simulations, estimate value of position (call this V)
	2. Run another N simulations, average the value of psi(s,a) over all  
positions and moves in games that black won (call this g)

3. Adjust parameters by alpha * (V* - V) * g

The feature vector is the set of patterns you use, with value 1 if a  
pattern is matched and 0 otherwise. The simulation policy selects  
actions in proportion to the exponentiated, weighted sum of all  
matching patterns. For example let's say move a matches patterns 1 and  
2, move b matches patterns 1 and 3, and move c matches patterns 2 and  
4. Then move a would be selected with probability e^(theta1 +  
theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +  
theta4)). The theta values are the weights on the patterns which we  
would like to learn. They are the log of the Elo ratings in Remi  
Coulom's approach.


The only tricky part is computing the vector psi(s,a). Each component  
of psi(s,a) corresponds to a particular pattern, and is the difference  
between the observed feature (i.e. whether the pattern actually  
occurred after move a in position s) and the expected feature (the  
average value of the pattern, weighted by the probability of selecting  
each action).


It's also very important to be careful about signs and the colour to  
play - it's easy to make a mistake and follow the gradient in the  
wrong direction.


Is that any clearer?
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread Don Dailey
But I'm only trying to make a point, not pin the analogy down perfectly.
Naturally the stronger the player, the more likely his moves will conform to
the level of the top players.

The basic principle is that the longer the contest, the more opportunities a
strong player has to demonstrate his superiority over an inferior player
and/or recover from mistakes or surprises.In my golf analogy,   my
friend hits a hole in one.  Let's say it was on the first hole of 9 against
Tiger Woods.   Tiger likely would have hit it in 2 strokes,  and my beginner
friend would be ahead by 1 stroke!But that 1 stroke lead would almost
certainly disappear by the time the next hole was completed, as my friend
was typically taking 7 or 8 shots on that par 3 course!

Another way to look at this,  is that a random legal move generator has a
chance to play a perfect game, although that chance is almost infinitesimal.
  But even that small chance can be an order of magnitude larger if you can
shorted the contest even by 1 or 2 moves.

- Don


On Tue, Apr 28, 2009 at 8:09 PM, steve uurtamo  wrote:

> also, i'm not sure that a lot of most amateurs' moves are very
> good.  the spectrum of bad moves is wide, it's just that it takes
> someone many stones stronger to severely punish small differences
> between good and nearly-good moves.  among players of relatively
> similar strength, these differences will go unnoticed and unpunished.
>
> s.
>
> 2009/4/28 Don Dailey :
> > A simplistic model that helps explain this is golf.   On a single hole,
> even
> > a casual golfer has a realistic chance of out-golfing Tiger Woods.  Tiger
> > occasionally shoots a 1 over par on some hole and even weak amateurs
> > occasionally par or even birdie a hole.It's not going to happen a
> lot,
> > but it's not ridiculous either.   Years ago I taught a player how to
> golf,
> > and on his third time out with me,  he hit a hole in one on a short par
> > 3. If Tiger Woods had been playing with us, he would have lost that
> hole
> > to this beginner.
> >
> > But in a 9 hole match,  the odds go down enormously - for all practical
> > purposes there is no chance.
> >
> > I kind of think of GO like that, even though it's a pretty simplistic
> > model.   Each move is like a hole of golf,  it can be a good "shot" or a
> bad
> > one. With GO, however, probably a LOT of your moves are just as good
> as
> > the moves of a good player.   But it's the ones that fall short, that
> kill
> > you.
> >
> > Go on a big board is like 18 holes of golf  compared to just 1 or 2 holes
> of
> > golf.   The better player is far more likely to win the 18 hole match
> than
> > the 1 hole match.
> >
> > - Don
> >
> >
> >
> >
> >
> > On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois 
> wrote:
> >>>
> >>> I noticed that, in general, changes in the playout policy have a much
> >>> bigger impact on larger boards than on smaller boards.
> >>>
> >>> Rémi
> >>
> >> I think rating differences are emplified on larger boards. This is easy
> to
> >> see if you think about it this way :
> >>
> >> Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game
> that
> >> I would call 4-Go where instead of playing one game, you play
> simultenously
> >> 4 games and determine the winner by calculating the sum of the scores of
> the
> >> four games. Certainly rating differences would be bigger with 4-go than
> with
> >> go (given the same two players). This explains why rating differences
> are
> >> bigger on 19x19 than 9x9.
> >>
> >> Ivan
> >>
> >> ___
> >> computer-go mailing list
> >> computer-go@computer-go.org
> >> http://www.computer-go.org/mailman/listinfo/computer-go/
> >
> >
> > ___
> > computer-go mailing list
> > computer-go@computer-go.org
> > http://www.computer-go.org/mailman/listinfo/computer-go/
> >
> ___
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread Yamato
David Silver wrote:
>> because the previous approaches were not optimized for such a small  
>> boards.
>
>I'm not sure what you mean here? The supervised learning and  
>reinforcement learning approaches that we compared against are both  
>trained on the small boards, i.e. the pattern weights are specifically  
>optimised for that size of board.

Ah, sorry. I missed it.

>I agree that the handcrafted policy from Fuego was not optimised for  
>small boards, which is why it performed poorly. But perhaps this is  
>also interesting, i.e. it suggests that a handcrafted policy for 9x9  
>Go may be significantly suboptimal when used in 19x19 Go. So  
>automatically learning a simulation policy may ultimately prove to be  
>very beneficial.

It is surprising Fuego was far worse than uniform random on 5x5 and 6x6.
These results show the particularity of the very small boards.
I suppose 5x5 is very different from 9x9 but 19x19 is not so from 9x9.

>> I'm looking forward to your results on larger boards.
>
>Me too :-)
>Coming soon, will let you know how it goes.
>But I hope that others will try these ideas too, it's always much  
>better to compare multiple implementations of the same algorithm.

Could you give us the source code which you used?  Your algorithm is
too complicated, so it would be very helpful if possible.

--
Yamato
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread terry mcintyre
A fair number of amateur moves are exactly what a pro would use; this is no 
great surprise, since the moves are actually copied from professional games. 
The weaker player can blindly emulate, but does not know why a pro plays A 
instead of B or C; does not know how to punish slack moves.

I've been taking some classes from an 8d pro. Sadly, I'm at the bottom of the 
class. The top players in these classes are 8d amateurs; the pro can give them 
3 stones and win. When going over their games, many of their moves are quite 
respectable, but the pro is able to find mistakes which can be exploited with 
devastating effect. You can't afford to be "a little bit off" against a pro, 
but weaker players find it hard to exploit slack moves.

 
When explaining his moves, the pro often uses very deep reading. We've all seen 
simple cases, where move A is advisable if a ladder works, otherwise B is 
recommended. The pro knows six moves in advance that he'll need to read that 
ladder, before the ladder is actually on the board. These are the simplest 
expositions I've seen; things get more complex from there. 
Terry McIntyre 


"Government is an association of men who do violence to the rest of us."
- Leo Tolstoy





From: steve uurtamo 
To: computer-go 
Sent: Tuesday, April 28, 2009 5:09:20 PM
Subject: Re: [computer-go] Monte-Carlo Simulation Balancing

also, i'm not sure that a lot of most amateurs' moves are very
good.  the spectrum of bad moves is wide, it's just that it takes
someone many stones stronger to severely punish small differences
between good and nearly-good moves.  among players of relatively
similar strength, these differences will go unnoticed and unpunished.

s.

2009/4/28 Don Dailey :
> A simplistic model that helps explain this is golf.   On a single hole, even
> a casual golfer has a realistic chance of out-golfing Tiger Woods.  Tiger
> occasionally shoots a 1 over par on some hole and even weak amateurs
> occasionally par or even birdie a hole.It's not going to happen a lot,
> but it's not ridiculous either.   Years ago I taught a player how to golf,
> and on his third time out with me,  he hit a hole in one on a short par
> 3. If Tiger Woods had been playing with us, he would have lost that hole
> to this beginner.
>
> But in a 9 hole match,  the odds go down enormously - for all practical
> purposes there is no chance.
>
> I kind of think of GO like that, even though it's a pretty simplistic
> model.   Each move is like a hole of golf,  it can be a good "shot" or a bad
> one. With GO, however, probably a LOT of your moves are just as good as
> the moves of a good player.   But it's the ones that fall short, that kill
> you.
>
> Go on a big board is like 18 holes of golf  compared to just 1 or 2 holes of
> golf.   The better player is far more likely to win the 18 hole match than
> the 1 hole match.
>
> - Don
>
>
>
>
>
> On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois  wrote:
>>>
>>> I noticed that, in general, changes in the playout policy have a much
>>> bigger impact on larger boards than on smaller boards.
>>>
>>> Rémi
>>
>> I think rating differences are emplified on larger boards. This is easy to
>> see if you think about it this way :
>>
>> Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that
>> I would call 4-Go where instead of playing one game, you play simultenously
>> 4 games and determine the winner by calculating the sum of the scores of the
>> four games. Certainly rating differences would be bigger with 4-go than with
>> go (given the same two players). This explains why rating differences are
>> bigger on 19x19 than 9x9.
>>
>> Ivan
>>
>> ___
>> computer-go mailing list
>> computer-go@computer-go.org
>> http://www.computer-go.org/mailman/listinfo/computer-go/
>
>
> ___
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



  ___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread steve uurtamo
also, i'm not sure that a lot of most amateurs' moves are very
good.  the spectrum of bad moves is wide, it's just that it takes
someone many stones stronger to severely punish small differences
between good and nearly-good moves.  among players of relatively
similar strength, these differences will go unnoticed and unpunished.

s.

2009/4/28 Don Dailey :
> A simplistic model that helps explain this is golf.   On a single hole, even
> a casual golfer has a realistic chance of out-golfing Tiger Woods.  Tiger
> occasionally shoots a 1 over par on some hole and even weak amateurs
> occasionally par or even birdie a hole.    It's not going to happen a lot,
> but it's not ridiculous either.   Years ago I taught a player how to golf,
> and on his third time out with me,  he hit a hole in one on a short par
> 3. If Tiger Woods had been playing with us, he would have lost that hole
> to this beginner.
>
> But in a 9 hole match,  the odds go down enormously - for all practical
> purposes there is no chance.
>
> I kind of think of GO like that, even though it's a pretty simplistic
> model.   Each move is like a hole of golf,  it can be a good "shot" or a bad
> one. With GO, however, probably a LOT of your moves are just as good as
> the moves of a good player.   But it's the ones that fall short, that kill
> you.
>
> Go on a big board is like 18 holes of golf  compared to just 1 or 2 holes of
> golf.   The better player is far more likely to win the 18 hole match than
> the 1 hole match.
>
> - Don
>
>
>
>
>
> On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois  wrote:
>>>
>>> I noticed that, in general, changes in the playout policy have a much
>>> bigger impact on larger boards than on smaller boards.
>>>
>>> Rémi
>>
>> I think rating differences are emplified on larger boards. This is easy to
>> see if you think about it this way :
>>
>> Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that
>> I would call 4-Go where instead of playing one game, you play simultenously
>> 4 games and determine the winner by calculating the sum of the scores of the
>> four games. Certainly rating differences would be bigger with 4-go than with
>> go (given the same two players). This explains why rating differences are
>> bigger on 19x19 than 9x9.
>>
>> Ivan
>>
>> ___
>> computer-go mailing list
>> computer-go@computer-go.org
>> http://www.computer-go.org/mailman/listinfo/computer-go/
>
>
> ___
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread Don Dailey
A simplistic model that helps explain this is golf.   On a single hole, even
a casual golfer has a realistic chance of out-golfing Tiger Woods.  Tiger
occasionally shoots a 1 over par on some hole and even weak amateurs
occasionally par or even birdie a hole.It's not going to happen a lot,
but it's not ridiculous either.   Years ago I taught a player how to golf,
and on his third time out with me,  he hit a hole in one on a short par
3. If Tiger Woods had been playing with us, he would have lost that hole
to this beginner.

But in a 9 hole match,  the odds go down enormously - for all practical
purposes there is no chance.

I kind of think of GO like that, even though it's a pretty simplistic
model.   Each move is like a hole of golf,  it can be a good "shot" or a bad
one. With GO, however, probably a LOT of your moves are just as good as
the moves of a good player.   But it's the ones that fall short, that kill
you.

Go on a big board is like 18 holes of golf  compared to just 1 or 2 holes of
golf.   The better player is far more likely to win the 18 hole match than
the 1 hole match.

- Don





On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois  wrote:

> I noticed that, in general, changes in the playout policy have a much
>> bigger impact on larger boards than on smaller boards.
>>
>> Rémi
>>
>
> I think rating differences are emplified on larger boards. This is easy to
> see if you think about it this way :
>
> Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I
> would call 4-Go where instead of playing one game, you play simultenously 4
> games and determine the winner by calculating the sum of the scores of the
> four games. Certainly rating differences would be bigger with 4-go than with
> go (given the same two players). This explains why rating differences are
> bigger on 19x19 than 9x9.
>
> Ivan
>
> ___
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread Ivan Dubois
I noticed that, in general, changes in the playout policy have a much 
bigger impact on larger boards than on smaller boards.


Rémi


I think rating differences are emplified on larger boards. This is easy to 
see if you think about it this way :


Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I 
would call 4-Go where instead of playing one game, you play simultenously 4 
games and determine the winner by calculating the sum of the scores of the 
four games. Certainly rating differences would be bigger with 4-go than with 
go (given the same two players). This explains why rating differences are 
bigger on 19x19 than 9x9.


Ivan 



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread Rémi Coulom

David Silver wrote:

I don't think the 200+ Elo improvement is so impressive


I agree that it would be much more impressive to report positive 
results on larger boards. But perhaps it is already interesting that 
tuning the balance of the simulation policy can make a big difference 
on small boards? Also, larger boards mean longer simulations, and so 
the importance of simulation balancing should become even more 
exaggerated. 


What komi did you use for 5x5 and 6x6 ?

I find it strange that you get only 70 Elo points from supervised 
learning over uniform random. Don't you have any feature for atari 
extension ? This one alone should improve strength immensely (extend 
string in atari because of the previous move).


Let us extrapolate: I got more than 200 Elo points of improvements from 
my patterns in 9x9 over uniform random (I never really measured that, it 
may be even more than 200). So maybe I could get 600 more Elo points 
with your method. And even more on 19x19.


I noticed that, in general, changes in the playout policy have a much 
bigger impact on larger boards than on smaller boards.


Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-27 Thread Yamato
David Silver wrote:
>Please find attached my ICML paper with Gerry Tesauro on automatically  
>learning a simulation policy for Monte-Carlo Go. Our preliminary  
>results show a 200+ Elo improvement over previous approaches, although  
>our experiments were restricted to simple Monte-Carlo search with no  
>tree on small boards.

I like you idea, but why do you use only 5x5 and 6x6 Go?  I don't think
the 200+ Elo improvement is so impressive because the previous approaches
were not optimized for such a small boards. I'm looking forward to your
results on larger boards.

--
Yamato
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-27 Thread Michael Williams

My favorite part:
"One natural idea is to use the learned simulation policy in Monte-Carlo search, and 
generate new deep search values, in an iterative cycle."

But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position.  But isn't the actual 
value of the position either a win or a loss?  On such small boards, can't you assume that Fuego is able to correctly determin who is winning and round it's 
evaluation to the nearest win/loss?  i.e. if it evaluates the position to 0.674, that gets rounded to 1.  If such an assumption about Fuego's ability to read 
the position on a small board is valid, then it should improve the results of your balanced simulation strategy, right?  Or am I missing something?





David Silver wrote:

Hi everyone,

Please find attached my ICML paper with Gerry Tesauro on automatically 
learning a simulation policy for Monte-Carlo Go. Our preliminary results 
show a 200+ Elo improvement over previous approaches, although our 
experiments were restricted to simple Monte-Carlo search with no tree on 
small boards.





Abstract

In this paper we introduce the first algorithms for efficiently learning 
a simulation policy for Monte-Carlo search. Our main idea is to optimise 
the balance of a simulation policy, so that an accurate spread of 
simulation outcomes is maintained, rather than optimising the direct 
strength of the simulation policy. We develop two algorithms for 
balancing a simulation policy by gradient descent. The first algorithm 
optimises the balance of complete simulations, using a policy gradient 
algorithm; whereas the second algorithm optimises the balance over every 
two steps of simulation. We compare our algorithms to reinforcement 
learning and supervised learning algorithms for maximising the strength 
of the simulation policy. We test each algorithm in the domain of 5x5 
and 6x6 Computer Go, using a softmax policy that is parameterised by 
weights for a hundred simple patterns. When used in a simple Monte-Carlo 
search, the policies learnt by simulation balancing achieved 
significantly better performance, with half the mean squared error of a 
uniform random policy, and equal overall performance to a sophisticated 
Go engine.


-Dave




___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-27 Thread Rémi Coulom

David Silver wrote:

Hi everyone,

Please find attached my ICML paper with Gerry Tesauro on automatically 
learning a simulation policy for Monte-Carlo Go. Our preliminary 
results show a 200+ Elo improvement over previous approaches, although 
our experiments were restricted to simple Monte-Carlo search with no 
tree on small boards.


Very interesting, thanks.

If I understand correctly, your method makes your program 250 Elo points 
stronger than my pattern-learning algorithm on 5x5 and 6x6, by just 
learning better weights. That is impressive. I had stopped working on my 
program for a very long time, but maybe I will go back to work and try 
your algorithm.


Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-27 Thread Michael Williams

Finally!  I guess you can add this technique to your list, Lukasz.


David Silver wrote:

Hi everyone,

Please find attached my ICML paper with Gerry Tesauro on automatically 
learning a simulation policy for Monte-Carlo Go. Our preliminary results 
show a 200+ Elo improvement over previous approaches, although our 
experiments were restricted to simple Monte-Carlo search with no tree on 
small boards.





Abstract

In this paper we introduce the first algorithms for efficiently learning 
a simulation policy for Monte-Carlo search. Our main idea is to optimise 
the balance of a simulation policy, so that an accurate spread of 
simulation outcomes is maintained, rather than optimising the direct 
strength of the simulation policy. We develop two algorithms for 
balancing a simulation policy by gradient descent. The first algorithm 
optimises the balance of complete simulations, using a policy gradient 
algorithm; whereas the second algorithm optimises the balance over every 
two steps of simulation. We compare our algorithms to reinforcement 
learning and supervised learning algorithms for maximising the strength 
of the simulation policy. We test each algorithm in the domain of 5x5 
and 6x6 Computer Go, using a softmax policy that is parameterised by 
weights for a hundred simple patterns. When used in a simple Monte-Carlo 
search, the policies learnt by simulation balancing achieved 
significantly better performance, with half the mean squared error of a 
uniform random policy, and equal overall performance to a sophisticated 
Go engine.


-Dave




___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/