Hi Oliver
Reinforcement learning is different to unsupervised learning. We used
reinforcement learning to train the Atari games. Also we published a more
recent paper (www.nature.com/articles/nature14236) that applied the same
network to 50 different Atari games (achieving human level in around
Hi Martin
- Would you be willing to share some of the sgf game records played by your
network with the community? I tried to replay the game record in your
paper, but got stuck since it does not show any of the moves that got
captured.
Sorry about that, we will correct the figure and repost.
Hi Lars,
is there anyone who can repost the pdf (rave.pdf?) the following mails
are talking about?
http://computer-go.org/pipermail/computer-go/2008-February/014095.html
I think you can still find the original attachment here:
Hi,
We used alpha=0.1. There may well be a better setting of alpha, but
this appeared to work nicely in our experiments.
-Dave
On 3-May-09, at 2:01 AM, elife wrote:
Hi Dave,
In your experiments what's the constant value alpha you set?
Thanks.
2009/5/1 David Silver sil...@cs.ualberta.ca
Hi Yamato,
If M and N are the same, is there any reason to run M simulations and
N simulations separately? What happens if you combine them and
calculate
V and g in the single loop?
I think it gives the wrong answer to do it in a single loop. Note that
the simulation outcomes z are used
Hi Remi,
This is strange: you do not take lost playouts into consideration.
I believe there is a problem with your estimation of the gradient.
Suppose for instance that you count z = +1 for a win, and z = -1 for
a loss. Then you would take lost playouts into consideration. This
makes me
Hi Remi,
I understood this. What I find strange is that using -1/1 should be
equivalent to using 0/1, but your algorithm behaves differently: it
ignores lost games with 0/1, and uses them with -1/1.
Imagine you add a big constant to z. One million, say. This does not
change the problem.
Hi Yamato,
Thanks for the detailed explanation.
M, N and alpha are constant numbers, right? What did you set them to?
You're welcome!
Yes, in our experiments they were just constant numbers M=N=100.
The feature vector is the set of patterns you use, with value 1 if a
pattern is matched and
IMO other people's equations/code/ideas/papers always seem smarter
than your own. The stuff you understand and do yourself just seems
like common sense, and the stuff you don't always has a mystical air
of complexity, at least until you understand it too :-)
On 30-Apr-09, at 1:59 PM,
Hi Yamato,
Could you give us the source code which you used? Your algorithm is
too complicated, so it would be very helpful if possible.
Actually I think the source code would be much harder to understand!
It is written inside RLGO, and makes use of a substantial existing
framework that
Hi Remi,
What komi did you use for 5x5 and 6x6 ?
I used 7.5 komi for both board sizes.
I find it strange that you get only 70 Elo points from supervised
learning over uniform random. Don't you have any feature for atari
extension ? This one alone should improve strength immensely (extend
Hi Yamato,
I like you idea, but why do you use only 5x5 and 6x6 Go?
1. Our second algorithm, two-ply simulation balancing, requires a
training set of two-ply rollouts. Rolling out every position from a
complete two-ply search is very expensive on larger board sizes, so we
would probably
Hi Remi,
If I understand correctly, your method makes your program 250 Elo
points
stronger than my pattern-learning algorithm on 5x5 and 6x6, by just
learning better weights.
Yes, although this is just in a very simple MC setting.
Also we did not compare directly to the algorithm you used
Hi Michael,
But one thing confuses me: You are using the value from Fuego's 10k
simulations as an approximation of the actual value of the
position. But isn't the actual
value of the position either a win or a loss? On such small boards,
can't you assume that Fuego is able to correctly
This document is confusing, but here is my interpretation of it. And
it works well for Valkyria. I would really want to see a pseudocode
version of it. I might post the code I use for Valkyria, but it is
probably not the same thing so I would probably just increase the
confusion if I did...
The
Hi Petr,
Thanks for the great comments, sorry to be so slow in getting back to
you (on vacation/workshop...)
Hello,
On Sun, Apr 06, 2008 at 08:55:26PM -0600, David Silver wrote:
Here is a draft of the paper, any feedback would be very welcome :-)
http://www.cs.ualberta.ca/~silver/research
Hi everyone,
Sylvain and myself have had a paper accepted for the Nectar track at
the 23rd Conference on Artificial Intelligence (AAAI-08). The idea of
this track is to summarise previously published results from a
specific field to a wider audience interested in general AI.
Please bear
I am very confused about the new UCT-RAVE formula.
The equation 9 seems to mean:
variance_u = value_ur * (1 - value_ur) / n.
Is it wrong? If correct, why is it the variance?
I think that the variance of the UCT should be:
variance_u = value_u * (1 - value_u).
Hi Yamato,
There are two
Hi Erik,
Thanks for the thought-provoking response!
Yes, but why add upper confidence bounds to the rave values at all? If
they really go down that fast, does it make much of a difference?
According to the recent experiments in MoGo, you are right :-)
However, I've seen slightly different
David Silver wrote:
BTW if anyone just wants the formula, and doesn't care about the
derivation - then just use equations 11-14.
Yes, I just want to use the formula.
But I don't know what the bias is...
How can I get the value of br?
Sorry for the slow reply...
The simplest answer
In other words UCT works well when evaluation/playouts is/are
strong. I
believe
there are still improvements possible to the UCT algorithm as
shown by the
recent papers by Mogo and Crazystone authors, but what really will
make a
difference is in the quality in the playouts.
Sylvain said
Seems like it should be up to the person in the other environment
to adapt your
successful algorithm (and notation/terminology) to their environment.
But how do the other people in other environments find out about the
algorithm? And find out that it is something they could use in their
It's because Go is not only game in the world and certainly not only
reinforcement learning problem. They are using a widely accepted
terminology.
But a very inappropriate one. I have read Suttons book and all the
things I
know (e.g. TD-Gammon) are completly obfuscated.
Really? I think
It's because Go is not only game in the world and certainly not only
reinforcement learning problem. They are using a widely accepted
terminology.
But a very inappropriate one. I have read Suttons book and all the
things I
know (e.g. TD-Gammon) are completly obfuscated. Its maybe suitable
On 5/18/07, Rémi Coulom [EMAIL PROTECTED] wrote:
My idea was very similar to what you describe. The program built a
collection of rules of the kind if condition then move. Condition
could be anything from a tree-search rule of the kind in this
particular position play x, or general rule such
Very interesting paper!
I have one question. The assumption in your paper is that increasing
the performance of the simulation player will increase the
performance of Monte-Carlo methods that use that simulation player.
However, we found in MoGo that this is not necessarily the case! Do
I also use an online learning algorithm in RLGO to adjust feature
weights during the game. I use around a million features (all
possible patterns from 1x1 up to 3x3 at all locations on the board)
and update the weights online from simulated games using temporal
difference learning. I also
Thanks for the great paper. And thanks for sharing it before it's
published.
Now I know what directions to take my engine in next.
Time for Team MoGo so share some more secrets :)
We are publishing MoGo's secrets at ICML 2007, in just over a month.
So not long to wait now!
-Dave
28 matches
Mail list logo