Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Eric Boesch
I could be drawing wrong inferences from incomplete information, but as
Darren pointed out, this paper does leave the impression Alpha Zero is not
as strong as the real AlphaGo Zero, in which case it would be clearer to
say so explicitly. Of course the chess and shogi results are impressive
regardless. (In chess, the 28/100 wins is good, but 0 losses is even
better. Entering a drawn sequence starting from an inferior position --
such as playing black -- is a desirable result for even a perfect program
without contempt, so failing to win as black is not a good indicator of
strength.)

Comparing the Elo charts in this new paper and the Nature paper on AlphaGo
Zero, and assigning AlphaGo Lee a reference rating of 0 Elo, it appears
that the order in strength of go play is Alpha Zero (~900 Elo), AlphaGo
Master (~1400 Elo), then the full-strength AlphaGo Zero (~1500 Elo).

I would also think Alpha Zero's 8 hours of training with the help of an
immense network of 5,000 first generation TPUs is more expensive, and only
faster in a strictly chronological sense, than AlphaGo Zero 20-block
3-day's training with 4 second generation TPUs.


On Wed, Dec 6, 2017 at 4:29 PM, Brian Sheppard via Computer-go <
computer-go@computer-go.org> wrote:

> The chess result is 64-36: a 100 rating point edge! I think the Stockfish
> open source project improved Stockfish by ~20 rating points in the last
> year. Given the number of people/computers involved, Stockfish’s annual
> effort level seems comparable to the AZ effort.
>
>
>
> Stockfish is really, really tweaked out to do exactly what it does. It is
> very hard to improve anything about Stockfish. To be clear: I am not
> disparaging the code or people or project in any way. The code is great,
> people are great, project is great. It is really easy to work on Stockfish,
> but very hard to make progress given the extraordinarily fine balance of
> resources that already exists.  I tried hard for about 6 months last year
> without any successes. I tried dozens (maybe 100?) experiments, including
> several that were motivated by automated tuning or automated searching for
> opportunities. No luck.
>
>
>
> AZ would dominate the current TCEC. Stockfish didn’t lose a game in the
> semi-final, failing to make the final because of too many draws against the
> weaker players.
>
>
>
> The Stockfish team will have some self-examination going forward for sure.
> I wonder what they will decide to do.
>
>
>
> I hope this isn’t the last we see of these DeepMind programs.
>
>
>
> *From:* Computer-go [mailto:computer-go-boun...@computer-go.org] *On
> Behalf Of *Richard Lorentz
> *Sent:* Wednesday, December 6, 2017 12:50 PM
> *To:* computer-go@computer-go.org
> *Subject:* Re: [Computer-go] Mastering Chess and Shogi by Self-Play with
> a General Reinforcement Learning Algorithm
>
>
>
> One chess result stood out for me, namely, just how much easier it was for
> AlphaZero to win with white (25 wins, 25 draws, 0 losses) rather than with
> black (3 wins, 47 draws, 0 losses).
>
> Maybe we should not give up on the idea of White to play and win in chess!
>
> On 12/06/2017 01:24 AM, Hiroshi Yamashita wrote:
>
> Hi,
>
> DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero
> method.
>
> Mastering Chess and Shogi by Self-Play with a General Reinforcement
> Learning Algorithm
> https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.
> org_pdf_1712.01815.pdf=DwIGaQ=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-
> j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=w0qxE9GOfBVzqPOT0NBm1nsdQqJMlN
> u40BOCWfsO-gQ=dsola-9J77ArHVeuVc0ZCZKn2nJOsjfsnJzPc_MdPDo=
>
> AlphaZero(Chess) outperformed Stockfish after 4 hours,
> AlphaZero(Shogi) outperformed elmo after 2 hours.
>
> Search is MCTS.
> AlphaZero(Chess) searches 80,000 positions/sec.
> Stockfishsearches 70,000,000 positions/sec.
> AlphaZero(Shogi) searches 40,000 positions/sec.
> elmo searches 35,000,000 positions/sec.
>
> Thanks,
> Hiroshi Yamashita
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__
> computer-2Dgo.org_mailman_listinfo_computer-2Dgo=DwIGaQ=Oo8bPJf7k7r_
> cPTz1JF7vEiFxvFRfQtp-j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=
> w0qxE9GOfBVzqPOT0NBm1nsdQqJMlNu40BOCWfsO-gQ=
> Dflm7ezefzMJ9xLNmNYrSQKWa7qvG9FkzlCHngo_NcY=
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Darren Cook
>> One of the changes they made (bottom of p.3) was to continuously 
>> update the neural net, rather than require a new network to beat
>> it 55% of the time to be used. (That struck me as strange at the
>> time, when reading the AlphaGoZero paper - why not just >50%?)

Gian wrote:
> I read that as a simple way of establishing confidence that the 
> result was statistically significant > 0. (+35 Elo over 400 games...

Brian Sheppard also:
> Requiring a margin > 55% is a defense against a random result. A 55% 
> score in a 400-game match is 2 sigma.

Good point. That makes sense.

But (where A is best so far, and B is the newer network) in
A vs. B, if B wins 50.1%, there is a slightly greater than 50-50 chance
that B is better than A. In the extreme case of 54.9% win rate there is
something like a 94%-6% chance (?) that B is better, but they still
throw B away.

If B just got lucky, and A was better, well the next generation is just
more likely to de-throne B, so long-term you won't lose much.

On the other hand, at very strong levels, this might prevent
improvement, as a jump to 55% win rate in just one generation sounds
unlikely to happen. (Did I understand that right? As B is thrown away,
and A continues to be used, there is only that one generation within
which to improve on it, each time?)

Darren
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Brian Sheppard via Computer-go
Requiring a margin > 55% is a defense against a random result. A 55% score in a 
400-game match is 2 sigma.

But I like the AZ policy better, because it does not require arbitrary 
parameters. It also improves more fluidly by always drawing training examples 
from the current probability distribution, and when the program is close to 
perfect you would be able to capture the lest 5% of skill.

I am not sure what to make of the AZ vs AGZ result. Mathematically, there 
should be a degree of training sufficient for AZ to exceed any fixed level of 
skill, such as AGZ's 40/40 level. So there must be a reason why DeepMind did 
not report such a result, but it unclear what that is.

-Original Message-
From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Darren Cook
Sent: Wednesday, December 6, 2017 12:58 PM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a 
General Reinforcement Learning Algorithm

> Mastering Chess and Shogi by Self-Play with a General Reinforcement 
> Learning Algorithm https://arxiv.org/pdf/1712.01815.pdf

One of the changes they made (bottom of p.3) was to continuously update the 
neural net, rather than require a new network to beat it 55% of the time to be 
used. (That struck me as strange at the time, when reading the AlphaGoZero 
paper - why not just >50%?)

The AlphaZero paper shows it out-performs AlphaGoZero, but they are comparing 
to the 20-block, 3-day version. Not the 40-block, 40-day version that was even 
stronger.

As papers rarely show failures, can we take it to mean they couldn't 
out-perform their best go bot, do you think? If so, I wonder how hard they 
tried?

In other words, do you think the changes they made from AlphaGo Zero to Alpha 
Zero have made it weaker (when just viewed from the point of view of making the 
strongest possible go program).

Darren
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Brian Sheppard via Computer-go
The chess result is 64-36: a 100 rating point edge! I think the Stockfish open 
source project improved Stockfish by ~20 rating points in the last year. Given 
the number of people/computers involved, Stockfish’s annual effort level seems 
comparable to the AZ effort.

 

Stockfish is really, really tweaked out to do exactly what it does. It is very 
hard to improve anything about Stockfish. To be clear: I am not disparaging the 
code or people or project in any way. The code is great, people are great, 
project is great. It is really easy to work on Stockfish, but very hard to make 
progress given the extraordinarily fine balance of resources that already 
exists.  I tried hard for about 6 months last year without any successes. I 
tried dozens (maybe 100?) experiments, including several that were motivated by 
automated tuning or automated searching for opportunities. No luck.

 

AZ would dominate the current TCEC. Stockfish didn’t lose a game in the 
semi-final, failing to make the final because of too many draws against the 
weaker players.

 

The Stockfish team will have some self-examination going forward for sure. I 
wonder what they will decide to do.

 

I hope this isn’t the last we see of these DeepMind programs.

 

From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Richard Lorentz
Sent: Wednesday, December 6, 2017 12:50 PM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a 
General Reinforcement Learning Algorithm

 

One chess result stood out for me, namely, just how much easier it was for 
AlphaZero to win with white (25 wins, 25 draws, 0 losses) rather than with 
black (3 wins, 47 draws, 0 losses).

Maybe we should not give up on the idea of White to play and win in chess!

On 12/06/2017 01:24 AM, Hiroshi Yamashita wrote:

Hi, 

DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero method. 

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning 
Algorithm 
https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1712.01815.pdf
 

 
=DwIGaQ=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=w0qxE9GOfBVzqPOT0NBm1nsdQqJMlNu40BOCWfsO-gQ=dsola-9J77ArHVeuVc0ZCZKn2nJOsjfsnJzPc_MdPDo=
 

AlphaZero(Chess) outperformed Stockfish after 4 hours, 
AlphaZero(Shogi) outperformed elmo after 2 hours. 

Search is MCTS. 
AlphaZero(Chess) searches 80,000 positions/sec. 
Stockfishsearches 70,000,000 positions/sec. 
AlphaZero(Shogi) searches 40,000 positions/sec. 
elmo searches 35,000,000 positions/sec. 

Thanks, 
Hiroshi Yamashita 

___ 
Computer-go mailing list 
Computer-go@computer-go.org   
https://urldefense.proofpoint.com/v2/url?u=http-3A__computer-2Dgo.org_mailman_listinfo_computer-2Dgo
 

 
=DwIGaQ=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=w0qxE9GOfBVzqPOT0NBm1nsdQqJMlNu40BOCWfsO-gQ=Dflm7ezefzMJ9xLNmNYrSQKWa7qvG9FkzlCHngo_NcY=

 

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Petr Baudis
On Wed, Dec 06, 2017 at 09:57:42AM -0800, Darren Cook wrote:
> > Mastering Chess and Shogi by Self-Play with a General Reinforcement
> > Learning Algorithm
> > https://arxiv.org/pdf/1712.01815.pdf
> 
> One of the changes they made (bottom of p.3) was to continuously update
> the neural net, rather than require a new network to beat it 55% of the
> time to be used. (That struck me as strange at the time, when reading
> the AlphaGoZero paper - why not just >50%?)

  Yes, that also struck me.  I think it's good news for the community to
see it reported that this works, as it makes the training process much
more straightforward.  They also use just 800 simulations, another good
news.  (Both were one of the first tradeoffs I made in Nochi.)

  Another interesting tidbit: they use the TPUs to also generate the
selfplay games.

> The AlphaZero paper shows it out-performs AlphaGoZero, but they are
> comparing to the 20-block, 3-day version. Not the 40-block, 40-day
> version that was even stronger.
> 
> As papers rarely show failures, can we take it to mean they couldn't
> out-perform their best go bot, do you think? If so, I wonder how hard
> they tried?

  IMHO the most likely explanation is that this research has been going
on for a while and when they started in this direction, that early
version was their state-of-art baseline.  This kind of chronology, with
the 40-block version being almost "a last-minute addition", is imho
apparent even in the text of the Nature paper.

  Also, the 3-day version simply had roughly similar training time
available as AlphaZero did.

-- 
Petr Baudis, Rossum
Run before you walk! Fly before you crawl! Keep moving forward!
If we fail, I'd rather fail really hugely.  -- Moist von Lipwig
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Gian-Carlo Pascutto
On 6/12/2017 19:48, Xavier Combelle wrote:
> Another result is that chess is really drawish, at the opposite of shogi

We sort-of knew that, but OTOH isn't that also because the resulting
engine strength was close to Stockfish, unlike in other games?

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Ingo Althöfer
> The AlphaZero paper shows it out-performs AlphaGoZero, but they are
> comparing to the 20-block, 3-day version. Not the 40-block, 40-day
> version that was even stronger.
> As papers rarely show failures, can we take it to mean they couldn't
> out-perform their best go bot, do you think? ...
> 
> In other words, do you think the changes they made from AlphaGo Zero to
> Alpha Zero have made it weaker ...

Just some speculation:

The article on AlphaGo Zero is in NATURE.
Perhaps they made the AlphaZero research simultaneously,
and when facing problems with acceptance in a journal (like NATURE)
they decided to publish a preversion on AlphaZero in arXiv.
So, perhaps the 40-block 40-day experiment was not yet done when
they had written the AlphaZero paper.

Just speculating...
Ingo.
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Gian-Carlo Pascutto
On 6/12/2017 18:57, Darren Cook wrote:
>> Mastering Chess and Shogi by Self-Play with a General Reinforcement
>> Learning Algorithm
>> https://arxiv.org/pdf/1712.01815.pdf
> 
> One of the changes they made (bottom of p.3) was to continuously update
> the neural net, rather than require a new network to beat it 55% of the
> time to be used. (That struck me as strange at the time, when reading
> the AlphaGoZero paper - why not just >50%?)

I read that as a simple way of establishing confidence that the result
was statistically significant > 0. (+35 Elo over 400 games - I don't
know by hearth how large the typical error margin of 400 games is, but I
think it won't be far off!)

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Xavier Combelle
Another result is that chess is really drawish, at the opposite of shogi


Le 06/12/2017 à 18:50, Richard Lorentz a écrit :
> One chess result stood out for me, namely, just how much easier it was
> for AlphaZero to win with white (25 wins, 25 draws, 0 losses) rather
> than with black (3 wins, 47 draws, 0 losses).
>
> Maybe we should not give up on the idea of White to play and win in chess!
>
> On 12/06/2017 01:24 AM, Hiroshi Yamashita wrote:
>> Hi,
>>
>> DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero
>> method.
>>
>> Mastering Chess and Shogi by Self-Play with a General Reinforcement
>> Learning Algorithm
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1712.01815.pdf=DwIGaQ=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=w0qxE9GOfBVzqPOT0NBm1nsdQqJMlNu40BOCWfsO-gQ=dsola-9J77ArHVeuVc0ZCZKn2nJOsjfsnJzPc_MdPDo=
>>
>>
>> AlphaZero(Chess) outperformed Stockfish after 4 hours,
>> AlphaZero(Shogi) outperformed elmo after 2 hours.
>>
>> Search is MCTS.
>> AlphaZero(Chess) searches 80,000 positions/sec.
>> Stockfish    searches 70,000,000 positions/sec.
>> AlphaZero(Shogi) searches 40,000 positions/sec.
>> elmo searches 35,000,000 positions/sec.
>>
>> Thanks,
>> Hiroshi Yamashita
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__computer-2Dgo.org_mailman_listinfo_computer-2Dgo=DwIGaQ=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=w0qxE9GOfBVzqPOT0NBm1nsdQqJMlNu40BOCWfsO-gQ=Dflm7ezefzMJ9xLNmNYrSQKWa7qvG9FkzlCHngo_NcY=
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Ingo Althöfer
"Joshua Shriver"  asked:
> What about arimaa?

My personal impression: Arimaa should be rather easy for the
AlphaZero approach.


My questions:
* How well does the AlphaZero approach
perform in Non-zero-sum games?
(or in games with more than two players)

* How well does the AlphaZero approach
perform in games with a robot component
(for instance in Frisbee Go)?
https://www.althofer.de/robot-play/frisbee-robot-go.jpg

* How well does AlphaZero perform in games where "we"
know the best moves by mathematical analysis (for instance
the Nim game), or where we know that the second player
has a mirror strategy to secure a draw?

Ingo.

PS. For a long time I thought that Boston Dynamics was
the best horse in Google's staple. But it seems that
DeepMind was and is better...
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Darren Cook
> Mastering Chess and Shogi by Self-Play with a General Reinforcement
> Learning Algorithm
> https://arxiv.org/pdf/1712.01815.pdf

One of the changes they made (bottom of p.3) was to continuously update
the neural net, rather than require a new network to beat it 55% of the
time to be used. (That struck me as strange at the time, when reading
the AlphaGoZero paper - why not just >50%?)

The AlphaZero paper shows it out-performs AlphaGoZero, but they are
comparing to the 20-block, 3-day version. Not the 40-block, 40-day
version that was even stronger.

As papers rarely show failures, can we take it to mean they couldn't
out-perform their best go bot, do you think? If so, I wonder how hard
they tried?

In other words, do you think the changes they made from AlphaGo Zero to
Alpha Zero have made it weaker (when just viewed from the point of view
of making the strongest possible go program).

Darren
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Richard Lorentz
One chess result stood out for me, namely, just how much easier it was 
for AlphaZero to win with white (25 wins, 25 draws, 0 losses) rather 
than with black (3 wins, 47 draws, 0 losses).


Maybe we should not give up on the idea of White to play and win in chess!

On 12/06/2017 01:24 AM, Hiroshi Yamashita wrote:

Hi,

DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero 
method.


Mastering Chess and Shogi by Self-Play with a General Reinforcement 
Learning Algorithm
https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1712.01815.pdf=DwIGaQ=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=w0qxE9GOfBVzqPOT0NBm1nsdQqJMlNu40BOCWfsO-gQ=dsola-9J77ArHVeuVc0ZCZKn2nJOsjfsnJzPc_MdPDo= 



AlphaZero(Chess) outperformed Stockfish after 4 hours,
AlphaZero(Shogi) outperformed elmo after 2 hours.

Search is MCTS.
AlphaZero(Chess) searches 80,000 positions/sec.
Stockfish    searches 70,000,000 positions/sec.
AlphaZero(Shogi) searches 40,000 positions/sec.
elmo searches 35,000,000 positions/sec.

Thanks,
Hiroshi Yamashita

___
Computer-go mailing list
Computer-go@computer-go.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__computer-2Dgo.org_mailman_listinfo_computer-2Dgo=DwIGaQ=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-j14fFwh71U=i0hg-cKH69CA5MsdosvezQ=w0qxE9GOfBVzqPOT0NBm1nsdQqJMlNu40BOCWfsO-gQ=Dflm7ezefzMJ9xLNmNYrSQKWa7qvG9FkzlCHngo_NcY=


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Joshua Shriver
What about arimaa?

On Wed, Dec 6, 2017 at 9:28 AM, "Ingo Althöfer" <3-hirn-ver...@gmx.de> wrote:
> It seems, we are living in extremely
> heavy times ...
>
> I want to go to bed now and meditate for threee days.
>
>> DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero method.
>> Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning 
>> Algorithm
>> https://arxiv.org/pdf/1712.01815.pdf
>>
>> AlphaZero(Chess) outperformed Stockfish after 4 hours,
>> AlphaZero(Shogi) outperformed elmo after 2 hours.
>
> It may sound strange, but at the moment my only hopes for
> games too difficult for AlphaZero might be
>
> * a connection game like Hex (on 19x19 board)
>
> * a game like Clobber (based on CGT)
>
> Mastering Clobber would mean that also the concept of
> combinatorial game theory would be "easily" learnable.
>
>
> Side question: Would the classic Nim game be
> a trivial nut for AlphaZero ?
>
> Ingo (is now starting to hope for an AlphaZero type program
> that can do "general" mathematical research).
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread David Wu
Hex:
https://arxiv.org/pdf/1705.08439.pdf

This is not on a 19x19 board, and it was not tested against the current
state of the art (Mohex 1.0 was the state of the art at its time, but is at
least several years old now, I think), but they do get several hundred elo
points stronger than this old version of Mohex, have training curves that
suggest that they still haven' reached the limit of improvement, and are
doing it with orders of magnitude less computation than Google would have
available.

So, I think it is likely that hex is not going to be too difficult for
AlphaZero or similar architecture.


On Wed, Dec 6, 2017 at 9:28 AM, "Ingo Althöfer" <3-hirn-ver...@gmx.de>
wrote:

> It seems, we are living in extremely
> heavy times ...
>
> I want to go to bed now and meditate for threee days.
>
> > DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero
> method.
> > Mastering Chess and Shogi by Self-Play with a General Reinforcement
> Learning Algorithm
> > https://arxiv.org/pdf/1712.01815.pdf
> >
> > AlphaZero(Chess) outperformed Stockfish after 4 hours,
> > AlphaZero(Shogi) outperformed elmo after 2 hours.
>
> It may sound strange, but at the moment my only hopes for
> games too difficult for AlphaZero might be
>
> * a connection game like Hex (on 19x19 board)
>
> * a game like Clobber (based on CGT)
>
> Mastering Clobber would mean that also the concept of
> combinatorial game theory would be "easily" learnable.
>
>
> Side question: Would the classic Nim game be
> a trivial nut for AlphaZero ?
>
> Ingo (is now starting to hope for an AlphaZero type program
> that can do "general" mathematical research).
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Aja Huang
2017-12-06 13:52 GMT+00:00 Gian-Carlo Pascutto :

> On 06-12-17 11:47, Aja Huang wrote:
> > All I can say is that first-play-urgency is not a significant
> > technical detail, and what's why we didn't specify it in the paper.
>
> I will have to disagree here. Of course, it's always possible I'm
> misunderstanding something, or I have a program bug that I'm mixing up
> with this.
>

No matter I agree with you or not, unfortunately it's not up to me to
decide whether I can answer the question, even if I am personally happy to
(in fact, this post might be already exceeding my barrier a bit). I hope
you understand, and good luck with making it works.

I'm very happy the two Go papers we published have helped the Go community.
My dream was fulfilled and I've switched to pursue other challenges. :)

Aja


> Or maybe you mean that you expect the program to improve regardless of
> this setting. In any case, I've now seen people state here twice that
> this is detail that doesn't matter. But practical results suggest
> otherwise.
>
> For a strong supervised network, FPU=0 (i.e. not exploring all successor
> nodes for a longer time, relying strongly on policy priors) is much
> stronger. I've seen this in Leela Zero after we tested it, and I've
> known it to be true from regular Leela for a long time. IIRC, the strong
> open source Go bots also use some form of progressive widening, which
> produces the same effect.
>
> For a weak RL network without much useful policy priors, FPU>1 is much
> stronger than FPU=0.
>
> Now these are relative scores of course, so one could argue they don't
> affect the learning process. But they actually do that as well!
>
> The new AZ paper uses MCTS playouts = 800, and plays proportionally
> according to MCTS output. (Previous AGZ had playouts = 1600,
> proportional for first 30 moves).
>
> Consider what this means for the search probability outputs, exactly the
> thing the policy network has to learn. With FPU=1, the move
> probabilities are much more uniform, and the moves played are
> consequentially much more likely to be bad or even blunders, because
> there are less playouts that can be spent on the best move, even if it
> was found.
>
> > The initial value of Q is not very important because Q+U is
> > dominated by the U piece when the number of visits is small.
>
> a = Q(s, a) + coeff * P(s,a) * (sqrt(parent->visits) / 1.0f +
> child->visits());
>
> Assume parent->visits = 100, sqrt = 10
> Assume child->visits = 0
> Assume P(s, a) = 0.0027 (near uniform prior for "weak" network)
>
> The right most side of this (U term) is ~1. This clearly does not
> dominate the Q term. If Q > 1 (classic FPU) then every child node will
> get expanded. If Q = 0 (Q(s, a) = 0) then the first picked child
> (largest policy prior) will get something like 10 expansions before
> another child gets picked. That's a massive difference in search tree
> shape, *especially* with only 800 total playouts.
>
> --
> GCP
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Ingo Althöfer
It seems, we are living in extremely
heavy times ...

I want to go to bed now and meditate for threee days. 
 
> DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero method.
> Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning 
> Algorithm
> https://arxiv.org/pdf/1712.01815.pdf
> 
> AlphaZero(Chess) outperformed Stockfish after 4 hours,
> AlphaZero(Shogi) outperformed elmo after 2 hours.
 
It may sound strange, but at the moment my only hopes for
games too difficult for AlphaZero might be 

* a connection game like Hex (on 19x19 board)

* a game like Clobber (based on CGT)

Mastering Clobber would mean that also the concept of
combinatorial game theory would be "easily" learnable.


Side question: Would the classic Nim game be 
a trivial nut for AlphaZero ?

Ingo (is now starting to hope for an AlphaZero type program
that can do "general" mathematical research).
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Gian-Carlo Pascutto
On 06-12-17 11:47, Aja Huang wrote:
> All I can say is that first-play-urgency is not a significant 
> technical detail, and what's why we didn't specify it in the paper.

I will have to disagree here. Of course, it's always possible I'm
misunderstanding something, or I have a program bug that I'm mixing up
with this.

Or maybe you mean that you expect the program to improve regardless of
this setting. In any case, I've now seen people state here twice that
this is detail that doesn't matter. But practical results suggest otherwise.

For a strong supervised network, FPU=0 (i.e. not exploring all successor
nodes for a longer time, relying strongly on policy priors) is much
stronger. I've seen this in Leela Zero after we tested it, and I've
known it to be true from regular Leela for a long time. IIRC, the strong
open source Go bots also use some form of progressive widening, which
produces the same effect.

For a weak RL network without much useful policy priors, FPU>1 is much
stronger than FPU=0.

Now these are relative scores of course, so one could argue they don't
affect the learning process. But they actually do that as well!

The new AZ paper uses MCTS playouts = 800, and plays proportionally
according to MCTS output. (Previous AGZ had playouts = 1600,
proportional for first 30 moves).

Consider what this means for the search probability outputs, exactly the
thing the policy network has to learn. With FPU=1, the move
probabilities are much more uniform, and the moves played are
consequentially much more likely to be bad or even blunders, because
there are less playouts that can be spent on the best move, even if it
was found.

> The initial value of Q is not very important because Q+U is
> dominated by the U piece when the number of visits is small.

a = Q(s, a) + coeff * P(s,a) * (sqrt(parent->visits) / 1.0f +
child->visits());

Assume parent->visits = 100, sqrt = 10
Assume child->visits = 0
Assume P(s, a) = 0.0027 (near uniform prior for "weak" network)

The right most side of this (U term) is ~1. This clearly does not
dominate the Q term. If Q > 1 (classic FPU) then every child node will
get expanded. If Q = 0 (Q(s, a) = 0) then the first picked child
(largest policy prior) will get something like 10 expansions before
another child gets picked. That's a massive difference in search tree
shape, *especially* with only 800 total playouts.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Andy
Thanks for letting us know the situation Aja. It must be hard for an
engineer to not be able to discuss the details of his work!

As for the first-play-urgency value, if we indulge in some reading between
the lines: It's possible to interpret the paper as saying
first-play-urgency is zero. After rereading it myself that's the way I read
it now. But if that is true maybe Aja would have said "guys the paper
already says it is zero." That would imply it's actually some other value.

That is probably reading far too much into Aja's reply, but it's something
to think about.


2017-12-06 4:47 GMT-06:00 Aja Huang :

>
>
> 2017-12-06 9:23 GMT+00:00 Gian-Carlo Pascutto :
>
>> On 03-12-17 17:57, Rémi Coulom wrote:
>> > They have a Q(s,a) term in their node-selection formula, but they
>> > don't tell what value they give to an action that has not yet been
>> > visited. Maybe Aja can tell us.
>>
>> FWIW I already asked Aja this exact question a bit after the paper came
>> out and he told me he cannot answer questions about unpublished details.
>>
>
> Yes, I did ask my manager if I could answer your question but he
> specifically said no. All I can say is that first-play-urgency is not a
> significant technical detail, and what's why we didn't specify it in the
> paper.
>
> Aja
>
>
>
>> This is not very promising regarding reproducibility considering the AZ
>> paper is even lighter on them.
>>
>> Another issue which is up in the air is whether the choice of the number
>> of playouts for the MCTS part represents an implicit balancing between
>> self-play and training speed. This is particularly relevant if the
>> evaluation step is removed. But it's possible even DeepMind doesn't know
>> the answer for sure. They had a setup, and they optimized it. It's not
>> clear which parts generalize.
>>
>> (Usually one wonders about such things in terms of algorithms, but here
>> one wonders about it in terms of hardware!)
>>
>> --
>> GCP
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-06 Thread Aja Huang
2017-12-06 9:23 GMT+00:00 Gian-Carlo Pascutto :

> On 03-12-17 17:57, Rémi Coulom wrote:
> > They have a Q(s,a) term in their node-selection formula, but they
> > don't tell what value they give to an action that has not yet been
> > visited. Maybe Aja can tell us.
>
> FWIW I already asked Aja this exact question a bit after the paper came
> out and he told me he cannot answer questions about unpublished details.
>

Yes, I did ask my manager if I could answer your question but he
specifically said no. All I can say is that first-play-urgency is not a
significant technical detail, and what's why we didn't specify it in the
paper.

Aja



> This is not very promising regarding reproducibility considering the AZ
> paper is even lighter on them.
>
> Another issue which is up in the air is whether the choice of the number
> of playouts for the MCTS part represents an implicit balancing between
> self-play and training speed. This is particularly relevant if the
> evaluation step is removed. But it's possible even DeepMind doesn't know
> the answer for sure. They had a setup, and they optimized it. It's not
> clear which parts generalize.
>
> (Usually one wonders about such things in terms of algorithms, but here
> one wonders about it in terms of hardware!)
>
> --
> GCP
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] AlphaZero

2017-12-06 Thread cazenave

Hi,

It appears AlphaZero surpasses AlphaGo Zero at Go, Stockfish at Chess and
Elmo at Shogi in a few hours of self play...

https://arxiv.org/pdf/1712.01815.pdf

Best,

Tristan.


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

2017-12-06 Thread Hiroshi Yamashita

Hi,

DeepMind makes strongest Chess and Shogi programs with AlphaGo Zero method.

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning 
Algorithm
https://arxiv.org/pdf/1712.01815.pdf

AlphaZero(Chess) outperformed Stockfish after 4 hours,
AlphaZero(Shogi) outperformed elmo after 2 hours.

Search is MCTS. 


AlphaZero(Chess) searches 80,000 positions/sec.
Stockfishsearches 70,000,000 positions/sec.
AlphaZero(Shogi) searches 40,000 positions/sec.
elmo searches 35,000,000 positions/sec.

Thanks,
Hiroshi Yamashita

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] Google did it again this time with Chess and Shogi!

2017-12-06 Thread valkyria

https://arxiv.org/abs/1712.01815


Best
Magnus
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go