subject:"\[Computer\-go\] mini\-max with Policy and Value network"

Re: [Computer-go] mini-max with Policy and Value network

2017-06-07 Thread Gian-Carlo Pascutto

On 24-05-17 05:33, "Ingo Althöfer" wrote:
>> So, 0.001% probability. Demis commented that Lee Sedol's winning move in
>> game 4 was a one in 10 000 move. This is a 1 in 100 000 move.
> 
> In Summer 2016 I checked the games of AlphaGo vs Lee Sedol
> with repeated runs of CrazyStone DL:
> In 3 of 20 runs the program selected P10. It
> turned out that a rather early "switch" in the search was
> necessary to arrive at P10. But if CS did that it
> remained with this candidate.

I guess it's possible this move is selected by a policy other than the
neural network. Or perhaps the probability can be much higher with a
differently trained policy net.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-06-07 Thread Ingo Althöfer

Hi, just my 2 Cent.

"Gian-Carlo Pascutto"  wrote:

> In the attached SGF, AlphaGo played P10, which was considered a very
> surprising move by all commentators...
> I can sort-of confirm this:
> 
> 0.295057654 (E13)
> ...(60 more moves follow)...
> 0.11952 (P10)
> 
> So, 0.001% probability. Demis commented that Lee Sedol's winning move in
> game 4 was a one in 10 000 move. This is a 1 in 100 000 move.

In Summer 2016 I checked the games of AlphaGo vs Lee Sedol
with repeated runs of CrazyStone DL:
In 3 of 20 runs the program selected P10. It
turned out that a rather early "switch" in the search was
necessary to arrive at P10. But if CS did that it
remained with this candidate.

Ingo.
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-06-07 Thread Hideki Kato

Alvaro Begue:

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Gian-Carlo Pascutto

On 23-05-17 17:19, Hideki Kato wrote:
> Gian-Carlo Pascutto: <0357614a-98b8-6949-723e-e1a849c75...@sjeng.org>:
> 
>> Now, even the original AlphaGo played moves that surprised human pros
>> and were contrary to established sequences. So where did those come
>> from? Enough computation power to overcome the low probability?
>> Synthesized by inference from the (much larger than mine) policy network?
> 
> Demis Hassabis said in a talk:
> After the game with Sedol, the team used "adversarial learning" in 
> order to fill the holes in policy net (such as the Sedol's winning 
> move in the game 4).

I said, the "original AlphaGo", i.e. the one used in the match against
Lee Sedol. According to the Nature paper, the policy net was trained
with supervised learning only [1]. And yet...

In the attached SGF, AlphaGo played P10, which was considered a very
surprising move by all commentators. Presumably, this means it's not
seen in high level human play, and would not get a high rating in the
policy net. I can sort-of confirm this:

0.295057654 (E13)
...(60 more moves follow)...
0.11952 (P10)

So, 0.001% probability. Demis commented that Lee Sedol's winning move in
game 4 was a one in 10 000 move. This is a 1 in 100 000 move.
Differently trained policy nets might rate it a bit higher or lower, but
simply due to the fact that was considered very un-human to do, it seems
unlikely to ever be rated highly by a policy net based on supervised
learning.

So in AlphaGo's formula, you're dealing with a reduction of the UCT term
by a factor 100 000 plus or minus some order of magnitude.

  D6 -> 1359934 (W: 53.21%) (U: 49.34%) (V: 55.15%:  38918) (N:  6.3%)
PV: D6 F6 E7 F7 C8 B8 D7 B7 E9 C9 F8 H7 H
9 K7 H3 K9
...many moves...
 P10 -> 421 (W: 52.68%) (U: 50.09%) (V: 53.98%:  8) (N:  0.0%)
PV: P10 Q10 P8 Q9

Now, of course AlphaGo had a few orders of magnitude more hardware, but
you can see from the above that it's, eh, not easy for P10 to overtake
the top moves here in playout count.

And yet, that's the move that was played.

[1] I'm assuming that what played the match corresponds to what they
published there - maybe that is my mistake. I'm not sure I remember the
relevant timeline correctly.

-- 
GCP

sedol.sgf
Description: application/go-sgf
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread valkyria




(3) CNN cannot learn exclusive-or function due to the ReLU
activation function, instead of traditional sigmoid (tangent
hyperbolic).  CNN is good at approximating continuous (analog)
functions but Boolean (digital) ones.


Are you sure about that? I can imagine using two ReLU units to
construct a sigmoid-like step function, so I'd think a multi-layer net
should be fine (just like with ordinary perceptrons).


No, this is incorrect. A perceptron (a single layer neural network) 
cannot do XOR.
The whole point of 2+ layer networks was to overcome this basic 
weakness. A two layer network with infinite number of neurons in the 
layers can approximate any function.


But early on it turned out that learning was unstable and-or extremely 
slow for multilayer networks so the theoretical capacity was not 
practical.


Now with deep learning we know that with correct training, a lot of data 
and hardware (or patience) neural networks can learn almost anything.


It is probably correct that smooth functions are easier to approximate 
with a neural network, than high dimensional non-continuous functions.


I am training my networks on a single CPU thread so I have the benefit 
of following the learning process of NNOdin slowly. I have seen a lot of 
problems with the network but after some weeks of training they go away. 
It is interesting to see how its playing style changes. For a while it 
would rigididly play very local shapes but now it seems to start to take 
lie and death of large groups into account. Or maybe it lets the MC 
playout have more impact on the decisions made, by searching more 
effectively. Some weeks ago it would barely win against gnugo, and it 
won by just playing standard shapes until it got lucky. In the last 
couple of days it seems to surround and cut off gnugo's groups and kill 
them big as a strong player would.


So what do I want to say. So far i learned that the policy network will 
blindly play whatever shapes it finds good and ignore most alternative 
moves. So there is indeed a huge problem of "holes" in the policy 
function. But for Odin at least I do not know which holes will be a 
problem as the network matures with more learning. My plan is then to 
fix holes by making the MC evaluation strong.


Best
Magnus
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Álvaro Begué

On Tue, May 23, 2017 at 4:51 AM, Hideki Kato  wrote:

> (3) CNN cannot learn exclusive-or function due to the ReLU
> activation function, instead of traditional sigmoid (tangent
> hyperbolic).  CNN is good at approximating continuous (analog)
> functions but Boolean (digital) ones.
>

Oh, not this nonsense with the XOR function again.

You can see a neural network with ReLU activation function learning XOR
right here: http://playground.tensorflow.org/#activation=relu;
batchSize=10=xor=reg-plane=0.01&
regularizationRate=0=0=4,4=0.96791&
showTestData=false=false=50=true&
y=true=false=false=false=
false=false=false=false=false&
problem=classification=false=false

Enjoy,
Álvaro.
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Hideki Kato

Gian-Carlo Pascutto: :
>On 23-05-17 10:51, Hideki Kato wrote:
>> (2) The number of possible positions (input of the value net) in 
>> real games is at least 10^30 (10^170 in theory).  If the value 
>> net can recognize all?  L depend on very small difference of 
>> the placement of stones or liberties.  Can we provide necessary 
>> amount of training data?  Have the network enough capacity?  
>> The answer is almost obvious by the theory of function 
>> approximation.  (ANN is just a non-linear function 
>> approximator.)
>
>DCNN clearly have some ability to generalize from learned data and
>perform OK even with unseen examples. So I don't find this a very
>compelling argument. It's not like Monte Carlo playouts are going to
>handle all sequences correctly either.

CNN can generalize if global shapes can be built from smaller 
local shapes.  L of a large group is an exception because it's 
too sensitive for the detail of the position (ie, can be very 
global).  We can't have much expects on such generalization in 
L 

By our experiments, value net thinks a group is living if it has 
a large enough space.  That's all. 
#Actually, this is an opposit.  Value net thinks a group is dead  
if and only if it has short liberties.  Some nakade shapes can be 
solved if outer libeties are almost filled.

Additionally, value net frequently thinks false eyes as true, 
especially on the first lines.  (This problem can also be very 
global and very hard to be solved with no search.)

Value net itself cannot manage L correctly but allows so deeper 
search that this problem is hidden (ie, hard to be known).

>Evaluations are heuristic guidance for the search, and a help when the
>search terminates in an unresolved position. Having multiple independent
>ones improves the accuracy of the heuristic - a basic ensemble.

Value net approximates "true" value function of Go very 
coarsely.  Rollouts (MC simulations) fill the detail.  This could 
be a best ensemble.

>>(3) CNN cannot learn exclusive-or function due to the ReLU 
>>activation function, instead of traditional sigmoid (tangent 
>> hyperbolic).  CNN is good at approximating continuous (analog) 
>> functions but Boolean (digital) ones.
>
>Are you sure this is correct? Especially if we allow leaky ReLU?

Do you know the success of "DEEP" CNN comes from the use of 
ReLU?  Sigmoid easily vanishes gradient while ReLU not.  However, 
ReLU cannot represent sharp edges while sigmoid can.  DCNN (with 
ReLU) approximates functions in a piece-wise-linear style.

Hideki
ReLU) approximates functions in a piece-wise-linear style.

Hideki
-- 
Hideki Kato 
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Hideki Kato

Gian-Carlo Pascutto: <0357614a-98b8-6949-723e-e1a849c75...@sjeng.org>:

>Now, even the original AlphaGo played moves that surprised human pros
>and were contrary to established sequences. So where did those come
>from? Enough computation power to overcome the low probability?
>Synthesized by inference from the (much larger than mine) policy network?

Demis Hassabis said in a talk:
After the game with Sedol, the team used "adversarial learning" in 
order to fill the holes in policy net (such as the Sedol's winning 
move in the game 4).

Hideki

-- 
Hideki Kato 
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Hideki Kato

Erik van der Werf: 
:
>On Tue, May 23, 2017 at 10:51 AM, Hideki Kato 
>wrote:
>
>> Agree.
>>
>> (1) To solve L, some search is necessary in practice.  So, the
>> value net cannot solve some of them.
>> (2) The number of possible positions (input of the value net) in
>> real games is at least 10^30 (10^170 in theory).  If the value
>> net can recognize all?  L depend on very small difference of
>> the placement of stones or liberties.  Can we provide necessary
>> amount of training data?  Have the network enough capacity?
>> The answer is almost obvious by the theory of function
>> approximation.  (ANN is just a non-linear function
>> approximator.)
>>
>
>A similar argument can be made for natural neural nets, but we know humans
>are able to come up with reasonable solutions. I suppose a pure neural net
>approach would require some form of recursion, but when combined with a
>search, and rolling out the decision process to some sufficiently high
>number of max steps, apparently it's not that important.. Also, I suspect
>that nearly all positions can only be reached in real games by inferior
>moves from both sides. All that may be needed is some crude means to steer
>away from chaos (and even if one would start in chaos, humans probably
>wouldn't do well either).

My argument is for "stand-alone" DCNN.  Adding some (top-down?) 
control to DCNNs could solve this (like human's brain).  #I'm not 
sure about recurrency but maybe necessary.

>(3) CNN cannot learn exclusive-or function due to the ReLU
>> activation function, instead of traditional sigmoid (tangent
>> hyperbolic).  CNN is good at approximating continuous (analog)
>> functions but Boolean (digital) ones.
>>
>
>
>Are you sure about that? I can imagine using two ReLU units to construct a
>sigmoid-like step function, so I'd think a multi-layer net should be fine
>(just like with ordinary perceptrons).

Even if using many layers, it's hard to represent sharp edges by 
combining ReLUs.  (Not impossible but chances are few probably 
due to so many local traps.)

Best, Hideki

>Best,
>Erik
> inline file
>___
>Computer-go mailing list
>Computer-go@computer-go.org
>http://computer-go.org/mailman/listinfo/computer-go
-- 
Hideki Kato 
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Gian-Carlo Pascutto

On 23-05-17 10:51, Hideki Kato wrote:
> (2) The number of possible positions (input of the value net) in 
> real games is at least 10^30 (10^170 in theory).  If the value 
> net can recognize all?  L depend on very small difference of 
> the placement of stones or liberties.  Can we provide necessary 
> amount of training data?  Have the network enough capacity?  
> The answer is almost obvious by the theory of function 
> approximation.  (ANN is just a non-linear function 
> approximator.)

DCNN clearly have some ability to generalize from learned data and
perform OK even with unseen examples. So I don't find this a very
compelling argument. It's not like Monte Carlo playouts are going to
handle all sequences correctly either.

Evaluations are heuristic guidance for the search, and a help when the
search terminates in an unresolved position. Having multiple independent
ones improves the accuracy of the heuristic - a basic ensemble.

> (3) CNN cannot learn exclusive-or function due to the ReLU 
> activation function, instead of traditional sigmoid (tangent 
> hyperbolic).  CNN is good at approximating continuous (analog) 
> functions but Boolean (digital) ones.

Are you sure this is correct? Especially if we allow leaky ReLU?

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Gian-Carlo Pascutto

On 22-05-17 21:01, Marc Landgraf wrote:
> But what you should really look at here is Leelas evaluation of the game.

Note that this is completely irrelevant for the discussion about
tactical holes and the position I posted. You could literally plug any
evaluation into it (save for a static oracle, in which case why search
at all...) and it would still have the tactical blindness being discussed.

It's an issue of limitations of the policy network, combined with the
way one uses the UCT formula. I'll use the one from the original AlphaGo
paper here, because it's public and should behave even worse:

u(s, a) = c_puct * P(s, a) * sqrt(total_visits / (1 + child_visits))

Note that P(s, a) is a direct factor here, which means that for a move
ignored by the policy network, the UCT term will almost vanish. In other
words, unless the win is immediately visible (and for tactics it won't),
you're not going to find it. Also note that this is a deviation from
regular UCT or PUCT, which do not have such a direct term and hence only
have a disappearing prior, making the search eventually more exploratory.

Now, even the original AlphaGo played moves that surprised human pros
and were contrary to established sequences. So where did those come
from? Enough computation power to overcome the low probability?
Synthesized by inference from the (much larger than mine) policy network?

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Erik van der Werf

On Mon, May 22, 2017 at 4:54 PM, Gian-Carlo Pascutto  wrote:

> On 22-05-17 15:46, Erik van der Werf wrote:
> > Anyway, LMR seems like a good idea, but last time I tried it (in Migos)
> > it did not help. In Magog I had some good results with fractional depth
> > reductions (like in Realization Probability Search), but it's a long
> > time ago and the engines were much weaker then...
>
> What was generating your probabilities, though? A strong policy DCNN or
> something weaker?
>

Nothing deep. Back then for the move predictor I don't think I ever tried
more than two hidden layers (and it was only used near the root of the
search tree). RPS was even simpler (so it could be used with fast deep
searches). In hindsight I traded way too much accuracy for speed, but
coming from standard a AlphaBeta framework it still was a big improvement.

ERPS (LMR with fractional reductions based on move probabilities) with
> alpha-beta seems very similar to having MCTS with the policy prior being
> a factor in the UCT formula.

In the sense of the shape of the tree, possibly yes, but I have the
impression that AlphaBeta and similar search algorithms are more brittle
when working with high-variance (noisy) evaluations. In chess-like games it
may be less of an issue due to the implicit mobility feature that it adds,
but for Go mobility seems to be mostly irrelevant. The MCTS backup
(averaging evaluations) seems to reduce the variance much better than a
minimax backup.

Using a value net instead of raw Monte Carlo evaluation also reduces
variance (a lot), so trying out AlphaBeta with DCNN evaluations definitely
seems like an interesting experiment.

> This is what AlphaGo did according to their
> 2015 paper, so it can't be terrible, but it does mean that you are 100%
> blind to something the policy network doesn't see, which seems
> worrisome. I think I asked Aja once about what they do with first play
> urgency given that the paper doesn't address it - he politely ignored
> the question :-)
>

I don't think anyone has had good results with high FPU; it seems in Go we
simply cannot afford a very wide search (except perhaps near the root or on
the PV). I'm not sure if they still use an UCB term (which would ensure
some exploration of unlikely candidates). I think at some point David and
others argued against it, but in my own experiments it was always helpful,
and I think Aja may have found the same in Erica. Nevertheless, even
without it I think an argument can be made that the minimax result can
eventually be found.

I have an idea on what's causing the problems in Leela (and how you could
fix it), but I'll hold of on further commenting until I have some more time
to look at the examples.

Best,
Erik
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Gian-Carlo Pascutto

On 23-05-17 03:39, David Wu wrote:
> Leela playouts are definitely extremely bad compared to competitors like
> Crazystone. The deep-learning version of Crazystone has no value net as
> far as I know, only a policy net, which means it's going on MC playouts
> alone to produce its evaluations. Nonetheless, its playouts often have
> noticeable and usually correct opinions about early midgame game
> positions (as confirmed by the combination of own judgment as a dan
> player and Leela's value net). Which I find amazing - that it can even
> approximately get these right.

Leela's Monte Carlo playouts were designed and implemented in 2007,
before most of the current literature around them was public. Back then,
they were very "thick" and good enough to make the program one of the
strongest around. Needless to say, in the ~9 years or so when I was
absent from go programming, others made substantial progress in that
area, especially as before value nets this was clearly one of the most
important components of strength. Leela's Monte Carlo playouts for sure
are weaker than those of Crazy Stone and Zen, and even pachi. I have
done work on this in the last year, but a more complete overhaul isn't
in 0.10.0 yet.

Nevertheless (as you also observe below) they still contribute a benefit
to the strength of the engine. That's why I've been consistently saying
dropping them doesn't seem to be good, and why I like the orthogonality
they provide with the value net (and am generally wary of methods that
tune the playouts with or towards the value net).

> So clearly what's going on is that the playouts allow suicide,

I'll need to reconstruct the position you set up, but this is something
that shouldn't happen. Thank you for pointing it out, I'll try to
confirm on my side.

> Now I'm just speculating. My guess is that somehow 3% of the time, the
> game is scored without black having captured white's group. As in -
> black passes, white passes, white's dead group is still on the board, so
> white wins. The guess would be that liberties and putting it in atari
> increases the likelihood that the playouts kill the group before having
> both players pass and score. But that's just a guess, maybe there's also
> more black magic involving adjusting the "value" of a win depending on
> unknown factors beyond just having a "big win". Would need Gian-Carlo to
> actually confirm or refute this guess though.

Leela allows passes with a very low probability, so your analysis is
probably right.

> given that they're a significant weight in the evaluation
> alongside the value net, they're probably one of the major things
> holding Leela back at this point.

I assume that as well, which is why I've been doing some work on them,
but I'm also prepared to be disappointed. Note that I didn't put the
significant weighting arbitrarily: it's set to what gave the maximum
playing strength.

I suspect that when there are multiple options that seem objectively
equally good (from the value net), the playouts also help play towards
the option where it is harder to mess up. In this case, a larger amount
of stochasticity is not a bad thing.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-23 Thread Hideki Kato

Agree.

(1) To solve L, some search is necessary in practice.  So, the 
value net cannot solve some of them. 
(2) The number of possible positions (input of the value net) in 
real games is at least 10^30 (10^170 in theory).  If the value 
net can recognize all?  L depend on very small difference of 
the placement of stones or liberties.  Can we provide necessary 
amount of training data?  Have the network enough capacity?  
The answer is almost obvious by the theory of function 
approximation.  (ANN is just a non-linear function 
approximator.)
(3) CNN cannot learn exclusive-or function due to the ReLU 
activation function, instead of traditional sigmoid (tangent 
hyperbolic).  CNN is good at approximating continuous (analog) 
functions but Boolean (digital) ones.

Hideki

David Wu: :
>Addendum:
>
>Some additional playing around with the same position can flip the roles of
>the playouts and value net - so now the value net is very wrong and the
>playouts are mostly right. I think this gives good insight into what the
>value net is doing and why as a general matter playouts are still useful.
>
>Here's how:
>Play black moves as detailed in the previous email in the "leela10.sgf"
>game that Marc posted to resolve all the misunderstandings of the playouts
>and get it into the "3-10% white win" phase, but otherwise leave white's
>dead group on-board with tons of liberties. Let white have an absolute
>feast on the rest of the board while black simply connects his stones
>solidly. White gets to cut through Q6 and get pretty much every point
>available.
>
>Black is still winning even though he loses almost the entire rest of the
>board, as long as the middle white group dies. But with some fiddling
>around, you can arrive at a position where the value net is reporting 90%
>white win (wrong), while the playouts are rightly reporting only 3-10%
>white win.
>
>Intuitively, the value net only fuzzily evaluates white's group as probably
>dead, but isn't sure that it's dead, so intuitively it counts some value
>for white's group "in expectation" for the small chance it lives. And the
>score is otherwise not too far off on the rest of the board - the way I
>played it out, black wins by only ~5 points if white dies. So the small
>uncertainty that the huge white group is might actually be alive produces
>enough "expected value" for white to overwhelm the 5 point loss margin,
>such that the value net is 90% sure that white wins.
>
>What the value net has failed to "understand" here is that white's group
>surviving is a binary event. I.e. a 20% chance of the group being alive and
>white winning by 80 points along with a 80% chance that it's dead and white
>losing by 5 points does not average out to white being (0.2 * 80) - (0.8 *
>5) = 16 points ahead overall (although probably the value net doesn't
>exactly "think" in terms of points but rather something fuzzier). The
>playouts provide the much-needed "understanding" that win/loss is binary
>and that the expectation operator should be applied after mapping to
>win/loss outcomes, rather than before.
>
>It seems intuitive to me that a neural net would compute things in too much
>of a fuzzy and averaged way and thereby be vulnerable to this mistake. I
>wonder if it's possible to train a value net to get these things more
>correct without weakening it otherwise, with the right training. As it is,
>I suspect this is a systematic flaw in the value net's ability to produce
>good probabilities of winning in games where the game hinges on the life
>and death chances of a single large dragon, and where the expected score
>could be wildly uncorrelated with the probability of winning.
>
>
>On Mon, May 22, 2017 at 9:39 PM, David Wu  wrote:
>
>> Leela playouts are definitely extremely bad compared to competitors like
>> Crazystone. The deep-learning version of Crazystone has no value net as 
>far
>> as I know, only a policy net, which means it's going on MC playouts alone
>> to produce its evaluations. Nonetheless, its playouts often have 
>noticeable
>> and usually correct opinions about early midgame game positions (as
>> confirmed by the combination of own judgment as a dan player and Leela's
>> value net). Which I find amazing - that it can even approximately get 
>these
>> right.
>>
>> On to the game:
>>
>> Analyzing with Leela 0.10.0 in that second game, I think I can infer
>> pretty exactly what the playouts are getting wrong. Indeed the upper left
>> is being disastrously misplayed by them, but that's not all. -  I'm 
>finding
>> that multiple different things are all being played out wrong. All of the
>> following numbers are on white's turn, to give white the maximal chance to
>> distract black from resolving the tactics in the tree search and forcing
>> the playouts to do the work - the numbers are sometimes better if it's
>> black's turn.
>>
>> * At move 186, playouts as it stands

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread David Wu

Addendum:

Some additional playing around with the same position can flip the roles of
the playouts and value net - so now the value net is very wrong and the
playouts are mostly right. I think this gives good insight into what the
value net is doing and why as a general matter playouts are still useful.

Here's how:
Play black moves as detailed in the previous email in the "leela10.sgf"
game that Marc posted to resolve all the misunderstandings of the playouts
and get it into the "3-10% white win" phase, but otherwise leave white's
dead group on-board with tons of liberties. Let white have an absolute
feast on the rest of the board while black simply connects his stones
solidly. White gets to cut through Q6 and get pretty much every point
available.

Black is still winning even though he loses almost the entire rest of the
board, as long as the middle white group dies. But with some fiddling
around, you can arrive at a position where the value net is reporting 90%
white win (wrong), while the playouts are rightly reporting only 3-10%
white win.

Intuitively, the value net only fuzzily evaluates white's group as probably
dead, but isn't sure that it's dead, so intuitively it counts some value
for white's group "in expectation" for the small chance it lives. And the
score is otherwise not too far off on the rest of the board - the way I
played it out, black wins by only ~5 points if white dies. So the small
uncertainty that the huge white group is might actually be alive produces
enough "expected value" for white to overwhelm the 5 point loss margin,
such that the value net is 90% sure that white wins.

What the value net has failed to "understand" here is that white's group
surviving is a binary event. I.e. a 20% chance of the group being alive and
white winning by 80 points along with a 80% chance that it's dead and white
losing by 5 points does not average out to white being (0.2 * 80) - (0.8 *
5) = 16 points ahead overall (although probably the value net doesn't
exactly "think" in terms of points but rather something fuzzier). The
playouts provide the much-needed "understanding" that win/loss is binary
and that the expectation operator should be applied after mapping to
win/loss outcomes, rather than before.

It seems intuitive to me that a neural net would compute things in too much
of a fuzzy and averaged way and thereby be vulnerable to this mistake. I
wonder if it's possible to train a value net to get these things more
correct without weakening it otherwise, with the right training. As it is,
I suspect this is a systematic flaw in the value net's ability to produce
good probabilities of winning in games where the game hinges on the life
and death chances of a single large dragon, and where the expected score
could be wildly uncorrelated with the probability of winning.

On Mon, May 22, 2017 at 9:39 PM, David Wu  wrote:

> Leela playouts are definitely extremely bad compared to competitors like
> Crazystone. The deep-learning version of Crazystone has no value net as far
> as I know, only a policy net, which means it's going on MC playouts alone
> to produce its evaluations. Nonetheless, its playouts often have noticeable
> and usually correct opinions about early midgame game positions (as
> confirmed by the combination of own judgment as a dan player and Leela's
> value net). Which I find amazing - that it can even approximately get these
> right.
>
> On to the game:
>
> Analyzing with Leela 0.10.0 in that second game, I think I can infer
> pretty exactly what the playouts are getting wrong. Indeed the upper left
> is being disastrously misplayed by them, but that's not all. -  I'm finding
> that multiple different things are all being played out wrong. All of the
> following numbers are on white's turn, to give white the maximal chance to
> distract black from resolving the tactics in the tree search and forcing
> the playouts to do the work - the numbers are sometimes better if it's
> black's turn.
>
> * At move 186, playouts as it stands show about on the order of 60% for
> white, despite black absolutely having won the game. I partly played out
> the rest of the board in a very slow and solid way just in case it was
> confusing things, but not entirely, so that the tree search would still
> have plenty of endgame moves to be distracted by. Playouts stayed at 60%
> for white.
>
> * Add bA15, putting white down to 2 liberties: the PV shows an exchange of
> wA17 and bA14, keeping white at 2 liberties, and it drops to 50%.
> * Add bA16, putting white in atari: it drops to 40%.
>
> So clearly there's some funny business with black in the playouts
> self-atariing or something, and the chance that black does this lessens as
> white has fewer liberties and therefore is more likely to die first. Going
> further:
>
> * Add bE14, it drops to 30%.
>
> So the potential damezumari is another problem - I guess like 10% of the
> playouts are letting white kill the big black lump at

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Marc Landgraf

And talkig about tactical mistakes:
Another game, where a trick joseki early in the game (top right) completely
fools Leela. Leela here play this like it would be done in similar shapes,
but then gets completely blindsided. But to make things worse, it finds the
one way to make the loss the biggest. (note: this is not reliable when
trying this trick joseki, Leela will often lose the 4/5 stones on the left,
but will at least take the one stone on top in sente instead of screwing up
like it did here) Generally this "trick" is not that deep reading wise, but
given its similarity to more common shapes I can understand how the bot
falls for it.
Anyway, Leela manages to fully stabilize the game (given our general
difference in strength, this should come as no surprise), just to throw
away the center group.

But what you should really look at here is Leelas evaluation of the game.

Even very late in the game, the MC part of Leela considers Leela well
ahead, completely misreading the L+D here. Usually in most games Leela
loses to me, the issue comes the other way around. Leela NN strongly
believes in the game to be won, while the MC-part notices the real trouble.
But not here. Now of course this kind of misjudgement also could serve as
explaination how this group could die in first place.

But having had my own MC-Bot I really wonder how it could misevaluate so
badly here. To really lose this game as Black it either requires
substantial self ataris by Black, or large unanswered self atari by White.
Does Leela have such light playouts that those groups can really flip
status in 60%+ of the MC-Evaluations?

2017-05-22 20:46 GMT+02:00 Marc Landgraf :

> Leela has surprisingly large tactical holes. Right now it is throwing a
> good number of games against me in completely won endgames by fumbling away
> entirely alive groups.
>
> As an example I attached one game of myself (3d), even vs Leela10 @7d. But
> this really isn't a onetime occurence.
>
> If you look around move 150, the game is completely over by human
> standards as well as Leelas evaluation (Leela will give itself >80% here)
> But then Leela starts doing weird things.
> 186 is a minor mistake, but itself does not yet throw the game. But it is
> the start of series of bad turns.
> 236 then is a non-threat in a Ko fight, and checking Leelas evaluation,
> Leela doesn't even consider the possibility of it being ignored. This is
> btw a common topic with Leela in ko fights - it does not look at all at
> what happens if the Ko threat is ignored.
> 238 follows up the "ko threat", but this move isn't doing anything either!
> So Leela passed twice now.
> Suddenly there is some Ko appearing at the top right.
> Leela plays this Ko fight in some suboptimal way, not fully utilizing
> local ko threats, but this is a concept rather difficult to grasp for AIs
> afaik.
> I can not 100% judge whether ignoring the black threat of 253 is correct
> for Leela, I have some doubts on this one too.
> With 253 ignored, the game is now heavily swinging, but to my judgement,
> playing the hane instead of 256 would still keep it rather close and I'm
> not 100% sure who would win it now. But Leela decides to completely bury
> itself here with 256, while giving itself still 70% to win.
> As slowly realization of the real game state kicks in, the rest of the
> game is then the usual MC-throw away style we have known for years.
>
> Still... in this game you can see how a series of massive tactical
> blunders leads to throwing a completely won game. And this is just one of
> many examples. And it can not be all pinned on Ko's. I have seen a fair
> number of games where Leela does similar mistakes without Ko involved, even
> though Ko's drastically increase Leelas fumble chance.
> At the same time, Leela is completely and utterly outplaying me on a
> strategical level and whenever it manages to not make screwups like the
> ones shown I stand no chance at all. Even 3 stones is a serious challenge
> for me then. But those mistakes are common enough to keep me around even.
>
> 2017-05-22 17:47 GMT+02:00 Erik van der Werf :
>
>> On Mon, May 22, 2017 at 3:56 PM, Gian-Carlo Pascutto 
>> wrote:
>>
>>> On 22-05-17 11:27, Erik van der Werf wrote:
>>> > On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto >> > > wrote:
>>> >
>>> > ... This heavy pruning
>>> > by the policy network OTOH seems to be an issue for me. My program
>>> has
>>> > big tactical holes.
>>> >
>>> >
>>> > Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
>>> > move predictor (a.k.a. policy net), but I never saw the need to do hard
>>> > pruning. Steenvreter uses the predictions to set priors, and it is very
>>> > selective, but with infinite simulations eventually all potentially
>>> > relevant moves will get sampled.
>>>
>>> With infinite simulations everything is easy :-)
>>>
>>> In

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Marc Landgraf

Leela has surprisingly large tactical holes. Right now it is throwing a
good number of games against me in completely won endgames by fumbling away
entirely alive groups.

As an example I attached one game of myself (3d), even vs Leela10 @7d. But
this really isn't a onetime occurence.

If you look around move 150, the game is completely over by human standards
as well as Leelas evaluation (Leela will give itself >80% here)
But then Leela starts doing weird things.
186 is a minor mistake, but itself does not yet throw the game. But it is
the start of series of bad turns.
236 then is a non-threat in a Ko fight, and checking Leelas evaluation,
Leela doesn't even consider the possibility of it being ignored. This is
btw a common topic with Leela in ko fights - it does not look at all at
what happens if the Ko threat is ignored.
238 follows up the "ko threat", but this move isn't doing anything either!
So Leela passed twice now.
Suddenly there is some Ko appearing at the top right.
Leela plays this Ko fight in some suboptimal way, not fully utilizing local
ko threats, but this is a concept rather difficult to grasp for AIs afaik.
I can not 100% judge whether ignoring the black threat of 253 is correct
for Leela, I have some doubts on this one too.
With 253 ignored, the game is now heavily swinging, but to my judgement,
playing the hane instead of 256 would still keep it rather close and I'm
not 100% sure who would win it now. But Leela decides to completely bury
itself here with 256, while giving itself still 70% to win.
As slowly realization of the real game state kicks in, the rest of the game
is then the usual MC-throw away style we have known for years.

Still... in this game you can see how a series of massive tactical blunders
leads to throwing a completely won game. And this is just one of many
examples. And it can not be all pinned on Ko's. I have seen a fair number
of games where Leela does similar mistakes without Ko involved, even though
Ko's drastically increase Leelas fumble chance.
At the same time, Leela is completely and utterly outplaying me on a
strategical level and whenever it manages to not make screwups like the
ones shown I stand no chance at all. Even 3 stones is a serious challenge
for me then. But those mistakes are common enough to keep me around even.

2017-05-22 17:47 GMT+02:00 Erik van der Werf :

> On Mon, May 22, 2017 at 3:56 PM, Gian-Carlo Pascutto 
> wrote:
>
>> On 22-05-17 11:27, Erik van der Werf wrote:
>> > On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto > > > wrote:
>> >
>> > ... This heavy pruning
>> > by the policy network OTOH seems to be an issue for me. My program
>> has
>> > big tactical holes.
>> >
>> >
>> > Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
>> > move predictor (a.k.a. policy net), but I never saw the need to do hard
>> > pruning. Steenvreter uses the predictions to set priors, and it is very
>> > selective, but with infinite simulations eventually all potentially
>> > relevant moves will get sampled.
>>
>> With infinite simulations everything is easy :-)
>>
>> In practice moves with, say, a prior below 0.1% aren't going to get
>> searched, and I still regularly see positions where they're the winning
>> move, especially with tactics on the board.
>>
>> Enforcing the search to be wider without losing playing strength appears
>> to be hard.
>>
>>
> Well, I think that's fundamental; you can't be wide and deep at the same
> time, but at least you can chose an algorithm that (eventually) explores
> all directions.
>
> BTW I'm a bit surprised that you are still able to find 'big tactical
> holes' with Leela now playing as 8d KGS
>
> Best,
> Erik
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>

leela10screwup.sgf
Description: application/go-sgf
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Gian-Carlo Pascutto

On 22-05-17 17:47, Erik van der Werf wrote:
> On Mon, May 22, 2017 at 3:56 PM, Gian-Carlo Pascutto  > wrote:
> 
> Well, I think that's fundamental; you can't be wide and deep at the same
> time, but at least you can chose an algorithm that (eventually) explores
> all directions.

Right. But I'm uncomfortable with the current setup, because many
options won't get explored at all in practical situations. It would seem
logical that some minimum amount of (more spread) search effort would
plug enough holes to stop bad blunders, but finding a way to do that and
preserve strength seems elusive so far.

> BTW I'm a bit surprised that you are still able to find 'big tactical
> holes' with Leela now playing as 8d KGS

I've attached an example. It's not the prettiest one (one side has a 0.5
pt advantage in the critical variation so exact komi is an issue), but
it's a recent one from my mailbox.

This is with Leela 0.10.0 so you can follow along:

Leela: loadsgf tactics2.sgf 261
=

Passes: 0Black (X) Prisoners: 7
White (O) to moveWhite (O) Prisoners: 19

   a b c d e f g h j k l m n o p q r s t
19 . X O O . . . . . . . O O X X X O X . 19
18 . X X O O . . . . O O O X X X O O . O 18
17 . . X X O . . . O O X O O X . X O O . 17
16 . . X O O . . O O X X X X . . X X X . 16
15 . . X O . . O X X X X . . . . . X . X 15
14 . . . X O . O X O O O X . X X O O X(X)14
13 . . . X O O O O O X X X X O O X X X O 13
12 . . . . X X . O X X O O X O . O O X O 12
11 . . . . . . . O O X O X X O . . O O O 11
10 . . . X . X X O O O O O X O O O O . . 10
 9 . . . . X . X X O . O X O X X X O . .  9
 8 . . . . X . X O . . O X O O X O . . .  8
 7 X X X X O X X O . O O X O X X O O . .  7
 6 O O O O O O O . . . O X O . X X O . .  6
 5 X O O . X O O O O O X X X X X O . . .  5
 4 X X O O O X X O . O X X . X O O . . .  4
 3 X . X X X X O . O . O X . . X O . . .  3
 2 X O X . . O O . O . O O X X X O O . .  2
 1 . X . O O . . O O . O X X . X X O . .  1
   a b c d e f g h j k l m n o p q r s t

Hash: 106A3898CEC94132 Ko-Hash: 67E390C41BF2577

Black time: 00:30:00
White time: 00:30:00

Leela: heatmap
=

94.46% G11
 4.20% C1
 1.31% E2
 0.03% all other moves together

Note that O16 P17 N15 wins immediately. It's not that Leela is
completely blind to it, because that sequence is in some variations. But
in here, O16 won't get searched for a long time (it is actually the 4th
rated move) due to the skewed probabilities.

Leela: play w g11
=

Leela: heatmap
=

99.9% F11
 0.1% C1
   0% all other moves together

Leela: genmove b

Score looks bad. Resigning.
= resign

https://timkr.home.xs4all.nl/chess2/resigntxt.htm

If black plays O16 instead here he wins by 0.5 points.

-- 
GCP

tactics2.sgf
Description: application/go-sgf
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Erik van der Werf

On Mon, May 22, 2017 at 3:56 PM, Gian-Carlo Pascutto  wrote:

> On 22-05-17 11:27, Erik van der Werf wrote:
> > On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto  > > wrote:
> >
> > ... This heavy pruning
> > by the policy network OTOH seems to be an issue for me. My program
> has
> > big tactical holes.
> >
> >
> > Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
> > move predictor (a.k.a. policy net), but I never saw the need to do hard
> > pruning. Steenvreter uses the predictions to set priors, and it is very
> > selective, but with infinite simulations eventually all potentially
> > relevant moves will get sampled.
>
> With infinite simulations everything is easy :-)
>
> In practice moves with, say, a prior below 0.1% aren't going to get
> searched, and I still regularly see positions where they're the winning
> move, especially with tactics on the board.
>
> Enforcing the search to be wider without losing playing strength appears
> to be hard.
>
>
Well, I think that's fundamental; you can't be wide and deep at the same
time, but at least you can chose an algorithm that (eventually) explores
all directions.

BTW I'm a bit surprised that you are still able to find 'big tactical
holes' with Leela now playing as 8d KGS

Best,
Erik
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Gian-Carlo Pascutto

On 22-05-17 15:46, Erik van der Werf wrote:
> Oh, haha, after reading Brian's post I guess I misunderstood :-)
> 
> Anyway, LMR seems like a good idea, but last time I tried it (in Migos)
> it did not help. In Magog I had some good results with fractional depth
> reductions (like in Realization Probability Search), but it's a long
> time ago and the engines were much weaker then...

What was generating your probabilities, though? A strong policy DCNN or
something weaker?

ERPS (LMR with fractional reductions based on move probabilities) with
alpha-beta seems very similar to having MCTS with the policy prior being
a factor in the UCT formula. This is what AlphaGo did according to their
2015 paper, so it can't be terrible, but it does mean that you are 100%
blind to something the policy network doesn't see, which seems
worrisome. I think I asked Aja once about what they do with first play
urgency given that the paper doesn't address it - he politely ignored
the question :-)

The obvious defense (when looking at it in alpha-beta formulation) would
be to cap the depth reduction, and (in MCTS/UCT formulation) to cap the
minimum probability. I had no success with this in Go so far.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Gian-Carlo Pascutto

On 22-05-17 14:48, Brian Sheppard via Computer-go wrote:
> My reaction was "well, if you are using alpha-beta, then at least use
> LMR rather than hard pruning." Your reaction is "don't use 
> alpha-beta", and you would know better than anyone!

There's 2 aspects to my answer:

1) Unless you've made a breakthrough with value nets, there appears to
be a benefit to keeping the Monte Carlo simulations.

2) I am not sure the practical implementations of both algorithms end up
searching in a different manner.

(1) Is an argument against using alpha-beta. If we want to get rid of
the MC simulations - for whatever reason - it disappears. (2) isn't an
argument against. Stating the algorithm in a different manner may make
some heuristics or optimizations more obvious.

> Yes, LMR in Go has is a big difference compared to LMR in chess: Go 
> tactics take many moves to play out, whereas chess tactics are often
>  pretty immediate.

Not sure I agree with the basic premise here.

> So LMR could hurt Go tactics much more than it hurts chess tactics.
> Compare the benefit of forcing the playout to the end of the game.
LMR doesn't prune anything, it just reduces the remaining search depth
for non-highly rated moves. So it's certainly not going to make
something tactically weaker than hard pruning? If you're talking about
not pruning or reducing at all, you get the issue of the branching
factor again.

In chess you have quiescent search to filter out the simpler tactics. I
guess Monte Carlo simulations may act similar in that they're going to
raise/lower the score if in some simulations tactical shenanigans happen.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Gian-Carlo Pascutto

On 22-05-17 11:27, Erik van der Werf wrote:
> On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto  > wrote:
> 
> ... This heavy pruning
> by the policy network OTOH seems to be an issue for me. My program has
> big tactical holes.
> 
> 
> Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
> move predictor (a.k.a. policy net), but I never saw the need to do hard
> pruning. Steenvreter uses the predictions to set priors, and it is very
> selective, but with infinite simulations eventually all potentially
> relevant moves will get sampled.

With infinite simulations everything is easy :-)

In practice moves with, say, a prior below 0.1% aren't going to get
searched, and I still regularly see positions where they're the winning
move, especially with tactics on the board.

Enforcing the search to be wider without losing playing strength appears
to be hard.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Erik van der Werf

On Mon, May 22, 2017 at 11:27 AM, Erik van der Werf <
erikvanderw...@gmail.com> wrote:

> On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto 
> wrote:
>>
>> ... This heavy pruning
>> by the policy network OTOH seems to be an issue for me. My program has
>> big tactical holes.
>
>
> Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
> move predictor (a.k.a. policy net), but I never saw the need to do hard
> pruning. Steenvreter uses the predictions to set priors, and it is very
> selective, but with infinite simulations eventually all potentially
> relevant moves will get sampled.
>
>
Oh, haha, after reading Brian's post I guess I misunderstood :-)

Anyway, LMR seems like a good idea, but last time I tried it (in Migos) it
did not help. In Magog I had some good results with fractional depth
reductions (like in Realization Probability Search), but it's a long time
ago and the engines were much weaker then...
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Brian Sheppard via Computer-go

My reaction was "well, if you are using alpha-beta, then at least use LMR 
rather than hard pruning." Your reaction is "don't use alpha-beta", and you 
would know better than anyone!

Yes, LMR in Go has is a big difference compared to LMR in chess: Go tactics 
take many moves to play out, whereas chess tactics are often pretty immediate. 
So LMR could hurt Go tactics much more than it hurts chess tactics. Compare the 
benefit of forcing the playout to the end of the game.

Best,
Brian

-Original Message-
From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Gian-Carlo Pascutto
Sent: Monday, May 22, 2017 4:08 AM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] mini-max with Policy and Value network

On 20/05/2017 22:26, Brian Sheppard via Computer-go wrote:
> Could use late-move reductions to eliminate the hard pruning. Given 
> the accuracy rate of the policy network, I would guess that even move
> 2 should be reduced.
> 

The question I always ask is: what's the real difference between MCTS with a 
small UCT constant and an alpha-beta search with heavy Late Move Reductions? 
Are the explored trees really so different?

In any case, in my experiments Monte Carlo still gives a strong benefit, even 
with a not so strong Monte Carlo part. IIRC it was the case for AlphaGo too, 
and they used more training data for the value network than is publicly 
available, and Zen reported the same: Monte Carlo is important.

The main problem is the "only top x moves part". Late Move Reductions are very 
nice because there is never a full pruning. This heavy pruning by the policy 
network OTOH seems to be an issue for me. My program has big tactical holes.

--
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Erik van der Werf

On Mon, May 22, 2017 at 10:08 AM, Gian-Carlo Pascutto  wrote:
>
> ... This heavy pruning
> by the policy network OTOH seems to be an issue for me. My program has
> big tactical holes.

Do you do any hard pruning? My engines (Steenvreter,Magog) always had a
move predictor (a.k.a. policy net), but I never saw the need to do hard
pruning. Steenvreter uses the predictions to set priors, and it is very
selective, but with infinite simulations eventually all potentially
relevant moves will get sampled.

Erik
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-22 Thread Gian-Carlo Pascutto

On 20/05/2017 22:26, Brian Sheppard via Computer-go wrote:
> Could use late-move reductions to eliminate the hard pruning. Given
> the accuracy rate of the policy network, I would guess that even move
> 2 should be reduced.
> 

The question I always ask is: what's the real difference between MCTS
with a small UCT constant and an alpha-beta search with heavy Late Move
Reductions? Are the explored trees really so different?

In any case, in my experiments Monte Carlo still gives a strong benefit,
even with a not so strong Monte Carlo part. IIRC it was the case for
AlphaGo too, and they used more training data for the value network than
is publicly available, and Zen reported the same: Monte Carlo is important.

The main problem is the "only top x moves part". Late Move Reductions
are very nice because there is never a full pruning. This heavy pruning
by the policy network OTOH seems to be an issue for me. My program has
big tactical holes.

-- 
GCP
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

2017-05-20 Thread Brian Sheppard via Computer-go

Could use late-move reductions to eliminate the hard pruning. Given the 
accuracy rate of the policy network, I would guess that even move 2 should be 
reduced.

-Original Message-
From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Hiroshi Yamashita
Sent: Saturday, May 20, 2017 3:42 PM
To: computer-go@computer-go.org
Subject: [Computer-go] mini-max with Policy and Value network

Hi,

HiraBot author reported mini-max search with Policy and Value network. 
It does not use monte-carlo.
Only top 8 moves from Policy is searched in root node. In other depth,  top 4 
moves is searched.

Game result against Policy network best move (without search)

 Win Loss winrate 
MaxDepth=1, (558-442) 0.558   +40 Elo
MaxDepth=2, (351-150) 0.701  +148 Elo
MaxDepth=3, (406-116) 0.778  +218 Elo
MaxDepth=4, (670- 78) 0.896  +374 Elo
MaxDepth=5, (490- 57) 0.896  +374 Elo
MaxDepth=6, (520- 20) 0.963  +556 Elo

Search is simple alpha-beta.
There is a modification Policy network high probability moves tend to be 
selected.
MaxDepth=6 takes one second/move on i7-4790k + GTX1060.

His nega-max code
http://kiyoshifk.dip.jp/kiyoshifk/apk/negamax.zip
CGOS result, MaxDepth=6
http://www.yss-aya.com/cgos/19x19/cross/minimax-depth6.html
His Policy network(without search) is maybe 
http://www.yss-aya.com/cgos/19x19/cross/DCNN-No336-tygem.html
His Policy and Value network(MCTS) is maybe 
http://www.yss-aya.com/cgos/19x19/cross/Hiratuka10_38B100.html

Thanks,
Hiroshi Yamashita

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] mini-max with Policy and Value network

2017-05-20 Thread Hiroshi Yamashita


Hi,

HiraBot author reported mini-max search with Policy and Value network. 
It does not use monte-carlo.

Only top 8 moves from Policy is searched in root node. In other depth,
top 4 moves is searched.

Game result against Policy network best move (without search)

Win Loss winrate 
MaxDepth=1, (558-442) 0.558   +40 Elo

MaxDepth=2, (351-150) 0.701  +148 Elo
MaxDepth=3, (406-116) 0.778  +218 Elo
MaxDepth=4, (670- 78) 0.896  +374 Elo
MaxDepth=5, (490- 57) 0.896  +374 Elo
MaxDepth=6, (520- 20) 0.963  +556 Elo

Search is simple alpha-beta.
There is a modification Policy network high probability moves tend to be 
selected.
MaxDepth=6 takes one second/move on i7-4790k + GTX1060.

His nega-max code
http://kiyoshifk.dip.jp/kiyoshifk/apk/negamax.zip
CGOS result, MaxDepth=6
http://www.yss-aya.com/cgos/19x19/cross/minimax-depth6.html
His Policy network(without search) is maybe
http://www.yss-aya.com/cgos/19x19/cross/DCNN-No336-tygem.html
His Policy and Value network(MCTS) is maybe
http://www.yss-aya.com/cgos/19x19/cross/Hiratuka10_38B100.html

Thanks,
Hiroshi Yamashita

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

Re: [Computer-go] mini-max with Policy and Value network

[Computer-go] mini-max with Policy and Value network

28 matches

Site Navigation

Mail list logo

Footer information