Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-11 Thread Petr Baudis
On Thu, Sep 10, 2009 at 09:18:40AM -0400, Jason House wrote:
 Somewhat... One could generate a random number (r) and combine it
 with the move mask (m) as follows:
 black moves = m  r
 white moves = m  ~r
 
 This has the drawback that the number of black and white moves may
 not be equal. It can be modified to give an equal number of moves
 such as requiring that each subset of n intersections generates the
 same number of black and white moves, or doing some kind of looping
 until the condition is met.
 
 For looping, you could keep some subset of the generated moves in
 order to guarantee convergence. For example, if there are 14 more
 white moves than black moves, keep all generated moves except for 14
 white moves. Then you only have to pick 14 moves in the next loop
 iteration.

As a quick test mainly to test the speedup, I have made a simple hack
that takes every other intersection (creating an X-pattern) and
assigns it a randomized value (S_NONE,S_BLACK,S_WHITE). No check
on number of stones is made.

Now it thinks black has 46% win probability on empty board with no komi,
for a speedup of 1000 playouts/s. So the speedup seems totally
disappointing, and the bottleneck is in fact in the latter phases of the
game (merging groups and capturing). Hmm, something to dig into.

-- 
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Christian Nentwich

Petr,

wow, I didn't expect to see so much experimentation being performed! 
It's great that you have taken the time to implement this approach, 
because this now shows people both alternatives for implementing a 
playout algorithm on the GPU.


I strongly suspect the low performance in the per-intersection case is 
down to two reasons - please let me know what you think:
 - A big proportion of moves place one stone on an empty intersection. 
In this case 80 of your 81 threads are asleep doing nothing. This means 
GPU time is wasted.
 - In quite a few other cases, we get to the old problem with Go: that 
the intersections are not independent. Thus when you process captures or 
merges, you possibly have to lock 80 of 81 threads.


You can't do much about the first. Did you find a solution for the 
second? I was thinking for a while about whether it would be possible to 
come up with a highly parallel algorithm for captures/merges that 
requires no lock. That's quite hard so I concentrated on my own 
experiment instead.


I agree with you that your approach could be very performant in pattern 
matching scenarios. In fact, your work somewhat strengthens my 
conclusion that doing just straight-forward playouts on the GPU is not 
the way to go.


Christian


Petr Baudis wrote:

  I have written a quick CUDA implementation of the per-intersection
GPGPU apprach; Christian's e-mail finally made me polish it up to
a somewhat presentable form.

  In this implementation, each GPU thread maintains a single
intersection.

  The implementation uses 9x9 board (10x11 internally); expanding to
19x19 would probably mean either moving some data to global memory
or splitting the playout of single board to multiple blocks or waiting
for GPUs with larger shared memory. ;-)

  The speed is unfortunately very disappointing; on GTX260, I see
17,300 playouts per second, with 50 playouts running in parallel.
There are several buts:

  - With some further obvious optimizations, I'm thinking it should be
possible to get it to at least 22,000 playouts per second easily.

  - Bit boards are an obvious thing to try, but I'm quite doubtful
they will be an improvement

  - The playouts are uniformly random now, but they have been
implemented in a way to make heavier playouts easy; precise liberty
count is maintained and it should be easy to add pattern matching;
the move selection can deal with arbitrary probabilities for
different intersections (I wonder, what are the standard
high-performance numbers for heavier playouts with pattern matching?)

  - Christian is playing 2650 games in parallel; I'm playing only 50
games in parallel, in the meantime the CPU can pick next games to
play from the tree

  - My code already accounts for transferring board images from CPU to
the GPU (different one for each playout) and getting scores back

  - Christian's GPU seems quite better than mine

  - Apparently I still suck at CUDA optimizations; this has been my
first real small CUDA project, it would be awesome if someone more
experienced could look at the code and suggest obvious improvements

  Still, I'm pretty unhappy with the slowness; I wonder how Christian
achieved such a high speed. One problem with my approach is that I have
to make use of a lot of atomic instrinsics (some of them could be worked
around) and __syncthreads() all the time to ensure that all threads have
consistent board image at each stage.

  I think with pattern matching, the per-intersection approach could
shine more, since I can easily match patterns everywhere on the board
at once. Still, I wonder if it will become at least on par with good
CPU implementations.

On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote:
  

In other words: no tree search is involved, and this is the lightest
possible playout. The raw numbers are as follows:
 - CPU Search: 47,000 playouts per CPU core per second, on an Intel
6600 Core-2 Duo
 - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card



  I still find this quite astonishing; since you consider this
experiment a failure anyway, would you mind publishing the source code?
:-)

  



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



--

Christian Nentwich

Director, Model Two Zero Ltd.
+44-(0)7747-061302
http://www.modeltwozero.com

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Magnus Persson
This is very interesting. Here is a crazy idea, maybe it the same as  
Marks but I want to take it to its extreme.


Since AMAF values are so helpful, perhaps one can let go of the idea  
of sequential play following the rules of go, and basically play moves  
in parallel on all empty intersection. Compute new state (do captures)  
and repeat a fixed number of times and evaluate.


Since I do not understand GPU programming at all I cannot judge  
whether this would make things easier to implement.


Also a lot of weird thing will happens when you do things in parallel.  
What happens when two adjacent blocks are captured at the same time.  
Are both removed? Are there tricks to remove the one that was most  
likely to captured to begin with.


-Magnus

Quoting Christian Nentwich christ...@modeltwozero.com:


Petr,

wow, I didn't expect to see so much experimentation being performed!
It's great that you have taken the time to implement this approach,
because this now shows people both alternatives for implementing a
playout algorithm on the GPU.

I strongly suspect the low performance in the per-intersection case is
down to two reasons - please let me know what you think:
 - A big proportion of moves place one stone on an empty intersection.
In this case 80 of your 81 threads are asleep doing nothing. This means
GPU time is wasted.
 - In quite a few other cases, we get to the old problem with Go: that
the intersections are not independent. Thus when you process captures
or merges, you possibly have to lock 80 of 81 threads.

You can't do much about the first. Did you find a solution for the
second? I was thinking for a while about whether it would be possible
to come up with a highly parallel algorithm for captures/merges that
requires no lock. That's quite hard so I concentrated on my own
experiment instead.

I agree with you that your approach could be very performant in pattern
matching scenarios. In fact, your work somewhat strengthens my
conclusion that doing just straight-forward playouts on the GPU is not
the way to go.

Christian


Petr Baudis wrote:

 I have written a quick CUDA implementation of the per-intersection
GPGPU apprach; Christian's e-mail finally made me polish it up to
a somewhat presentable form.

 In this implementation, each GPU thread maintains a single
intersection.

 The implementation uses 9x9 board (10x11 internally); expanding to
19x19 would probably mean either moving some data to global memory
or splitting the playout of single board to multiple blocks or waiting
for GPUs with larger shared memory. ;-)

 The speed is unfortunately very disappointing; on GTX260, I see
17,300 playouts per second, with 50 playouts running in parallel.
There are several buts:

 - With some further obvious optimizations, I'm thinking it should be
   possible to get it to at least 22,000 playouts per second easily.

 - Bit boards are an obvious thing to try, but I'm quite doubtful
   they will be an improvement

 - The playouts are uniformly random now, but they have been
   implemented in a way to make heavier playouts easy; precise liberty
   count is maintained and it should be easy to add pattern matching;
   the move selection can deal with arbitrary probabilities for
   different intersections (I wonder, what are the standard
   high-performance numbers for heavier playouts with pattern matching?)

 - Christian is playing 2650 games in parallel; I'm playing only 50
   games in parallel, in the meantime the CPU can pick next games to
   play from the tree

 - My code already accounts for transferring board images from CPU to
   the GPU (different one for each playout) and getting scores back

 - Christian's GPU seems quite better than mine

 - Apparently I still suck at CUDA optimizations; this has been my
   first real small CUDA project, it would be awesome if someone more
   experienced could look at the code and suggest obvious improvements

 Still, I'm pretty unhappy with the slowness; I wonder how Christian
achieved such a high speed. One problem with my approach is that I have
to make use of a lot of atomic instrinsics (some of them could be worked
around) and __syncthreads() all the time to ensure that all threads have
consistent board image at each stage.

 I think with pattern matching, the per-intersection approach could
shine more, since I can easily match patterns everywhere on the board
at once. Still, I wonder if it will become at least on par with good
CPU implementations.

On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote:


In other words: no tree search is involved, and this is the lightest
possible playout. The raw numbers are as follows:
- CPU Search: 47,000 playouts per CPU core per second, on an Intel
6600 Core-2 Duo
- GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card



 I still find this quite astonishing; since you consider this
experiment a failure anyway, would you mind publishing the source code?
:-)

 

Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Petr Baudis
  Hi!

On Thu, Sep 10, 2009 at 12:26:06PM +0100, Christian Nentwich wrote:
 I strongly suspect the low performance in the per-intersection case
 is down to two reasons - please let me know what you think:
  - A big proportion of moves place one stone on an empty
 intersection. In this case 80 of your 81 threads are asleep doing
 nothing. This means GPU time is wasted.

  This is mostly true, though only through part of the loop body; I
think this is the biggest reason my approach is slower than your,
though, and it will be much worse on 19x19.

  It would be interesting to have N threads running independently and
getting contracted to various boards for massive tasks like capturing
and merging groups. But that would be quite complex task, not sure if
even practical to consider with the current state of the CUDA sandbox.

  - In quite a few other cases, we get to the old problem with Go:
 that the intersections are not independent. Thus when you process
 captures or merges, you possibly have to lock 80 of 81 threads.

  I do no locking, actually being able to do large captures and merges
very quickly was always my primary motivation in even entertaining this
way of parallelism.

  However, I do a lot of atomicAdd()s which have quite an overhead and
in some cases basically just serialize many threads.  Mainly:

  (i) When counting score. This is totally unoptimized and in O(N) now;
should be trivial to rewrite this as parallel summing in O(logN). This
is one of the easy optimizations I mentioned, in fact the most obvious
one.

  (ii) When recalculating liberties. All intersection threads adding
extra liberty to the same group will get serialized. Avoiding this
seems difficult.

-- 
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Petr Baudis
On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote:
 This is very interesting. Here is a crazy idea, maybe it the same as
 Marks but I want to take it to its extreme.
 
 Since AMAF values are so helpful, perhaps one can let go of the idea
 of sequential play following the rules of go, and basically play
 moves in parallel on all empty intersection. Compute new state (do
 captures) and repeat a fixed number of times and evaluate.
 
 Since I do not understand GPU programming at all I cannot judge
 whether this would make things easier to implement.
 
 Also a lot of weird thing will happens when you do things in
 parallel. What happens when two adjacent blocks are captured at the
 same time. Are both removed? Are there tricks to remove the one that
 was most likely to captured to begin with.

Well, after one iteration, _all_ blocks should be captured since there
will be no empty intersections, right? (Even if you don't play in
eye-like spaces, you will have this situation until some of these
actually form.)

The easiest heuristic would seem to be to capture the smallest group,
but I have hard time imagining this working well, _and_ how to choose
the smallest group efficiently (other than that, modifying my source to
do this should be fairly easy).

-- 
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Magnus Persson

Quoting Petr Baudis pa...@ucw.cz:


On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote:

This is very interesting. Here is a crazy idea, maybe it the same as
Marks but I want to take it to its extreme.

Since AMAF values are so helpful, perhaps one can let go of the idea
of sequential play following the rules of go, and basically play
moves in parallel on all empty intersection. Compute new state (do
captures) and repeat a fixed number of times and evaluate.

Since I do not understand GPU programming at all I cannot judge
whether this would make things easier to implement.

Also a lot of weird thing will happens when you do things in
parallel. What happens when two adjacent blocks are captured at the
same time. Are both removed? Are there tricks to remove the one that
was most likely to captured to begin with.


Well, after one iteration, _all_ blocks should be captured since there
will be no empty intersections, right? (Even if you don't play in
eye-like spaces, you will have this situation until some of these
actually form.)

The easiest heuristic would seem to be to capture the smallest group,
but I have hard time imagining this working well, _and_ how to choose
the smallest group efficiently (other than that, modifying my source to
do this should be fairly easy).

--
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/





--
Magnus Persson
Berlin, Germany
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Magnus Persson

Sorry for my empty reply to Petr, I misclicked

When a block is identified you could also sum up what the original  
strength of the original stones covered by the block and use that as a  
heuristic for what should be captured first.


Also to avoid filling all liberties. One can avoid that with pattern  
matching when the original position is copied to the GPU. For example  
if black has an empty triangle, black will rather pass than play that  
point. Thus either white plays there or it remains empty. Basically it  
mean we can impose heavy pruning on the original position. And then  
copy many copies of that position and evaluate them massively in  
parallel.


-Magnus


Quoting Petr Baudis pa...@ucw.cz:


On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote:

This is very interesting. Here is a crazy idea, maybe it the same as
Marks but I want to take it to its extreme.

Since AMAF values are so helpful, perhaps one can let go of the idea
of sequential play following the rules of go, and basically play
moves in parallel on all empty intersection. Compute new state (do
captures) and repeat a fixed number of times and evaluate.

Since I do not understand GPU programming at all I cannot judge
whether this would make things easier to implement.

Also a lot of weird thing will happens when you do things in
parallel. What happens when two adjacent blocks are captured at the
same time. Are both removed? Are there tricks to remove the one that
was most likely to captured to begin with.


Well, after one iteration, _all_ blocks should be captured since there
will be no empty intersections, right? (Even if you don't play in
eye-like spaces, you will have this situation until some of these
actually form.)

The easiest heuristic would seem to be to capture the smallest group,
but I have hard time imagining this working well, _and_ how to choose
the smallest group efficiently (other than that, modifying my source to
do this should be fairly easy).

--
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/





--
Magnus Persson
Berlin, Germany
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Petr Baudis
On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote:
 I've thought of something similar in the past, but with a twist:
 pre-compute a subset of moves that could be safely played in
 parallel. Even if you can only play 285 moves in parallel on an
 empty 19x19, it could still be a huge speed boost.

Hmm, do you have some ideas how this pre-computation could be done in
less than O(N)?

I'm gonna think about this now during some travelling... ;-)

Petr Pasky Baudis
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Isaac Deutsch

maybe some divide  conquer algorithm?


Am 10.09.2009 um 14:43 schrieb Petr Baudis:


On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote:

I've thought of something similar in the past, but with a twist:
pre-compute a subset of moves that could be safely played in
parallel. Even if you can only play 285 moves in parallel on an
empty 19x19, it could still be a huge speed boost.


Hmm, do you have some ideas how this pre-computation could be done in
less than O(N)?

I'm gonna think about this now during some travelling... ;-)

Petr Pasky Baudis
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread steve uurtamo
 Since AMAF values are so helpful, perhaps one can let go of the idea of
 sequential play following the rules of go, and basically play moves in
 parallel on all empty intersection. Compute new state (do captures) and
 repeat a fixed number of times and evaluate.

two thoughts:

i) how do you determine color to play for each thread?

ii) if i) becomes unsynchronized with the rules of go, you're
basically exploring random boards instead of boards that are related
to the game that you're interested in.  the board should rapidly
converge to something that retains no information about the initial
state of the board (assuming that the initial board has stones on it).

s.
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Jason House
Somewhat... One could generate a random number (r) and combine it with  
the move mask (m) as follows:

black moves = m  r
white moves = m  ~r

This has the drawback that the number of black and white moves may not  
be equal. It can be modified to give an equal number of moves such as  
requiring that each subset of n intersections generates the same  
number of black and white moves, or doing some kind of looping until  
the condition is met.


For looping, you could keep some subset of the generated moves in  
order to guarantee convergence. For example, if there are 14 more  
white moves than black moves, keep all generated moves except for 14  
white moves. Then you only have to pick 14 moves in the next loop  
iteration.


Sent from my iPhone

On Sep 10, 2009, at 8:43 AM, Petr Baudis pa...@ucw.cz wrote:


On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote:

I've thought of something similar in the past, but with a twist:
pre-compute a subset of moves that could be safely played in
parallel. Even if you can only play 285 moves in parallel on an
empty 19x19, it could still be a huge speed boost.


Hmm, do you have some ideas how this pre-computation could be done in
less than O(N)?

I'm gonna think about this now during some travelling... ;-)

   Petr Pasky Baudis
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

2009-09-10 Thread Magnus Persson

Quoting steve uurtamo uurt...@gmail.com:


Since AMAF values are so helpful, perhaps one can let go of the idea of
sequential play following the rules of go, and basically play moves in
parallel on all empty intersection. Compute new state (do captures) and
repeat a fixed number of times and evaluate.


two thoughts:

i) how do you determine color to play for each thread?


Random. Only one random bit necessary for each decision.



ii) if i) becomes unsynchronized with the rules of go, you're
basically exploring random boards instead of boards that are related
to the game that you're interested in.  the board should rapidly
converge to something that retains no information about the initial
state of the board (assuming that the initial board has stones on it).


Usual lightweight random games are also random creating situation that  
never happen in normal play. But violation of rules will probably make  
it more biased than a usual light playout. Hard to tell before trying.  
My feeling is that it might work surprisingly well but it might get  
extremely biased in some situations, which perhaps can be fixed by  
prefiltering of the move that can be played.


Magnus

--
Magnus Persson
Berlin, Germany
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/