Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
On Thu, Sep 10, 2009 at 09:18:40AM -0400, Jason House wrote: Somewhat... One could generate a random number (r) and combine it with the move mask (m) as follows: black moves = m r white moves = m ~r This has the drawback that the number of black and white moves may not be equal. It can be modified to give an equal number of moves such as requiring that each subset of n intersections generates the same number of black and white moves, or doing some kind of looping until the condition is met. For looping, you could keep some subset of the generated moves in order to guarantee convergence. For example, if there are 14 more white moves than black moves, keep all generated moves except for 14 white moves. Then you only have to pick 14 moves in the next loop iteration. As a quick test mainly to test the speedup, I have made a simple hack that takes every other intersection (creating an X-pattern) and assigns it a randomized value (S_NONE,S_BLACK,S_WHITE). No check on number of stones is made. Now it thinks black has 46% win probability on empty board with no komi, for a speedup of 1000 playouts/s. So the speedup seems totally disappointing, and the bottleneck is in fact in the latter phases of the game (merging groups and capturing). Hmm, something to dig into. -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Petr, wow, I didn't expect to see so much experimentation being performed! It's great that you have taken the time to implement this approach, because this now shows people both alternatives for implementing a playout algorithm on the GPU. I strongly suspect the low performance in the per-intersection case is down to two reasons - please let me know what you think: - A big proportion of moves place one stone on an empty intersection. In this case 80 of your 81 threads are asleep doing nothing. This means GPU time is wasted. - In quite a few other cases, we get to the old problem with Go: that the intersections are not independent. Thus when you process captures or merges, you possibly have to lock 80 of 81 threads. You can't do much about the first. Did you find a solution for the second? I was thinking for a while about whether it would be possible to come up with a highly parallel algorithm for captures/merges that requires no lock. That's quite hard so I concentrated on my own experiment instead. I agree with you that your approach could be very performant in pattern matching scenarios. In fact, your work somewhat strengthens my conclusion that doing just straight-forward playouts on the GPU is not the way to go. Christian Petr Baudis wrote: I have written a quick CUDA implementation of the per-intersection GPGPU apprach; Christian's e-mail finally made me polish it up to a somewhat presentable form. In this implementation, each GPU thread maintains a single intersection. The implementation uses 9x9 board (10x11 internally); expanding to 19x19 would probably mean either moving some data to global memory or splitting the playout of single board to multiple blocks or waiting for GPUs with larger shared memory. ;-) The speed is unfortunately very disappointing; on GTX260, I see 17,300 playouts per second, with 50 playouts running in parallel. There are several buts: - With some further obvious optimizations, I'm thinking it should be possible to get it to at least 22,000 playouts per second easily. - Bit boards are an obvious thing to try, but I'm quite doubtful they will be an improvement - The playouts are uniformly random now, but they have been implemented in a way to make heavier playouts easy; precise liberty count is maintained and it should be easy to add pattern matching; the move selection can deal with arbitrary probabilities for different intersections (I wonder, what are the standard high-performance numbers for heavier playouts with pattern matching?) - Christian is playing 2650 games in parallel; I'm playing only 50 games in parallel, in the meantime the CPU can pick next games to play from the tree - My code already accounts for transferring board images from CPU to the GPU (different one for each playout) and getting scores back - Christian's GPU seems quite better than mine - Apparently I still suck at CUDA optimizations; this has been my first real small CUDA project, it would be awesome if someone more experienced could look at the code and suggest obvious improvements Still, I'm pretty unhappy with the slowness; I wonder how Christian achieved such a high speed. One problem with my approach is that I have to make use of a lot of atomic instrinsics (some of them could be worked around) and __syncthreads() all the time to ensure that all threads have consistent board image at each stage. I think with pattern matching, the per-intersection approach could shine more, since I can easily match patterns everywhere on the board at once. Still, I wonder if it will become at least on par with good CPU implementations. On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote: In other words: no tree search is involved, and this is the lightest possible playout. The raw numbers are as follows: - CPU Search: 47,000 playouts per CPU core per second, on an Intel 6600 Core-2 Duo - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card I still find this quite astonishing; since you consider this experiment a failure anyway, would you mind publishing the source code? :-) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- Christian Nentwich Director, Model Two Zero Ltd. +44-(0)7747-061302 http://www.modeltwozero.com ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. -Magnus Quoting Christian Nentwich christ...@modeltwozero.com: Petr, wow, I didn't expect to see so much experimentation being performed! It's great that you have taken the time to implement this approach, because this now shows people both alternatives for implementing a playout algorithm on the GPU. I strongly suspect the low performance in the per-intersection case is down to two reasons - please let me know what you think: - A big proportion of moves place one stone on an empty intersection. In this case 80 of your 81 threads are asleep doing nothing. This means GPU time is wasted. - In quite a few other cases, we get to the old problem with Go: that the intersections are not independent. Thus when you process captures or merges, you possibly have to lock 80 of 81 threads. You can't do much about the first. Did you find a solution for the second? I was thinking for a while about whether it would be possible to come up with a highly parallel algorithm for captures/merges that requires no lock. That's quite hard so I concentrated on my own experiment instead. I agree with you that your approach could be very performant in pattern matching scenarios. In fact, your work somewhat strengthens my conclusion that doing just straight-forward playouts on the GPU is not the way to go. Christian Petr Baudis wrote: I have written a quick CUDA implementation of the per-intersection GPGPU apprach; Christian's e-mail finally made me polish it up to a somewhat presentable form. In this implementation, each GPU thread maintains a single intersection. The implementation uses 9x9 board (10x11 internally); expanding to 19x19 would probably mean either moving some data to global memory or splitting the playout of single board to multiple blocks or waiting for GPUs with larger shared memory. ;-) The speed is unfortunately very disappointing; on GTX260, I see 17,300 playouts per second, with 50 playouts running in parallel. There are several buts: - With some further obvious optimizations, I'm thinking it should be possible to get it to at least 22,000 playouts per second easily. - Bit boards are an obvious thing to try, but I'm quite doubtful they will be an improvement - The playouts are uniformly random now, but they have been implemented in a way to make heavier playouts easy; precise liberty count is maintained and it should be easy to add pattern matching; the move selection can deal with arbitrary probabilities for different intersections (I wonder, what are the standard high-performance numbers for heavier playouts with pattern matching?) - Christian is playing 2650 games in parallel; I'm playing only 50 games in parallel, in the meantime the CPU can pick next games to play from the tree - My code already accounts for transferring board images from CPU to the GPU (different one for each playout) and getting scores back - Christian's GPU seems quite better than mine - Apparently I still suck at CUDA optimizations; this has been my first real small CUDA project, it would be awesome if someone more experienced could look at the code and suggest obvious improvements Still, I'm pretty unhappy with the slowness; I wonder how Christian achieved such a high speed. One problem with my approach is that I have to make use of a lot of atomic instrinsics (some of them could be worked around) and __syncthreads() all the time to ensure that all threads have consistent board image at each stage. I think with pattern matching, the per-intersection approach could shine more, since I can easily match patterns everywhere on the board at once. Still, I wonder if it will become at least on par with good CPU implementations. On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote: In other words: no tree search is involved, and this is the lightest possible playout. The raw numbers are as follows: - CPU Search: 47,000 playouts per CPU core per second, on an Intel 6600 Core-2 Duo - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card I still find this quite astonishing; since you consider this experiment a failure anyway, would you mind publishing the source code? :-)
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Hi! On Thu, Sep 10, 2009 at 12:26:06PM +0100, Christian Nentwich wrote: I strongly suspect the low performance in the per-intersection case is down to two reasons - please let me know what you think: - A big proportion of moves place one stone on an empty intersection. In this case 80 of your 81 threads are asleep doing nothing. This means GPU time is wasted. This is mostly true, though only through part of the loop body; I think this is the biggest reason my approach is slower than your, though, and it will be much worse on 19x19. It would be interesting to have N threads running independently and getting contracted to various boards for massive tasks like capturing and merging groups. But that would be quite complex task, not sure if even practical to consider with the current state of the CUDA sandbox. - In quite a few other cases, we get to the old problem with Go: that the intersections are not independent. Thus when you process captures or merges, you possibly have to lock 80 of 81 threads. I do no locking, actually being able to do large captures and merges very quickly was always my primary motivation in even entertaining this way of parallelism. However, I do a lot of atomicAdd()s which have quite an overhead and in some cases basically just serialize many threads. Mainly: (i) When counting score. This is totally unoptimized and in O(N) now; should be trivial to rewrite this as parallel summing in O(logN). This is one of the easy optimizations I mentioned, in fact the most obvious one. (ii) When recalculating liberties. All intersection threads adding extra liberty to the same group will get serialized. Avoiding this seems difficult. -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote: This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. Well, after one iteration, _all_ blocks should be captured since there will be no empty intersections, right? (Even if you don't play in eye-like spaces, you will have this situation until some of these actually form.) The easiest heuristic would seem to be to capture the smallest group, but I have hard time imagining this working well, _and_ how to choose the smallest group efficiently (other than that, modifying my source to do this should be fairly easy). -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Quoting Petr Baudis pa...@ucw.cz: On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote: This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. Well, after one iteration, _all_ blocks should be captured since there will be no empty intersections, right? (Even if you don't play in eye-like spaces, you will have this situation until some of these actually form.) The easiest heuristic would seem to be to capture the smallest group, but I have hard time imagining this working well, _and_ how to choose the smallest group efficiently (other than that, modifying my source to do this should be fairly easy). -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- Magnus Persson Berlin, Germany ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Sorry for my empty reply to Petr, I misclicked When a block is identified you could also sum up what the original strength of the original stones covered by the block and use that as a heuristic for what should be captured first. Also to avoid filling all liberties. One can avoid that with pattern matching when the original position is copied to the GPU. For example if black has an empty triangle, black will rather pass than play that point. Thus either white plays there or it remains empty. Basically it mean we can impose heavy pruning on the original position. And then copy many copies of that position and evaluate them massively in parallel. -Magnus Quoting Petr Baudis pa...@ucw.cz: On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote: This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. Well, after one iteration, _all_ blocks should be captured since there will be no empty intersections, right? (Even if you don't play in eye-like spaces, you will have this situation until some of these actually form.) The easiest heuristic would seem to be to capture the smallest group, but I have hard time imagining this working well, _and_ how to choose the smallest group efficiently (other than that, modifying my source to do this should be fairly easy). -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- Magnus Persson Berlin, Germany ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote: I've thought of something similar in the past, but with a twist: pre-compute a subset of moves that could be safely played in parallel. Even if you can only play 285 moves in parallel on an empty 19x19, it could still be a huge speed boost. Hmm, do you have some ideas how this pre-computation could be done in less than O(N)? I'm gonna think about this now during some travelling... ;-) Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
maybe some divide conquer algorithm? Am 10.09.2009 um 14:43 schrieb Petr Baudis: On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote: I've thought of something similar in the past, but with a twist: pre-compute a subset of moves that could be safely played in parallel. Even if you can only play 285 moves in parallel on an empty 19x19, it could still be a huge speed boost. Hmm, do you have some ideas how this pre-computation could be done in less than O(N)? I'm gonna think about this now during some travelling... ;-) Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. two thoughts: i) how do you determine color to play for each thread? ii) if i) becomes unsynchronized with the rules of go, you're basically exploring random boards instead of boards that are related to the game that you're interested in. the board should rapidly converge to something that retains no information about the initial state of the board (assuming that the initial board has stones on it). s. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Somewhat... One could generate a random number (r) and combine it with the move mask (m) as follows: black moves = m r white moves = m ~r This has the drawback that the number of black and white moves may not be equal. It can be modified to give an equal number of moves such as requiring that each subset of n intersections generates the same number of black and white moves, or doing some kind of looping until the condition is met. For looping, you could keep some subset of the generated moves in order to guarantee convergence. For example, if there are 14 more white moves than black moves, keep all generated moves except for 14 white moves. Then you only have to pick 14 moves in the next loop iteration. Sent from my iPhone On Sep 10, 2009, at 8:43 AM, Petr Baudis pa...@ucw.cz wrote: On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote: I've thought of something similar in the past, but with a twist: pre-compute a subset of moves that could be safely played in parallel. Even if you can only play 285 moves in parallel on an empty 19x19, it could still be a huge speed boost. Hmm, do you have some ideas how this pre-computation could be done in less than O(N)? I'm gonna think about this now during some travelling... ;-) Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Quoting steve uurtamo uurt...@gmail.com: Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. two thoughts: i) how do you determine color to play for each thread? Random. Only one random bit necessary for each decision. ii) if i) becomes unsynchronized with the rules of go, you're basically exploring random boards instead of boards that are related to the game that you're interested in. the board should rapidly converge to something that retains no information about the initial state of the board (assuming that the initial board has stones on it). Usual lightweight random games are also random creating situation that never happen in normal play. But violation of rules will probably make it more biased than a usual light playout. Hard to tell before trying. My feeling is that it might work surprisingly well but it might get extremely biased in some situations, which perhaps can be fixed by prefiltering of the move that can be played. Magnus -- Magnus Persson Berlin, Germany ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/