[computer-go] NVidia Fermi and go?
I was reading a linux mag article [1] saying that the latest nvidia GPUs [2] solve many of the problems of using them for supercomputing problems. There was a thread [3] here in September about running go playouts on GPUs, where the people who had tried it seemed generally pessimistic. I just wondered if this new Fermi GPU solves the issues for go playouts, or don't really make any difference? Darren [1]: http://www.linux-mag.com/id/7575 [2]: http://www.nvidia.com/object/io_1254288141829.html [3]: Starting here: http://computer-go.org/pipermail/computer-go/2009-September/019422.html -- Darren Cook, Software Researcher/Developer http://dcook.org/gobet/ (Shodan Go Bet - who will win?) http://dcook.org/mlsn/ (Multilingual open source semantic network) http://dcook.org/work/ (About me and my work) http://dcook.org/blogs.html (My blogs and articles) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] NVidia Fermi and go?
Darren, these articles are still somewhat short on detail, so it's hard to tell. A lot of the new features listed there won't have any impact on the suitability of the GPU for Go, because they do not change the method of computation (e.g. doubling floating point precision is irrelevant). Having said that, the parallel data cache they alude to may be significant. If this is going to enable the construction of data structures such as linked lists, or bring down global memory access time significantly, then I believe the performance of playout algorithms on the architecture will shoot up. Christian On 23/10/2009 09:28, Darren Cook wrote: I was reading a linux mag article [1] saying that the latest nvidia GPUs [2] solve many of the problems of using them for supercomputing problems. There was a thread [3] here in September about running go playouts on GPUs, where the people who had tried it seemed generally pessimistic. I just wondered if this new Fermi GPU solves the issues for go playouts, or don't really make any difference? Darren [1]: http://www.linux-mag.com/id/7575 [2]: http://www.nvidia.com/object/io_1254288141829.html [3]: Starting here: http://computer-go.org/pipermail/computer-go/2009-September/019422.html ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] NVidia Fermi and go?
these articles are still somewhat short on detail, so it's hard to tell. Yes the linux mag article was a bit empty wasn't it, but did you take a look at the 20-page whitepaper: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Having said that, the parallel data cache they alude to may be significant. If this is going to enable the construction of data structures such as linked lists, or bring down global memory access time significantly, then I believe the performance of playout algorithms on the architecture will shoot up. I only skimmed it very lightly, but page 15 discusses memory and page 16 shows how this gives big speedups for radix sort and fluid simulations. Darren -- Darren Cook, Software Researcher/Developer http://dcook.org/gobet/ (Shodan Go Bet - who will win?) http://dcook.org/mlsn/ (Multilingual open source semantic network) http://dcook.org/work/ (About me and my work) http://dcook.org/blogs.html (My blogs and articles) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] NVidia Fermi and go?
I just wondered if this new Fermi GPU solves the issues for go playouts, or don't really make any difference? My first impression of Fermi is very positive. Fermi contains a lot of features that make general purpose computing on a GPU much easier and better performing. However, it remains the case that all kernels on a multiprocessor must execute the same instruction on each cycle. When executing if/then/else logic, this implies that if *any* core needs to execute a branch, then *all* cores must wait for those instructions to complete. Playout policies have a lot of if/then/else logic. Sequential processors handle such code quite well, because most of it isn't executed. But when you have 32 playouts executing in parallel, then there is a high chance that both branches will be needed. This really cuts into the potential gain. Amdahl's law is a factor, as well. Amdahl's law says that the gain from parallelization is limited when some aspects of the solution execute sequentially. For example, the GPU has to generate positions and transfer them to the GPU for playout. Generation and transfer are sequential. Because of such overhead, massively parallel programs generally need very high increases from parallelization. Clock speed is also a factor. CPUS execute at over 3 GHz, and because of speculative execution they often execute more than one instruction per clock. The GPU generally has a clock rate ~ 1 GHz, and most general purpose instructions require multiple clocks. So you must have a large parallel speedup just to break even. (Unless you can exploit some of the specialized graphics instructions, such as texture mapping, that equate to dozens of sequential instructions yet execute as a single instruction on the GPU. I don't think computer Go has that possibility.) So I am not convinced yet, but Fermi is a big step (really many small steps) in the right direction. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] NVidia Fermi and go?
In my own gpu experiment (light playouts), registers/memory were the bounding factors on simulation speed. I expected branching to affect it more but as long as you have null branches (instead of branches that do something) then the total execution only takes as long as the longest branch, which turns out to be not that bad. And someone here did a more recent gpu-go experiment on a bigger card and it worked even better, which I expect will be the trend for a while. ~ Chase Albert On Fri, Oct 23, 2009 at 12:07, Brian Sheppard sheppar...@aol.com wrote: I just wondered if this new Fermi GPU solves the issues for go playouts, or don't really make any difference? My first impression of Fermi is very positive. Fermi contains a lot of features that make general purpose computing on a GPU much easier and better performing. However, it remains the case that all kernels on a multiprocessor must execute the same instruction on each cycle. When executing if/then/else logic, this implies that if *any* core needs to execute a branch, then *all* cores must wait for those instructions to complete. Playout policies have a lot of if/then/else logic. Sequential processors handle such code quite well, because most of it isn't executed. But when you have 32 playouts executing in parallel, then there is a high chance that both branches will be needed. This really cuts into the potential gain. Amdahl's law is a factor, as well. Amdahl's law says that the gain from parallelization is limited when some aspects of the solution execute sequentially. For example, the GPU has to generate positions and transfer them to the GPU for playout. Generation and transfer are sequential. Because of such overhead, massively parallel programs generally need very high increases from parallelization. Clock speed is also a factor. CPUS execute at over 3 GHz, and because of speculative execution they often execute more than one instruction per clock. The GPU generally has a clock rate ~ 1 GHz, and most general purpose instructions require multiple clocks. So you must have a large parallel speedup just to break even. (Unless you can exploit some of the specialized graphics instructions, such as texture mapping, that equate to dozens of sequential instructions yet execute as a single instruction on the GPU. I don't think computer Go has that possibility.) So I am not convinced yet, but Fermi is a big step (really many small steps) in the right direction. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] NVidia Fermi and go?
In my own gpu experiment (light playouts), registers/memory were the bounding factors on simulation speed. I respect your experimental finding, but I note that you have carefully specified light playouts, probably because you suspect that there may be a significant difference if playouts are heavy. I have not done any GPU experiments, so readers should take my guesswork FWIW. I think the code that is light is the only piece that parallelizes efficiently. Heavy playouts look for rare but important situations and handle them using specific knowledge. If a situation is rare then a warp of 32 playouts won't have many matches, so it will stall the other cores. I have no data regarding the probability of such stalls in heavy playouts, but I think they must be frequent. For example, if the ratio of heavy to light playouts is a four-fold increase in CPU time, then a sequential program is spending 75% of its time identifying and handling rare situations. My opinion is that heavy playouts are necessary, so if you put all this guesswork together then you come up with my opinion: Fermi is probably still not enough, but there are steps in the right direction. BTW, it occurs to me that we can approximate the efficiency of parallelization by taking execution counts from a profiler and post-processing them. I should do that before buying a new GPU. :-) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] NVidia Fermi and go?
On Fri, Oct 23, 2009 at 01:34:29PM -0600, Brian Sheppard wrote: I have not done any GPU experiments, so readers should take my guesswork FWIW. I think the code that is light is the only piece that parallelizes efficiently. Heavy playouts look for rare but important situations and handle them using specific knowledge. If a situation is rare then a warp of 32 playouts won't have many matches, so it will stall the other cores. I have no data regarding the probability of such stalls in heavy playouts, but I think they must be frequent. For example, if the ratio of heavy to light playouts is a four-fold increase in CPU time, then a sequential program is spending 75% of its time identifying and handling rare situations. My experiment used light playouts, but ready for increasing weight by picking a move according to probability distribution (which is easy to do without any branch stalls itself). I think adding that to the other experiment (playout-per-thread instead of intersection-per-thread) wouldn't be too difficult, and then as long as you track liberty counts, CrazyStone-style playouts are already just a single step away, I'd guess; I think then the problem is not branch stalls, but getting the bandwidth for pattern matching; then I'm wondering if you could actually use some clever GPU-specific texture tricks for that, but I haven't looked at that in depth (yet)... BTW, it occurs to me that we can approximate the efficiency of parallelization by taking execution counts from a profiler and post-processing them. I should do that before buying a new GPU. :-) I wonder what do you mean by that. -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] NVidia Fermi and go?
BTW, it occurs to me that we can approximate the efficiency of parallelization by taking execution counts from a profiler and post-processing them. I should do that before buying a new GPU. :-) I wonder what you mean by that. If you run your program on a sequential machine and count statements then you can analyze what happens around branches to calculate how your system will parallelize. Simple example: If (foo) { // 3200 hits Bar(); // 3190 hits } Else Baz(); // 10 hits } If the 3200 executions were divided into 100 warps of 32, then the 10 hits on Baz() would be distributed into 9 or 10 different warps. So the Baz code stalls 10 out of 100 warps. The effect would be as if the Baz code were executed on 10% of trials rather than 0.3%. You can calculate the cost of this if-statement as 100% of Bar() + 10% of Baz() on a GPU. Note that the CPU cost is 99.7% of Bar() + 0.3% of Baz(). If Baz takes 1000 cycles and Bar takes 10 cycles, then the GPU costs 110 cycles and the CPU costs 13 cycles. Factor in 32x parallelism and the GPU uses fewer cycles. But in this case the fact that the CPU's cycles are so much faster will outweigh parallelism. Regarding the rest of your post: I think you are onto something. The GPU is different from the CPU, so maybe a different design is a better match. You need to express the playout policy as data rather than code. The downside there is that the memory hierarchy on a GPU imposes huge latencies. Maybe the caches in Fermi will make a big difference here. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Ginsei-igo 10
Ian Osgood: c60d3feb-cce7-497d-b97e-f62a1452e...@quirkster.com: On Oct 20, 2009, at 1:57 AM, Hideki Kato wrote: New version of ex-strongest Ginsei-igo, Gisei-igo 10 is announced to be shipped on December 28th. http://www.silverstar.co.jp/02products/gigo10/index.html (in Japanese) New Ginsei features a hybrid Monte-Carlo engine. Its price is not announced but I've found the retail and discount prices are 13,440 and 9,899 yen (w/tax), respectively at an Internet shop. http://www.murauchi.com/MCJ-front-web/CoD/011133148/forwardKey%5B0%5D=cart/forwardKey%5B1%5D=wishList/forwardKey%5B2%5D=compareMyPage/forwardKey%5B3%5D=compareCatalog/forwardName%5B0%5D=COMMODITY_LIST/forwardName%5B1%5D=COMMODITY_LIST/forwardName%5B2%5D=COMMODITY_LIST/forwardName%5B3%5D=COMMODITY_LIST/ (in Japanese) Cf. The prices of Tencho-no-igo (Zenith/Zen) are 13,440 and 9,138 yen respectively at the same shop. Hideki -- g...@nue.ci.i.u-tokyo.ac.jp (Kato) Interesting! Do you know whether this is a development of the KCC Igo engine by the same team? Or have they given up and replaced their classical engine entirely? Any plans for translated versions for the international market? I now have still no info about that, though my bet is the same team and they have integrated MCTS to their old engine. I'll post here if I'll find some. Hideki A political question: if it is the same team, regardless of engine, would it still be blacklisted from tournaments? Ian ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- g...@nue.ci.i.u-tokyo.ac.jp (Kato) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/