[computer-go] NVidia Fermi and go?

2009-10-23 Thread Darren Cook
I was reading a linux mag article [1] saying that the latest nvidia GPUs
[2] solve many of the problems of using them for supercomputing problems.
There was a thread [3] here in September about running go playouts on
GPUs, where the people who had tried it seemed generally pessimistic. I
just wondered if this new Fermi GPU solves the issues for go playouts,
or don't really make any difference?

Darren

[1]: http://www.linux-mag.com/id/7575
[2]: http://www.nvidia.com/object/io_1254288141829.html
[3]: Starting here:
http://computer-go.org/pipermail/computer-go/2009-September/019422.html

-- 
Darren Cook, Software Researcher/Developer
http://dcook.org/gobet/  (Shodan Go Bet - who will win?)
http://dcook.org/mlsn/ (Multilingual open source semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] NVidia Fermi and go?

2009-10-23 Thread Christian Nentwich

Darren,

these articles are still somewhat short on detail, so it's hard to tell. 
A lot of the new features listed there won't have any impact on the 
suitability of the GPU for Go, because they do not change the method of 
computation (e.g. doubling floating point precision is irrelevant).


Having said that, the parallel data cache they alude to may be 
significant. If this is going to enable the construction of data 
structures such as linked lists, or bring down global memory access time 
significantly, then I believe the performance of playout algorithms on 
the architecture will shoot up.


Christian

On 23/10/2009 09:28, Darren Cook wrote:

I was reading a linux mag article [1] saying that the latest nvidia GPUs
[2] solve many of the problems of using them for supercomputing problems.
There was a thread [3] here in September about running go playouts on
GPUs, where the people who had tried it seemed generally pessimistic. I
just wondered if this new Fermi GPU solves the issues for go playouts,
or don't really make any difference?

Darren

[1]: http://www.linux-mag.com/id/7575
[2]: http://www.nvidia.com/object/io_1254288141829.html
[3]: Starting here:
http://computer-go.org/pipermail/computer-go/2009-September/019422.html

   


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] NVidia Fermi and go?

2009-10-23 Thread Darren Cook
 these articles are still somewhat short on detail, so it's hard to tell.

Yes the linux mag article was a bit empty wasn't it, but did you take a
look at the 20-page whitepaper:
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

 Having said that, the parallel data cache they alude to may be
 significant. If this is going to enable the construction of data
 structures such as linked lists, or bring down global memory access time
 significantly, then I believe the performance of playout algorithms on
 the architecture will shoot up.

I only skimmed it very lightly, but page 15 discusses memory and page 16
shows how this gives big speedups for radix sort and fluid simulations.

Darren


-- 
Darren Cook, Software Researcher/Developer
http://dcook.org/gobet/  (Shodan Go Bet - who will win?)
http://dcook.org/mlsn/ (Multilingual open source semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] NVidia Fermi and go?

2009-10-23 Thread Brian Sheppard
I just wondered if this new Fermi GPU solves the issues for go
playouts, or don't really make any difference?

My first impression of Fermi is very positive. Fermi contains a lot of
features that make general purpose computing on a GPU much easier and better
performing.

However, it remains the case that all kernels on a multiprocessor must
execute the same instruction on each cycle. When executing if/then/else
logic, this implies that if *any* core needs to execute a branch, then *all*
cores must wait for those instructions to complete.

Playout policies have a lot of if/then/else logic. Sequential processors
handle such code quite well, because most of it isn't executed. But when you
have 32 playouts executing in parallel, then there is a high chance that
both branches will be needed. This really cuts into the potential gain.

Amdahl's law is a factor, as well. Amdahl's law says that the gain from
parallelization is limited when some aspects of the solution execute
sequentially. For example, the GPU has to generate positions and transfer
them to the GPU for playout. Generation and transfer are sequential. Because
of such overhead, massively parallel programs generally need very high
increases from parallelization.

Clock speed is also a factor. CPUS execute at over 3 GHz, and because of
speculative execution they often execute more than one instruction per
clock. The GPU generally has a clock rate ~ 1 GHz, and most general purpose
instructions require multiple clocks. So you must have a large parallel
speedup just to break even. (Unless you can exploit some of the specialized
graphics instructions, such as texture mapping, that equate to dozens of
sequential instructions yet execute as a single instruction on the GPU. I
don't think computer Go has that possibility.)

So I am not convinced yet, but Fermi is a big step (really many small steps)
in the right direction.

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] NVidia Fermi and go?

2009-10-23 Thread ☢ ☠
In my own gpu experiment (light playouts), registers/memory were the
bounding factors on simulation speed. I expected branching to affect it more
but as long as you have null branches (instead of branches that do
something) then the total execution only takes as long as the longest
branch, which turns out to be not that bad. And someone here did a more
recent gpu-go experiment on a bigger card and it worked even better, which I
expect will be the trend for a while.

~

Chase Albert

On Fri, Oct 23, 2009 at 12:07, Brian Sheppard sheppar...@aol.com wrote:

 I just wondered if this new Fermi GPU solves the issues for go
 playouts, or don't really make any difference?

 My first impression of Fermi is very positive. Fermi contains a lot of
 features that make general purpose computing on a GPU much easier and
 better
 performing.

 However, it remains the case that all kernels on a multiprocessor must
 execute the same instruction on each cycle. When executing if/then/else
 logic, this implies that if *any* core needs to execute a branch, then
 *all*
 cores must wait for those instructions to complete.

 Playout policies have a lot of if/then/else logic. Sequential processors
 handle such code quite well, because most of it isn't executed. But when
 you
 have 32 playouts executing in parallel, then there is a high chance that
 both branches will be needed. This really cuts into the potential gain.

 Amdahl's law is a factor, as well. Amdahl's law says that the gain from
 parallelization is limited when some aspects of the solution execute
 sequentially. For example, the GPU has to generate positions and transfer
 them to the GPU for playout. Generation and transfer are sequential.
 Because
 of such overhead, massively parallel programs generally need very high
 increases from parallelization.

 Clock speed is also a factor. CPUS execute at over 3 GHz, and because of
 speculative execution they often execute more than one instruction per
 clock. The GPU generally has a clock rate ~ 1 GHz, and most general purpose
 instructions require multiple clocks. So you must have a large parallel
 speedup just to break even. (Unless you can exploit some of the specialized
 graphics instructions, such as texture mapping, that equate to dozens of
 sequential instructions yet execute as a single instruction on the GPU. I
 don't think computer Go has that possibility.)

 So I am not convinced yet, but Fermi is a big step (really many small
 steps)
 in the right direction.

 ___
 computer-go mailing list
 computer-go@computer-go.org
 http://www.computer-go.org/mailman/listinfo/computer-go/

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] NVidia Fermi and go?

2009-10-23 Thread Brian Sheppard
In my own gpu experiment (light playouts), registers/memory were the
bounding factors on simulation speed.

I respect your experimental finding, but I note that you have carefully
specified light playouts, probably because you suspect that there may be a
significant difference if playouts are heavy.

I have not done any GPU experiments, so readers should take my guesswork
FWIW. I think the code that is light is the only piece that parallelizes
efficiently. Heavy playouts look for rare but important situations and
handle them using specific knowledge. If a situation is rare then a warp
of 32 playouts won't have many matches, so it will stall the other cores.

I have no data regarding the probability of such stalls in heavy playouts,
but I think they must be frequent. For example, if the ratio of heavy to
light playouts is a four-fold increase in CPU time, then a sequential
program is spending 75% of its time identifying and handling rare
situations.

My opinion is that heavy playouts are necessary, so if you put all this
guesswork together then you come up with my opinion: Fermi is probably still
not enough, but there are steps in the right direction.

BTW, it occurs to me that we can approximate the efficiency of
parallelization by taking execution counts from a profiler and
post-processing them. I should do that before buying a new GPU. :-)


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] NVidia Fermi and go?

2009-10-23 Thread Petr Baudis
On Fri, Oct 23, 2009 at 01:34:29PM -0600, Brian Sheppard wrote:
 I have not done any GPU experiments, so readers should take my guesswork
 FWIW. I think the code that is light is the only piece that parallelizes
 efficiently. Heavy playouts look for rare but important situations and
 handle them using specific knowledge. If a situation is rare then a warp
 of 32 playouts won't have many matches, so it will stall the other cores.
 
 I have no data regarding the probability of such stalls in heavy playouts,
 but I think they must be frequent. For example, if the ratio of heavy to
 light playouts is a four-fold increase in CPU time, then a sequential
 program is spending 75% of its time identifying and handling rare
 situations.

My experiment used light playouts, but ready for increasing weight by
picking a move according to probability distribution (which is easy to
do without any branch stalls itself).

I think adding that to the other experiment (playout-per-thread instead
of intersection-per-thread) wouldn't be too difficult, and then as long
as you track liberty counts, CrazyStone-style playouts are already just
a single step away, I'd guess; I think then the problem is not branch
stalls, but getting the bandwidth for pattern matching; then I'm
wondering if you could actually use some clever GPU-specific texture
tricks for that, but I haven't looked at that in depth (yet)...

 BTW, it occurs to me that we can approximate the efficiency of
 parallelization by taking execution counts from a profiler and
 post-processing them. I should do that before buying a new GPU. :-)

I wonder what do you mean by that.

-- 
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] NVidia Fermi and go?

2009-10-23 Thread Brian Sheppard
 BTW, it occurs to me that we can approximate the efficiency of
 parallelization by taking execution counts from a profiler and
 post-processing them. I should do that before buying a new GPU. :-)

I wonder what you mean by that.

If you run your program on a sequential machine and count statements then
you can analyze what happens around branches to calculate how your system
will parallelize.

Simple example:

If (foo) {   // 3200 hits
  Bar(); // 3190 hits
}
Else
  Baz(); //   10 hits
}

If the 3200 executions were divided into 100 warps of 32, then the 10 hits
on Baz() would be distributed into 9 or 10 different warps. So the Baz code
stalls 10 out of 100 warps. The effect would be as if the Baz code were
executed on 10% of trials rather than 0.3%.

You can calculate the cost of this if-statement as 100% of Bar() + 10% of
Baz() on a GPU. Note that the CPU cost is 99.7% of Bar() + 0.3% of Baz(). If
Baz takes 1000 cycles and Bar takes 10 cycles, then the GPU costs 110 cycles
and the CPU costs 13 cycles. Factor in 32x parallelism and the GPU uses
fewer cycles. But in this case the fact that the CPU's cycles are so much
faster will outweigh parallelism.

Regarding the rest of your post: I think you are onto something. The GPU is
different from the CPU, so maybe a different design is a better match.

You need to express the playout policy as data rather than code. The
downside there is that the memory hierarchy on a GPU imposes huge latencies.
Maybe the caches in Fermi will make a big difference here.



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Ginsei-igo 10

2009-10-23 Thread Hideki Kato
Ian Osgood: c60d3feb-cce7-497d-b97e-f62a1452e...@quirkster.com:
On Oct 20, 2009, at 1:57 AM, Hideki Kato wrote:

 New version of ex-strongest Ginsei-igo, Gisei-igo 10 is announced
 to be shipped on December 28th.
 http://www.silverstar.co.jp/02products/gigo10/index.html (in Japanese)

 New Ginsei features a hybrid Monte-Carlo engine.  Its price is not
 announced but I've found the retail and discount prices are 13,440
 and 9,899 yen (w/tax), respectively at an Internet shop.
 
http://www.murauchi.com/MCJ-front-web/CoD/011133148/forwardKey%5B0%5D=cart/forwardKey%5B1%5D=wishList/forwardKey%5B2%5D=compareMyPage/forwardKey%5B3%5D=compareCatalog/forwardName%5B0%5D=COMMODITY_LIST/forwardName%5B1%5D=COMMODITY_LIST/forwardName%5B2%5D=COMMODITY_LIST/forwardName%5B3%5D=COMMODITY_LIST/
 (in Japanese)
 Cf. The prices of Tencho-no-igo (Zenith/Zen) are 13,440 and 9,138 yen
 respectively at the same shop.

 Hideki
 --
 g...@nue.ci.i.u-tokyo.ac.jp (Kato)

Interesting! Do you know whether this is a development of the KCC Igo  
engine by the same team? Or have they given up and replaced their  
classical engine entirely?  Any plans for translated versions for the  
international market?

I now have still no info about that, though my bet is the same team 
and they have integrated MCTS to their old engine.  I'll post here if 
I'll find some.

Hideki

A political question: if it is the same team, regardless of engine,  
would it still be blacklisted from tournaments?

Ian

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/
--
g...@nue.ci.i.u-tokyo.ac.jp (Kato)
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/