Re: [ViennaCL-devel] BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hey, alright, we've got some issues to fight ;-) On GPUs with 16kB of shared memory (e.g. GTX 285), the generated GEMM kernels now exceed the available memory: Log: ptxas error : Entry function 'kernel_0x207f4b0_0' uses too much shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max) Thi

[ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Philippe Tillet
Oops, again did "reply" instead of "reply to all". :) -- Forwarded message -- From: Philippe Tillet Date: 2013/8/13 Subject: Re: [ViennaCL-devel] BLAS3, range, slice, compilation time... To: Karl Rupp Hey, 2013/8/13 Karl Rupp > Hey, > > alright, we've got some issues to fig

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hi, > On GPUs with 16kB of shared memory (e.g. GTX 285), the generated > GEMM kernels now exceed the available memory: > > Log: ptxas error : Entry function 'kernel_0x207f4b0_0' uses too > much shared data (0x40a0 bytes + 0x10 bytes system, 0x4000 max) > > This is because of

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Philippe Tillet
Hi hi, 2013/8/13 Karl Rupp > Hi, > > > On GPUs with 16kB of shared memory (e.g. GTX 285), the generated > > GEMM kernels now exceed the available memory: > > > > Log: ptxas error : Entry function 'kernel_0x207f4b0_0' uses too > > much shared data (0x40a0 bytes + 0x10 bytes sys

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hi, > We can directly query the available local device memory (which is the > reason why I added all this buffering to the device class). Am I missing > something? > > > Yes, we could. But having the combination {vendor, local memory} seems a > bit weird to me, I think {vendor, genera

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hi again, thanks, the compilation problem is fixed. Unfortunately, there's still the invalid work group size error showing up. Output from viennacl-info: Address Bits: 32 Available: 1 Compiler Available:1 Endian Little: 1 Error Cor

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Philippe Tillet
Hi hi, Yes, the default NVidia profile for double precision uses a work group size of 1024... All this is checked during the autotuning procedure so that it will work for the hardware it's tunned for... Meh, seems like we need a couple additional levels of abstraction to reach safety. Best regard

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Philippe Tillet
Hey, 2013/8/13 Karl Rupp > Hi, > > > > We can directly query the available local device memory (which is the > >> reason why I added all this buffering to the device class). Am I >> missing >> something? >> >> >> Yes, we could. But having the combination {vendor, local memory} seems

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hi, > Yes, the default NVidia profile for double precision uses a work group > size of 1024... All this is checked during the autotuning procedure so > that it will work for the hardware it's tunned for... > Meh, seems like we need a couple additional levels of abstraction to > reach safety. In

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hey, > {vendor, generation} is the natural format for the handling the > profile internally, yes. This will presumably involve string parsing > of the device name, yes :-( > > > I'll do that :) Should I add a "generation" method in the ocl::device > class? I think it is most suited he

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Philippe Tillet
Hey hey, I've pushed the changes. Does it solve the GTX285 case? The policy is : - One global GPU fallback (very conservative) - One global CPU fallback (very conservative) - One global Accelerator fallback (very conservative) -One Fallback per architecture family if the vendor is not i

[ViennaCL-devel] Fwd: APPML is now available as open source as clMath

2013-08-13 Thread Karl Rupp
Hi guys, wow, AMD open-sourced their Math libraries... Best regards, Karli --- *AMD Accelerated Parallel Processing Math Libraries (APPML) is now available as open source as clMath.* I am extremely pleased to have the opportunity to announce that the APPML BLAS & FFT proje

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hey, > I've pushed the changes. Does it solve the GTX285 case? thanks, it does! > The policy is : > > - One global GPU fallback (very conservative) > - One global CPU fallback (very conservative) > - One global Accelerator fallback (very conservative) > -One Fallback per architecture family >

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Philippe Tillet
Hi, 2013/8/14 Karl Rupp > Hey, > > > I've pushed the changes. Does it solve the GTX285 case? > > thanks, it does! > > Cool ! > > > The policy is : >> >> - One global GPU fallback (very conservative) >> - One global CPU fallback (very conservative) >> - One global Accelerator fallback (very co

Re: [ViennaCL-devel] Fwd: BLAS3, range, slice, compilation time...

2013-08-13 Thread Karl Rupp
Hi, > Do we want to keep the full device name in the profiles map? With > vendor and arch determined, we know pretty much everything we need > to know. If we need to match the name 1:1, there may be too many > devices which we miss even though the 'faster' profile should work? > >