Re: D and GPGPU

2015-02-20 Thread francesco.cattoglio via Digitalmars-d

On Wednesday, 18 February 2015 at 18:14:19 UTC, luminousone wrote:
HSA does work with discrete gpu's and not just the embedded 
stuff, And I believe that HSA can be used to accelerate OpenCL 
2.0, via copyless cache coherent memory access.


Unless I'm mistaken, it will more like the opposite: HSA will use 
OpenCL 2.0 as a backend to do that kind of copyless GPGPU 
acceleration.


Re: D and GPGPU

2015-02-20 Thread Jacob Carlborg via Digitalmars-d

On 2015-02-18 18:56, ponce wrote:


- the Runtime API abstract over multi-GPU and is the basis for
high-level libraries NVIDIA churns out in trendy domains.
(request to Linux/Mac readers: still searching for the correct library
names for linux :) ).


For OS X:

CUDA Driver: This will install /Library/Frameworks/CUDA.framework and 
the UNIX-compatibility stub /usr/local/cuda/lib/libcuda.dylib that 
refers to it


I would recommend the framework. Make sure the correct path is added, 
take a look at SDL for example [1]. You need something like 
../Frameworks/CUDA.framework/CUDA to make sure it's possible to bundle 
the cuda framework in an application bundle.


[1] 
https://github.com/DerelictOrg/DerelictSDL2/blob/master/source/derelict/sdl2/sdl.d#L42


--
/Jacob Carlborg


Re: D and GPGPU

2015-02-20 Thread luminousone via Digitalmars-d
On Friday, 20 February 2015 at 10:05:34 UTC, francesco.cattoglio 
wrote:
On Wednesday, 18 February 2015 at 18:14:19 UTC, luminousone 
wrote:
HSA does work with discrete gpu's and not just the embedded 
stuff, And I believe that HSA can be used to accelerate OpenCL 
2.0, via copyless cache coherent memory access.


Unless I'm mistaken, it will more like the opposite: HSA will 
use OpenCL 2.0 as a backend to do that kind of copyless GPGPU 
acceleration.


HSAIL does not depend on opencl, and it supports more then 
copyless gpgpu acceleration, as it said, it has been access to 
virtual memory, including the program stack.


HSA defines changes to the MMU, IOMMU, cpu cache coherency 
protocol, a new bytecode(HSAIL), a software stack built around 
llvm and its own backend in the gpu device driver.


OpenCL 2.0, generally obtains its copyless accel from remapping 
the gpu memory into system memory, not from direct access to 
virtual memory, Intel supports a form of copyless accel via this 
remapping system.


The major difference between the two systems, is that HSA can 
access any arbitrary location in memory, where as OpenCL must 
still rely on the pointers being mapped for use.


HSA has for example have complete access to runtime type 
reflection, vtable pointers, you could have a linked list or a 
tree that is allocated arbitrarily in memory.


Re: D and GPGPU

2015-02-19 Thread Paulo Pinto via Digitalmars-d

On Wednesday, 18 February 2015 at 18:14:19 UTC, luminousone wrote:
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder 
wrote:
It strikes me that D really ought to be able to work with 
GPGPU – is
there already something and I just failed to notice. This is 
data
parallelism but of a slightly different sort to that in 
std.parallelism.
std.concurrent, std.parallelism, std.gpgpu ought to be 
harmonious

though.

The issue is to create a GPGPU kernel (usually C code with 
bizarre data
structures and calling conventions) set it running and then 
pipe data in
and collect data out – currently very slow but the next 
generation of
Intel chips will fix this (*). And then there is the 
OpenCL/CUDA debate.


Personally I think OpenCL, for all it's deficiencies, as it is 
vendor
neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA 
back end
for OpenCL. With a system like PyOpenCL, the infrastructure 
data and
process handling is abstracted, but you still have to write 
the kernels
in C. They really ought to do a Python DSL for that, but… So 
with D can
we write D kernels and have them compiled and loaded using a 
combination

of CTFE, D → C translation, C ompiler call, and other magic?

Is this a GSoC 2015 type thing?


(*) It will be interesting to see how NVIDIA responds to the 
tack Intel

are taking on GPGPU and main memory access.


https://github.com/HSAFoundation

This is really the way to go, yea opencl and cuda exist, along 
with opengl/directx compute shaders, but pretty much every 
thing out their suffers from giant limitations.


With HSA, HSAIL bytecode is embedded directly into the elf/exe 
file, HASIL bytecode can can fully support all the features of 
c++, virtual function lookups in code, access to the stack, 
cache coherent memory access, the same virtual memory view as 
the application it runs in, etc.


HSA is implemented in the llvm backend compiler, and when it is 
used in a elf/exe file, their is a llvm based finalizer that 
generates gpu bytecode.


More importantly, it should be very easy to implement in any 
llvm supported language once all of the patches are moved up 
stream to their respective libraries/toolsets.


I believe that linux kernel 3.19 and above have the iommu 2.5 
patches, and I think amd's radeon KFD driver made it into 3.20. 
HSA will also be supported by ARM.


HSA is generic enough, that assuming Intel implements similar 
capabilities into their chips it otta be supportable their with 
or without intels direct blessing.


HSA does work with discrete gpu's and not just the embedded 
stuff, And I believe that HSA can be used to accelerate OpenCL 
2.0, via copyless cache coherent memory access.


Java will support HSA as of Java 9 - 10, depending on project's 
progress.


http://openjdk.java.net/projects/sumatra/

https://wiki.openjdk.java.net/display/Sumatra/Main

--
Paulo


Re: D and GPGPU

2015-02-18 Thread ponce via Digitalmars-d
On Wednesday, 18 February 2015 at 16:03:20 UTC, Laeeth Isharc 
wrote:
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder 
wrote:
It strikes me that D really ought to be able to work with 
GPGPU – is
there already something and I just failed to notice. This is 
data
parallelism but of a slightly different sort to that in 
std.parallelism.
std.concurrent, std.parallelism, std.gpgpu ought to be 
harmonious

though.

The issue is to create a GPGPU kernel (usually C code with 
bizarre data
structures and calling conventions) set it running and then 
pipe data in
and collect data out – currently very slow but the next 
generation of
Intel chips will fix this (*). And then there is the 
OpenCL/CUDA debate.


Personally I think OpenCL, for all it's deficiencies, as it is 
vendor
neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA 
back end
for OpenCL. With a system like PyOpenCL, the infrastructure 
data and
process handling is abstracted, but you still have to write 
the kernels
in C. They really ought to do a Python DSL for that, but… So 
with D can
we write D kernels and have them compiled and loaded using a 
combination

of CTFE, D → C translation, C ompiler call, and other magic?

Is this a GSoC 2015 type thing?


(*) It will be interesting to see how NVIDIA responds to the 
tack Intel

are taking on GPGPU and main memory access.


I agree it would be very helpful.

I have this on my to look at list, and don't yet know exactly 
what it does and doesn't do:

http://code.dlang.org/packages/derelict-cuda


What is does is provide access to the most useful part of the 
CUDA API which is two-headed:


- the Driver API provides the most control over the GPU and I 
would recommend this one. If you are in CUDA you probably want 
top efficiency and control.


- the Runtime API abstract over multi-GPU and is the basis for 
high-level libraries NVIDIA churns out in trendy domains.
(request to Linux/Mac readers: still searching for the correct 
library names for linux :) ).


When using DerelictCUDA, you still need nvcc to compile your .cu 
files and then load them.


This is less easy than when using the NVIDIA SDK which will 
eventually allow to combine GPU and CPU code in the same source 
file.
Apart from that, this is 2015 and I see little reasons to start 
new projects in CUDA with the advent of OpenCL 2.0 drivers.


Re: D and GPGPU

2015-02-18 Thread luminousone via Digitalmars-d
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder 
wrote:
It strikes me that D really ought to be able to work with GPGPU 
– is
there already something and I just failed to notice. This is 
data
parallelism but of a slightly different sort to that in 
std.parallelism.
std.concurrent, std.parallelism, std.gpgpu ought to be 
harmonious

though.

The issue is to create a GPGPU kernel (usually C code with 
bizarre data
structures and calling conventions) set it running and then 
pipe data in
and collect data out – currently very slow but the next 
generation of
Intel chips will fix this (*). And then there is the 
OpenCL/CUDA debate.


Personally I think OpenCL, for all it's deficiencies, as it is 
vendor
neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA 
back end
for OpenCL. With a system like PyOpenCL, the infrastructure 
data and
process handling is abstracted, but you still have to write the 
kernels
in C. They really ought to do a Python DSL for that, but… So 
with D can
we write D kernels and have them compiled and loaded using a 
combination

of CTFE, D → C translation, C ompiler call, and other magic?

Is this a GSoC 2015 type thing?


(*) It will be interesting to see how NVIDIA responds to the 
tack Intel

are taking on GPGPU and main memory access.


https://github.com/HSAFoundation

This is really the way to go, yea opencl and cuda exist, along 
with opengl/directx compute shaders, but pretty much every thing 
out their suffers from giant limitations.


With HSA, HSAIL bytecode is embedded directly into the elf/exe 
file, HASIL bytecode can can fully support all the features of 
c++, virtual function lookups in code, access to the stack, cache 
coherent memory access, the same virtual memory view as the 
application it runs in, etc.


HSA is implemented in the llvm backend compiler, and when it is 
used in a elf/exe file, their is a llvm based finalizer that 
generates gpu bytecode.


More importantly, it should be very easy to implement in any llvm 
supported language once all of the patches are moved up stream to 
their respective libraries/toolsets.


I believe that linux kernel 3.19 and above have the iommu 2.5 
patches, and I think amd's radeon KFD driver made it into 3.20. 
HSA will also be supported by ARM.


HSA is generic enough, that assuming Intel implements similar 
capabilities into their chips it otta be supportable their with 
or without intels direct blessing.


HSA does work with discrete gpu's and not just the embedded 
stuff, And I believe that HSA can be used to accelerate OpenCL 
2.0, via copyless cache coherent memory access.


Re: D and GPGPU

2015-02-18 Thread ponce via Digitalmars-d
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder 
wrote:


The issue is to create a GPGPU kernel (usually C code with 
bizarre data
structures and calling conventions) set it running and then 
pipe data in
and collect data out – currently very slow but the next 
generation of
Intel chips will fix this (*). And then there is the 
OpenCL/CUDA debate.


Personally I think OpenCL, for all it's deficiencies, as it is 
vendor
neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA 
back end
for OpenCL. With a system like PyOpenCL, the infrastructure 
data and
process handling is abstracted, but you still have to write the 
kernels
in C. They really ought to do a Python DSL for that, but… So 
with D can
we write D kernels and have them compiled and loaded using a 
combination

of CTFE, D → C translation, C ompiler call, and other magic?


I'd like to about the kernel languages (having done both OpenCL 
and CUDA).


A big speed-up factor is the multiple level of parallelism 
exposed in OpenCL C and CUDA C:


- contect parallelism (eg. several GPU)
- command parallelism (based on a future model)
- block parallelism
- warp/sub-block parallelism
- in each sub-block, N threads (typically 32 or 64)

All of that supported by appropriate barrier semantics. Typical 
C-like code only has threads as parallelism and a less complex 
cache.


Also most algorithms don't translate all that well to SIMD 
threads working in lockstep.


Example: instead of looping on that 2D image and perform an 
horizontal blur on 15 pixels, instead perform this operation on 
32x16 blocks simultaneously, while caching stuff in block-local 
memory.


It is much like an auto-vectorization problem and 
auto-vectorization is hard.







Re: D and GPGPU

2015-02-18 Thread Laeeth Isharc via Digitalmars-d
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder 
wrote:
It strikes me that D really ought to be able to work with GPGPU 
– is
there already something and I just failed to notice. This is 
data
parallelism but of a slightly different sort to that in 
std.parallelism.
std.concurrent, std.parallelism, std.gpgpu ought to be 
harmonious

though.

The issue is to create a GPGPU kernel (usually C code with 
bizarre data
structures and calling conventions) set it running and then 
pipe data in
and collect data out – currently very slow but the next 
generation of
Intel chips will fix this (*). And then there is the 
OpenCL/CUDA debate.


Personally I think OpenCL, for all it's deficiencies, as it is 
vendor
neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA 
back end
for OpenCL. With a system like PyOpenCL, the infrastructure 
data and
process handling is abstracted, but you still have to write the 
kernels
in C. They really ought to do a Python DSL for that, but… So 
with D can
we write D kernels and have them compiled and loaded using a 
combination

of CTFE, D → C translation, C ompiler call, and other magic?

Is this a GSoC 2015 type thing?


(*) It will be interesting to see how NVIDIA responds to the 
tack Intel

are taking on GPGPU and main memory access.


I agree it would be very helpful.

I have this on my to look at list, and don't yet know exactly 
what it does and doesn't do:

http://code.dlang.org/packages/derelict-cuda


Re: D and GPGPU

2015-02-18 Thread Laeeth Isharc via Digitalmars-d
One interesting C++ use in finance of CUDA.  Joshi is porting 
quantlib, or at least part of it, to a cuda environment.  Some 
nice speed ups for Bermudan pricing.


http://sourceforge.net/projects/kooderive/
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1473563


Re: D and GPGPU

2015-02-18 Thread John Colvin via Digitalmars-d
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder 
wrote:
It strikes me that D really ought to be able to work with GPGPU 
– is
there already something and I just failed to notice. This is 
data
parallelism but of a slightly different sort to that in 
std.parallelism.
std.concurrent, std.parallelism, std.gpgpu ought to be 
harmonious

though.

The issue is to create a GPGPU kernel (usually C code with 
bizarre data
structures and calling conventions) set it running and then 
pipe data in
and collect data out – currently very slow but the next 
generation of
Intel chips will fix this (*). And then there is the 
OpenCL/CUDA debate.


Personally I think OpenCL, for all it's deficiencies, as it is 
vendor
neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA 
back end
for OpenCL. With a system like PyOpenCL, the infrastructure 
data and
process handling is abstracted, but you still have to write the 
kernels
in C. They really ought to do a Python DSL for that, but… So 
with D can
we write D kernels and have them compiled and loaded using a 
combination

of CTFE, D → C translation, C ompiler call, and other magic?

Is this a GSoC 2015 type thing?


(*) It will be interesting to see how NVIDIA responds to the 
tack Intel

are taking on GPGPU and main memory access.


It would be great if LDC could do this using 
https://www.khronos.org/spir