subject:"\\\[HACKERS\\\] CUDA Sorting"

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Marti Raudsepp

On Mon, Feb 13, 2012 at 20:48, Greg Stark st...@mit.edu wrote:
 I don't think we should be looking at either CUDA or OpenCL directly.
 We should be looking for a generic library that can target either and
 is well maintained and actively developed.

I understand your point about using some external library for the
primitives, but I don't see why it needs to support both CUDA and
OpenCL. Libraries for GPU-accelerated primitives generally target
OpenCL *or* CUDA, not both.

As far as I understand (and someone correct me if I'm wrong), the
difference between them is mostly the API and the fact that CUDA had a
head start, and thus a larger developer community around it. (All the
early adopters went to CUDA)

But OpenCL already acts as an abstraction layer. CUDA is
NVIDIA-specific, but OpenCL is supported by AMD, Intel as well as
NVIDIA. It's pretty rare for servers to have separate graphics cards,
but recent Intel and AMD CPUs already have a GPU included on die,
which is another bonus for OpenCL.

So I'd say, the way things are heading, it's only a matter of time
before OpenCL takes over and there will be little reason to look back.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Gaetano Mendola


On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.



I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:

 - It could be in the interest of the hardware vendor to provide the 
most powerful sort implementation (I'm sure for example that TBB sort 
implementation is faster that pg_sort)


 - It can permit people to play with it without being deep involved 
in pg development and stuffs.


 - It can relieve the postgres core group the choose about the right 
language/tool/implementation to use.


 - Also for people not willing (or not able for the matter) to upgrade
postgres engine to change instead the sort function upon an hardware
upgrade.


Of course if this happens postgres engine has to make some sort of
sanity check (that the function for example actually sorts) before to 
thrust the plugged sort.

The engine can even have multiple sort implementation available and
use the most proficient one (imagine some sorts acts better on
a certain range value or on certain element size).


Regards
Gaetano Mendola


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Gaetano Mendola


On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.



I think one option is to make the sort function plugable with a shared
library/dll. I see several benefits from this:

 - It could be in the interest of the hardware vendor to provide the 
most powerful sort implementation (I'm sure for example that TBB sort 
implementation is faster that pg_sort)


 - It can permit people to play with it without being deep involved 
in pg development and stuffs.


 - It can relieve the postgres core group the choose about the right 
language/tool/implementation to use.


 - Also for people not willing (or not able for the matter) to upgrade
postgres engine to change instead the sort function upon an hardware
upgrade.


Of course if this happens postgres engine has to make some sort of
sanity check (that the function for example actually sorts) before to 
thrust the plugged sort.

The engine can even have multiple sort implementation available and
use the most proficient one (imagine some sorts acts better on
a certain range value or on certain element size).


Regards
Gaetano Mendola


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Peter Geoghegan

On 15 February 2012 20:00, Gaetano Mendola mend...@gmail.com wrote:
 On 13/02/2012 19:48, Greg Stark wrote:

 I don't think we should be looking at either CUDA or OpenCL directly.
 We should be looking for a generic library that can target either and
 is well maintained and actively developed. Any GPU code we write
 ourselves would rapidly be overtaken by changes in the hardware and
 innovations in parallel algorithms. If we find a library that provides
 a sorting api and adapt our code to use it then we'll get the benefits
 of any new hardware feature as the library adds support for them.


 I think one option is to make the sort function pluggable with a shared
 library/dll. I see several benefits from this:

  - It could be in the interest of the hardware vendor to provide the most
 powerful sort implementation (I'm sure for example that TBB sort
 implementation is faster that pg_sort)

  - It can permit people to play with it without being deep involved in pg
 development and stuffs.

Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.

Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.

I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.

Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Gaetano Mendola


On 15/02/2012 23:11, Peter Geoghegan wrote:

On 15 February 2012 20:00, Gaetano Mendolamend...@gmail.com  wrote:

On 13/02/2012 19:48, Greg Stark wrote:


I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.



I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:

  - It could be in the interest of the hardware vendor to provide the most
powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

  - It can permit people to play with it without being deep involved in pg
development and stuffs.


Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.


AFAIK thrust library uses the radix sort if the keys you are sorting are 
POD data comparable with a  operator otherwise it does the

comparison based sort using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely
holds people back, I'm saying that it will simplify the process now
and int the future, of course that's my opinion.


Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.


That sounds a bit harsh. I'm one of those indeed, I haven't look in the 
details not having enough time for it. At work we do GPU computing (not

the sort type stuff) and given the fact I'm a Postgres enthusiast I
asked my self: my server is able to sort around 500 milions integer per
seconds, if postgres was able to do that as well it would be very nice.

What I have to say? Sorry for my thoughts.


I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.


Then I did look in the wrong direction. Thank you for point that out.


Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.


The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be
parallelized.

Regards
Gaetano Mendola



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Gaetano Mendola


On 15/02/2012 23:11, Peter Geoghegan wrote:

On 15 February 2012 20:00, Gaetano Mendolamend...@gmail.com  wrote:

On 13/02/2012 19:48, Greg Stark wrote:


I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.



I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:

  - It could be in the interest of the hardware vendor to provide the most
powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

  - It can permit people to play with it without being deep involved in pg
development and stuffs.


Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.


AFAIK thrust library uses the radix sort if the keys you are sorting are 
POD data comparable with a  operator otherwise it does the

comparison based sort using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely
holds people back, I'm saying that it will simplify the process now
and int the future, of course that's my opinion.


Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.


That sounds a bit harsh. I'm one of those indeed, I haven't look in the 
details not having enough time for it. At work we do GPU computing (not

the sort type stuff) and given the fact I'm a Postgres enthusiast I
asked my self: my server is able to sort around 500 milions integer per
seconds, if postgres was able to do that as well it would be very nice.

What I have to say? Sorry for my thoughts.


I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.


Then I did look in the wrong direction. Thank you for point that out.


Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.


The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be
parallelized.

Regards
Gaetano Mendola



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Peter Geoghegan

On 15 February 2012 22:54, Gaetano Mendola mend...@gmail.com wrote:
 That sounds a bit harsh. I'm one of those indeed, I haven't look in the
 details not having enough time for it. At work we do GPU computing (not
 the sort type stuff) and given the fact I'm a Postgres enthusiast I
 asked my self: my server is able to sort around 500 milions integer per
 seconds, if postgres was able to do that as well it would be very nice.

 What I have to say? Sorry for my thoughts.

I'm not trying to sound harsh.

The only reason that my patch *nearly* had support for this was
because the implementation that we nearly went with would have only
needed another couple of lines of code to support it. It very probably
wouldn't have turned out to have been useful for any novel sorting
idea, and was really only intended to be used to support user-defined
full sorting specialisations. That didn't end up making the cut.

My point is that whatever is holding back the development of a useful
prototype here, it definitely isn't the lack of an existing API. We
don't know what such an API should look like, and just how invasive it
needs to be. More importantly, it remains to be seen how useful this
idea is in the real world - we don't have so much as a synthetic test
case with a single client, as far as I'm aware.

I'd encourage the OP to share his work on github or something along those lines.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-15 Thread Dann Corbit

-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Gaetano Mendola
Sent: Wednesday, February 15, 2012 2:54 PM
To: Peter Geoghegan; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] CUDA Sorting

On 15/02/2012 23:11, Peter Geoghegan wrote:
 On 15 February 2012 20:00, Gaetano Mendolamend...@gmail.com  wrote:
 On 13/02/2012 19:48, Greg Stark wrote:

 I don't think we should be looking at either CUDA or OpenCL directly.
 We should be looking for a generic library that can target either 
 and is well maintained and actively developed. Any GPU code we write 
 ourselves would rapidly be overtaken by changes in the hardware and 
 innovations in parallel algorithms. If we find a library that 
 provides a sorting api and adapt our code to use it then we'll get 
 the benefits of any new hardware feature as the library adds support for 
 them.

 I think one option is to make the sort function pluggable with a 
 shared library/dll. I see several benefits from this:

   - It could be in the interest of the hardware vendor to provide the 
 most powerful sort implementation (I'm sure for example that TBB sort 
 implementation is faster that pg_sort)

   - It can permit people to play with it without being deep 
 involved in pg development and stuffs.

 Sorry, but I find it really hard to believe that the non-availability 
 of pluggable sorting is what's holding people back here. Some vanguard 
 needs to go and prove the idea by building a rough prototype before we 
 can even really comment on what an API should look like. For example, 
 I am given to understand that GPUs generally sort using radix sort - 
 resolving the impedance mismatch that prevents someone from using a 
 non-comparison based sort sure sounds like a lot of work for an 
 entirely speculative reward.

AFAIK thrust library uses the radix sort if the keys you are sorting are POD 
data comparable with a  operator otherwise it does the comparison based sort 
using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely holds 
people back, I'm saying that it will simplify the process now and int the 
future, of course that's my opinion.

 Someone who cannot understand tuplesort, which is not all that 
 complicated, has no business trying to build GPU sorting into 
 Postgres.

That sounds a bit harsh. I'm one of those indeed, I haven't look in the details 
not having enough time for it. At work we do GPU computing (not the sort type 
stuff) and given the fact I'm a Postgres enthusiast I asked my self: my server 
is able to sort around 500 milions integer per seconds, if postgres was able to 
do that as well it would be very nice.

What I have to say? Sorry for my thoughts.

 I had a patch committed a few hours ago that almost included the 
 capability of assigning an alternative sorting function, but only one 
 with the exact same signature as my variant of qsort_arg. pg_qsort 
 isn't used to sort tuples at all, by the way.

Then I did look in the wrong direction. Thank you for point that out.

 Threading building blocks is not going to form the basis of any novel 
 sorting implementation, because comparators in general are not thread 
 safe, and it isn't available on all the platforms we support, and 
 because of how longjmp interacts with C++ stack unwinding and so on 
 and so on. Now, you could introduce some kind of parallelism into 
 sorting integers and floats, but that's an awful lot of work for a 
 marginal reward.

The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be parallelized.

Radix sorting can be used for any data type, if you create a callback that 
provides the most significant bits in width buckets.  At any rate, I can't 
imagine why anyone would want to complain about sorting 40 times faster than 
before, considering the amount of time database spend in ordering data.

I have a Cuda card in this machine (NVIDIA GeForce GTX 460) and I would not 
mind it a bit if my database ORDER BY clause suddenly started running ten 
times faster than before when I am dealing with a huge volume of data.

There have been other experiments along these lines such as:
GPU-based Sorting in PostgreSQL Naju Mancheril, School of Computer Science - 
Carnegie Mellon University
www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpgpu10.pdf (This is for 
SQLite, but the grammar of SQLite is almost a pure subset of PostgreSQL, 
including things like vacuum...)
http://wiki.postgresql.org/images/6/65/Pgopencl.pdf
http://dl.acm.org/citation.cfm?id=1807207
http://www.scribd.com/doc/51484335/PostgreSQL-OpenCL-Procedural-Language-pgEast-March-2011

See also
http://highscalability.com/scaling-postgresql-using-cuda

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-14 Thread Gaetano Mendola


On 13/02/2012 08:26, Greg Smith wrote:

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
enclosure. That's a completely sensible purchase if your goal is to
build a computing cluster, where a lot of work is handed off to a set of
GPUs. I think that's even less likely to be a cost-effective option for
a database server. Adding a single dedicated GPU installed in a server
to accelerate sorting is something that might be justifiable, based on
your benchmarks. This is a much more expensive option than that though.
Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for
anyone who wants to see just how big this external box is.


I did some experimenst timing the sort done with CUDA and the sort
done with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms
CUDA time has already in the copy operations (host-device,
device-host).
As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz


That's really interesting, and the X5650 is by no means a slow CPU. So
this benchmark is providing a lot of CPU power yet still seeing over a
6X speedup in sort times. It sounds like the PCI Express bus has gotten
fast enough that the time to hand data over and get it back again can
easily be justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales
when using in parallel. Easiest way to do that would be to use the
pgbench -f feature, which allows running an arbitrary number of some
query at once. Seeing whether this acceleration continued to hold as the
number of clients increases is a useful data point.

Is it possible for you to break down where the time is being spent? For
example, how much of this time is consumed in the GPU itself, compared
to time spent transferring data between CPU and GPU? I'm also curious
where the bottleneck is at with this approach. If it's the speed of the
PCI-E bus for smaller data sets, adding more GPUs may never be
practical. If the bus can handle quite a few of these at once before it
saturates, it might be possible to overload a single GPU. That seems
like it would be really hard to reach for database sorting though; I
can't really defend justify my gut feel for that being true though.


There you go (times are in ms):

Size   H-D SORT D-H TOTAL
64   0.209824 0.479392 0.013856 0.703072
128  0.098144 0.41744  0.01312  0.528704
256  0.096832 0.420352 0.013696 0.53088
512  0.097568 0.3952   0.014464 0.507232
1024 0.09872  0.396608 0.014624 0.509952
2048 0.101344 0.56224  0.016896 0.68048
4096 0.106176 0.562976 0.02016  0.689312
8192 0.116512 0.571264 0.02672  0.714496
163840.136096 0.587584 0.040192 0.763872
327680.179296 0.658112 0.066304 0.903712
655360.212352 0.84816  0.118016 1.178528
131072   0.317056 1.1465   0.22784  1.691396
262144   0.529376 1.82237  0.42512  2.776866
524288   0.724032 2.39834  0.64576  3.768132
1048576  1.11162  3.51978  1.12176  5.75316
2097152  1.95939  5.93434  2.06992  9.96365
4194304  3.76192  10.6011  4.10614  18.46916
8388608  7.16845  19.9245  7.93741  35.03036
16777216 13.8693  38.7413  15.4073  68.0179
33554432 27.3017  75.6418  30.6646  133.6081
67108864 54.2171  151.192  60.327   265.7361

pg_sort

64   0.01
128  0.01
256  0.021000
512  0.128000
1024 0.092000
2048 0.196000
4096 0.415000
8192 0.883000
163841.881000
327683.96
655368.432000
131072  17.951000
262144  37.14
524288  78.32
1048576163.276000
2097152339.118000
4194304693.223000
8388608   1423.142000
16777216  2891.218000
33554432  5910.851000
67108864 11980.93

As you can notice the times with CUDA are lower than the timing I have 
reported on my previous post because the server was doing something else

in mean while, I have repeated those benchmarks with server completely
unused.

And this is the boost as in pg_sort/cuda :

64   0.0142232943
128  0.018914175
256  0.039556962
512  0.2070058671
1024 0.1804091365
2048 0.2880319774
4096 0.6078524674
8192 1.2372357578
163842.4637635625
327684.4106972133
655367.1742037525
131072   10.5090706139
262144   13.3719091955
524288   20.5834084369
1048576  28.2516043357
2097152  33.9618513296
4194304  37.5247168794
8388608  40.5135716561
16777216 42.4743633661
33554432 44.2394809896
67108864 45.1499777411



 I've never seen a PostgreSQL server capable of running CUDA, and I
 don't expect that to change.

That sounds like:

I think there is a world market for

Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Kohei KaiGai

2012/2/13 Greg Smith g...@2ndquadrant.com:
 On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

 The trend is to have server capable of running CUDA providing GPU via
 external hardware (PCI Express interface with PCI Express switches), look
 for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


 The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
 enclosure.  That's a completely sensible purchase if your goal is to build a
 computing cluster, where a lot of work is handed off to a set of GPUs.  I
 think that's even less likely to be a cost-effective option for a database
 server.  Adding a single dedicated GPU installed in a server to accelerate
 sorting is something that might be justifiable, based on your benchmarks.
  This is a much more expensive option than that though.  Details at
 http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants
 to see just how big this external box is.


 I did some experimenst timing the sort done with CUDA and the sort done
 with pg_qsort:
                       CUDA      pg_qsort
 33Milion integers:   ~ 900 ms,  ~ 6000 ms
 1Milion integers:    ~  21 ms,  ~  162 ms
 100k integers:       ~   2 ms,  ~   13 ms
 CUDA time has already in the copy operations (host-device, device-host).
 As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)
 Xeon(R) CPU X5650  @ 2.67GHz


 That's really interesting, and the X5650 is by no means a slow CPU.  So this
 benchmark is providing a lot of CPU power yet still seeing over a 6X speedup
 in sort times.  It sounds like the PCI Express bus has gotten fast enough
 that the time to hand data over and get it back again can easily be
 justified for medium to large sized sorts.

 It would be helpful to take this patch and confirm whether it scales when
 using in parallel.  Easiest way to do that would be to use the pgbench -f
 feature, which allows running an arbitrary number of some query at once.
  Seeing whether this acceleration continued to hold as the number of clients
 increases is a useful data point.

 Is it possible for you to break down where the time is being spent?  For
 example, how much of this time is consumed in the GPU itself, compared to
 time spent transferring data between CPU and GPU?  I'm also curious where
 the bottleneck is at with this approach.  If it's the speed of the PCI-E bus
 for smaller data sets, adding more GPUs may never be practical.  If the bus
 can handle quite a few of these at once before it saturates, it might be
 possible to overload a single GPU.  That seems like it would be really hard
 to reach for database sorting though; I can't really defend justify my gut
 feel for that being true though.


  I've never seen a PostgreSQL server capable of running CUDA, and I
  don't expect that to change.

 That sounds like:

 I think there is a world market for maybe five computers.
 - IBM Chairman Thomas Watson, 1943


 Yes, and 640K will be enough for everyone, ha ha.  (Having said the 640K
 thing is flat out denied by Gates, BTW, and no one has come up with proof
 otherwise).

 I think you've made an interesting case for this sort of acceleration now
 being useful for systems doing what's typically considered a data warehouse
 task.  I regularly see servers waiting for far more than 13M integers to
 sort.  And I am seeing a clear trend toward providing more PCI-E slots in
 servers now.  Dell's R810 is the most popular single server model my
 customers have deployed in the last year, and it has 5 X8 slots in it.  It's
 rare all 5 of those are filled.  As long as a dedicated GPU works fine when
 dropped to X8 speeds, I know a fair number of systems where one of those
 could be added now.

 There's another data point in your favor I didn't notice before your last
 e-mail.  Amazon has a Cluster GPU Quadruple Extra Large node type that
 runs with NVIDIA Tesla hardware.  That means the installed base of people
 who could consider CUDA is higher than I expected.  To demonstrate how much
 that costs, to provision a GPU enabled reserved instance from Amazon for one
 year costs $2410 at Light Utilization, giving a system with 22GB of RAM
 and 1.69GB of storage.  (I find the reserved prices easier to compare with
 dedicated hardware than the hourly ones)  That's halfway between the
 High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and
 the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
 $2200.  If someone could prove sorting was a bottleneck on their server,
 that isn't an unreasonable option to consider on a cloud-based database
 deployment.

 I still think that an approach based on OpenCL is more likely to be suitable
 for PostgreSQL, which was part of why I gave CUDA low odds here.  The points
 in favor of OpenCL are:

 -Since you last posted, OpenCL compiling has switched to using LLVM as their
 standard compiler.  Good PostgreSQL support for LLVM isn't far away.  It
 looks to me like the compiler situation

Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Gaetano Mendola

On Feb 13, 2012 11:39 a.m., Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2012/2/13 Greg Smith g...@2ndquadrant.com:
  On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
 
  The trend is to have server capable of running CUDA providing GPU via
  external hardware (PCI Express interface with PCI Express switches),
look
  for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
 
 
  The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
  enclosure.  That's a completely sensible purchase if your goal is to
build a
  computing cluster, where a lot of work is handed off to a set of GPUs.
 I
  think that's even less likely to be a cost-effective option for a
database
  server.  Adding a single dedicated GPU installed in a server to
accelerate
  sorting is something that might be justifiable, based on your
benchmarks.
   This is a much more expensive option than that though.  Details at
  http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who
wants
  to see just how big this external box is.
 
 
  I did some experimenst timing the sort done with CUDA and the sort done
  with pg_qsort:
CUDA  pg_qsort
  33Milion integers:   ~ 900 ms,  ~ 6000 ms
  1Milion integers:~  21 ms,  ~  162 ms
  100k integers:   ~   2 ms,  ~   13 ms
  CUDA time has already in the copy operations (host-device,
device-host).
  As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R)
  Xeon(R) CPU X5650  @ 2.67GHz
 
 
  That's really interesting, and the X5650 is by no means a slow CPU.  So
this
  benchmark is providing a lot of CPU power yet still seeing over a 6X
speedup
  in sort times.  It sounds like the PCI Express bus has gotten fast
enough
  that the time to hand data over and get it back again can easily be
  justified for medium to large sized sorts.
 
  It would be helpful to take this patch and confirm whether it scales
when
  using in parallel.  Easiest way to do that would be to use the pgbench
-f
  feature, which allows running an arbitrary number of some query at once.
   Seeing whether this acceleration continued to hold as the number of
clients
  increases is a useful data point.
 
  Is it possible for you to break down where the time is being spent?  For
  example, how much of this time is consumed in the GPU itself, compared
to
  time spent transferring data between CPU and GPU?  I'm also curious
where
  the bottleneck is at with this approach.  If it's the speed of the
PCI-E bus
  for smaller data sets, adding more GPUs may never be practical.  If the
bus
  can handle quite a few of these at once before it saturates, it might be
  possible to overload a single GPU.  That seems like it would be really
hard
  to reach for database sorting though; I can't really defend justify my
gut
  feel for that being true though.
 
 
   I've never seen a PostgreSQL server capable of running CUDA, and I
   don't expect that to change.
 
  That sounds like:
 
  I think there is a world market for maybe five computers.
  - IBM Chairman Thomas Watson, 1943
 
 
  Yes, and 640K will be enough for everyone, ha ha.  (Having said the
640K
  thing is flat out denied by Gates, BTW, and no one has come up with
proof
  otherwise).
 
  I think you've made an interesting case for this sort of acceleration
now
  being useful for systems doing what's typically considered a data
warehouse
  task.  I regularly see servers waiting for far more than 13M integers to
  sort.  And I am seeing a clear trend toward providing more PCI-E slots
in
  servers now.  Dell's R810 is the most popular single server model my
  customers have deployed in the last year, and it has 5 X8 slots in it.
 It's
  rare all 5 of those are filled.  As long as a dedicated GPU works fine
when
  dropped to X8 speeds, I know a fair number of systems where one of those
  could be added now.
 
  There's another data point in your favor I didn't notice before your
last
  e-mail.  Amazon has a Cluster GPU Quadruple Extra Large node type that
  runs with NVIDIA Tesla hardware.  That means the installed base of
people
  who could consider CUDA is higher than I expected.  To demonstrate how
much
  that costs, to provision a GPU enabled reserved instance from Amazon
for one
  year costs $2410 at Light Utilization, giving a system with 22GB of
RAM
  and 1.69GB of storage.  (I find the reserved prices easier to compare
with
  dedicated hardware than the hourly ones)  That's halfway between the
  High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100
and
  the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
  $2200.  If someone could prove sorting was a bottleneck on their server,
  that isn't an unreasonable option to consider on a cloud-based database
  deployment.
 
  I still think that an approach based on OpenCL is more likely to be
suitable
  for PostgreSQL, which was part of why I gave CUDA low odds here.  The
points
  in favor of OpenCL are:
 
  -Since you last posted,

Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Greg Stark

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Gaetano Mendola

On Feb 13, 2012 7:49 p.m., Greg Stark st...@mit.edu wrote:

 I don't think we should be looking at either CUDA or OpenCL directly.
 We should be looking for a generic library that can target either and
 is well maintained and actively developed. Any GPU code we write
 ourselves would rapidly be overtaken by changes in the hardware and
 innovations in parallel algorithms. If we find a library that provides
 a sorting api and adapt our code to use it then we'll get the benefits
 of any new hardware feature as the library adds support for them.

To sort integer I used the Thrust Nvidia library.

Re: [HACKERS] CUDA Sorting

2012-02-12 Thread Oleg Bartunov


I'm wondering if CUDA will win in geomentry operations, for example,
tesing point @ complex_polygon

Oleg
On Sun, 12 Feb 2012, Gaetano Mendola wrote:


On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.


The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.



The trend is to have server capable of running CUDA providing GPU via 
external hardware (PCI Express interface with PCI Express switches), look for 
example at PowerEdge C410x PCIe Expansion Chassis from DELL.


I did some experimenst timing the sort done with CUDA and the sort done with 
pg_qsort:

  CUDA  pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:~  21 ms,  ~  162 ms
100k integers:   ~   2 ms,  ~   13 ms

CUDA time has already in the copy operations (host-device, device-host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) 
Xeon(R) CPU X5650  @ 2.67GHz


Copy operations and kernel runs (the sort for instance) can run in parallel, 
so while you are sorting a batch of data, you can copy the next batch in 
parallel.


As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect in 
the near future the bottle neck of the device-host-device copies to have 
less impact.


I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.


I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.


That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-12 Thread Gaetano Mendola

On 12/02/2012 13:13, Oleg Bartunov wrote:

I'm wondering if CUDA will win in geomentry operations, for example,
tesing point @ complex_polygon

I'm not sure if the algorithm you mentioned can be implemented in terms
of vector algebra, blas, etc.

It's plenty of geometry operations implemented in CUDA out there, my
field of CUDA application is not this one so I'm not that much in it.

However I can point you to official NVIDIA npp library that provides
vector algebra algorithms, and some geometry algorithms as well.

http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/NPP_Library.pdf

(take a look at around page 620).

Regards
Gaetano Mendola

Oleg
On Sun, 12 Feb 2012, Gaetano Mendola wrote:

On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

I did some experimenst timing the sort done with CUDA and the sort
done with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms

CUDA time has already in the copy operations (host-device,
device-host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Copy operations and kernel runs (the sort for instance) can run in
parallel, so while you are sorting a batch of data, you can copy the
next batch in parallel.

As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so
expect in the near future the bottle neck of the
device-host-device copies to have less impact.

I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-12 Thread Greg Smith


On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
The trend is to have server capable of running CUDA providing GPU via 
external hardware (PCI Express interface with PCI Express switches), 
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


The C410X adds 16 PCIe slots to a server, housed inside a separate 3U 
enclosure.  That's a completely sensible purchase if your goal is to 
build a computing cluster, where a lot of work is handed off to a set of 
GPUs.  I think that's even less likely to be a cost-effective option for 
a database server.  Adding a single dedicated GPU installed in a server 
to accelerate sorting is something that might be justifiable, based on 
your benchmarks.  This is a much more expensive option than that 
though.  Details at 
http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who 
wants to see just how big this external box is.


I did some experimenst timing the sort done with CUDA and the sort 
done with pg_qsort:

   CUDA  pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:~  21 ms,  ~  162 ms
100k integers:   ~   2 ms,  ~   13 ms
CUDA time has already in the copy operations (host-device, 
device-host).
As GPU I was using a C2050, and the CPU doing the pg_qsort was a 
Intel(R) Xeon(R) CPU X5650  @ 2.67GHz


That's really interesting, and the X5650 is by no means a slow CPU.  So 
this benchmark is providing a lot of CPU power yet still seeing over a 
6X speedup in sort times.  It sounds like the PCI Express bus has gotten 
fast enough that the time to hand data over and get it back again can 
easily be justified for medium to large sized sorts.


It would be helpful to take this patch and confirm whether it scales 
when using in parallel.  Easiest way to do that would be to use the 
pgbench -f feature, which allows running an arbitrary number of some 
query at once.  Seeing whether this acceleration continued to hold as 
the number of clients increases is a useful data point.


Is it possible for you to break down where the time is being spent?  For 
example, how much of this time is consumed in the GPU itself, compared 
to time spent transferring data between CPU and GPU?  I'm also curious 
where the bottleneck is at with this approach.  If it's the speed of the 
PCI-E bus for smaller data sets, adding more GPUs may never be 
practical.  If the bus can handle quite a few of these at once before it 
saturates, it might be possible to overload a single GPU.  That seems 
like it would be really hard to reach for database sorting though; I 
can't really defend justify my gut feel for that being true though.



 I've never seen a PostgreSQL server capable of running CUDA, and I
 don't expect that to change.

That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943


Yes, and 640K will be enough for everyone, ha ha.  (Having said the 
640K thing is flat out denied by Gates, BTW, and no one has come up with 
proof otherwise).


I think you've made an interesting case for this sort of acceleration 
now being useful for systems doing what's typically considered a data 
warehouse task.  I regularly see servers waiting for far more than 13M 
integers to sort.  And I am seeing a clear trend toward providing more 
PCI-E slots in servers now.  Dell's R810 is the most popular single 
server model my customers have deployed in the last year, and it has 5 
X8 slots in it.  It's rare all 5 of those are filled.  As long as a 
dedicated GPU works fine when dropped to X8 speeds, I know a fair number 
of systems where one of those could be added now.


There's another data point in your favor I didn't notice before your 
last e-mail.  Amazon has a Cluster GPU Quadruple Extra Large node type 
that runs with NVIDIA Tesla hardware.  That means the installed base of 
people who could consider CUDA is higher than I expected.  To 
demonstrate how much that costs, to provision a GPU enabled reserved 
instance from Amazon for one year costs $2410 at Light Utilization, 
giving a system with 22GB of RAM and 1.69GB of storage.  (I find the 
reserved prices easier to compare with dedicated hardware than the 
hourly ones)  That's halfway between the High-Memory Double Extra Large 
Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple 
Extra Large Instance (64GB RAM/1690GB disk) at $2200.  If someone could 
prove sorting was a bottleneck on their server, that isn't an 
unreasonable option to consider on a cloud-based database deployment.


I still think that an approach based on OpenCL is more likely to be 
suitable for PostgreSQL, which was part of why I gave CUDA low odds 
here.  The points in favor of OpenCL are:


-Since you last posted, OpenCL compiling has switched to using LLVM as 
their standard compiler.  Good PostgreSQL support for LLVM isn't far 
away.  It looks to me like the compiler situation for CUDA requires 
their PathScale based compiler.  I

Re: [HACKERS] CUDA Sorting

2012-02-11 Thread Gaetano Mendola


On 19/09/2011 21:41, PostgreSQL - Hans-Jürgen Schönig wrote:


On Sep 19, 2011, at 5:16 PM, Tom Lane wrote:


Greg Starkst...@mit.edu  writes:

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well.


I think the real problem would be that we are seldom sorting just the
key values.  If you have to push the tuples through the GPU too, your
savings are going to go up in smoke pretty quickly …




i would argument along a similar line.
to make GPU code fast it has to be pretty much tailored to do exactly one thing 
- otherwise you have no chance to get anywhere close to card-bandwith.
if you look at two similar GPU codes which seem to do the same thing you 
might easily see that one is 10 times faster than the other - for bloody reason such as 
memory alignment, memory transaction size or whatever.
this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible 
that i would be really surprised if somebody could come up with a solution 
which really comes close to what the GPU can do.
it would definitely be interesting to see a prototype, however.


Thrust Nvidia library provides the same sorting flexibility as postgres 
does.


// generate 32M random numbers on the host
thrust::host_vectorint h_vec(32  20);
thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the device
thrust::device_vectorint d_vec = h_vec;

// sort data on the device (846M keys per second on GeForce GTX 480)
thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());


as you can see the type to be ordered is template, and
the thrust::sort have also a version in where it takes the comparator to 
use.

So compared with pg_qsort  thrust::sort gives you the same flexibility.

http://docs.thrust.googlecode.com/hg/group__sorting.html

Regards
Gaetano Mendola











--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2012-02-11 Thread Gaetano Mendola


On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.


The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.



The trend is to have server capable of running CUDA providing GPU via 
external hardware (PCI Express interface with PCI Express switches), 
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


I did some experimenst timing the sort done with CUDA and the sort done 
with pg_qsort:

   CUDA  pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:~  21 ms,  ~  162 ms
100k integers:   ~   2 ms,  ~   13 ms

CUDA time has already in the copy operations (host-device, device-host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a 
Intel(R) Xeon(R) CPU X5650  @ 2.67GHz


Copy operations and kernel runs (the sort for instance) can run in 
parallel, so while you are sorting a batch of data, you can copy the 
next batch in parallel.


As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect 
in the near future the bottle neck of the device-host-device copies 
to have less impact.


I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.

 I've never seen a PostgreSQL server capable of running CUDA, and I
 don't expect that to change.

That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-27 Thread Vitor Reus

Hey hackers,

I'm still having problems reading the values of the columns in tuplesort.c,
in order to understand how to port this to CUDA.

Should I use the heap_getattr macro to read them?

2011/9/24 Hannu Krosing ha...@krosing.net

 On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote:
  On 09/19/2011 10:12 AM, Greg Stark wrote:
   With the GPU I'm curious to see how well
   it handles multiple processes contending for resources, it might be a
   flashy feature that gets lots of attention but might not really be
   very useful in practice. But it would be very interesting to see.
  
 
  The main problem here is that the sort of hardware commonly used for
  production database servers doesn't have any serious enough GPU to
  support CUDA/OpenCL available.  The very clear trend now is that all
  systems other than gaming ones ship with motherboard graphics chipsets
  more than powerful enough for any task but that.  I just checked the 5
  most popular configurations of server I see my customers deploy
  PostgreSQL onto (a mix of Dell and HP units), and you don't get a
  serious GPU from any of them.
 
  Intel's next generation Ivy Bridge chipset, expected for the spring of
  2012, is going to add support for OpenCL to the built-in motherboard
  GPU.  We may eventually see that trickle into the server hardware side
  of things too.
 
  I've never seen a PostgreSQL server capable of running CUDA, and I don't
  expect that to change.

 CUDA sorting could be beneficial on general server hardware if it can
 run well on multiple cpus in parallel. GPU-s being in essence parallel
 processors on fast shared memory, it may be that even on ordinary RAM
 and lots of CPUs some CUDA algorithms are a significant win.

 and then there is non-graphics GPU availabe on EC2

  Cluster GPU Quadruple Extra Large Instance

  22 GB of memory
  33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem”
   architecture)
  2 x NVIDIA Tesla “Fermi” M2050 GPUs
  1690 GB of instance storage
  64-bit platform
  I/O Performance: Very High (10 Gigabit Ethernet)
  API name: cg1.4xlarge

 It costs $2.10 per hour, probably a lot less if you use the Spot
 Instances.

  --
  Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
  PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
 
 



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-24 Thread Hannu Krosing

On Mon, 2011-09-19 at 15:12 +0100, Greg Stark wrote:
 On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus vitor.r...@gmail.com wrote:
  Since I'm new to pgsql development, I replaced the code of pgsql
  qsort_arg to get used with the way postgres does the sort. The problem
  is that I can't use the qsort_arg_comparator comparator function on
  GPU, I need to implement my own. I didn't find out how to access the
  sorting key value data of the tuples on the Tuplesortstate or
  SortTuple structures. This part looks complicated because it seems the
  state holds the pointer for the scanner(?), but I didn't managed to
  access the values directly. Can anyone tell me how this works?



 With the GPU I'm curious to see how well
 it handles multiple processes contending for resources, it might be a
 flashy feature that gets lots of attention but might not really be
 very useful in practice. But it would be very interesting to see.

There are cases where concurrency may not be that important like some
specialized OLAP loads where you have to sort, for example finding a
median in large data sets.


-- 
---
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-24 Thread Hannu Krosing

On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote:
 On 09/19/2011 10:12 AM, Greg Stark wrote:
  With the GPU I'm curious to see how well
  it handles multiple processes contending for resources, it might be a
  flashy feature that gets lots of attention but might not really be
  very useful in practice. But it would be very interesting to see.
 
 
 The main problem here is that the sort of hardware commonly used for 
 production database servers doesn't have any serious enough GPU to 
 support CUDA/OpenCL available.  The very clear trend now is that all 
 systems other than gaming ones ship with motherboard graphics chipsets 
 more than powerful enough for any task but that.  I just checked the 5 
 most popular configurations of server I see my customers deploy 
 PostgreSQL onto (a mix of Dell and HP units), and you don't get a 
 serious GPU from any of them.
 
 Intel's next generation Ivy Bridge chipset, expected for the spring of 
 2012, is going to add support for OpenCL to the built-in motherboard 
 GPU.  We may eventually see that trickle into the server hardware side 
 of things too.
 
 I've never seen a PostgreSQL server capable of running CUDA, and I don't 
 expect that to change.

CUDA sorting could be beneficial on general server hardware if it can
run well on multiple cpus in parallel. GPU-s being in essence parallel
processors on fast shared memory, it may be that even on ordinary RAM
and lots of CPUs some CUDA algorithms are a significant win.

and then there is non-graphics GPU availabe on EC2 

  Cluster GPU Quadruple Extra Large Instance

  22 GB of memory
  33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem”
   architecture)
  2 x NVIDIA Tesla “Fermi” M2050 GPUs
  1690 GB of instance storage
  64-bit platform
  I/O Performance: Very High (10 Gigabit Ethernet)
  API name: cg1.4xlarge

It costs $2.10 per hour, probably a lot less if you use the Spot
Instances.

 -- 
 Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
 PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
 
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-20 Thread Florian Pflug

On Sep19, 2011, at 19:46 , Stephen Frost wrote:
 I agree that it'd be interesting to do, but I share Lord Stark's
 feelings about the challenges and lack of potential gain- it's a very
 small set of queries that would benefit from this.  You need to be
 working with enough data to make the cost of tranferring it all over to
 the GPU worthwhile, just for starters..

I wonder if anyone has ever tried to employ a GPU for more low-level
tasks. Things like sorting or hashing are hard to move to the
GPU in postgres because, in the general case, they involve essentially
arbitrary user-defined functions. But couldn't for example the WAL CRC
computation be moved to a GPU? Or, to get really crazy, even the search
for the optimal join order (only for a large number of joins though,
i.e. where we currently switch to a genetic algorithmn)?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-20 Thread Nulik Nol


 I already did some benchmarks with GPU sorting (not in pgsql), and
 measured total sort times, copy bandwidth and energy usage, and got
 some exciting results:
Was that qsort implementation on CPU cache friendly and optimized for SSE ?
To make a fair comparison you have to take the best CPU implementation
and compare it to best GPU implementation. Because if not, you are
comparing full throttled GPU vs lazy CPU.
Check this paper on how hash join was optimized 17x when SSE
instructions were used.
www.vldb.org/pvldb/2/vldb09-257.pdf

Regards


-- 
==
The power of zero is infinite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CUDA Sorting

2011-09-19 Thread Vitor Reus

Hello everyone,

I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
can improve the ORDER BY statement performance in 4 to 10 times. I
already have a generic CUDA sort that performs around 10 times faster
than std qsort. I also managed to load CUDA into pgsql.

Since I'm new to pgsql development, I replaced the code of pgsql
qsort_arg to get used with the way postgres does the sort. The problem
is that I can't use the qsort_arg_comparator comparator function on
GPU, I need to implement my own. I didn't find out how to access the
sorting key value data of the tuples on the Tuplesortstate or
SortTuple structures. This part looks complicated because it seems the
state holds the pointer for the scanner(?), but I didn't managed to
access the values directly. Can anyone tell me how this works?

Cheers,
Vítor

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Thom Brown

On 19 September 2011 13:11, Vitor Reus vitor.r...@gmail.com wrote:
 Hello everyone,

 I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
 can improve the ORDER BY statement performance in 4 to 10 times. I
 already have a generic CUDA sort that performs around 10 times faster
 than std qsort. I also managed to load CUDA into pgsql.

 Since I'm new to pgsql development, I replaced the code of pgsql
 qsort_arg to get used with the way postgres does the sort. The problem
 is that I can't use the qsort_arg_comparator comparator function on
 GPU, I need to implement my own. I didn't find out how to access the
 sorting key value data of the tuples on the Tuplesortstate or
 SortTuple structures. This part looks complicated because it seems the
 state holds the pointer for the scanner(?), but I didn't managed to
 access the values directly. Can anyone tell me how this works?

I can't help with explaining the inner workings of sorting code, but
just a note that CUDA is a proprietary framework from nVidia and
confines its use to nVidia GPUs only.  You'd probably be better off
investing in the OpenCL standard which is processor-agnostic.  Work
has already been done in this area by Tim Child with pgOpenCL,
although doesn't appear to be available yet.  It might be worth
engaging with him to see if there are commonalities to what you're
both trying to achieve.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Thom Brown

On 19 September 2011 14:32, Vitor Reus vitor.r...@gmail.com wrote:
 2011/9/19 Thom Brown t...@linux.com:
 On 19 September 2011 13:11, Vitor Reus vitor.r...@gmail.com wrote:
 Hello everyone,

 I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
 can improve the ORDER BY statement performance in 4 to 10 times. I
 already have a generic CUDA sort that performs around 10 times faster
 than std qsort. I also managed to load CUDA into pgsql.

 Since I'm new to pgsql development, I replaced the code of pgsql
 qsort_arg to get used with the way postgres does the sort. The problem
 is that I can't use the qsort_arg_comparator comparator function on
 GPU, I need to implement my own. I didn't find out how to access the
 sorting key value data of the tuples on the Tuplesortstate or
 SortTuple structures. This part looks complicated because it seems the
 state holds the pointer for the scanner(?), but I didn't managed to
 access the values directly. Can anyone tell me how this works?

 I can't help with explaining the inner workings of sorting code, but
 just a note that CUDA is a proprietary framework from nVidia and
 confines its use to nVidia GPUs only.  You'd probably be better off
 investing in the OpenCL standard which is processor-agnostic.  Work
 has already been done in this area by Tim Child with pgOpenCL,
 although doesn't appear to be available yet.  It might be worth
 engaging with him to see if there are commonalities to what you're
 both trying to achieve.

 --
 Thom Brown
 Twitter: @darkixion
 IRC (freenode): dark_ixion
 Registered Linux user: #516935

 EnterpriseDB UK: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 Hi Thom Brown,

 thank you very much for your reply.

 I am aware that CUDA is a proprietary framework, but since the high
 level CUDA API is easier than OpenCL, it will be faster to implement
 and test. Also, CUDA can be translated to OpenCL in a straightforward
 way, since the low level CUDA API generated code is really similar to
 OpenCL.

 I'll try engaging with Tim Child, but it seems that his work is to
 create GPU support for specific SQL, like procedural SQL statements
 with CUDA extensions, did I understand it right? And my focus is to
 unlock the GPU power without the user being aware of this.

Please use Reply To All in your responses so the mailing list is included.

Is your aim to have this committed into core PostgreSQL, or just for
your own version?  If it's the former, I don't anticipate any
enthusiasm from the hacker community.

But you're right, Tim Child's work is aimed at procedural acceleration
rather than speeding up core functionality (from what I gather
anyway).

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Greg Stark

On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus vitor.r...@gmail.com wrote:
 Since I'm new to pgsql development, I replaced the code of pgsql
 qsort_arg to get used with the way postgres does the sort. The problem
 is that I can't use the qsort_arg_comparator comparator function on
 GPU, I need to implement my own. I didn't find out how to access the
 sorting key value data of the tuples on the Tuplesortstate or
 SortTuple structures. This part looks complicated because it seems the
 state holds the pointer for the scanner(?), but I didn't managed to
 access the values directly. Can anyone tell me how this works?



This is something I've been curious about for a while. The biggest
difficulty is that Postgres has a user-extensible type system and
calls user provided functions to do things like comparisons. Postgres
only supports comparison sorts and does so by calling the user
function for the data type being sorted.

These user defined function is looked up earlier in the query parsing
and analysis phase and stored in Tuplesortstate-scanKeys which is an
array of structures that hold information about the ordering required.
In there there's a pointer to the function, a set of flags (such as
NULLS FIRST/LAST), the text collation needed and the collation.

I assume you're going to have to have tuplesort.c recognize if all the
comparators are one of a small set of standard comparators that you
can implement on the GPU such as integer and floating point
comparison. In which case you could call a specialized qsort which
implements that comparator inlined instead of calling the standard
function. That might actually be a useful optimization to do anyways
since it may well be much faster even without the GPU.  So that would
probably be a good place to start.

But the barrier to get over here might be relatively high. In order to
tolerate that amount of duplicated code and special cases there would
have to be benchmarks showing it's significantly faster and helps
real-world user queries. It would also have to be pretty cleanly
implemented so that it doesn't impose a lot of extra overhead every
time this code needs to be changed -- for example when adding
collations it would have been unfortunate to have to add it to half a
dozen specializations of tuplesort (though frankly I don't think that
would have made that much of a dent in the happiness of the people who
worked on collations).

All that said my personal opinion is that this can be done cleanly and
would be more than worth the benefit even without the GPU -- sorting
integers and floating point numbers is a very common case and Peter
Geoghan recently showed our qsort could be about twice as fast if it
could inline the comparisons. With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Greg Smith


On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.
   


The main problem here is that the sort of hardware commonly used for 
production database servers doesn't have any serious enough GPU to 
support CUDA/OpenCL available.  The very clear trend now is that all 
systems other than gaming ones ship with motherboard graphics chipsets 
more than powerful enough for any task but that.  I just checked the 5 
most popular configurations of server I see my customers deploy 
PostgreSQL onto (a mix of Dell and HP units), and you don't get a 
serious GPU from any of them.


Intel's next generation Ivy Bridge chipset, expected for the spring of 
2012, is going to add support for OpenCL to the built-in motherboard 
GPU.  We may eventually see that trickle into the server hardware side 
of things too.


I've never seen a PostgreSQL server capable of running CUDA, and I don't 
expect that to change.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Thom Brown

On 19 September 2011 15:36, Greg Smith g...@2ndquadrant.com wrote:
 On 09/19/2011 10:12 AM, Greg Stark wrote:

 With the GPU I'm curious to see how well
 it handles multiple processes contending for resources, it might be a
 flashy feature that gets lots of attention but might not really be
 very useful in practice. But it would be very interesting to see.


 The main problem here is that the sort of hardware commonly used for
 production database servers doesn't have any serious enough GPU to support
 CUDA/OpenCL available.  The very clear trend now is that all systems other
 than gaming ones ship with motherboard graphics chipsets more than powerful
 enough for any task but that.  I just checked the 5 most popular
 configurations of server I see my customers deploy PostgreSQL onto (a mix of
 Dell and HP units), and you don't get a serious GPU from any of them.

 Intel's next generation Ivy Bridge chipset, expected for the spring of 2012,
 is going to add support for OpenCL to the built-in motherboard GPU.  We may
 eventually see that trickle into the server hardware side of things too.

 I've never seen a PostgreSQL server capable of running CUDA, and I don't
 expect that to change.

But couldn't that also be seen as a chicken/egg situation?  No-one
buys GPUs for database servers because the database won't make use of
it, but databases don't implement GPU functionality since database
servers don't tend to have GPUs.  It's more likely the latter of those
two reasonings would have to be the first to budge.

But nVidia does produce a non-graphics-oriented GPGPU line called
Tesla dedicated to such processing.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Greg Stark

On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith g...@2ndquadrant.com wrote:
 The main problem here is that the sort of hardware commonly used for
 production database servers doesn't have any serious enough GPU to support
 CUDA/OpenCL available

Of course that could change if adding a GPU would help Postgres... I
would expect it to help mostly for data warehouse batch query type
systems, especially ones with very large i/o subsystems that can
saturate the memory bus with sequential i/o. Run your large batch
queries twice as fast by adding a $400 part to your $40,000 server
might be a pretty compelling sales pitch :)

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well. I expect someone has
implemented heaps for CUDA/OpenCL already though.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Thom Brown

On 19 September 2011 15:54, Greg Stark st...@mit.edu wrote:
 On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith g...@2ndquadrant.com wrote:
 The main problem here is that the sort of hardware commonly used for
 production database servers doesn't have any serious enough GPU to support
 CUDA/OpenCL available

 Of course that could change if adding a GPU would help Postgres... I
 would expect it to help mostly for data warehouse batch query type
 systems, especially ones with very large i/o subsystems that can
 saturate the memory bus with sequential i/o. Run your large batch
 queries twice as fast by adding a $400 part to your $40,000 server
 might be a pretty compelling sales pitch :)

 That said, to help in the case I described you would have to implement
 the tapesort algorithm on the GPU as well. I expect someone has
 implemented heaps for CUDA/OpenCL already though.

I seem to recall a paper on such a thing by Carnegie Mellon
University.  Can't remember where I saw it though.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Thom Brown

On 19 September 2011 16:10, Thom Brown t...@linux.com wrote:
 On 19 September 2011 15:54, Greg Stark st...@mit.edu wrote:
 On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith g...@2ndquadrant.com wrote:
 The main problem here is that the sort of hardware commonly used for
 production database servers doesn't have any serious enough GPU to support
 CUDA/OpenCL available

 Of course that could change if adding a GPU would help Postgres... I
 would expect it to help mostly for data warehouse batch query type
 systems, especially ones with very large i/o subsystems that can
 saturate the memory bus with sequential i/o. Run your large batch
 queries twice as fast by adding a $400 part to your $40,000 server
 might be a pretty compelling sales pitch :)

 That said, to help in the case I described you would have to implement
 the tapesort algorithm on the GPU as well. I expect someone has
 implemented heaps for CUDA/OpenCL already though.

 I seem to recall a paper on such a thing by Carnegie Mellon
 University.  Can't remember where I saw it though.

Found it! 
http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Tom Lane

Greg Stark st...@mit.edu writes:
 That said, to help in the case I described you would have to implement
 the tapesort algorithm on the GPU as well.

I think the real problem would be that we are seldom sorting just the
key values.  If you have to push the tuples through the GPU too, your
savings are going to go up in smoke pretty quickly ...

FWIW, I tend to believe a variant of what Greg Stark said upthread:
there would surely be some win from reducing the impedance mismatch for
comparison functions.  In concrete terms, there would be no reason to
have tuplesort.c's myFunctionCall2Coll, and maybe not
inlineApplySortFunction either, if the datatype-specific comparison
functions had APIs that were closer to what sorting wants rather than
following the general SQL-callable-function API.  And those functions
cost a *lot* more than a one-instruction comparison does.  But it's very
much more of a stretch to believe that inlining per se is going to do
much for us, and even more of a stretch to believe that getting a
separate processor involved is going to be a win.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Vitor Reus

2011/9/19 Thom Brown t...@linux.com
 Is your aim to have this committed into core PostgreSQL, or just for
 your own version?  If it's the former, I don't anticipate any
 enthusiasm from the hacker community.

This is a research thesis and I'm not confident to commit it on the
core just by myself. I will, however, release the source, and I
believe it will open the way to future work be committed on core
PostgreSQL.


2011/9/19 Greg Stark st...@mit.edu
 Of course that could change if adding a GPU would help Postgres... I
 would expect it to help mostly for data warehouse batch query type
 systems, especially ones with very large i/o subsystems that can
 saturate the memory bus with sequential i/o. Run your large batch
 queries twice as fast by adding a $400 part to your $40,000 server
 might be a pretty compelling sales pitch :)

My focus is also energy proportionality. If you add a GPU, you will
increase the power consumption in about 2 times, but perhaps could
increse the efficiency much more.


 That said, to help in the case I described you would have to implement
 the tapesort algorithm on the GPU as well. I expect someone has
 implemented heaps for CUDA/OpenCL already though.

For now, I'm planning to implement just the in-memory sort, for
simplicity and to see if it would give a real performance gain.


2011/9/19 Greg Stark st...@mit.edu:
 In which case you could call a specialized qsort which
 implements that comparator inlined instead of calling the standard
 function.

Actually I'm now trying to make a custom comparator for integers, but
I didn't had great progress. If this works, I'll port it to GPU and
start working with the next comparators, such as float, then strings,
in a incremental way.


2011/9/19 Thom Brown t...@linux.com:
 Found it! 
 http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf
This is a really great work, and I'm basing mine on it. But it's
implemented using OpenGL (yes, not OpenCL), and therefore has a lot of
limitations. I also tried to contact naju but didn't get any answer.

Vítor Uwe Reus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Nulik Nol

On Mon, Sep 19, 2011 at 7:11 AM, Vitor Reus vitor.r...@gmail.com wrote:
 Hello everyone,

 I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
 can improve the ORDER BY statement performance in 4 to 10 times. I
 already have a generic CUDA sort that performs around 10 times faster
 than std qsort. I also managed to load CUDA into pgsql.
NVIDIA cards are not that good as ATI cards. ATI cards are much faster
with integer operations, and should be ideal for sorting transaction
ids or sort of similar numbers (unless you are going to sort prices
stored as float, which ATI still beats NVIDIA but not by that much)
Another problem you have to deal with is PCI Express speed. Transfer
is very slow compared to RAM. You will have to put more GPUs to match
the performance and this will increase solution cost. There was a
sorting algorithm for 4 CPU cores that was beating sort on a GTX 285
(I don't have the link, sorry), but CPUs are not that bad with sorting
like you think.
AMD is already working with embedding GPUs into the motherboard, if I
am not mistaken there are already some of them on the market available
for purchase.
Anyone who uses a tiny embedded ATI for sorting problems with integers
will outperform your NVIDIA based PCI-Express connected GPU with CUDA,
because basically your algorithm will waste a lot of time transfering
data to GPU and getting it back.
But if you use embedded ATI GPU , you can also use SSE registers on
each CPU core to add more performance to your algorithm. It is not
going to be a very hardware compatible solution but if you want good
speed/cost, this should be the best solution.
I recommend doing some bandwidth benchmark test before you start coding.

Regards
Nulik
 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
==
The power of zero is infinite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Christopher Browne

On Mon, Sep 19, 2011 at 10:36 AM, Greg Smith g...@2ndquadrant.com wrote:
 Intel's next generation Ivy Bridge chipset, expected for the spring of 2012,
 is going to add support for OpenCL to the built-in motherboard GPU.  We may
 eventually see that trickle into the server hardware side of things too.

Note that Amazon's EC2 offerings include a configuration with a pair of GPUs.

Whether or not this continues has a certain chicken and egg aspect to it...

- I'm glad that Amazon is selling such a configuration, as it does
give folks the option of trying it out.

- Presumably, it will only continue on their product list if customers
do more than merely trying it out.

I think I'd be shocked if PostgreSQL offered much support for such a
configuration in the next year; despite there being some work ongoing,
drawing the functionality into core would require Core decisions that
I'd be surprised to see so quickly.

Unfortunately, that may be slow enough progress that PostgreSQL won't
be contributing to the would-be success of the technology.

If this kind of GPU usage fails to attract much interest, then it's
probably a good thing that we're not committed to it.  But if other
uses lead to it taking off, then we'll doubtless get a lot of noise on
lists about a year from now to the effect Why don't you have this in
core yet?  Not 3773t enough!?!?

Having a bit of progress taking place now would probably be good
timing, in case it *does* take off...
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Stephen Frost

* Thom Brown (t...@linux.com) wrote:
 But nVidia does produce a non-graphics-oriented GPGPU line called
 Tesla dedicated to such processing.

Just as a side-note, I've got a couple Tesla's that aren't doing
terribly much at the moment and they're in a Linux 'server'-type box
from Penguin computing.  I could certainly install PG on it and run some
tests- if someone's written the code and provides the tests.

I agree that it'd be interesting to do, but I share Lord Stark's
feelings about the challenges and lack of potential gain- it's a very
small set of queries that would benefit from this.  You need to be
working with enough data to make the cost of tranferring it all over to
the GPU worthwhile, just for starters..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Greg Smith


On 09/19/2011 10:53 AM, Thom Brown wrote:

But couldn't that also be seen as a chicken/egg situation?



The chicken/egg problem here is a bit deeper than just no one offers 
GPUs because no one wants them on server systems.  One of the reasons 
there aren't more GPUs in typical database server configurations is that 
you're already filling up some number of the full size slots, and 
correspondingly the bandwidth available to cards, with disk 
controllers.  It doesn't help that many server class motherboards don't 
even have a x16 PCI-e slot on them, which is what most GPUs as delivered 
on regular consumer video cards are optimized for.



But nVidia does produce a non-graphics-oriented GPGPU line called
Tesla dedicated to such processing.
   


Tesla units start at around $1500 USD, which is a nice budget to spend 
on either more RAM (to allow higher work_mem), faster storage to store 
temporary files onto, or a faster CPU to chew through all sorts of tasks 
more quickly.  The Tesla units are easy to justify if you have a serious 
GPU-oriented application.  The good bang for the buck point with CPU 
sorting for PostgreSQL is probably going to be a $50-$100 video card 
instead.  For example, the card Vitor is seeing good results on costs 
around $60.  (That's also a system with fairly slow RAM, though; it will 
be interesting to see if the gain holds up on newer systems.)


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread PostgreSQL - Hans-Jürgen Schönig


On Sep 19, 2011, at 5:16 PM, Tom Lane wrote:

 Greg Stark st...@mit.edu writes:
 That said, to help in the case I described you would have to implement
 the tapesort algorithm on the GPU as well.
 
 I think the real problem would be that we are seldom sorting just the
 key values.  If you have to push the tuples through the GPU too, your
 savings are going to go up in smoke pretty quickly …
 


i would argument along a similar line.
to make GPU code fast it has to be pretty much tailored to do exactly one thing 
- otherwise you have no chance to get anywhere close to card-bandwith.
if you look at two similar GPU codes which seem to do the same thing you 
might easily see that one is 10 times faster than the other - for bloody reason 
such as memory alignment, memory transaction size or whatever.
this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible 
that i would be really surprised if somebody could come up with a solution 
which really comes close to what the GPU can do.
it would definitely be interesting to see a prototype, however.

btw, there is a handful of interesting talks / lectures about GPU programming 
provided by the university of chicago (just cannot find the link atm).

regards,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

2011-09-19 Thread Cédric Villemain

2011/9/19 Greg Smith g...@2ndquadrant.com:
 On 09/19/2011 10:53 AM, Thom Brown wrote:

 But couldn't that also be seen as a chicken/egg situation?


 The chicken/egg problem here is a bit deeper than just no one offers GPUs
 because no one wants them on server systems.  One of the reasons there
 aren't more GPUs in typical database server configurations is that you're
 already filling up some number of the full size slots, and correspondingly
 the bandwidth available to cards, with disk controllers.  It doesn't help
 that many server class motherboards don't even have a x16 PCI-e slot on
 them, which is what most GPUs as delivered on regular consumer video cards
 are optimized for.


Sandy bridge and ivy bridge intel series are CPU/GPU. I don't know how
using the GPU affect the CPU part but it might be interesting to
explore...


-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

40 matches

Mail list logo