Re: [HACKERS] CUDA Sorting

2012-02-12 Thread Oleg Bartunov

I'm wondering if CUDA will win in geomentry operations, for example,
tesing point @ complex_polygon

Oleg
On Sun, 12 Feb 2012, Gaetano Mendola wrote:


On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.


The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.



The trend is to have server capable of running CUDA providing GPU via 
external hardware (PCI Express interface with PCI Express switches), look for 
example at PowerEdge C410x PCIe Expansion Chassis from DELL.


I did some experimenst timing the sort done with CUDA and the sort done with 
pg_qsort:

  CUDA  pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:~  21 ms,  ~  162 ms
100k integers:   ~   2 ms,  ~   13 ms

CUDA time has already in the copy operations (host-device, device-host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) 
Xeon(R) CPU X5650  @ 2.67GHz


Copy operations and kernel runs (the sort for instance) can run in parallel, 
so while you are sorting a batch of data, you can copy the next batch in 
parallel.


As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect in 
the near future the bottle neck of the device-host-device copies to have 
less impact.


I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.


I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.


That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpoint sync pause

2012-02-12 Thread Jeff Janes
On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith gsm...@gregsmith.com wrote:
 On 02/03/2012 11:41 PM, Jeff Janes wrote:

 -The steady stream of backend writes that happen between checkpoints have
 filled up most of the OS write cache.  A look at /proc/meminfo shows
 around
 2.5GB Dirty:

 backend writes includes bgwriter writes, right?


 Right.


 Has using a newer kernal with dirty_background_bytes been tried, so it
 could be set to a lower level?  If so, how did it do?  Or does it just
 refuse to obey below the 5% level, as well?


 Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster
 than it improves checkpoint latency.

Does it cause VACUUM to create latency for other processes (like the
checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM
just slow down without effecting other tasks?

It seems to me that just lowering dirty_background_bytes (while not
also lowering dirty_bytes) should not cause the latter to happen, but
it seems like these kernel tunables never do exactly what they
advertise.

This may not be relevant to the current situation, but I wonder if we
don't need a vacuum_cost_page_dirty_seq so that if the pages we are
dirtying are consecutive (or at least closely spaced) they cost less,
in anticipation that the eventual writes will be combined and thus
consume less IO resources.  I would think it would be common for some
regions of table to be intensely dirtied, and some to be lightly
dirtied (but still aggregating up to a considerable amount of random
IO).   But the vacuum process might also need to be made more
bursty, as even if it generates sequential dirty pages the IO system
might write them randomly anyway if there are too many delays
interspersed


 Since the sort of servers that have
 checkpoint issues are quite often ones that have VACUUM ones, too, that
 whole path doesn't seem very productive.  The one test I haven't tried yet
 is whether increasing the size of the VACUUM ring buffer might improve how
 well the server responds to a lower write cache.

I wouldn't expect this to help.  It seems like it would hurt, as it
just leaves the data for even longer (however long it takes to
circumnavigate the ring buffer) before there is any possibility of it
getting written.  I guess it does increase the chances that the dirty
pages will accidentally get written by the bgwriter rather than the
vacuum itself, but I doubt that that would be significant.

...

 Was the sorted checkpoint with an fsync after every file (real file,
 not VFD) one of the changes you tried?


...

 I haven't had very good luck with sorting checkpoints at the PostgreSQL
 relation level on server-size systems.  There is a lot of sorting already
 happening at both the OS (~3GB) and BBWC (=512MB) levels on this server.
  My own tests on my smaller test server--with a scaled down OS (~750MB) and
 BBWC (256MB) cache--haven't ever validated sorting as a useful technique on
 top of that.  It's never bubbled up to being considered a likely win on the
 production one as a result.

Without sorted checkpoints (or some other fancier method) you have to
write out the entire pool before you can do any fsyncs.  Or you have
to do multiple fsyncs of the same file, with at least one occurring
after the entire pool was written.  With a sorted checkpoint, you can
start issuing once-only fsyncs very early in the checkpoint process.
I think that on large servers, that would be the main benefit, not the
actually more efficient IO.  (On small servers I've seen sorted
checkpoints be much faster on shutdown checkpoints, but not on natural
checkpoints, and presumably this improvement *is* due to better
ordering).

On your servers, you need big delays between fsyncs and not between
writes (as they are buffered until the fsync).  But in other
situations, people need the delays between the writes.  By using
sorted checkpoints with fsyncs between each file, the delays between
writes are naturally delays between fsyncs as well.  So I think the
benefit of using sorted checkpoints is that code to improve your
situations is less likely to degrade someone else's situation, without
having to introduce an extra layer of tunables.



 What I/O are they trying to do?  It seems like all your data is in RAM
 (if not, I'm surprised you can get queries to ran fast enough to
 create this much dirty data).  So they probably aren't blocking on
 reads which are being interfered with by all the attempted writes.


 Reads on infrequently read data.  Long tail again; even though caching is
 close to 100%, the occasional outlier client who wants some rarely accessed
 page with their personal data on it shows up.  Pollute the write caches
 badly enough, and what happens to reads mixed into there gets very fuzzy.
  Depends on the exact mechanics of the I/O scheduler used in the kernel
 version deployed.

OK, but I would still think it is a minority of transactions which
need at least one of those infrequently read data and most 

Re: [HACKERS] CUDA Sorting

2012-02-12 Thread Gaetano Mendola

On 12/02/2012 13:13, Oleg Bartunov wrote:

I'm wondering if CUDA will win in geomentry operations, for example,
tesing point @ complex_polygon



I'm not sure if the algorithm you mentioned can be implemented in terms
of vector algebra, blas, etc.

It's plenty of geometry operations implemented in CUDA out there, my
field of CUDA application is not this one so I'm not that much in it.

However I can point you to official NVIDIA npp library that provides
vector algebra algorithms, and some geometry algorithms as well.

http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/NPP_Library.pdf

(take a look at around page 620).

Regards
Gaetano Mendola



Oleg
On Sun, 12 Feb 2012, Gaetano Mendola wrote:


On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.


The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.



The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

I did some experimenst timing the sort done with CUDA and the sort
done with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms

CUDA time has already in the copy operations (host-device,
device-host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Copy operations and kernel runs (the sort for instance) can run in
parallel, so while you are sorting a batch of data, you can copy the
next batch in parallel.

As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so
expect in the near future the bottle neck of the
device-host-device copies to have less impact.

I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.


I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.


That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-02-12 Thread Jeff Janes
On Sun, Jan 8, 2012 at 8:19 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, Jan 4, 2012 at 11:14 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Not having a freelist at all is probably a simpler way of avoiding the
 lock contention, so I'll happily back that suggestion instead. Patch
 attached, previous patch revoked.

 v2 attached with cleanup of some random stuff that crept onto patch.

Hi Simon,

Based on the way this patch leaves the old code behind (using #ifdef),
this looks more like a WIP patch which you want people to do
performance testing with, rather than  patch proposed for committing.
If that is the case, could you outline the type of performance testing
where you think it would make a difference (and whether it should be
done on top of the main patch from this thread, the concurrent index
drop one)?

Also, it would be much easier to do the performance testing if this
behavior was controlled by a temporarily added GUC, rather than an
#ifdef.  Do you think it is feasible to do that, or would the overhead
of a single if (some_guc) per StrategyGetBuffer and
StrategyFreeBuffer call distort things too much?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 3rd Cluster Hackers Summit, May 15th in Ottawa

2012-02-12 Thread Joshua Berkus
Hackers, 

NTT Open Source has requested that I convene the 3rd Cluster Hackers summit at 
pgCon this year.  As last year, it will be held on Tuesday (May 15th) during 
tutorials (and not conflicting with the Developer Summit).

If you are a contributor to any of PostgreSQL's various replication, 
clustering, or virtualization and multiserver management tools, you are invited 
to attend.  Please RSVP (see below).

Draft agenda follows.  Please let me know of any contributions/changes to the 
agenda you have:

= Project Reports: 5 minutes from each project
   * Hot Standby/Binary Replication
   * pgPoolII
   * PostgresXC
   * Your Project Here

= Technical Issues of common interest
   * SSI in cluster/replication
   * Parser export
   * Managing consistent views of data
   * Fault detection and handling
   * Node addition/removal
   * Configuration and operation
   * Cursor in replication/multi master
   * Your Issue Here

The Cluster Summit will be from 10am to 5pm, with a break for lunch, which will 
be provided, sponsored by NTT.

If you will be able to attend, please respond (offlist) to this email with the 
following:

Your Name
Project(s) you work on
If you will be giving a Project Report
If you have additions to the agenda
Special dietary needs for lunch, if any
If you need travel assistance

Note that the availability of travel funding is not guaranteed; I can just 
agree to request it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
San Francisco

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-12 Thread Jeff Janes
On Thu, Feb 9, 2012 at 3:02 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 After applying this patch and then forcing crashes, upon recovery the
 database is not correct.

 If I make a table with 10,000 rows and then after that intensively
 update it using a unique key:

 update foo set count=count+1 where foobar=?

 Then after the crash there are less than 10,000 visible rows:

 select count(*) from foo

 This not a subtle thing, it happens every time.  I get counts of
 between 1973 and 8827.  Without this patch I always get exactly
 10,000.

 I don't really know where to start on tracking this down.

 Similar problem happened on my test. When I executed CREATE TABLE and
 shut down the server with immediate mode, after recovery I could not see the
 created table. Here are the server log of recovery with wal_debug = on:

 LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 
 JST
 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  redo starts at 0/179CC90
 LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
 XLOG - nextOid: 24576
 LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
 Storage - file create: base/12277/16384
 LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
 LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
 LOG:  redo done at 0/179CCE8

 According to the log there is no contrecord flag, ISTM the path treats the
 contrecord of backup block incorrectly, and which causes the problem.

 Yep, as far as I read the patch, it seems to have forgotten to set
 XLP_FIRST_IS_CONTRECORD flag.


Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
 I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do.  I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

It does get rid of the there is no contrecord flag errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on record with zero length at...,
that recovery is correct.

But when it ends on invalid magic number  in log file.. then the
recovery is screwed up.

Cheers,

Jeff


xloginsert_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-12 Thread Greg Smith

On 02/11/2012 07:53 PM, Jeff Janes wrote:

Has it ever been well-characterized what the problem is with8GB?
I've used shared buffers above that size for testing purposes and
could never provoke a problem with it.


If anyone ever manages to characterize it well, we might actually make 
progress on isolating and fixing it.  All we have so far are a couple of 
application level test results suggesting a higher value caused 
performance to drop.  The first public one I remember was from Jignesh; 
http://archives.postgresql.org/pgsql-performance/2008-02/msg00184.php 
gives him quoting on where he found the Solaris roll-off was at.  What 
we really need to stomp this one down is someone to find the same thing, 
then show profiler output in each case.  Note that Jignesh's report 
included significant amount of filesystem level tuning, using things 
like more direct I/O, and that might be a necessary requirement to run 
into the exact variant of this limitation he mentioned.


I haven't spent a lot of time looking for this problem myself.  What 
I've heard second-hand from more than one person now is a) larger 
settings than 8GB can be an improvement for some people still, and b) 
simple benchmarks don't always have this problem.  I have noted that the 
few public and private reports I've gotten all suggest problems show up 
on benchmarks of more complicated workloads.  I think Jignesh mentioned 
this being obvious in the more complicated TPC-derived benchmarks, not 
in simple things like pgbench.  I may be misquoting him though.  And 
given that one of the possible causes for this was an excess of some 
lock contention, it's quite possible this one is already gone from 9.2, 
given the large number of lock related issues that have been squashed so 
far in this release.


All of those disclaimers are why I think no one has pushed to put a note 
about this in the official docs.  Right now the only suggested limit is 
this one:


The useful range for shared_buffers on Windows systems is generally 
from 64MB to 512MB.


The most common practical limit I've run into with large shared_buffers 
settings hits earlier than 8GB:  running into checkpoint spike issues.  
I have installs that started with shared_buffers in the 4 to 8GB range, 
where we saw badly spiking I/O at checkpoint sync time.  Lowering the 
databases cache can result in smarter writing decisions withing the OS, 
improving latency--even though total writes are actually higher if you 
measure what flows from the database to OS.  That side of the latency 
vs. throughput trade-off existing is one of the main reasons I haven't 
gone chasing after problems with really large shared_buffers settings.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Correctly initialise shared recoveryLastRecPtr in recovery.

2012-02-12 Thread Fujii Masao
On Mon, Jan 16, 2012 at 10:44 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Jan 13, 2012 at 10:05 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Correctly initialise shared recoveryLastRecPtr in recovery.
 Previously we used ReadRecPtr rather than EndRecPtr, which was
 not a serious error but caused pg_stat_replication to report
 incorrect replay_location until at least one WAL record is replayed.

 Thanks for the commit!

 This needs to be backported to 9.1 and 9.0 because they also have
 the same problem.

http://archives.postgresql.org/message-id/14373.1328890248%40sss.pgh.pa.us

Minor version updates are scheduled to be released at the end of this month.
Before that, can you backport this commit to 9.1 and 9.0?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of pg_archivecleanup -x option patch

2012-02-12 Thread Greg Smith

On 02/01/2012 07:53 AM, Alex Shulgin wrote:

One problem I've found while trying the example workflow is this:

~/local/postgresql/HEAD$ ./bin/pg_archivecleanup -d -x .gz data/archive/ 
00010002.0020.backup.gz
pg_archivecleanup: invalid filename input
Try pg_archivecleanup --help for more information.

I think we could accept the suffixed WAL filename and strip the suffix;
another idea is to actually detect the suffix (if any) from the last
command line argument, thus rendering the -x option unnecessary.


Going full-on automatic with detecting the suffix seems like it might do 
the wrong thing when presented with a strange extension.  I know a lot 
of the time I use pg_archivecleanup is inside scripts that tend to get 
setup and then ignored.  If one of those has a bug, or unexpected data 
appears in the archive, I'd rather see this sort of major, obvious 
failure--not an attempt to sort things out that might make a bad decision.


The smaller step of automatically stripping the specified suffix from 
the input file name, when it matches the one we've told the program to 
expect, seems like a usability improvement unlikely to lead to bad 
things though.  I've certainly made the mistake you've shown when using 
the patched version of the program myself, just didn't think about 
catching and correcting it myself before.  We can rev this to add that 
feature and resubmit easily enough, will turn that around soon.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.

2012-02-12 Thread Heikki Linnakangas

On 11.02.2012 11:57, Alex Hunsaker wrote:

On Thu, Feb 9, 2012 at 11:30, Tom Lanet...@sss.pgh.pa.us  wrote:

Alvaro Herreraalvhe...@commandprompt.com  writes:

Excerpts from Tom Lane's message of jue feb 09 12:17:59 -0300 2012:



FWIW that script is throwing a warning here:
Use of assignment to $[ is deprecated at 
/pgsql/source/HEAD/src/tools/check_keywords.pl line 19.


Any Perl hackers want to improve that script?


The attached 2 liner does it for me.


Thanks, applied!

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-12 Thread Greg Smith

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
The trend is to have server capable of running CUDA providing GPU via 
external hardware (PCI Express interface with PCI Express switches), 
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


The C410X adds 16 PCIe slots to a server, housed inside a separate 3U 
enclosure.  That's a completely sensible purchase if your goal is to 
build a computing cluster, where a lot of work is handed off to a set of 
GPUs.  I think that's even less likely to be a cost-effective option for 
a database server.  Adding a single dedicated GPU installed in a server 
to accelerate sorting is something that might be justifiable, based on 
your benchmarks.  This is a much more expensive option than that 
though.  Details at 
http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who 
wants to see just how big this external box is.


I did some experimenst timing the sort done with CUDA and the sort 
done with pg_qsort:

   CUDA  pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:~  21 ms,  ~  162 ms
100k integers:   ~   2 ms,  ~   13 ms
CUDA time has already in the copy operations (host-device, 
device-host).
As GPU I was using a C2050, and the CPU doing the pg_qsort was a 
Intel(R) Xeon(R) CPU X5650  @ 2.67GHz


That's really interesting, and the X5650 is by no means a slow CPU.  So 
this benchmark is providing a lot of CPU power yet still seeing over a 
6X speedup in sort times.  It sounds like the PCI Express bus has gotten 
fast enough that the time to hand data over and get it back again can 
easily be justified for medium to large sized sorts.


It would be helpful to take this patch and confirm whether it scales 
when using in parallel.  Easiest way to do that would be to use the 
pgbench -f feature, which allows running an arbitrary number of some 
query at once.  Seeing whether this acceleration continued to hold as 
the number of clients increases is a useful data point.


Is it possible for you to break down where the time is being spent?  For 
example, how much of this time is consumed in the GPU itself, compared 
to time spent transferring data between CPU and GPU?  I'm also curious 
where the bottleneck is at with this approach.  If it's the speed of the 
PCI-E bus for smaller data sets, adding more GPUs may never be 
practical.  If the bus can handle quite a few of these at once before it 
saturates, it might be possible to overload a single GPU.  That seems 
like it would be really hard to reach for database sorting though; I 
can't really defend justify my gut feel for that being true though.



 I've never seen a PostgreSQL server capable of running CUDA, and I
 don't expect that to change.

That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943


Yes, and 640K will be enough for everyone, ha ha.  (Having said the 
640K thing is flat out denied by Gates, BTW, and no one has come up with 
proof otherwise).


I think you've made an interesting case for this sort of acceleration 
now being useful for systems doing what's typically considered a data 
warehouse task.  I regularly see servers waiting for far more than 13M 
integers to sort.  And I am seeing a clear trend toward providing more 
PCI-E slots in servers now.  Dell's R810 is the most popular single 
server model my customers have deployed in the last year, and it has 5 
X8 slots in it.  It's rare all 5 of those are filled.  As long as a 
dedicated GPU works fine when dropped to X8 speeds, I know a fair number 
of systems where one of those could be added now.


There's another data point in your favor I didn't notice before your 
last e-mail.  Amazon has a Cluster GPU Quadruple Extra Large node type 
that runs with NVIDIA Tesla hardware.  That means the installed base of 
people who could consider CUDA is higher than I expected.  To 
demonstrate how much that costs, to provision a GPU enabled reserved 
instance from Amazon for one year costs $2410 at Light Utilization, 
giving a system with 22GB of RAM and 1.69GB of storage.  (I find the 
reserved prices easier to compare with dedicated hardware than the 
hourly ones)  That's halfway between the High-Memory Double Extra Large 
Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple 
Extra Large Instance (64GB RAM/1690GB disk) at $2200.  If someone could 
prove sorting was a bottleneck on their server, that isn't an 
unreasonable option to consider on a cloud-based database deployment.


I still think that an approach based on OpenCL is more likely to be 
suitable for PostgreSQL, which was part of why I gave CUDA low odds 
here.  The points in favor of OpenCL are:


-Since you last posted, OpenCL compiling has switched to using LLVM as 
their standard compiler.  Good PostgreSQL support for LLVM isn't far 
away.  It looks to me like the compiler situation for CUDA requires 
their PathScale based compiler.  I