Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-08 Thread Jim C. Nasby
On Thu, Sep 29, 2005 at 03:28:27PM +0200, Zeugswetter Andreas DAZ SD wrote:
 
  In my original example, a sequential scan of the 1TB of 2KB 
  or 4KB records, = 250M or 500M records of data, being sorted 
  on a binary value key will take ~1000x more time than reading 
  in the ~1GB Btree I described that used a Key+RID (plus node 
  pointers) representation of the data.
 
 Imho you seem to ignore the final step your algorithm needs of
 collecting the
 data rows. After you sorted the keys the collect step will effectively
 access the 
 tuples in random order (given a sufficiently large key range).
 
 This random access is bad. It effectively allows a competing algorithm
 to read the
 whole data at least 40 times sequentially, or write the set 20 times
 sequentially. 
 (Those are the random/sequential ratios of modern discs)

True, but there is a compromise... not shuffling full tuples around when
sorting in memory. Do your sorting with pointers, then write the full
tuples out to 'tape' if needed.

Of course the other issue here is that as correlation improves it
becomes better and better to do full pointer-based sorting.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Hannu Krosing
On K, 2005-10-05 at 13:21 -0400, Ron Peacetree wrote:
 First I wanted to verify that pg's IO rates were inferior to The Competition.
 Now there's at least an indication that someone else has solved similar
 problems.  Existence proofs make some things easier ;-)
 
 Is there any detailed programmer level architectual doc set for pg?  I know
 the best doc is the code,

For postgres it is often best doc's are in the code, in form of
comments.

-- 
Hannu Krosing [EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Hannu Krosing
On K, 2005-10-05 at 19:54 -0400, Ron Peacetree wrote:

 +I made the from left field suggestion that perhaps a pg native fs
 format would be worth consideration.  This is a major project, so
 the suggestion was to at least some extent tongue-in-cheek.

This idea is discussed about once a year on hackers. If you are more
interested in this, search the archives :)

-- 
Hannu Krosing [EMAIL PROTECTED]


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Zeugswetter Andreas DAZ SD

 Now I've asked for the quickest path to detailed 
 understanding of the pg IO subsystem.  The goal being to get 
 more up to speed on its coding details.  Certainly not to 
 annoy you or anyone else.

Basically pg does random 8k (compile time blocksize) reads/writes only.
Bitmap and sequential scans read 8k blocks in order.
Only WAL does n x 8k writes with one system call.

pg relys on the OS readahead (== larger block IO) to do efficient IO.
Basically the pg scan performance should match a dd if=file of=/dev/null
bs=8k,
unless CPU bound.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Michael Stone

On Wed, Oct 05, 2005 at 04:55:51PM -0700, Luke Lonergan wrote:

You've proven my point completely.  This process is bottlenecked in the CPU.
The only way to improve it would be to optimize the system (libc) functions
like fread where it is spending most of it's time.


Or to optimize its IO handling to be more efficient. (E.g., use larger
blocks to reduce the number of syscalls.) 


Mike Stone

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Josh Berkus

Andreas,


pg relys on the OS readahead (== larger block IO) to do efficient IO.
Basically the pg scan performance should match a dd if=file of=/dev/null
bs=8k,
unless CPU bound.


FWIW, we could improve performance by creating larger write blocks when 
appropriate, particularly on Unixes like Solaris.  But that's a bad 
effort/result tradeoff for most OSes, so it's not the route I'd be 
suggesting for general scans.


However, the external sort code could possibly be improved by more 
appropriate block sizing, which I think someone has already suggested.


--Josh


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Martijn van Oosterhout
On Wed, Oct 05, 2005 at 07:54:15PM -0400, Ron Peacetree wrote:
 I asked some questions about physical layout and format translation
 overhead being possibly suboptimal that seemed to be agreed to, but
 specifics as to where we are taking the hit don't seem to have been
 made explicit yet.

This hit is easy to see and terribly hard to do anything about at the
same time. Any single row in a table stores its values but the offsets
arn't constant. If a field is NULL, it is skipped. If a field is
variable length, you have to look at the length before you can jump
over to the next value.

If you have no NULLs and no variable length fields, then you can
optimise access. This is already done and it's hard to see how you
could improve it further. To cut costs, many places use
heap_deform_tuple and similar routines so that the costs are reduced,
but they're still there.

Upping the data transfer rate from disk is a worthy goal, just some
people beleive it is of lower priority than improving CPU usage.

 We were also told that evidently we are CPU bound far before one
 would naively expect to be based on the performance specifications
 of the components involved.

As someone pointed out, calls to the C library are not counted
seperately, making it harder to see if we're overcalling some of them.
Pinpointing the performance bottleneck is hard work.

 Now I've asked for the quickest path to detailed understanding of the
 pg IO subsystem.  The goal being to get more up to speed on its
 coding details.  Certainly not to annoy you or anyone else.

Well, the work is all in storage/smgr and storage/file. It's not
terribly complicated, it just sometimes takes a while to understand
*why* it is done this way.

Indeed, one of the things on my list is to remove all the lseeks in
favour of pread. Halving the number of kernel calls has got to be worth
something right? Portability is an issue ofcourse...

But it's been a productive thread, absolutly. Progress has been made...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgp5wHEqj2N70.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Luke Lonergan
Andreas,

On 10/6/05 3:56 AM, Zeugswetter Andreas DAZ SD [EMAIL PROTECTED]
wrote:

 pg relys on the OS readahead (== larger block IO) to do efficient IO.
 Basically the pg scan performance should match a dd if=file of=/dev/null
 bs=8k,
 unless CPU bound.

Which it is.  Postgres will currently do a maximum of 120MB/s scan rate on
modern CPUs, no matter how fast the underlying I/O subsystem is.  Several
people on this list have confirmed that by posting 100MB/s numbers on faster
subsystems.  Others have posted that Oracle/DB2/etc run at near disk rates.
IMO, most of the problem is that there are too many small (tuple, single
block) operations in the I/O path and they need to be optimized out.

The first step is recognition that there is a problem worth solving, and it
seems that there are now many who have recognized the issue of poor I/O path
optimization in Postgres.  I think a prioritized short list might look like
this:

- Sort performance
  - The implementation of a possibly fine algorithm yields very poor use of
the underlying hardware.  It needs to be optimized to remove redundant I/O
operations and possibly (with different algorithm?) to make better use of
available RAM.  This would be practice for the next item on the list:

- Scan performance
  - Someone needs to trace the life of a tuple through the executor and
implement better buffering at each stage of the process.  There was a start
at this, but it only buffered in one node.  Yes, I know we have a shared
memory buffer, but it acts as a final destination, this is about the path
through the executor.
  - I suspect that after profiling (with system routines included), we will
find much more tuple-scale work that is interfering with the optimal flow of
tuples through the executor.  It's unlikely that the problem is isolated to
the COUNT() aggregate, but is present in many other nodes.  Redundant
allocation/free of memory on a per tuple basis should be replaced with more
persistent buffers on a page or larger basis.
  - Ultimately, after working through the above two issues, we will reach a
point of diminishing returns where async I/O is needed to be able to remove
producer/consumer problems in a single threaded executor. This can be
implemented using sliced processes or threading, or by using aio.  I believe
that even without async I/O, we should be seeing scan rates of 200-400 MB/s
on modern CPUs for the simple COUNT() aggregation pattern though.

FWIW,

- Luke 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 Indeed, one of the things on my list is to remove all the lseeks in
 favour of pread. Halving the number of kernel calls has got to be worth
 something right? Portability is an issue ofcourse...

Being sure that it's not a pessimization is another issue.  I note that
glibc will emulate these functions if the kernel doesn't have them;
which means you could be replacing one kernel call with three.

And I don't think autoconf has any way to determine whether a libc
function represents a native kernel call or not ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 Are we awfully worried about people still using 2.0 kernels? And it
 would replace two calls with three in the worst case, we currently
 lseek before every read.

That's utterly false.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Alvaro Herrera
On Thu, Oct 06, 2005 at 03:57:38PM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  Indeed, one of the things on my list is to remove all the lseeks in
  favour of pread. Halving the number of kernel calls has got to be worth
  something right? Portability is an issue ofcourse...
 
 Being sure that it's not a pessimization is another issue.  I note that
 glibc will emulate these functions if the kernel doesn't have them;
 which means you could be replacing one kernel call with three.
 
 And I don't think autoconf has any way to determine whether a libc
 function represents a native kernel call or not ...

The problem kernels would be Linux 2.0, which I very much doubt is going
to be present in to-be-deployed database servers.

Unless someone runs glibc on top of some other kernel, I guess.  Is this
a common scenario?  I've never seen it.

-- 
Alvaro Herrera  http://www.amazon.com/gp/registry/DXLWNGRJD34
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Martijn van Oosterhout
On Thu, Oct 06, 2005 at 03:57:38PM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  Indeed, one of the things on my list is to remove all the lseeks in
  favour of pread. Halving the number of kernel calls has got to be worth
  something right? Portability is an issue ofcourse...
 
 Being sure that it's not a pessimization is another issue.  I note that
 glibc will emulate these functions if the kernel doesn't have them;
 which means you could be replacing one kernel call with three.

From the linux pread manpage:

HISTORY
   The pread and pwrite system calls were added to Linux in version
   2.1.60; the entries in the i386 system call table were added in
   2.1.69.  The libc support (including emulation on older kernels
   without the system calls) was added in glibc 2.1.

Are we awfully worried about people still using 2.0 kernels? And it
would replace two calls with three in the worst case, we currently
lseek before every read.

I don't know about other OSes.
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpxa4MjoQpmJ.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-06 Thread Martijn van Oosterhout
On Thu, Oct 06, 2005 at 04:25:11PM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  Are we awfully worried about people still using 2.0 kernels? And it
  would replace two calls with three in the worst case, we currently
  lseek before every read.
 
 That's utterly false.

Oops, you're right. I usually strace during a vacuum or a large query
and my screen fills up with:

lseek()
read()
lseek()
read()
...

So didn't wonder if the straight sequential read was optimised. Still,
I think pread() would be a worthwhile improvement, at least for Linux.

-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpu7YI0K0Ugf.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Michael Stone

On Mon, Oct 03, 2005 at 01:34:01PM -0700, Josh Berkus wrote:

Realistically, you can't do better than about 25MB/s on a
 single-threaded I/O on current Linux machines,

What on earth gives you that idea? Did you drop a zero?


Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A 
Big-Name Proprietary Database doesn't get much more than that either.


You seem to be talking about database IO, which isn't what you said.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Martijn van Oosterhout
On Wed, Oct 05, 2005 at 05:41:25AM -0400, Michael Stone wrote:
 On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote:
 COPY TO /dev/null WITH binary
 13MB/s55% user 45% system  (ergo, CPU bound)
 [snip]
 the most expensive. But it does point out that the whole process is
 probably CPU bound more than anything else.
 
 Note that 45% of that cpu usage is system--which is where IO overhead
 would end up being counted. Until you profile where you system time is
 going it's premature to say it isn't an IO problem.

It's a dual CPU system, so 50% is the limit for a single process. Since
system usage  user, PostgreSQL is the limiter. Sure, the system is
taking a lot of time, but PostgreSQL is still the limiting factor.

Anyway, the later measurements using gprof exclude system time
altogether and it still shows CPU being the limiting factor. Fact is,
extracting tuples from pages is expensive.
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpP683jRzgfx.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Michael Stone

On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote:

COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)

[snip]

the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.


Note that 45% of that cpu usage is system--which is where IO overhead
would end up being counted. Until you profile where you system time is
going it's premature to say it isn't an IO problem.

Mike Stone


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Michael Stone

On Tue, Oct 04, 2005 at 12:43:10AM +0300, Hannu Krosing wrote:

Just FYI, I run a count(*) on a 15.6GB table on a lightly loaded db and
it run in 163 sec. (Dual opteron 2.6GHz, 6GB RAM, 6 x 74GB 15k  disks in
RAID10, reiserfs). A little less than 100MB sec.


And none of that 15G table is in the 6G RAM?

Mike Stone

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Luke Lonergan
Nope - it would be disk wait.

COPY is CPU bound on I/O subsystems faster that 50 MB/s on COPY (in) and about 
15 MB/s (out).

- Luke

 -Original Message-
From:   Michael Stone [mailto:[EMAIL PROTECTED]
Sent:   Wed Oct 05 09:58:41 2005
To: Martijn van Oosterhout
Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject:Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote:
COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)
[snip]
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

Note that 45% of that cpu usage is system--which is where IO overhead
would end up being counted. Until you profile where you system time is
going it's premature to say it isn't an IO problem.

Mike Stone


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
I've now gotten verification from multiple working DBA's that DB2, Oracle, and
SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in
setups akin to Oracle RAC) when attached to a decent (not outrageous, but
decent) HD subsystem...

I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR is
attainable.  Cache based bursts that high, yes.  ASTR, no.

The DBA's in question run RW installations that include Solaris, M$, and Linux 
OS's
for companies that just about everyone on these lists are likely to recognize.

Also, the implication of these pg IO limits is that money spent on even 
moderately
priced 300MBps SATA II based RAID HW is wasted $'s.

In total, this situation is a recipe for driving potential pg users to other 
DBMS. 
  
25MBps in and 15MBps out is =BAD=.

Have we instrumented the code in enough detail that we can tell _exactly_ where
the performance drainage is?

We have to fix this.
Ron  


-Original Message-
From: Luke Lonergan [EMAIL PROTECTED]
Sent: Oct 5, 2005 11:24 AM
To: Michael Stone [EMAIL PROTECTED], Martijn van Oosterhout 
kleptog@svana.org
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Nope - it would be disk wait.

COPY is CPU bound on I/O subsystems faster that 50 MB/s on COPY (in) and about 
15 MB/s (out).

- Luke

 -Original Message-
From:   Michael Stone [mailto:[EMAIL PROTECTED]
Sent:   Wed Oct 05 09:58:41 2005
To: Martijn van Oosterhout
Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject:Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 06:19:41PM +0200, Martijn van Oosterhout wrote:
COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)
[snip]
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

Note that 45% of that cpu usage is system--which is where IO overhead
would end up being counted. Until you profile where you system time is
going it's premature to say it isn't an IO problem.

Mike Stone


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



---(end of broadcast)---
TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Joshua D. Drake

 We have to fix this.
 Ron  
 


The source is freely available for your perusal. Please feel free to
point us in specific directions in the code where you may see some
benefit. I am positive all of us that can, would put resources into
fixing the issue had we a specific direction to attack.

Sincerely,

Joshua D. Drake


-- 
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
First I wanted to verify that pg's IO rates were inferior to The Competition.
Now there's at least an indication that someone else has solved similar
problems.  Existence proofs make some things easier ;-)

Is there any detailed programmer level architectual doc set for pg?  I know
the best doc is the code, but the code in isolation is often the Slow Path to
understanding with systems as complex as a DBMS IO layer.

Ron
 

-Original Message-
From: Joshua D. Drake [EMAIL PROTECTED]
Sent: Oct 5, 2005 1:18 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?


The source is freely available for your perusal. Please feel free to
point us in specific directions in the code where you may see some
benefit. I am positive all of us that can, would put resources into
fixing the issue had we a specific direction to attack.

Sincerely,

Joshua D. Drake

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Michael Stone

On Wed, Oct 05, 2005 at 11:24:07AM -0400, Luke Lonergan wrote:

Nope - it would be disk wait.


I said I/O overhead; i.e., it could be the overhead of calling the
kernel for I/O's. E.g., the following process is having I/O problems:

time dd if=/dev/sdc of=/dev/null bs=1 count=1000   
1000+0 records in  
1000+0 records out 
1000 bytes transferred in 8.887845 seconds (1125132 bytes/sec) 
  
real0m8.889s   
user0m0.877s   
sys 0m8.010s   


it's not in disk wait state (in fact the whole read was cached) but it's
only getting 1MB/s. 


Mike Stone

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Jeffrey W. Baker
On Wed, 2005-10-05 at 12:14 -0400, Ron Peacetree wrote:
 I've now gotten verification from multiple working DBA's that DB2, Oracle, and
 SQL Server can achieve ~250MBps ASTR (with as much as ~500MBps ASTR in
 setups akin to Oracle RAC) when attached to a decent (not outrageous, but
 decent) HD subsystem...
 
 I've not yet had any RW DBA verify Jeff Baker's supposition that ~1GBps ASTR 
 is
 attainable.  Cache based bursts that high, yes.  ASTR, no.

I find your tone annoying.  That you do not have access to this level of
hardware proves nothing, other than pointing out that your repeated
emails on this list are based on supposition.

If you want 1GB/sec STR you need:

1) 1 or more Itanium CPUs
2) 24 or more disks
3) 2 or more SATA controllers
4) Linux

Have fun.

-jwb

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Andrej Ricnik-Bay
On 10/6/05, Michael Stone [EMAIL PROTECTED] wrote:
 On Wed, Oct 05, 2005 at 11:24:07AM -0400, Luke Lonergan wrote:
 Nope - it would be disk wait.

 I said I/O overhead; i.e., it could be the overhead of calling the
 kernel for I/O's. E.g., the following process is having I/O problems:

 time dd if=/dev/sdc of=/dev/null bs=1 count=1000
 1000+0 records in
 1000+0 records out
 1000 bytes transferred in 8.887845 seconds (1125132 bytes/sec)

 real0m8.889s
 user0m0.877s
 sys 0m8.010s

 it's not in disk wait state (in fact the whole read was cached) but it's
 only getting 1MB/s.

 Mike Stone

 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

I think you only proved that dd isn't the smartest tool out there... or
that using it with a blocksize of 1 byte doesn't make too much sense.


[EMAIL PROTECTED]:~]$ time dd if=/dev/sr0 of=/dev/null bs=2048 count=4883
4883+0 records in
4883+0 records out

real0m6.824s
user0m0.010s
sys 0m0.060s
[EMAIL PROTECTED]:~]$ time dd if=/dev/sr0 of=/dev/null bs=1 count=1000
1000+0 records in
1000+0 records out

real0m18.523s
user0m7.410s
sys 0m10.310s
[EMAIL PROTECTED]:~]$ time dd if=/dev/sr0 of=/dev/null bs=8192 count=1220
1220+0 records in
1220+0 records out

real0m6.796s
user0m0.000s
sys 0m0.070s

That's with caching, and all.  Or did I miss the point of your post
completely?  Interestingly, the CPU usage with the bs=1 goes up
to 97%, it stays at a mellow 3% with the 8192 and 2048.


Cheers,
Andrej

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Jonah H. Harris
Ron,

This thread is getting on my nerves.  Your tone in some of the other
posts (as-well-as this one) is getting very annoying.  Yes,
PostgreSQL's storage manager (like all other open source databases),
lacks many of the characteristics and enhancements of the commercial
databases.  Unlike Oracle, Microsoft, etc., the PostgreSQL Global
Development Group doesn't have the tens of millions of dollars
required to pay hundreds of developers around the world for
round-the-clock development and RD.  Making sure that every little
tweak, on every system, is taken advantage of is expensive (in terms
of time) for an open source project where little ROI is gained. 
Before you make a statement like, I wanted to verify that pg's IO
rates were inferior to The Competition, think about how you'd write
your own RDBMS from scratch (in reality, not in theory).

As for your question regarding developer docs for the storage manager
and related components, read the READMEs and the code... just like
everyone else.

Rather than posting more assumptions and theory, please read through
the code and come back with actual suggestions.

-Jonah

2005/10/5, Ron Peacetree [EMAIL PROTECTED]:
 First I wanted to verify that pg's IO rates were inferior to The Competition.
 Now there's at least an indication that someone else has solved similar
 problems.  Existence proofs make some things easier ;-)

 Is there any detailed programmer level architectual doc set for pg?  I know
 the best doc is the code, but the code in isolation is often the Slow Path 
 to
 understanding with systems as complex as a DBMS IO layer.

 Ron


 -Original Message-
 From: Joshua D. Drake [EMAIL PROTECTED]
 Sent: Oct 5, 2005 1:18 PM
 Subject: Re: [HACKERS] [PERFORM] A Better External Sort?


 The source is freely available for your perusal. Please feel free to
 point us in specific directions in the code where you may see some
 benefit. I am positive all of us that can, would put resources into
 fixing the issue had we a specific direction to attack.

 Sincerely,

 Joshua D. Drake

 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match



--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Ron Peacetree
I'm putting in as much time as I can afford thinking about pg related
performance issues.  I'm doing it because of a sincere desire to help
understand and solve them, not to annoy people.

If I didn't believe in pg, I would't be posting thoughts about how to
make it better.  

It's probably worth some review (suggestions marked with a +:

+I came to the table with a possibly better way to deal with external
sorts (that now has branched into 2 efforts: short term improvements
to the existing code, and the original from-the-ground-up idea).  That
suggestion was based on a great deal of prior thought and research,
despite what some others might think.

Then we were told that our IO limit was lower than I thought.

+I suggested that as a Quick Fix we try making sure we do IO
transfers in large enough chunks based in the average access time
of the physical device in question so as to achieve the device's
ASTR (ie at least 600KB per access for a 50MBps ASTR device with
a 12ms average access time.) whenever circumstances allowed us.
As far as I know, this experiment hasn't been tried yet.

I asked some questions about physical layout and format translation
overhead being possibly suboptimal that seemed to be agreed to, but
specifics as to where we are taking the hit don't seem to have been
made explicit yet.

+I made the from left field suggestion that perhaps a pg native fs
format would be worth consideration.  This is a major project, so
the suggestion was to at least some extent tongue-in-cheek.

+I then made some suggestions about better code instrumentation
so that we can more accurately characterize were the bottlenecks are. 

We were also told that evidently we are CPU bound far before one
would naively expect to be based on the performance specifications
of the components involved.

Double checking among the pg developer community led to some
differing opinions as to what the actual figures were and under what
circumstances they were achieved.  Further discussion seems to have
converged on both accurate values and a better understanding as to
the HW and SW  needed; _and_ we've gotten some RW confirmation
as to what current reasonable expectations are within this problem
domain from outside the pg community.

+Others have made some good suggestions in this thread as well.
Since I seem to need to defend my tone here, I'm not detailing them
here.  That should not be construed as a lack of appreciation of them.

Now I've asked for the quickest path to detailed understanding of the
pg IO subsystem.  The goal being to get more up to speed on its
coding details.  Certainly not to annoy you or anyone else.

At least from my perspective, this for the most part seems to have
been an useful and reasonable engineering discussion that has
exposed a number of important things.
  
Regards,
Ron

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-05 Thread Luke Lonergan
Michael,

On 10/5/05 8:33 AM, Michael Stone [EMAIL PROTECTED] wrote:

 real0m8.889s 
 user0m0.877s 
 sys 0m8.010s 
 
 it's not in disk wait state (in fact the whole read was cached) but it's
 only getting 1MB/s.

You've proven my point completely.  This process is bottlenecked in the CPU.
The only way to improve it would be to optimize the system (libc) functions
like fread where it is spending most of it's time.

In COPY, we found lots of libc functions like strlen() being called
ridiculous numbers of times, in one case it was called on every
timestamp/date attribute to get the length of TZ, which is constant.  That
one function call was in the system category, and was responsible for
several percent of the time.

By the way, system routines like fgetc/getc/strlen/atoi etc, don't appear in
gprof profiles of dynamic linked objects, nor by default in oprofile
results.

If the bottleneck is in I/O, you will see the time spent in disk wait, not
in system.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Martijn van Oosterhout
On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote:
  Basically, I recommend adding -Winline -finline-limit-1500 to the
  default build while we discuss other options.
 
 I add -Winline but get no warnings. Why would I use -finline-limit-1500?
 
 I'm interested, but uncertain as to what difference this makes. Surely
 using -O3 works fine?

Different versions of gcc have different ideas of when a function can
be inlined. From my reading of the documentation, this decision is
independant of optimisation level. Maybe your gcc version has a limit
higher than 1500 by default.
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgppgbguCvZK1.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Simon Riggs
On Tue, 2005-10-04 at 12:04 +0200, Martijn van Oosterhout wrote:
 On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote:
   Basically, I recommend adding -Winline -finline-limit-1500 to the
   default build while we discuss other options.
  
  I add -Winline but get no warnings. Why would I use -finline-limit-1500?
  
  I'm interested, but uncertain as to what difference this makes. Surely
  using -O3 works fine?

How did you determine the 1500 figure? Can you give some more info to
surround that recommendation to allow everybody to evaluate it?

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Martijn van Oosterhout
On Tue, Oct 04, 2005 at 12:24:54PM +0100, Simon Riggs wrote:
 How did you determine the 1500 figure? Can you give some more info to
 surround that recommendation to allow everybody to evaluate it?

[EMAIL PROTECTED]:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc 
-finline-limit-1000 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE  
 -c -o tuplesort.o tuplesort.c
tuplesort.c: In function 'applySortFunction':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:1906: warning: called from here
tuplesort.c: In function 'comparetup_heap':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:1937: warning: called from here
tuplesort.c: In function 'comparetup_index':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:2048: warning: called from here
tuplesort.c: In function 'comparetup_datum':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:2167: warning: called from here
[EMAIL PROTECTED]:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc 
-finline-limit-1500 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE  
 -c -o tuplesort.o tuplesort.c
no warnings

A quick binary search puts the cutoff between 1200 and 1300. Given
version variation I picked a nice round number, 1500.

Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work.
Maybe we should go for 5000 or so.

I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpadXPF53tUp.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Ron Peacetree
The constants related to inlining involve pcode, not actual assembly 
instructions, and are compiler version dependent as well as subject to change 
without notice by the GNU folks...

from:
http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/Optimize-Options.html#Optimize-Options

-finline-limit=n
By default, gcc limits the size of functions that can be inlined. This flag 
allows the control of this limit for functions that are explicitly marked as 
inline (i.e., marked with the inline keyword or defined within the class 
definition in c++). n is the size of functions that can be inlined in number of 
pseudo instructions (not counting parameter handling). The default value of n 
is 600. Increasing this value can result in more inlined code at the cost of 
compilation time and memory consumption. Decreasing usually makes the 
compilation faster and less code will be inlined (which presumably means slower 
programs). This option is particularly useful for programs that use inlining 
heavily such as those based on recursive templates with C++.

Inlining is actually controlled by a number of parameters, which may be 
specified individually by using --param name=value. The -finline-limit=n option 
sets some of these parameters as follows:

max-inline-insns
is set to n.
max-inline-insns-single
is set to n/2.
max-inline-insns-auto
is set to n/2.
min-inline-insns
is set to 130 or n/4, whichever is smaller.
max-inline-insns-rtl
is set to n. 

Using -finline-limit=600 thus results in the default settings for these 
parameters. See below for a documentation of the individual parameters 
controlling inlining.

Note: pseudo instruction represents, in this particular context, an abstract 
measurement of function's size. In no way, it represents a count of assembly 
instructions and as such its exact meaning might change from one release to an 
another. 

Further Down It Says...

--param name=value
In some places, GCC uses various constants to control the amount of 
optimization that is done. For example, GCC will not inline functions that 
contain more that a certain number of instructions. You can control some of 
these constants on the command-line using the --param option.

The names of specific parameters, and the meaning of the values, are tied to 
the internals of the compiler, and are subject to change without notice in 
future releases.

In each case, the value is an integer. The allowable choices for name are given 
in the following table:

snip

max-inline-insns-single
Several parameters control the tree inliner used in gcc. This number sets the 
maximum number of instructions (counted in gcc's internal representation) in a 
single function that the tree inliner will consider for inlining. This only 
affects functions declared inline and methods implemented in a class 
declaration (C++). The default value is 300.

max-inline-insns-auto
When you use -finline-functions (included in -O3), a lot of functions that 
would otherwise not be considered for inlining by the compiler will be 
investigated. To those functions, a different (more restrictive) limit compared 
to functions declared inline can be applied. The default value is 300.

max-inline-insns
The tree inliner does decrease the allowable size for single functions to be 
inlined after we already inlined the number of instructions given here by 
repeated inlining. This number should be a factor of two or more larger than 
the single function limit. Higher numbers result in better runtime performance, 
but incur higher compile-time resource (CPU time, memory) requirements and 
result in larger binaries. Very high values are not advisable, as too large 
binaries may adversely affect runtime performance. The default value is 600.

max-inline-slope
After exceeding the maximum number of inlined instructions by repeated 
inlining, a linear function is used to decrease the allowable size for single 
functions. The slope of that function is the negative reciprocal of the number 
specified here. The default value is 32.

min-inline-insns
The repeated inlining is throttled more and more by the linear function after 
exceeding the limit. To avoid too much throttling, a minimum for this function 
is specified here to allow repeated inlining for very small functions even when 
a lot of repeated inlining already has been done. The default value is 130.

max-inline-insns-rtl
For languages that use the RTL inliner (this happens at a later stage than tree 
inlining), you can set the maximum allowable size (counted in RTL instructions) 
for the RTL inliner with this parameter. The default value is 600.


-Original Message-
From: Martijn van Oosterhout kleptog@svana.org
Sent: Oct 4, 2005 8:24 AM
To: Simon Riggs [EMAIL PROTECTED]
Cc: Tom Lane [EMAIL PROTECTED], Ron Peacetree [EMAIL PROTECTED], 
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 A quick binary search puts the cutoff between 1200 and 1300. Given
 version variation I picked a nice round number, 1500.

 Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work.
 Maybe we should go for 5000 or so.

 I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)

I don't know what the units of this number are, but it's apparently far
too gcc-version-dependent to consider putting into our build scripts.
Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on
i386, and compiling tuplesort.c as you did, I find:
-O2: warning goes away between 800 and 900
-O3: warning is always there (tried values up to 1000)
(the latter behavior may indicate a bug, not sure).

What's even more interesting is that the warning does not appear in
either case if I omit -finline-limit --- so the default value is plenty.

At least on this particular compiler, the proposed switch would be
counterproductive.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Simon Riggs
On Tue, 2005-10-04 at 16:30 +0200, Martijn van Oosterhout wrote:
 On Tue, Oct 04, 2005 at 10:06:24AM -0400, Tom Lane wrote:
  Martijn van Oosterhout kleptog@svana.org writes:
   I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)
  
  I don't know what the units of this number are, but it's apparently far
  too gcc-version-dependent to consider putting into our build scripts.
  Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on
  i386, and compiling tuplesort.c as you did, I find:
  -O2: warning goes away between 800 and 900
  -O3: warning is always there (tried values up to 1000)
  (the latter behavior may indicate a bug, not sure).

 Facsinating. The fact that the warning goes away if you don't specify
 -finline-limit seems to indicate they've gotten smarter. Or a bug.
 We'd have to check the asm code to see if it's actually inlined or
 not.

I've been using gcc 3.4 and saw no warning when using either -Winline
or -O3 -Winline.

Martijn, at the moment it sounds like this is a feature that we no
longer need to support - even if we should have done for previous
releases.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 1. Add -Winline so we can at least be aware of when it's (not) happening.

Yeah, I agree with that part, just not with adding a fixed -finline-limit
value.

While on the subject of gcc warnings ... if I touch that code, I want to
remove -Wold-style-definition from the default flags, too.  It's causing
much more clutter than it's worth, because all the flex files generate
several such warnings.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Martijn van Oosterhout
On Tue, Oct 04, 2005 at 10:06:24AM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)
 
 I don't know what the units of this number are, but it's apparently far
 too gcc-version-dependent to consider putting into our build scripts.
 Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on
 i386, and compiling tuplesort.c as you did, I find:
   -O2: warning goes away between 800 and 900
   -O3: warning is always there (tried values up to 1000)
 (the latter behavior may indicate a bug, not sure).

Facsinating. The fact that the warning goes away if you don't specify
-finline-limit seems to indicate they've gotten smarter. Or a bug.
We'd have to check the asm code to see if it's actually inlined or
not.

Two options:
1. Add -Winline so we can at least be aware of when it's (not) happening.
2. If we can't get gcc to reliably inline, maybe we need to consider
other options?

In particular, move the isNull test statements out since they are ones
the optimiser can use to best effect.

Add if we put in -Winline, it would be visible to users while
compiling so they can tweak their own build options (if they care).
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgppKVdbv5luV.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread Martijn van Oosterhout
On Tue, Oct 04, 2005 at 03:56:53PM +0100, Simon Riggs wrote:
 I've been using gcc 3.4 and saw no warning when using either -Winline
 or -O3 -Winline.

Ok, I've just installed 3.4 and verified that. I examined the asm code
and gcc is inlining it. I concede, at this point just throw in -Winline
and monitor the situation.

As an aside, the *_getattr calls end up a bit suboptimal though. It's
producing code like:

  cmp attlen, 4
  je $elsewhere1
  cmp attlen, 2
  je $elsewhere2
  ld byte
here:

--- much later ---
elsewhere1:
  ld integer
  jmp $here
elsewhere2:
  ld short
  jmp $here

No idea whether we want to go down the path of hinting to gcc which
size will be the most common.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgp4B0wVNkzsI.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-04 Thread mark
On Tue, Oct 04, 2005 at 05:23:41PM +0200, Martijn van Oosterhout wrote:
 On Tue, Oct 04, 2005 at 03:56:53PM +0100, Simon Riggs wrote:
  I've been using gcc 3.4 and saw no warning when using either -Winline
  or -O3 -Winline.
 Ok, I've just installed 3.4 and verified that. I examined the asm code
 and gcc is inlining it. I concede, at this point just throw in -Winline
 and monitor the situation.
 As an aside, the *_getattr calls end up a bit suboptimal though. It's
 producing code like:
   cmp attlen, 4
   je $elsewhere1
   cmp attlen, 2
   je $elsewhere2
   ld byte
 here:
 --- much later ---
 elsewhere1:
   ld integer
   jmp $here
 elsewhere2:
   ld short
   jmp $here
 No idea whether we want to go down the path of hinting to gcc which
 size will be the most common.

If it will very frequently be one value, and not the other values, I
don't see why we wouldn't want to hint? #ifdef it to a expand to just
the expression if not using GCC. It's important that we know that the
value would be almost always a certain value, however, as GCC will try
to make the path for the expected value as fast as possible, at the
cost of an unexpected value being slower.

  __builtin_expect (long EXP, long C)

 You may use `__builtin_expect' to provide the compiler with branch
 prediction information.  In general, you should prefer to use
 actual profile feedback for this (`-fprofile-arcs'), as
 programmers are notoriously bad at predicting how their programs
 actually perform.  However, there are applications in which this
 data is hard to collect.

 The return value is the value of EXP, which should be an integral
 expression.  The value of C must be a compile-time constant.  The
 semantics of the built-in are that it is expected that EXP == C.
 For example:

  if (__builtin_expect (x, 0))
foo ();

 would indicate that we do not expect to call `foo', since we
 expect `x' to be zero.  Since you are limited to integral
 expressions for EXP, you should use constructions such as

  if (__builtin_expect (ptr != NULL, 1))
error ();

 when testing pointer or floating-point values.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Josh Berkus
Michael,

 Realistically, you can't do better than about 25MB/s on a
  single-threaded I/O on current Linux machines,

 What on earth gives you that idea? Did you drop a zero?

Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A 
Big-Name Proprietary Database doesn't get much more than that either.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Josh Berkus
Tom,

 Raising work_mem to a gig should result in about five runs, needing only
 one pass, which is really going to be as good as it gets.  If you could
 not see any difference then I see little hope for the idea that reducing
 the number of merge passes will help.

Right.  It *should have*, but didn't seem to.Example of a simple sort 
test of 100 million random-number records

1M   3294 seconds
  16M   1107 seconds
  256M   1209 seconds
  512M   1174 seconds
  512M with 'not null' for column that is indexed  1168 seconds

 Umm ... you were raising maintenance_work_mem, I trust, not work_mem?

Yes.


 We really need to get some hard data about what's going on here.  The
 sort code doesn't report any internal statistics at the moment, but it
 would not be hard to whack together a patch that reports useful info
 in the form of NOTICE messages or some such.

Yeah, I'll do this as soon as the patch is finished.   Always useful to 
gear up the old TPC-H.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Jeffrey W. Baker
On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote:
 Michael,
 
  Realistically, you can't do better than about 25MB/s on a
   single-threaded I/O on current Linux machines,
 
  What on earth gives you that idea? Did you drop a zero?
 
 Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A 
 Big-Name Proprietary Database doesn't get much more than that either.

I find this claim very suspicious.  I get single-threaded reads in
excess of 1GB/sec with XFS and  250MB/sec with ext3.  

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Josh Berkus
Jeff,

  Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A
  Big-Name Proprietary Database doesn't get much more than that either.

 I find this claim very suspicious.  I get single-threaded reads in
 excess of 1GB/sec with XFS and  250MB/sec with ext3.

Database reads?  Or raw FS reads?  It's not the same thing.

Also, we're talking *write speed* here, not read speed.

I also find *your* claim suspicious, since there's no way XFS is 300% faster 
than ext3 for the *general* case.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
Jeff, are those _burst_ rates from HD buffer or _sustained_ rates from
actual HD media?  Rates from IO subsystem buffer or cache are
usually considerably higher than Average Sustained Transfer Rate.

Also, are you measuring _raw_ HD IO (bits straight off the platters, no
FS or other overhead) or _cooked_ HD IO (actual FS or pg IO)?

BTW, it would seem Useful to measure all of raw HD IO, FS HD IO,
and pg HD IO as this would give us an idea of just how much overhead
each layer is imposing on the process.

We may be able to get better IO than we currently are for things like
sorts by the simple expedient of making sure we read enough data per
seek.

For instance, a HD with a 12ms average access time and a ASTR of
50MBps should always read _at least_ 600KB/access or it is impossible
for it to achieve it's rated ASTR.

This number will vary according to the average access time and the
ASTR of your physical IO subsystem, but the concept is valid for _any_
physical IO subsystem.
 

-Original Message-
From: Jeffrey W. Baker [EMAIL PROTECTED]
Sent: Oct 3, 2005 4:42 PM
To: josh@agliodbs.com
Cc: 
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote:
 Michael,
 
  Realistically, you can't do better than about 25MB/s on a
   single-threaded I/O on current Linux machines,
 
  What on earth gives you that idea? Did you drop a zero?
 
 Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A 
 Big-Name Proprietary Database doesn't get much more than that either.

I find this claim very suspicious.  I get single-threaded reads in
excess of 1GB/sec with XFS and  250MB/sec with ext3.  

-jwb

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Luke Lonergan
Jeff, Josh,

On 10/3/05 2:16 PM, Josh Berkus josh@agliodbs.com wrote:

 Jeff,
 
 Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A
 Big-Name Proprietary Database doesn't get much more than that either.
 
 I find this claim very suspicious.  I get single-threaded reads in
 excess of 1GB/sec with XFS and  250MB/sec with ext3.
 
 Database reads?  Or raw FS reads?  It's not the same thing.
 
 Also, we're talking *write speed* here, not read speed.

I think you are both talking past each other here.  I'll state what I
*think* each of you are saying:

Josh: single threaded DB writes are limited to 25MB/s

My opinion: Not if they're done better than they are now in PostgreSQL.
PostgreSQL COPY is still CPU limited at 12MB/s on a super fast Opteron.  The
combination of WAL and head writes while this is the case is about 50MB/s,
which is far from the limit of the filesystems we test on that routinely
perform at 250MB/s on ext2 writing in sequential 8k blocks.

There is no reason that we couldn't do triple the current COPY speed by
reducing the CPU overhead in parsing and attribute conversion.  We've talked
this to death, and implemented much of the code to fix it, but there's much
more to do.

Jeff: Plenty of FS bandwidth to be had on Linux, observed 250MB/s on ext3
and 1,000MB/s on XFS.

Wow - can you provide a link or the results from the XFS test?  Is this 8k
blocksize sequential I/O?  How many spindles and what controller are you
using?  Inquiring minds want to know...

- Luke 



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Jeffrey W. Baker
On Mon, 2005-10-03 at 14:16 -0700, Josh Berkus wrote:
 Jeff,
 
   Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A
   Big-Name Proprietary Database doesn't get much more than that either.
 
  I find this claim very suspicious.  I get single-threaded reads in
  excess of 1GB/sec with XFS and  250MB/sec with ext3.
 
 Database reads?  Or raw FS reads?  It's not the same thing.

Just reading files off the filesystem.  These are input rates I get with
a specialized sort implementation.  1GB/sec is not even especially
wonderful, I can get that on two controllers with 24-disk stripe set.

I guess database reads are different, but I remain unconvinced that they
are *fundamentally* different.  After all, a tab-delimited file (my sort
workload) is a kind of database.

 Also, we're talking *write speed* here, not read speed.

Ok, I did not realize.  Still you should see 250-300MB/sec
single-threaded sequential output on ext3, assuming the storage can
provide that rate.

 I also find *your* claim suspicious, since there's no way XFS is 300% faster 
 than ext3 for the *general* case.

On a single disk you wouldn't notice, but XFS scales much better when
you throw disks at it.  I get a 50MB/sec boost from the 24th disk,
whereas ext3 stops scaling after 16 disks.  For writes both XFS and ext3
top out around 8 disks, but in this case XFS tops out at 500MB/sec while
ext3 can't break 350MB/sec.

I'm hopeful that in the future the work being done at ClusterFS will
make ext3 on-par with XFS.

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Hannu Krosing
On E, 2005-10-03 at 14:16 -0700, Josh Berkus wrote:
 Jeff,
 
   Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A
   Big-Name Proprietary Database doesn't get much more than that either.
 
  I find this claim very suspicious.  I get single-threaded reads in
  excess of 1GB/sec with XFS and  250MB/sec with ext3.
 
 Database reads?  Or raw FS reads?  It's not the same thing.

Just FYI, I run a count(*) on a 15.6GB table on a lightly loaded db and
it run in 163 sec. (Dual opteron 2.6GHz, 6GB RAM, 6 x 74GB 15k  disks in
RAID10, reiserfs). A little less than 100MB sec.

After this I ran count(*) over a 2.4GB file from another tablespace on
another device (4x142GB 10k disks in RAID10) and it run 22.5 sec on
first run and 12.5 on second.

db=# show shared_buffers ;
 shared_buffers

 196608
(1 row)

db=# select version();
  version

 PostgreSQL 8.0.3 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 3.3.6
(Debian 1:3.3.6-7)
(1 row)


-- 
Hannu Krosing [EMAIL PROTECTED]


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Simon Riggs
On Sun, 2005-10-02 at 21:38 +0200, Martijn van Oosterhout wrote:
 Ok, I tried two optimisations:
 

 2. By specifying: -Winline -finline-limit-1500 (only on tuplesort.c).
 This causes inlineApplySortFunction() to be inlined, like the code
 obviously expects it to be.
 
 default build (baseline)235 seconds
 -finline only   217 seconds (7% better)
 comparetup_index_fastbyval4 only221 seconds (6% better)
 comparetup_index_fastbyval4 and -finline203 seconds (13.5% better)
 
 This is indexing the integer sequence column on a 2.7 million row
 table. The times are as given by gprof and so exclude system call time.
 
 Basically, I recommend adding -Winline -finline-limit-1500 to the
 default build while we discuss other options.

I add -Winline but get no warnings. Why would I use -finline-limit-1500?

I'm interested, but uncertain as to what difference this makes. Surely
using -O3 works fine?

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Luke Lonergan
Hannu,

On 10/3/05 2:43 PM, Hannu Krosing [EMAIL PROTECTED] wrote:

 Just FYI, I run a count(*) on a 15.6GB table on a lightly loaded db and
 it run in 163 sec. (Dual opteron 2.6GHz, 6GB RAM, 6 x 74GB 15k  disks in
 RAID10, reiserfs). A little less than 100MB sec.

This confirms our findings - sequential scan is CPU limited at about 120MB/s
per single threaded executor.  This is too slow for fast file systems like
we're discussing here.

Bizgres MPP gets 250MB/s by running multiple scanners, but we still chew up
unnecessary amounts of CPU.
  
 After this I ran count(*) over a 2.4GB file from another tablespace on
 another device (4x142GB 10k disks in RAID10) and it run 22.5 sec on
 first run and 12.5 on second.

You're getting caching effects here.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Josh Berkus
Michael,

 Nope, LOTS of testing, at OSDL, GreenPlum and Sun.   For comparison, A
 Big-Name Proprietary Database doesn't get much more than that either.

 You seem to be talking about database IO, which isn't what you said.

Right, well, it was what I meant.   I failed to specify, that's all.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Josh Berkus
Jeffrey,

 I guess database reads are different, but I remain unconvinced that they
 are *fundamentally* different.  After all, a tab-delimited file (my sort
 workload) is a kind of database.

Unfortunately, they are ... because of CPU overheads.   I'm basing what's 
reasonable for data writes on the rates which other high-end DBs can 
make.   From that, 25mb/s or even 40mb/s for sorts should be achievable 
but doing 120mb/s would require some kind of breakthrough.

 On a single disk you wouldn't notice, but XFS scales much better when
 you throw disks at it.  I get a 50MB/sec boost from the 24th disk,
 whereas ext3 stops scaling after 16 disks.  For writes both XFS and ext3
 top out around 8 disks, but in this case XFS tops out at 500MB/sec while
 ext3 can't break 350MB/sec.

That would explain it.  I seldom get more than 6 disks (and 2 channels) to 
test with.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
Let's pretend we get a 24HD HW RAID solution like that J Baker
says he has access to and set it up as a RAID 10.  Assuming
it uses two 64b 133MHz PCI-X busses and has the fastest HDs
available on it,  Jeff says he can hit ~1GBps of XFS FS IO rate
with that set up (12*83.3MBps= 1GBps).

Josh says that pg can't do more than 25MBps of DB level IO
regardless of how fast the physical IO subsystem is because at
25MBps, pg is CPU bound.  

Just how bad is this CPU bound condition?  How powerful a CPU is
needed to attain a DB IO rate of 25MBps?
 
If we replace said CPU with one 2x, 10x, etc faster than that, do we
see any performance increase?

If a modest CPU can drive a DB IO rate of 25MBps, but that rate
does not go up regardless of how much extra CPU we throw at
it...

Ron 

-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Oct 3, 2005 6:03 PM
To: Jeffrey W. Baker [EMAIL PROTECTED]
Cc: 
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Jeffrey,

 I guess database reads are different, but I remain unconvinced that they
 are *fundamentally* different.  After all, a tab-delimited file (my sort
 workload) is a kind of database.

Unfortunately, they are ... because of CPU overheads.   I'm basing what's 
reasonable for data writes on the rates which other high-end DBs can 
make.   From that, 25mb/s or even 40mb/s for sorts should be achievable 
but doing 120mb/s would require some kind of breakthrough.

 On a single disk you wouldn't notice, but XFS scales much better when
 you throw disks at it.  I get a 50MB/sec boost from the 24th disk,
 whereas ext3 stops scaling after 16 disks.  For writes both XFS and ext3
 top out around 8 disks, but in this case XFS tops out at 500MB/sec while
 ext3 can't break 350MB/sec.

That would explain it.  I seldom get more than 6 disks (and 2 channels) to 
test with.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Gregory Maxwell
On 10/3/05, Ron Peacetree [EMAIL PROTECTED] wrote:
[snip]
 Just how bad is this CPU bound condition?  How powerful a CPU is
 needed to attain a DB IO rate of 25MBps?

 If we replace said CPU with one 2x, 10x, etc faster than that, do we
 see any performance increase?

 If a modest CPU can drive a DB IO rate of 25MBps, but that rate
 does not go up regardless of how much extra CPU we throw at
 it...

Single threaded was mentioned.
Plus even if it's purely cpu bound, it's seldom as trivial as throwing
CPU at it, consider the locking in both the application, in the
filesystem, and elsewhere in the kernel.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-03 Thread Ron Peacetree
OK, change performance to single thread performance and we
still have a valid starting point for a discussion.

Ron


-Original Message-
From: Gregory Maxwell [EMAIL PROTECTED]
Sent: Oct 3, 2005 8:19 PM
To: Ron Peacetree [EMAIL PROTECTED]
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On 10/3/05, Ron Peacetree [EMAIL PROTECTED] wrote:
[snip]
 Just how bad is this CPU bound condition?  How powerful a CPU is
 needed to attain a DB IO rate of 25MBps?

 If we replace said CPU with one 2x, 10x, etc faster than that, do we
 see any performance increase?

 If a modest CPU can drive a DB IO rate of 25MBps, but that rate
 does not go up regardless of how much extra CPU we throw at
 it...

Single threaded was mentioned.
Plus even if it's purely cpu bound, it's seldom as trivial as throwing
CPU at it, consider the locking in both the application, in the
filesystem, and elsewhere in the kernel.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-02 Thread Martijn van Oosterhout
On Sat, Oct 01, 2005 at 11:26:07PM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
  doing an index create on a 2960296 row table (3 columns, table size
  317MB).
 
 3 columns in the index you mean?  What were the column datatypes?
 Any null values?

Nope, three columns in the table, one column in the index, no nulls.
The indexed column was integer. I did it once with around 6500 values
repeated over and over, lots of duplicate kays. And once on a serial
column but it made no descernable difference either way. Although the
comparison function was called less (only 76 million times), presumably
because it was mostly sorted already.

  The number 1 bottleneck with 41% of user time is comparetup_index.
  ...
  The thing is, I can't see anything in comparetup_index() that could
  take much time.
 
 The index_getattr and heap_getattr macros can be annoyingly expensive.

And yet they are optimised for the common case. nocache_index_getattr
was only called 7 times, which is about what you expect. I'm getting
annotated output now, to determine which line takes the time...
Actually, my previous profile overstated stuff a bit. Profiling turned
off optimisation so I put it back and you get better results but the
order doesn't change much. By line results are below.

The top two are the index_getattr calls in comparetup_index. Third and
fourth are the HEAPCOMPARES in tuplesort_heap_siftup. Then comes the
inlineApplySortFunction call (which isn't being inlined, despite
suggesting it should be, -Winline warns about this).

Looks to me that there are no real gains to be made in this function.
What is needed is an algorithmic change to call this function less
often...

Have a nice weekend,

  %   cumulative   self  self total   
 time   seconds   secondscalls  ms/call  ms/call  name
  9.40 22.5622.56 comparetup_index 
(tuplesort.c:2042 @ 8251060)
  5.07 34.7312.17 comparetup_index 
(tuplesort.c:2043 @ 82510c0)
  4.73 46.0911.36 tuplesort_heap_siftup 
(tuplesort.c:1648 @ 825074d)
  3.48 54.45 8.36 tuplesort_heap_siftup 
(tuplesort.c:1661 @ 82507a9)
  2.80 61.18 6.73 comparetup_index 
(tuplesort.c:2102 @ 8251201)
  2.68 67.62 6.44 comparetup_index 
(tuplesort.c:2048 @ 8251120)
  2.16 72.82 5.20 tuplesort_heap_siftup 
(tuplesort.c:1652 @ 825076d)
  1.88 77.34 4.52 76025782 0.00 0.00  comparetup_index 
(tuplesort.c:2016 @ 8251010)
  1.82 81.70 4.36 76025782 0.00 0.00  inlineApplySortFunction 
(tuplesort.c:1833 @ 8251800)
  1.73 85.85 4.15 readtup_heap 
(tuplesort.c:2000 @ 8250fd8)
  1.67 89.86 4.01 AllocSetAlloc (aset.c:568 
@ 824bec0)
  1.61 93.72 3.86 comparetup_index 
(tuplesort.c:2025 @ 825102f)
  1.47 97.25 3.53 76025785 0.00 0.00  btint4cmp 
(nbtcompare.c:74 @ 80924a0)
  1.11 99.92 2.67 readtup_datum 
(tuplesort.c:2224 @ 82517c4)
  1.10102.55 2.64 comparetup_index 
(tuplesort.c:2103 @ 82511e7)

  %   cumulative   self  self total   
 time   seconds   secondscalls   s/call   s/call  name
 28.34 68.0168.01 76025782 0.00 0.00  comparetup_index
 13.56100.5432.53  7148934 0.00 0.00  tuplesort_heap_siftup
  8.66121.3320.79 76025782 0.00 0.00  inlineApplySortFunction
  4.43131.9610.63 13084567 0.00 0.00  AllocSetAlloc
  3.73140.90 8.94 76025785 0.00 0.00  btint4cmp
  2.15146.07 5.17  6095625 0.00 0.00  LWLockAcquire
  2.02150.92 4.85  2960305 0.00 0.00  heapgettup
  1.98155.66 4.74  7148934 0.00 0.00  tuplesort_heap_insert
  1.78159.94 4.28  2960312 0.00 0.00  slot_deform_tuple
  1.73164.09 4.15 readtup_heap
  1.67168.09 4.00  6095642 0.00 0.00  LWLockRelease
  1.53171.76 3.68  2960308 0.00 0.00  index_form_tuple
  1.44175.21 3.45 13083442 0.00 0.00  AllocSetFree
  1.28178.28 3.07  8377285 0.00 0.00  LogicalTapeWrite
  1.25181.29 3.01  8377285 0.00 0.00  LogicalTapeRead
  1.11183.96 2.67 readtup_datum
  1.06186.51 2.551 2.55   123.54  IndexBuildHeapScan

-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the 

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-02 Thread Martijn van Oosterhout
Ok, I tried two optimisations:

1. By creating a special version of comparetup_index for single key
integer indexes. Create an index_get_attr with byval and len args. By
using fetch_att and specifying the values at compile time, gcc
optimises the whole call to about 12 instructions of assembly rather
than the usual mess.

2. By specifying: -Winline -finline-limit-1500 (only on tuplesort.c).
This causes inlineApplySortFunction() to be inlined, like the code
obviously expects it to be.

default build (baseline)235 seconds
-finline only   217 seconds (7% better)
comparetup_index_fastbyval4 only221 seconds (6% better)
comparetup_index_fastbyval4 and -finline203 seconds (13.5% better)

This is indexing the integer sequence column on a 2.7 million row
table. The times are as given by gprof and so exclude system call time.

Basically, I recommend adding -Winline -finline-limit-1500 to the
default build while we discuss other options.

comparetup_index_fastbyval4 patch attached per example.
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.
--- pgsql-clean/src/include/access/itup.h   2005-10-02 21:30:20.327464320 
+0200
+++ pgsql-sort/src/include/access/itup.h2005-10-02 16:04:00.0 
+0200
@@ -126,6 +126,34 @@
) \
 )
 
+#define index_get_attr(tup, attnum, tupleDesc, attbyval, attlen, isnull) \
+( \
+   AssertMacro(PointerIsValid(isnull)  (attnum)  0), \
+   *(isnull) = false, \
+   !IndexTupleHasNulls(tup) ? \
+   ( \
+   (tupleDesc)-attrs[(attnum)-1]-attcacheoff = 0 ? \
+   ( \
+   fetch_att((char *) (tup) + 
IndexInfoFindDataOffset((tup)-t_info) \
+   + (tupleDesc)-attrs[(attnum)-1]-attcacheoff, 
attbyval, attlen) \
+   ) \
+   : \
+   nocache_index_getattr((tup), (attnum), (tupleDesc), 
(isnull)) \
+   ) \
+   : \
+   ( \
+   (att_isnull((attnum)-1, (char *)(tup) + 
sizeof(IndexTupleData))) ? \
+   ( \
+   *(isnull) = true, \
+   (Datum)NULL \
+   ) \
+   : \
+   ( \
+   nocache_index_getattr((tup), (attnum), (tupleDesc), 
(isnull)) \
+   ) \
+   ) \
+)
+
 
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
--- pgsql-clean/src/backend/utils/sort/tuplesort.c  2005-09-24 
23:23:39.0 +0200
+++ pgsql-sort/src/backend/utils/sort/tuplesort.c   2005-10-02 
21:29:39.349086302 +0200
@@ -375,6 +375,8 @@
 unsigned int len);
 static int comparetup_index(Tuplesortstate *state,
 const void *a, const void *b);
+static int comparetup_index_fastbyval4(Tuplesortstate *state,
+const void *a, const void *b);
 static void *copytup_index(Tuplesortstate *state, void *tup);
 static void writetup_index(Tuplesortstate *state, int tapenum, void *tup);
 static void *readtup_index(Tuplesortstate *state, int tapenum,
@@ -498,8 +500,12 @@
  int workMem, bool randomAccess)
 {
Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+   TupleDesc   tupDes = RelationGetDescr(indexRel);
 
-   state-comparetup = comparetup_index;
+   if( tupDes-natts == 1  tupDes-attrs[0]-attbyval == 1  
tupDes-attrs[0]-attlen == 4 )
+   state-comparetup = comparetup_index_fastbyval4;
+   else
+   state-comparetup = comparetup_index;
state-copytup = copytup_index;
state-writetup = writetup_index;
state-readtup = readtup_index;
@@ -2102,6 +2108,92 @@
return 0;
 }
 
+static int
+comparetup_index_fastbyval4(Tuplesortstate *state, const void *a, const void 
*b)
+{
+   /*
+* This is almost the same as _bt_tuplecompare(), but we need to keep
+* track of whether any null fields are present.  Also see the special
+* treatment for equal keys at the end.
+*/
+   IndexTuple  tuple1 = (IndexTuple) a;
+   IndexTuple  tuple2 = (IndexTuple) b;
+   Relationrel = state-indexRel;
+   ScanKey scankey = state-indexScanKey;
+   TupleDesc   tupDes;
+   boolequal_hasnull = false;
+
+   tupDes = RelationGetDescr(rel);
+
+   ScanKey entry = scankey[0];
+   Datum   datum1,
+   datum2;
+   boolisnull1,
+   isnull2;
+   int32   compare;
+
+   datum1 = index_get_attr(tuple1, 1, tupDes, 1, 4, isnull1);
+   

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Tom Lane
Jeffrey W. Baker [EMAIL PROTECTED] writes:
 I think the largest speedup will be to dump the multiphase merge and
 merge all tapes in one pass, no matter how large M.  Currently M is
 capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
 the tape.  It could be done in a single pass heap merge with N*log(M)
 comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here.  The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives.  But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass.  That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is.  But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size.  (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
be well repaid by the potential savings in I/O.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Dann Corbit
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Tom Lane
 Sent: Friday, September 30, 2005 11:02 PM
 To: Jeffrey W. Baker
 Cc: Luke Lonergan; Josh Berkus; Ron Peacetree; pgsql-
 [EMAIL PROTECTED]; pgsql-performance@postgresql.org
 Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
 
 Jeffrey W. Baker [EMAIL PROTECTED] writes:
  I think the largest speedup will be to dump the multiphase merge and
  merge all tapes in one pass, no matter how large M.  Currently M is
  capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes
over
  the tape.  It could be done in a single pass heap merge with
N*log(M)
  comparisons, and, more importantly, far less input and output.
 
 I had more or less despaired of this thread yielding any usable ideas
 :-( but I think you have one here.  

I believe I made the exact same suggestion several days ago.

The reason the current code uses a
 six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
 edition) shows that there's not much incremental gain from using more
 tapes ... if you are in the regime where number of runs is much
greater
 than number of tape drives.  But if you can stay in the regime where
 only one merge pass is needed, that is obviously a win.
 
 I don't believe we can simply legislate that there be only one merge
 pass.  That would mean that, if we end up with N runs after the
initial
 run-forming phase, we need to fit N tuples in memory --- no matter how
 large N is, or how small work_mem is.  But it seems like a good idea
to
 try to use an N-way merge where N is as large as work_mem will allow.
 We'd not have to decide on the value of N until after we've completed
 the run-forming phase, at which time we've already seen every tuple
 once, and so we can compute a safe value for N as work_mem divided by
 largest_tuple_size.  (Tape I/O buffers would have to be counted too
 of course.)

You only need to hold the sort column(s) in memory, except for the queue
you are exhausting at the time.  [And of those columns, only the values
for the smallest one in a sub-list.]  Of course, the more data from each
list that you can hold at once, the fewer the disk reads and seeks.

Another idea (not sure if it is pertinent):
Instead of having a fixed size for the sort buffers, size it to the
query.  Given a total pool of size M, give a percentage according to the
difficulty of the work to perform.  So a query with 3 small columns and
a cardinality of 1000 gets a small percentage and a query with 10 GB of
data gets a big percentage of available sort mem.
 
 It's been a good while since I looked at the sort code, and so I don't
 recall if there are any fundamental reasons for having a compile-time-
 constant value of the merge order rather than choosing it at runtime.
 My guess is that any inefficiencies added by making it variable would
 be well repaid by the potential savings in I/O.
 
   regards, tom lane
 
 ---(end of
broadcast)---
 TIP 6: explain analyze is your friend

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Simon Riggs
On Fri, 2005-09-30 at 13:41 -0700, Josh Berkus wrote:
 Yeah, that's what I thought too.   But try sorting an 10GB table, and 
 you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're 
 CPU-bound, because sort is being really inefficient about something. I 
 just don't know what yet.
 
 If we move that CPU-binding to a higher level of performance, then we can 
 start looking at things like async I/O, O_Direct, pre-allocation etc. that 
 will give us incremental improvements.   But what we need now is a 5-10x 
 improvement and that's somewhere in the algorithms or the code.

I'm trying to keep an open mind about what the causes are, and I think
we need to get a much better characterisation of what happens during a
sort before we start trying to write code. It is always too easy to jump
in and tune the wrong thing, which is not a good use of time.

The actual sort algorithms looks damn fine to me and the code as it
stands is well optimised. That indicates to me that we've come to the
end of the current line of thinking and we need a new approach, possibly
in a number of areas.

For myself, I don't wish to be drawn further on solutions at this stage
but I am collecting performance data, so any test results are most
welcome.

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Simon Riggs
On Sat, 2005-10-01 at 02:01 -0400, Tom Lane wrote:
 Jeffrey W. Baker [EMAIL PROTECTED] writes:
  I think the largest speedup will be to dump the multiphase merge and
  merge all tapes in one pass, no matter how large M.  Currently M is
  capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
  the tape.  It could be done in a single pass heap merge with N*log(M)
  comparisons, and, more importantly, far less input and output.
 
 I had more or less despaired of this thread yielding any usable ideas
 :-( but I think you have one here.  The reason the current code uses a
 six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
 edition) shows that there's not much incremental gain from using more
 tapes ... if you are in the regime where number of runs is much greater
 than number of tape drives.  But if you can stay in the regime where
 only one merge pass is needed, that is obviously a win.
 
 I don't believe we can simply legislate that there be only one merge
 pass.  That would mean that, if we end up with N runs after the initial
 run-forming phase, we need to fit N tuples in memory --- no matter how
 large N is, or how small work_mem is.  But it seems like a good idea to
 try to use an N-way merge where N is as large as work_mem will allow.
 We'd not have to decide on the value of N until after we've completed
 the run-forming phase, at which time we've already seen every tuple
 once, and so we can compute a safe value for N as work_mem divided by
 largest_tuple_size.  (Tape I/O buffers would have to be counted too
 of course.)
 
 It's been a good while since I looked at the sort code, and so I don't
 recall if there are any fundamental reasons for having a compile-time-
 constant value of the merge order rather than choosing it at runtime.
 My guess is that any inefficiencies added by making it variable would
 be well repaid by the potential savings in I/O.

Well, perhaps Knuth is not untouchable!

So we merge R runs with N variable rather than N=6.

Pick N so that N = 6 and N = R, with N limited by memory, sufficient
to allow long sequential reads from the temp file.

Looking at the code, in selectnewtape() we decide on the connection
between run number and tape number. This gets executed during the
writing of initial runs, which was OK when the run-tape mapping was
known ahead of time because of fixed N.

To do this it sounds like we'd be better to write each run out to its
own personal runtape, taking the assumption that N is very large. Then
when all runs are built, re-assign the run numbers to tapes for the
merge. That is likely to be a trivial mapping unless N isn't large
enough to fit in memory. That idea should be easily possible because the
tape numbers were just abstract anyway.

Right now, I can't see any inefficiencies from doing this. It uses
memory better and Knuth shows that using more tapes is better anyhow.
Keeping track of more tapes isn't too bad, even for hundreds or even
thousands of runs/tapes.

Tom, its your idea, so you have first dibs. I'm happy to code this up if
you choose not to, once I've done my other immediate chores.

That just leaves these issues for a later time:
- CPU and I/O interleaving
- CPU cost of abstract data type comparison operator invocation

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Michael Stone

On Fri, Sep 30, 2005 at 01:41:22PM -0700, Josh Berkus wrote:
Realistically, you can't do better than about 25MB/s on a single-threaded 
I/O on current Linux machines,


What on earth gives you that idea? Did you drop a zero?

Mike Stone

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Hannu Krosing
On R, 2005-09-30 at 13:38 -0700, Luke Lonergan wrote:

 
 Bulk loading speed is irrelevant here - that is dominated by parsing, which
 we have covered copiously (har har) previously and have sped up by 500%,
 which still makes Postgres  1/2 the loading speed of MySQL.

Is this  1/2 of MySQL with WAL on different spindle and/or WAL
disabled ?

-- 
Hannu Krosing [EMAIL PROTECTED]


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Ron Peacetree
*blink* Tapes?!  I thought that was a typo...
If our sort is code based on sorting tapes, we've made a mistake.  HDs
are not tapes, and Polyphase Merge Sort and it's brethren are not the
best choices for HD based sorts.

Useful references to this point:
Knuth, Vol 3 section 5.4.9, (starts p356 of 2ed)   
Tharp, ISBN 0-471-60521-2, starting p352
Folk, Zoellick, and Riccardi, ISBN 0-201-87401-6, chapter 8 (starts p289)

The winners of the Daytona version of Jim Gray's sorting contest, for
general purpose external sorting algorithms that are of high enough quality
to be offered commercially, also demonstrate a number of better ways to
attack external sorting using HDs.

The big take aways from all this are:
1= As in Polyphase Merge Sort, optimum External HD Merge Sort
performance is obtained by using Replacement Selection and creating
buffers of different lengths for later merging.  The values are different.

2= Using multiple HDs split into different functions, IOW _not_ simply
as RAIDs, is a big win.
A big enough win that we should probably consider having a config option
to pg that allows the use of HD(s) or RAID set(s) dedicated as temporary
work area(s).
 
3= If the Key is small compared record size, Radix or Distribution
Counting based algorithms are worth considering.

The good news is all this means it's easy to demonstrate that we can
improve the performance of our sorting functionality.

Assuming we get the abyssmal physical IO performance fixed...
(because until we do, _nothing_ is going to help us as much)

Ron 


-Original Message-
From: Tom Lane [EMAIL PROTECTED]
Sent: Oct 1, 2005 2:01 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort? 

Jeffrey W. Baker [EMAIL PROTECTED] writes:
 I think the largest speedup will be to dump the multiphase merge and
 merge all tapes in one pass, no matter how large M.  Currently M is
 capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
 the tape.  It could be done in a single pass heap merge with N*log(M)
 comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here.  The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives.  But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass.  That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is.  But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size.  (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
be well repaid by the potential savings in I/O.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Andrew Dunstan



Ron Peacetree wrote:


The good news is all this means it's easy to demonstrate that we can
improve the performance of our sorting functionality.

Assuming we get the abyssmal physical IO performance fixed...
(because until we do, _nothing_ is going to help us as much)

 



I for one would be paying more attention if such a demonstration were 
forthcoming, in the form of a viable patch and some benchmark results.


cheers

andrew

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 The biggest single area where I see PostgreSQL external sort sucking is 
   on index creation on large tables.   For example, for free version of 
 TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's 
 hardware, but over 3 hours to create each index on that table.  This 
 means that over all our load into TPCH takes 4 times as long to create 
 the indexes as it did to bulk load the data.
 ...
 Following an index creation, we see that 95% of the time required is the 
 external sort, which averages 2mb/s.  This is with seperate drives for 
 the WAL, the pg_tmp, the table and the index.  I've confirmed that 
 increasing work_mem beyond a small minimum (around 128mb) had no benefit 
 on the overall index creation speed.

These numbers don't seem to add up.  You have not provided any details
about the index key datatypes or sizes, but I'll take a guess that the
raw data for each index is somewhere around 10GB.  The theory says that
the runs created during the first pass should on average be about twice
work_mem, so at 128mb work_mem there should be around 40 runs to be
merged, which would take probably three passes with six-way merging.
Raising work_mem to a gig should result in about five runs, needing only
one pass, which is really going to be as good as it gets.  If you could
not see any difference then I see little hope for the idea that reducing
the number of merge passes will help.

Umm ... you were raising maintenance_work_mem, I trust, not work_mem?

We really need to get some hard data about what's going on here.  The
sort code doesn't report any internal statistics at the moment, but it
would not be hard to whack together a patch that reports useful info
in the form of NOTICE messages or some such.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Jeffrey W. Baker [EMAIL PROTECTED] writes:
  I think the largest speedup will be to dump the multiphase merge and
  merge all tapes in one pass, no matter how large M.  Currently M is
  capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
  the tape.  It could be done in a single pass heap merge with N*log(M)
  comparisons, and, more importantly, far less input and output.
 
 I had more or less despaired of this thread yielding any usable ideas
 :-( but I think you have one here.  The reason the current code uses a
 six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
 edition) shows that there's not much incremental gain from using more
 tapes ... if you are in the regime where number of runs is much greater
 than number of tape drives.  But if you can stay in the regime where
 only one merge pass is needed, that is obviously a win.

Is that still true when the multiple tapes are being multiplexed onto a single
actual file on disk?

That brings up one of my pet features though. The ability to declare multiple
temporary areas on different spindles and then have them be used on a rotating
basis. So a sort could store each tape on a separate spindle and merge them
together at full sequential i/o speed.

This would make the tradeoff between multiway merges and many passes even
harder to find though. The broader the multiway merges the more sort areas
would be used which would increase the likelihood of another sort using the
same sort area and hurting i/o performance.

-- 
greg


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Martijn van Oosterhout
On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote:
 Assuming we get the abyssmal physical IO performance fixed...
 (because until we do, _nothing_ is going to help us as much)

I'm still not convinced this is the major problem. For example, in my
totally unscientific tests on an oldish machine I have here:

Direct filesystem copy to /dev/null
21MB/s10% user 50% system  (dual cpu, so the system is using a whole CPU)

COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)

COPY TO /dev/null
4.4MB/s   60% user 40% system

\copy to /dev/null in psql
6.5MB/s   60% user 40% system

This machine is a bit strange setup, not sure why fs copy is so slow.
As to why \copy is faster than COPY, I have no idea, but it is
repeatable. And actually turning the tuples into a printable format is
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

So, I don't think physical I/O is the problem. It's something further
up the call tree. I wouldn't be surprised at all it it had to do with
the creation and destruction of tuples. The cost of comparing tuples
should not be underestimated.
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpEek2beyKVc.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Ron Peacetree
As I posted earlier, I'm looking for code to base a prototype on now.
I'll test it outside pg to make sure it is bug free and performs as
promised before I hand it off to the core pg developers.

Someone else is going to have to merge it into the pg code base
since I don't know the code intimately enough to make changes this
deep in the core functionality, nor is there enough time for me to
do so if we are going to be timely enough get this into 8.2
(and no, I can't devote 24x7 to doing pg development unless
someone is going to replace my current ways of paying my bills so
that I can.)

Ron
 

-Original Message-
From: Andrew Dunstan [EMAIL PROTECTED]
Sent: Oct 1, 2005 11:19 AM
To: Ron Peacetree [EMAIL PROTECTED]
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?



Ron Peacetree wrote:

The good news is all this means it's easy to demonstrate that we can
improve the performance of our sorting functionality.

Assuming we get the abyssmal physical IO performance fixed...
(because until we do, _nothing_ is going to help us as much)

  


I for one would be paying more attention if such a demonstration were 
forthcoming, in the form of a viable patch and some benchmark results.

cheers

andrew


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Ron Peacetree
You have not said anything about what HW, OS version, and pg version
used here, but even at that can't you see that something Smells Wrong?

The most common CPUs currently shipping have clock rates of ~2-3GHz
and have 8B-16B internal pathways.  SPARCs and other like CPUs are
clocked slower but have 16B-32B internal pathways.  In short, these
CPU's have an internal bandwidth of 16+ GBps.

The most common currently shipping mainboards have 6.4GBps RAM
subsystems.  ITRW, their peak is ~80% of that, or ~5.1GBps.

In contrast, the absolute peak bandwidth of a 133MHx 8B PCI-X bus is
1GBps, and ITRW it peaks at ~800-850MBps.  Should anyone ever build
a RAID system that can saturate a PCI-Ex16 bus, that system will be
maxing ITRW at ~3.2GBps.

CPUs should NEVER be 100% utilized during copy IO.  They should be
idling impatiently waiting for the next piece of data to finish being
processed even when the RAM IO subsystem is pegged; and they
definitely should be IO starved rather than CPU bound when doing
HD IO.

Those IO rates are also alarming in all but possibly the first case.  A
single ~50MBps HD doing 21MBps isn't bad, but for even a single
~80MBps HD it starts to be of concern.  If any these IO rates came
from any reasonable 300+MBps RAID array, then they are BAD.

What your simple experiment really does is prove We Have A
Problem (tm) with our IO code at either or both of the OS or the pg
level(s).

Ron

 
-Original Message-
From: Martijn van Oosterhout kleptog@svana.org
Sent: Oct 1, 2005 12:19 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote:
 Assuming we get the abyssmal physical IO performance fixed...
 (because until we do, _nothing_ is going to help us as much)

I'm still not convinced this is the major problem. For example, in my
totally unscientific tests on an oldish machine I have here:

Direct filesystem copy to /dev/null
21MB/s10% user 50% system  (dual cpu, so the system is using a whole CPU)

COPY TO /dev/null WITH binary
13MB/s55% user 45% system  (ergo, CPU bound)

COPY TO /dev/null
4.4MB/s   60% user 40% system

\copy to /dev/null in psql
6.5MB/s   60% user 40% system

This machine is a bit strange setup, not sure why fs copy is so slow.
As to why \copy is faster than COPY, I have no idea, but it is
repeatable. And actually turning the tuples into a printable format is
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

So, I don't think physical I/O is the problem. It's something further
up the call tree. I wouldn't be surprised at all it it had to do with
the creation and destruction of tuples. The cost of comparing tuples
should not be underestimated.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Martijn van Oosterhout
[removed -performance, not subscribed]

On Sat, Oct 01, 2005 at 01:42:32PM -0400, Ron Peacetree wrote:
 You have not said anything about what HW, OS version, and pg version
 used here, but even at that can't you see that something Smells Wrong?

Somewhat old machine running 7.3 on Linux 2.4. Not exactly speed
daemons but it's still true that the whole process would be CPU bound
*even* if the O/S could idle while it's waiting. PostgreSQL used a
*whole CPU* which is its limit. My point is that trying to reduce I/O
by increasing CPU usage is not going to be benficial, we need CPU usage
down also.

Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
doing an index create on a 2960296 row table (3 columns, table size
317MB).

The number 1 bottleneck with 41% of user time is comparetup_index. It
was called 95,369,361 times (about 2*ln(N)*N). It used 3 tapes. Another
15% of time went to tuplesort_heap_siftup.

The thing is, I can't see anything in comparetup_index() that could
take much time. The actual comparisons are accounted elsewhere
(inlineApplySortFunction) which amounted to 10% of total time. Since
nocache_index_getattr doesn't feature I can't imagine index_getattr
being a big bottleneck. Any ideas what's going on here?

Other interesting features:
- ~4 memory allocations per tuple, nearly all of which were explicitly
freed
- Things I though would be expensive, like: heapgettup and
myFunctionCall2 didn't really count for much.

Have a nice weekend,

  %   cumulative   self  self total   
 time   seconds   secondscalls   s/call   s/call  name
 43.63277.81   277.81 95370055 0.00 0.00  comparetup_index
 16.24381.24   103.43  5920592 0.00 0.00  tuplesort_heap_siftup
  3.76405.1723.93 95370055 0.00 0.00  inlineApplySortFunction
  3.18425.4220.26 95370056 0.00 0.00  btint4cmp
  2.82443.3717.95 11856219 0.00 0.00  AllocSetAlloc
  2.52459.4416.07 95370055 0.00 0.00  myFunctionCall2
  1.71470.3510.91  2960305 0.00 0.00  heapgettup
  1.26478.38 8.03 11841204 0.00 0.00  GetMemoryChunkSpace
  1.14485.67 7.29  5920592 0.00 0.00  tuplesort_heap_insert
  1.11492.71 7.04  2960310 0.00 0.00  index_form_tuple
  1.09499.67 6.96 11855105 0.00 0.00  AllocSetFree
  0.97505.83 6.17 23711355 0.00 0.00  AllocSetFreeIndex
  0.84511.19 5.36  5920596 0.00 0.00  LogicalTapeWrite
  0.84516.51 5.33  2960314 0.00 0.00  slot_deform_tuple
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpfhYaBi5xFJ.pgp
Description: PGP signature


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-10-01 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
 doing an index create on a 2960296 row table (3 columns, table size
 317MB).

3 columns in the index you mean?  What were the column datatypes?
Any null values?

 The number 1 bottleneck with 41% of user time is comparetup_index.
 ...
 The thing is, I can't see anything in comparetup_index() that could
 take much time.

The index_getattr and heap_getattr macros can be annoyingly expensive.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
From: Pailloncy Jean-Gerard [EMAIL PROTECTED]
Sent: Sep 29, 2005 7:11 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
Jeff Baker:
Your main example seems to focus on a large table where a key  
column has constrained values.  This case is interesting in
proportion to the number of possible values.  If I have billions
of rows, each having one of only two values, I can think of a
trivial and very fast method of returning the table sorted by
that key: make two sequential passes, returning the first value
on the first pass and the second value on the second pass.
 This will be faster than the method you propose.

Ron Peacetree:
1= No that was not my main example.  It was the simplest example  
used to frame the later more complicated examples.  Please don't
get hung up on it.

2= You are incorrect.  Since IO is the most expensive operation we  
can do, any method that makes two passes through the data at top
scanning speed will take at least 2x as long as any method that only
takes one such pass.

You do not get the point.
As the time you get the sorted references to the tuples, you need to  
fetch the tuples themself, check their visbility, etc. and returns  
them to the client.

As PFC correctly points out elsewhere in this thread, =maybe= you
have to do all that.  The vast majority of the time people are not
going to want to look at a detailed record by record output of that
much data.

The most common usage is to calculate or summarize some quality
or quantity of the data and display that instead or to use the tuples
or some quality of the tuples found as an intermediate step in a
longer query process such as a join.

Sometimes there's a need to see _some_ of the detailed records; a
random sample or a region in a random part of the table or etc.
It's rare that there is a RW need to actually list every record in a
table of significant size.

On the rare occasions where one does have to return or display all
records in such large table, network IO and/or display IO speeds
are the primary performance bottleneck.  Not HD IO.

Nonetheless, if there _is_ such a need, there's nothing stopping us
from rearranging the records in RAM into sorted order in one pass
through RAM (using at most space for one extra record) after
constructing the cache conscious Btree index.  Then the sorted
records can be written to HD in RAM buffer sized chunks very
efficiently.  
Repeating this process until we have stepped through the entire
data set will take no more HD IO than one HD scan of the data
and leave us with a permanent result that can be reused for
multiple purposes.  If the sorted records are written in large
enough chunks, rereading them at any later time can be done
at maximum HD throughput

In a total of two HD scans (one to read the original data, one
to write out the sorted data) we can make a permanent
rearrangement of the data.  We've essentially created a 
cluster index version of the data.


So, if there is only 2 values in the column of big table that is larger  
than available RAM, two seq scans of the table without any sorting
is the fastest solution.

If you only need to do this once, yes this wins.  OTOH, if you have
to do this sort even twice, my method is better.

regards,
Ron  

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
From: Zeugswetter Andreas DAZ SD [EMAIL PROTECTED]
Sent: Sep 29, 2005 9:28 AM
Subject: RE: [HACKERS] [PERFORM] A Better External Sort?

In my original example, a sequential scan of the 1TB of 2KB 
or 4KB records, = 250M or 500M records of data, being sorted 
on a binary value key will take ~1000x more time than reading 
in the ~1GB Btree I described that used a Key+RID (plus node 
pointers) representation of the data.

Imho you seem to ignore the final step your algorithm needs of
collecting the data rows. After you sorted the keys the collect
step will effectively access the tuples in random order (given a 
sufficiently large key range).

Collecting the data rows can be done for each RAM buffer full of
of data in one pass through RAM after we've built the Btree.  Then
if desired those data rows can be read out to HD in sorted order
in essentially one streaming burst.  This combination of index build
+ RAM buffer rearrangement + write results to HD can be repeat
as often as needed until we end up with an overall Btree index and
a set of sorted sublists on HD.  Overall HD IO for the process is only
two effectively sequential passes through the data.

Subsequent retrieval of the sorted information from HD can be
done at full HD streaming speed and whatever we've decided to
save to HD can be reused later if we desire.

Hope this helps,
Ron

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
From: Josh Berkus josh@agliodbs.com
Sent: Sep 29, 2005 12:54 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

The biggest single area where I see PostgreSQL external
sort sucking is on index creation on large tables.   For
example, for free version of TPCH, it takes only 1.5 hours to
load a 60GB Lineitem table on OSDL's hardware, but over 3
hours to create each index on that table.  This means that
over all our load into TPCH takes 4 times as long to create 
the indexes as it did to bulk load the data.

Hmmm.
60GB/5400secs= 11MBps.  That's ssllooww.  So the first
problem is evidently our physical layout and/or HD IO layer
sucks.

Creating the table and then creating the indexes on the table
is going to require more physical IO than if we created the
table and the indexes concurrently in chunks and then
combined the indexes on the chunks into the overall indexes
for the whole table, so there's a potential speed-up.

The method I've been talking about is basically a recipe for
creating indexes as fast as possible with as few IO operations,
HD or RAM, as possible and nearly no random ones, so it
could help as well.

OTOH, HD IO rate is the fundamental performance metric.
As long as our HD IO rate is pessimal, so will the performance
of everything else be.   Why can't we load a table at closer to
the peak IO rate of the HDs?   


Anyone restoring a large database from pg_dump is in the
same situation.  Even worse, if you have to create a new
index on a large table on a production database in use,
because the I/O from the index creation swamps everything.

Fix for this in the works ;-)


Following an index creation, we see that 95% of the time
required is the external sort, which averages 2mb/s.

Assuming decent HD HW, this is HORRIBLE.

What's kind of instrumenting and profiling has been done of
the code involved?


This is with seperate drives for the WAL, the pg_tmp, the table
and the index.  I've confirmed that increasing work_mem 
beyond a small minimum (around 128mb) had no benefit on
the overall index creation speed.

No surprise.  The process is severely limited by the abyssmally
slow HD IO.

Ron

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread PFC


Just to add a little anarchy in your nice debate...

Who really needs all the results of a sort on your terabyte table ?

	I guess not many people do a SELECT from such a table and want all the  
results. So, this leaves :

- Really wanting all the results, to fetch using a cursor,
- CLUSTER type things, where you really want everything in order,
	- Aggregates (Sort-GroupAggregate), which might really need to sort the  
whole table.
	- Complex queries where the whole dataset needs to be examined, in order  
to return a few values

- Joins (again, the whole table is probably not going to be selected)
- And the ones I forgot.

However,

Most likely you only want to SELECT N rows, in some ordering :
- the first N (ORDER BY x LIMIT N)
- last N (ORDER BY x DESC LIMIT N)
- WHERE xvalue ORDER BY x LIMIT N
- WHERE xvalue ORDER BY x DESC LIMIT N
- and other variants

	Or, you are doing a Merge JOIN against some other table ; in that case,  
yes, you might need the whole sorted terabyte table, but most likely there  
are WHERE clauses in the query that restrict the set, and thus, maybe we  
can get some conditions or limit values on the column to sort.


	Also the new, optimized hash join, which is more memory efficient, might  
cover this case.


	Point is, sometimes, you only need part of the results of your sort. And  
the bigger the sort, the most likely it becomes that you only want part of  
the results. So, while we're in the fun hand-waving, new algorithm trying  
mode, why not consider this right from the start ? (I know I'm totally in  
hand-waving mode right now, so slap me if needed).


I'd say your new, fancy sort algorithm needs a few more input values :

- Range of values that must appear in the final result of the sort :
		none, minimum, maximum, both, or even a set of values from the other  
side of the join, hashed, or sorted.

- LIMIT information (first N, last N, none)
	- Enhanced Limit information (first/last N values of the second column to  
sort, for each value of the first column) (the infamous top10 by  
category query)

- etc.

	With this, the amount of data that needs to be kept in memory is  
dramatically reduced, from the whole table (even using your compressed  
keys, that's big) to something more manageable which will be closer to the  
size of the final result set which will be returned to the client, and  
avoid a lot of effort.


	So, this would not be useful in all cases, but when it applies, it would  
be really useful.


Regards !


















---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Josh Berkus

Ron,


Hmmm.
60GB/5400secs= 11MBps.  That's ssllooww.  So the first
problem is evidently our physical layout and/or HD IO layer
sucks.


Actually, it's much worse than that, because the sort is only dealing 
with one column.  As I said, monitoring the iostat our top speed was 
2.2mb/s.


--Josh

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Dann Corbit
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of PFC
 Sent: Thursday, September 29, 2005 9:10 AM
 To: [EMAIL PROTECTED]
 Cc: Pg Hackers; pgsql-performance@postgresql.org
 Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
 
 
   Just to add a little anarchy in your nice debate...
 
   Who really needs all the results of a sort on your terabyte
table ?

Reports with ORDER BY/GROUP BY, and many other possibilities.  40% of
mainframe CPU cycles are spent sorting.  That is because huge volumes of
data require lots of energy to be meaningfully categorized.  Let's
suppose that instead of a terabyte of data (or a petabyte or whatever)
we have 10% of it.  That's still a lot of data.

   I guess not many people do a SELECT from such a table and want
all
 the
 results. 

What happens when they do?  The cases where it is already fast are not
very important.  The cases where things go into the crapper are the ones
that need attention.

So, this leaves :
   - Really wanting all the results, to fetch using a cursor,
   - CLUSTER type things, where you really want everything in
order,
   - Aggregates (Sort-GroupAggregate), which might really need to
sort
 the
 whole table.
   - Complex queries where the whole dataset needs to be examined,
in
 order
 to return a few values
   - Joins (again, the whole table is probably not going to be
 selected)
   - And the ones I forgot.
 
   However,
 
   Most likely you only want to SELECT N rows, in some ordering :
   - the first N (ORDER BY x LIMIT N)
   - last N (ORDER BY x DESC LIMIT N)

For these, the QuickSelect algorithm is what is wanted.  For example:
#include stdlib.h
typedef double  Etype;

extern EtypeRandomSelect(Etype * A, size_t p, size_t r, size_t i);
extern size_t   RandRange(size_t a, size_t b);
extern size_t   RandomPartition(Etype * A, size_t p, size_t r);
extern size_t   Partition(Etype * A, size_t p, size_t r);

/*
**
** In the following code, every reference to CLR means:
**
**Introduction to Algorithms
**By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest
**ISBN 0-07-013143-0
*/


/*
** CLR, page 187
*/
Etype   RandomSelect(Etype A[], size_t p, size_t r, size_t i)
{
size_t  q,
k;
if (p == r)
return A[p];
q = RandomPartition(A, p, r);
k = q - p + 1;

if (i = k)
return RandomSelect(A, p, q, i);
else
return RandomSelect(A, q + 1, r, i - k);
}

size_t  RandRange(size_t a, size_t b)
{
size_t  c = (size_t) ((double) rand() / ((double) RAND_MAX +
1) * (b - a));
return c + a;
}

/*
** CLR, page 162
*/
size_t  RandomPartition(Etype A[], size_t p, size_t r)
{
size_t  i = RandRange(p, r);
Etype   Temp;
Temp = A[p];
A[p] = A[i];
A[i] = Temp;
return Partition(A, p, r);
}

/*
** CLR, page 154
*/
size_t  Partition(Etype A[], size_t p, size_t r)
{
Etype   x,
temp;
size_t  i,
j;

x = A[p];
i = p - 1;
j = r + 1;

for (;;) {
do {
j--;
} while (!(A[j] = x));
do {
i++;
} while (!(A[i] = x));
if (i  j) {
temp = A[i];
A[i] = A[j];
A[j] = temp;
} else
return j;
}
}

   - WHERE xvalue ORDER BY x LIMIT N
   - WHERE xvalue ORDER BY x DESC LIMIT N
   - and other variants
 
   Or, you are doing a Merge JOIN against some other table ; in
that
 case,
 yes, you might need the whole sorted terabyte table, but most likely
there
 are WHERE clauses in the query that restrict the set, and thus, maybe
we
 can get some conditions or limit values on the column to sort.

Where clause filters are to be applied AFTER the join operations,
according to the SQL standard.

   Also the new, optimized hash join, which is more memory
efficient,
 might
 cover this case.

For == joins.  Not every order by is applied to joins.  And not every
join is an equal join.

   Point is, sometimes, you only need part of the results of your
sort.
 And
 the bigger the sort, the most likely it becomes that you only want
part of
 the results.

That is an assumption that will sometimes be true, and sometimes not.
It is not possible to predict usage patterns for a general purpose
database system.

 So, while we're in the fun hand-waving, new algorithm trying
 mode, why not consider this right from the start ? (I know I'm totally
in
 hand-waving mode right now, so slap me if needed).
 
   I'd say your new, fancy sort algorithm needs a few more input
values
 :
 
   - Range of values that must appear in the final result of the
sort :
   none, minimum, maximum, both, or even a set of values
from the
 other
 side of the join, hashed, or sorted.

That will already happen (or it certainly

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Josh Berkus
Ron,

 That 11MBps was your =bulk load= speed.  If just loading a table
 is this slow, then there are issues with basic physical IO, not just
 IO during sort operations.

Oh, yeah.  Well, that's separate from sort.  See multiple posts on this 
list from the GreenPlum team, the COPY patch for 8.1, etc.  We've been 
concerned about I/O for a while.  

Realistically, you can't do better than about 25MB/s on a single-threaded 
I/O on current Linux machines, because your bottleneck isn't the actual 
disk I/O.   It's CPU.   Databases which go faster than this are all, to 
my knowledge, using multi-threaded disk I/O.

(and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's 
another thread ... )

 As I said, the obvious candidates are inefficient physical layout
 and/or flawed IO code.

Yeah, that's what I thought too.   But try sorting an 10GB table, and 
you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're 
CPU-bound, because sort is being really inefficient about something. I 
just don't know what yet.

If we move that CPU-binding to a higher level of performance, then we can 
start looking at things like async I/O, O_Direct, pre-allocation etc. that 
will give us incremental improvements.   But what we need now is a 5-10x 
improvement and that's somewhere in the algorithms or the code.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Luke Lonergan
Ron,

On 9/30/05 1:20 PM, Ron Peacetree [EMAIL PROTECTED] wrote:

 That 11MBps was your =bulk load= speed.  If just loading a table
 is this slow, then there are issues with basic physical IO, not just
 IO during sort operations.

Bulk loading speed is irrelevant here - that is dominated by parsing, which
we have covered copiously (har har) previously and have sped up by 500%,
which still makes Postgres  1/2 the loading speed of MySQL.
 
 As I said, the obvious candidates are inefficient physical layout
 and/or flawed IO code.

Yes.
 
 Until the basic IO issues are addressed, we could replace the
 present sorting code with infinitely fast sorting code and we'd
 still be scrod performance wise.

Postgres' I/O path has many problems that must be micro-optimized away.  Too
small of an operand size compared to disk caches, memory, etc etc are the
common problem.  Another is lack of micro-parallelism (loops) with long
enough runs to let modern processors pipeline and superscale.
  
The net problem here is that a simple select blah from blee order
by(blah.a); runs at 1/100 of the sequential scan rate.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Dann Corbit
I see the following routines that seem to be related to sorting.

If I were to examine these routines to consider ways to improve it, what
routines should I key in on?  I am guessing that tuplesort.c is the hub
of activity for database sorting.

Directory of U:\postgresql-snapshot\src\backend\access\nbtree

08/11/2005  06:22 AM24,968 nbtsort.c
   1 File(s) 24,968 bytes

 Directory of U:\postgresql-snapshot\src\backend\executor

03/16/2005  01:38 PM 7,418 nodeSort.c
   1 File(s)  7,418 bytes

 Directory of U:\postgresql-snapshot\src\backend\utils\sort

09/23/2005  08:36 AM67,585 tuplesort.c
   1 File(s) 67,585 bytes

 Directory of U:\postgresql-snapshot\src\bin\pg_dump

06/29/2005  08:03 PM31,620 pg_dump_sort.c
   1 File(s) 31,620 bytes

 Directory of U:\postgresql-snapshot\src\port

07/27/2005  09:03 PM 5,077 qsort.c
   1 File(s)  5,077 bytes

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread PFC



Bulk loading speed is irrelevant here - that is dominated by parsing,  
which

we have covered copiously (har har) previously and have sped up by 500%,
which still makes Postgres  1/2 the loading speed of MySQL.


Let's ask MySQL 4.0


LOAD DATA INFILE blah

0 errors, 666 warnings

SHOW WARNINGS;

not implemented. upgrade to 4.1

duh

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
That 11MBps was your =bulk load= speed.  If just loading a table
is this slow, then there are issues with basic physical IO, not just
IO during sort operations.

As I said, the obvious candidates are inefficient physical layout
and/or flawed IO code.

Until the basic IO issues are addressed, we could replace the
present sorting code with infinitely fast sorting code and we'd
still be scrod performance wise.

So why does basic IO suck so badly?

Ron  


-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Sep 30, 2005 1:23 PM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

 Hmmm.
 60GB/5400secs= 11MBps.  That's ssllooww.  So the first
 problem is evidently our physical layout and/or HD IO layer
 sucks.

Actually, it's much worse than that, because the sort is only dealing 
with one column.  As I said, monitoring the iostat our top speed was 
2.2mb/s.

--Josh


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Jignesh K. Shah

I have seen similar performance as Josh and my reasoning is as follows:

* WAL is the biggest bottleneck with its default size of 16MB. Many 
people hate to recompile the code   to change its default, and 
increasing checkpoint segments help but still there is lot of overhead 
in the rotation of WAL files (Even putting WAL on tmpfs shows that it is 
still slow). Having an option for bigger size is helpful to a small 
extent percentagewise (and frees up CPU a bit in doing file rotation)


* Growing files: Even though this is OS dependent but it does spend lot 
of time doing small 8K block increases to grow files. If we can signal 
bigger chunks to grow or pre-grow  to expected size of  data files 
that will help a lot in such cases.


* COPY command had restriction but that has been fixed to a large 
extent.(Great job)


But ofcourse I have lost touch with programming and can't begin to 
understand PostgreSQL code to change it myself.


Regards,
Jignesh




Ron Peacetree wrote:


That 11MBps was your =bulk load= speed.  If just loading a table
is this slow, then there are issues with basic physical IO, not just
IO during sort operations.

As I said, the obvious candidates are inefficient physical layout
and/or flawed IO code.

Until the basic IO issues are addressed, we could replace the
present sorting code with infinitely fast sorting code and we'd
still be scrod performance wise.

So why does basic IO suck so badly?

Ron  



-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Sep 30, 2005 1:23 PM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

 


Hmmm.
60GB/5400secs= 11MBps.  That's ssllooww.  So the first
problem is evidently our physical layout and/or HD IO layer
sucks.
   



Actually, it's much worse than that, because the sort is only dealing 
with one column.  As I said, monitoring the iostat our top speed was 
2.2mb/s.


--Josh


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly
 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Ron Peacetree
25MBps should not be a CPU bound limit for IO, nor should it be
an OS limit.  It should be something ~100x (Single channel RAM)
to ~200x (dual channel RAM) that.

For an IO rate of 25MBps to be pegging the CPU at 100%, the CPU
is suffering some combination of
A= lot's of cache misses (cache thrash), 
B= lot's of random rather than sequential IO (like pointer chasing)
C= lot's of wasteful copying
D= lot's of wasteful calculations

In fact, this is crappy enough performance that the whole IO layer
should be rethought and perhaps reimplemented from scratch.
Optimization of the present code is unlikely to yield a 100-200x
improvement.

On the HD side, the first thing that comes to mind is that DBs are
-NOT- like ordinary filesystems in a few ways:
1= the minimum HD IO is a record that is likely to be larger than
a HD sector.  Therefore, the FS we use should be laid out with
physical segments of max(HD sector size, record size)

2= DB files (tables) are usually considerably larger than any other
kind of files stored.  Therefore the FS we should use should be laid
out using LARGE physical pages.  64KB-256KB at a _minimum_.

3= The whole 2GB striping of files idea needs to be rethought.
Our tables are significantly different in internal structure from the
usual FS entity.

4= I'm sure we are paying all sorts of nasty overhead for essentially
emulating the pg filesystem inside another filesystem.  That means
~2x as much overhead to access a particular piece of data.   

The simplest solution is for us to implement a new VFS compatible
filesystem tuned to exactly our needs: pgfs.

We may be able to avoid that by some amount of hacking or
modifying of the current FSs we use, but I suspect it would be more
work for less ROI.

Ron 


-Original Message-
From: Josh Berkus josh@agliodbs.com
Sent: Sep 30, 2005 4:41 PM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

 That 11MBps was your =bulk load= speed.  If just loading a table
 is this slow, then there are issues with basic physical IO, not just
 IO during sort operations.

Oh, yeah.  Well, that's separate from sort.  See multiple posts on this 
list from the GreenPlum team, the COPY patch for 8.1, etc.  We've been 
concerned about I/O for a while.  

Realistically, you can't do better than about 25MB/s on a single-threaded 
I/O on current Linux machines, because your bottleneck isn't the actual 
disk I/O.   It's CPU.   Databases which go faster than this are all, to 
my knowledge, using multi-threaded disk I/O.

(and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's 
another thread ... )

 As I said, the obvious candidates are inefficient physical layout
 and/or flawed IO code.

Yeah, that's what I thought too.   But try sorting an 10GB table, and 
you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're 
CPU-bound, because sort is being really inefficient about something. I 
just don't know what yet.

If we move that CPU-binding to a higher level of performance, then we can 
start looking at things like async I/O, O_Direct, pre-allocation etc. that 
will give us incremental improvements.   But what we need now is a 5-10x 
improvement and that's somewhere in the algorithms or the code.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Dann Corbit
;
do {
{
(parent) = (i);
(temp) = (array)[(parent)];
(child) = (parent) * 2;
while ((nmemb)  (child)) {
if ((*((array) + (child) + 1)  *((array) +
(child {
++(child);
}
if ((*((array) + (child))  *((temp {
(array)[(parent)] = (array)[(child)];
(parent) = (child);
(child) *= 2;
} else {
--(child);
break;
}
}
if ((nmemb) == (child)  (*((array) + (child)) 
*((temp {
(array)[(parent)] = (array)[(child)];
(parent) = (child);
}
(array)[(parent)] = (temp);
}
} while (i--);
((void) ((temp) = *(array), *(array) = *(array + nmemb), *(array
+ nmemb) = (temp)));
for (--nmemb; nmemb; --nmemb) {
{
(parent) = (0);
(temp) = (array)[(parent)];
(child) = (parent) * 2;
while ((nmemb)  (child)) {
if ((*((array) + (child) + 1)  *((array) +
(child {
++(child);
}
if ((*((array) + (child))  *((temp {
(array)[(parent)] = (array)[(child)];
(parent) = (child);
(child) *= 2;
} else {
--(child);
break;
}
}
if ((nmemb) == (child)  (*((array) + (child)) 
*((temp {
(array)[(parent)] = (array)[(child)];
(parent) = (child);
}
(array)[(parent)] = (temp);
}
((void) ((temp) = *(array), *(array) = *(array + nmemb),
*(array + nmemb) = (temp)));
}
}
}

// 
// We use this to check to see if a partition is already sorted.
// 

template  class e_type 
int sorted(e_type * array, size_t nmemb)
{
for (--nmemb; nmemb; --nmemb) {
if ((*(array)  *(array + 1))) {
return 0;
}
++array;
}
return 1;
}

// 
// We use this to check to see if a partition is already reverse-sorted.
// 

template  class e_type 
int rev_sorted(e_type * array, size_t nmemb)
{
for (--nmemb; nmemb; --nmemb) {
if ((*(array + 1)  *(array))) {
return 0;
}
++array;
}
return 1;
}

// 
// We use this to reverse a reverse-sorted partition.
// 

template  class e_type 
void rev_array(e_type * array, size_t nmemb)
{
e_type  temp,
*end;
for (end = array + nmemb - 1; end  array; ++array) {
((void) ((temp) = *(array), *(array) = *(end), *(end) =
(temp)));
--end;
}
}

// 
// Introspective quick sort algorithm user entry point.
// You do not need to directly call any other sorting template.
// This sort will perform very well under all circumstances.
// 

template  class e_type 
void iqsort(e_type * array, size_t nmemb)
{
size_t d,
   n;
if (nmemb  1  !sorted(array, nmemb)) {
if (!rev_sorted(array, nmemb)) {
n = nmemb / 4;
d = 2;
while (n) {
++d;
n /= 2;
}
qloop(array, nmemb, 2 * d);
} else {
rev_array(array, nmemb);
}
}
}

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Jignesh K. Shah
 Sent: Friday, September 30, 2005 1:38 PM
 To: Ron Peacetree
 Cc: Josh Berkus; pgsql-hackers@postgresql.org; pgsql-
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
 
 I have seen similar performance as Josh and my reasoning is as
follows:
 
 * WAL is the biggest bottleneck with its default size of 16MB. Many
 people hate to recompile the code   to change its default, and
 increasing checkpoint segments help but still there is lot of overhead
 in the rotation of WAL files (Even putting WAL on tmpfs shows that it
is
 still slow). Having an option for bigger size is helpful to a small
 extent percentagewise (and frees up CPU a bit in doing file rotation)
 
 * Growing files: Even though this is OS dependent but it does spend
lot
 of time doing small 8K block increases to grow files. If we can signal
 bigger chunks to grow or pre-grow  to expected size of  data files
 that will help a lot in such cases.
 
 * COPY command had restriction but that has been fixed to a large
 extent.(Great job)
 
 But ofcourse I have lost touch with programming and can't begin to
 understand PostgreSQL code to change it myself.
 
 Regards,
 Jignesh
 
 
 
 
 Ron Peacetree wrote:
 
 That 11MBps was your =bulk load= speed

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Gregory Maxwell
On 9/30/05, Ron Peacetree [EMAIL PROTECTED] wrote:
 4= I'm sure we are paying all sorts of nasty overhead for essentially
 emulating the pg filesystem inside another filesystem.  That means
 ~2x as much overhead to access a particular piece of data.

 The simplest solution is for us to implement a new VFS compatible
 filesystem tuned to exactly our needs: pgfs.

 We may be able to avoid that by some amount of hacking or
 modifying of the current FSs we use, but I suspect it would be more
 work for less ROI.

On this point, Reiser4 fs already implements a number of things which
would be desirable for PostgreSQL. For example: write()s to reiser4
filesystems are atomic, so there is no risk of torn pages (this is
enabled because reiser4 uses WAFL like logging where data is not
overwritten but rather relocated). The filesystem is modular and
extensible so it should be easy to add whatever additional semantics
are needed.  I would imagine that all that would be needed is some
more atomicity operations (single writes are already atomic, but I'm
sure it would be useful to batch many writes into a transaction),some
layout and packing controls, and some flush controls.  A step further
would perhaps integrate multiversioning directly into the FS (the
wandering logging system provides the write side of multiversioning, a
little read side work would be required.). More importantly: the file
system was intended to be extensible for this sort of application.

It might make a good 'summer of code' project for someone next year,
... presumably by then reiser4 will have made it into the mainline
kernel by then. :)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Gregory Maxwell
On 9/28/05, Ron Peacetree [EMAIL PROTECTED] wrote:
 2= We use my method to sort two different tables.  We now have these
 very efficient representations of a specific ordering on these tables.  A
 join operation can now be done using these Btrees rather than the
 original data tables that involves less overhead than many current
 methods.

If we want to make joins very fast we should implement them using RD
trees. For the example cases where a join against a very large table
will produce a much smaller output, a RD tree will provide pretty much
the optimal behavior at a very low memory cost.

On the subject of high speed tree code for in-core applications, you
should check out http://judy.sourceforge.net/ . The performance
(insert, remove, lookup, AND storage) is really quite impressive.
Producing cache friendly code is harder than one might expect, and it
appears the judy library has already done a lot of the hard work. 
Though it is *L*GPLed, so perhaps that might scare some here away from
it. :) and good luck directly doing joins with a LC-TRIE. ;)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-30 Thread Dann Corbit
Judy definitely rates a WOW!!

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Gregory Maxwell
 Sent: Friday, September 30, 2005 7:07 PM
 To: Ron Peacetree
 Cc: Jeffrey W. Baker; pgsql-hackers@postgresql.org; pgsql-
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
 
 On 9/28/05, Ron Peacetree [EMAIL PROTECTED] wrote:
  2= We use my method to sort two different tables.  We now have these
  very efficient representations of a specific ordering on these
tables.
 A
  join operation can now be done using these Btrees rather than the
  original data tables that involves less overhead than many current
  methods.
 
 If we want to make joins very fast we should implement them using RD
 trees. For the example cases where a join against a very large table
 will produce a much smaller output, a RD tree will provide pretty much
 the optimal behavior at a very low memory cost.
 
 On the subject of high speed tree code for in-core applications, you
 should check out http://judy.sourceforge.net/ . The performance
 (insert, remove, lookup, AND storage) is really quite impressive.
 Producing cache friendly code is harder than one might expect, and it
 appears the judy library has already done a lot of the hard work.
 Though it is *L*GPLed, so perhaps that might scare some here away from
 it. :) and good luck directly doing joins with a LC-TRIE. ;)
 
 ---(end of
broadcast)---
 TIP 6: explain analyze is your friend

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Pailloncy Jean-Gerard
Your main example seems to focus on a large table where a key  
column has

constrained values.  This case is interesting in proportion to the
number of possible values.  If I have billions of rows, each  
having one

of only two values, I can think of a trivial and very fast method of
returning the table sorted by that key: make two sequential passes,
returning the first value on the first pass and the second value  
on the

second pass.  This will be faster than the method you propose.


1= No that was not my main example.  It was the simplest example  
used to
frame the later more complicated examples.  Please don't get hung  
up on it.


2= You are incorrect.  Since IO is the most expensive operation we  
can do,
any method that makes two passes through the data at top scanning  
speed
will take at least 2x as long as any method that only takes one  
such pass.

You do not get the point.
As the time you get the sorted references to the tuples, you need to  
fetch the tuples themself, check their visbility, etc. and returns  
them to the client.


So,
if there is only 2 values in the column of big table that is larger  
than available RAM,

two seq scans of the table without any sorting
is the fastest solution.

Cordialement,
Jean-Gérard Pailloncy


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
If I've done this correctly, there should not be anywhere near
the number of context switches we currently see while sorting.

Each unscheduled context switch represents something unexpected
occuring or things not being where they are needed when they are
needed.  Reducing such circumstances to the absolute minimum 
was one of the design goals.

Reducing the total amount of IO to the absolute minimum should
help as well. 

Ron


-Original Message-
From: Kevin Grittner [EMAIL PROTECTED]
Sent: Sep 27, 2005 11:21 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

I can't help wondering how a couple thousand context switches per
second would affect the attempt to load disk info into the L1 and
L2 caches.  That's pretty much the low end of what I see when the
server is under any significant load.




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
From: Jeffrey W. Baker [EMAIL PROTECTED]
Sent: Sep 29, 2005 12:27 AM
To: Ron Peacetree [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy.  A common, general-purpose
case would be the best.
 
??? I posted =3= specific classes of common, general-purpose query
operations where OES and the OES Btrees look like they should be
superior to current methods:
1= when splitting sorting or other operations across multiple CPUs
2= when doing joins of different tables by doing the join on these Btrees
rather than the original tables.
3= when the opportunity arises to reuse OES Btree results of previous
sorts for different keys in the same table.  Now we can combine the
existing Btrees to obtain the new order based on the composite key
without ever manipulating the original, much larger, table.  

In what way are these examples not concrete?


We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark.  But that's not what
we're doing here.  We are designing and operating relational databases.
So please explain the application.

This is a GENERAL method.  It's based on CPU cache efficient Btrees that
use variable length prefix keys and RIDs.
It assumes NOTHING about the data or the system in order to work.
I gave some concrete examples for the sake of easing explanation, NOT
as an indication of assumptions or limitations of the method.  I've even
gone out of my way to prove that no such assumptions or limitations exist.
Where in the world are you getting such impressions?
  

Your main example seems to focus on a large table where a key column has
constrained values.  This case is interesting in proportion to the
number of possible values.  If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table sorted by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass.  This will be faster than the method you propose.

1= No that was not my main example.  It was the simplest example used to
frame the later more complicated examples.  Please don't get hung up on it.

2= You are incorrect.  Since IO is the most expensive operation we can do,
any method that makes two passes through the data at top scanning speed
will take at least 2x as long as any method that only takes one such pass.


I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete.  If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.  

Hmmm.  Not sure which heap you are referring to, but the OES Btree
index is provably the lowest (in terms of tree height) and smallest
possible CPU cache efficient data structure that one can make and still
have all of the traditional benefits associated with a Btree representation
of a data set.

Nonetheless, returning a RID, or all RIDs with(out) the same Key, or all
RIDs (not) within a range of Keys, or simply all RIDs in sorted order is
efficient.  Just as should be for a Btree (actually it's a B+ tree variant to
use Knuth's nomenclature).  I'm sure someone posting from acm.org
recognizes how each of these Btree operations maps to various SQL
features...  

I haven't been talking about query plans because they are orthogonal to
the issue under discussion?  If we use a layered model for PostgreSQL's
architecture, this functionality is more primal than that of a query
planner.  ALL query plans that currently involve sorts will benefit from a
more efficient way to do, or avoid, sorts.


PS: Whatever mailer you use doesn't understand or respect threading nor
attribution.  Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.

That is an issue of infrastructure on the recieving side, not on the sending
(my) side since even my web mailer seems appropriately RFC conformant.
Everything seems to be going in the correct places and being properly 
organized on archival.postgres.org ...

Ron

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
From: Jeffrey W. Baker [EMAIL PROTECTED]
Sent: Sep 27, 2005 1:26 PM
To: Ron Peacetree [EMAIL PROTECTED]
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it.  The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)

Maybe you could describe some concrete use cases.  I can see what
you are getting at, and I can imagine some advantageous uses, but
I'd like to know what you are thinking.

1= In a 4P box, we split the data in RAM into 4 regions and create
a CPU cache friendly Btree using the method I described for each CPU.
The 4 Btrees can be merged in a more time and space efficient manner
than the original records to form a Btree that represents the sorted order
of the entire data set.  Any of these Btrees can be allowed to persist to
lower the cost of doing similar operations in the future (Updating the
Btrees during inserts and deletes is cheaper than updating the original
data files and then redoing the same sort from scratch in the future.)
Both the original sort and future such sorts are made more efficient
than current methods.

2= We use my method to sort two different tables.  We now have these
very efficient representations of a specific ordering on these tables.  A
join operation can now be done using these Btrees rather than the
original data tables that involves less overhead than many current
methods.

3=  We have multiple such Btrees for the same data set representing
sorts done using different fields (and therefore different Keys).
Calculating a sorted order for the data based on a composition of
those Keys is now cheaper than doing the sort based on the composite
Key from scratch.  When some of the Btrees exist and some of them
do not, there is a tradeoff calculation to be made.  Sometimes it will be
cheaper to do the sort from scratch using the composite Key.   


Specifically I'd like to see some cases where this would beat sequential
scan.  I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, = 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.
 
Just to clarify the point further,
1TB of 1B records = 2^40 records of at most 256 distinct values.
1TB of 2B records = 2^39 records of at most 2^16 distinct values.
1TB of 4B records = 2^38 records of at most 2^32 distinct values.
1TB of 5B records = 200B records of at most 200B distinct values.
From here on, the number of possible distinct values is limited by the
number of records.
100B records are used in the Indy version of Jim Gray's sorting 
contests, so 1TB = 10B records.
2KB-4KB is the most common record size I've seen in enterprise
class DBMS (so I used this value to make my initial example more
realistic).

Therefore the vast majority of the time representing a data set by Key
will use less space that the original record.  Less space used means
less IO to scan the data set, which means faster scan times.

This is why index files work in the first place, right?


Perhaps I believe this because you can now buy as much sequential I/O
as you want.  Random I/O is the only real savings.

1= No, you can not buy as much sequential IO as you want.  Even if
with an infinite budget, there are physical and engineering limits.  Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains.  So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

2= Most RW IT professionals have far from an infinite budget.  Just traffic
on these lists shows how severe the typical cost constraints usually are.
OTOH, if you have an inifinite IT budget, care to help a few less fortunate
than yourself?  After all, a even a large constant substracted from infinity
is still infinity... ;-)

3= No matter how fast you can do IO, IO remains the most expensive
part of the performance equation.  The fastest and cheapest IO you can
do is _no_ IO.  As long as we trade cheaper RAM and even cheaoer CPU
operations for IO correctly, more space efficient data representations will
always be a Win because of this.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining

Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Ron Peacetree
In the interest of efficiency and not reinventing the wheel, does anyone know
where I can find C or C++ source code for a Btree variant with the following
properties:

A= Data elements (RIDs) are only stored in the leaves, Keys (actually
KeyPrefixes; see D below) and Node pointers are only stored in the internal
nodes of the Btree.

B= Element redistribution is done as an alternative to node splitting in 
overflow
conditions during Inserts whenever possible.

C= Variable length Keys are supported.

D= Node buffering with a reasonable replacement policy is supported.

E= Since we will know beforehand exactly how many RID's will be stored, we
will know apriori how much space will be needed for leaves, and will know the
worst case for how much space will be required for the Btree internal nodes
as well.  This implies that we may be able to use an array, rather than linked
list, implementation of the Btree.  Less pointer chasing at the expense of more
CPU calculations, but that's a trade-off in the correct direction. 

Such source would be a big help in getting a prototype together.

Thanks in advance for any pointers or source,
Ron

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Zeugswetter Andreas DAZ SD

 In my original example, a sequential scan of the 1TB of 2KB 
 or 4KB records, = 250M or 500M records of data, being sorted 
 on a binary value key will take ~1000x more time than reading 
 in the ~1GB Btree I described that used a Key+RID (plus node 
 pointers) representation of the data.

Imho you seem to ignore the final step your algorithm needs of
collecting the
data rows. After you sorted the keys the collect step will effectively
access the 
tuples in random order (given a sufficiently large key range).

This random access is bad. It effectively allows a competing algorithm
to read the
whole data at least 40 times sequentially, or write the set 20 times
sequentially. 
(Those are the random/sequential ratios of modern discs)

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Josh Berkus

Jeff, Ron,

First off, Jeff, please take it easy.  We're discussing 8.2 features at 
this point and there's no reason to get stressed out at Ron.  You can 
get plenty stressed out when 8.2 is near feature freeze.  ;-)



Regarding use cases for better sorts:

The biggest single area where I see PostgreSQL external sort sucking is 
 on index creation on large tables.   For example, for free version of 
TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's 
hardware, but over 3 hours to create each index on that table.  This 
means that over all our load into TPCH takes 4 times as long to create 
the indexes as it did to bulk load the data.


Anyone restoring a large database from pg_dump is in the same situation. 
 Even worse, if you have to create a new index on a large table on a 
production database in use, because the I/O from the index creation 
swamps everything.


Following an index creation, we see that 95% of the time required is the 
external sort, which averages 2mb/s.  This is with seperate drives for 
the WAL, the pg_tmp, the table and the index.  I've confirmed that 
increasing work_mem beyond a small minimum (around 128mb) had no benefit 
on the overall index creation speed.



--Josh Berkus



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Luke Lonergan
Josh,

On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote:

 Following an index creation, we see that 95% of the time required is the
 external sort, which averages 2mb/s.  This is with seperate drives for
 the WAL, the pg_tmp, the table and the index.  I've confirmed that
 increasing work_mem beyond a small minimum (around 128mb) had no benefit
 on the overall index creation speed.

Yp!  That about sums it up - regardless of taking 1 or 2 passes through
the heap being sorted, 1.5 - 2 MB/s is the wrong number.  This is not
necessarily an algorithmic problem, but is a optimization problem with
Postgres that must be fixed before it can be competitive.

We read/write to/from disk at 240MB/s and so 2 passes would run at a net
rate of 120MB/s through the sort set if it were that efficient.

Anyone interested in tackling the real performance issue? (flame bait, but
for a worthy cause :-)

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread David Fetter
On Thu, Sep 29, 2005 at 10:06:52AM -0700, Luke Lonergan wrote:
 Josh,
 
 On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote:
 
  Following an index creation, we see that 95% of the time required
  is the external sort, which averages 2mb/s.  This is with seperate
  drives for the WAL, the pg_tmp, the table and the index.  I've
  confirmed that increasing work_mem beyond a small minimum (around
  128mb) had no benefit on the overall index creation speed.
 
 Yp!  That about sums it up - regardless of taking 1 or 2 passes
 through the heap being sorted, 1.5 - 2 MB/s is the wrong number.
 This is not necessarily an algorithmic problem, but is a
 optimization problem with Postgres that must be fixed before it can
 be competitive.
 
 We read/write to/from disk at 240MB/s and so 2 passes would run at a
 net rate of 120MB/s through the sort set if it were that efficient.
 
 Anyone interested in tackling the real performance issue? (flame
 bait, but for a worthy cause :-)

I'm not sure that it's flamebait, but what do I know?  Apart from the
nasty number (1.5-2 MB/s), what other observations do you have to
hand?  Any ideas about what things are not performing here?  Parts of
the code that could bear extra scrutiny?  Ideas on how to fix same in
a cross-platform way?

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Jeffrey W. Baker
On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
 Josh,
 
 On 9/29/05 9:54 AM, Josh Berkus josh@agliodbs.com wrote:
 
  Following an index creation, we see that 95% of the time required is the
  external sort, which averages 2mb/s.  This is with seperate drives for
  the WAL, the pg_tmp, the table and the index.  I've confirmed that
  increasing work_mem beyond a small minimum (around 128mb) had no benefit
  on the overall index creation speed.
 
 Yp!  That about sums it up - regardless of taking 1 or 2 passes through
 the heap being sorted, 1.5 - 2 MB/s is the wrong number.

Yeah this is really bad ... approximately the speed of GNU sort.

Josh, do you happen to know how many passes are needed in the multiphase
merge on your 60GB table?

Looking through tuplesort.c, I have a couple of initial ideas.  Are we
allowed to fork here?  That would open up the possibility of using the
CPU and the I/O in parallel.  I see that tuplesort.c also suffers from
the kind of postgresql-wide disease of calling all the way up and down a
big stack of software for each tuple individually.  Perhaps it could be
changed to work on vectors.

I think the largest speedup will be to dump the multiphase merge and
merge all tapes in one pass, no matter how large M.  Currently M is
capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
the tape.  It could be done in a single pass heap merge with N*log(M)
comparisons, and, more importantly, far less input and output.

I would also recommend using an external processes to asynchronously
feed the tuples into the heap during the merge.

What's the timeframe for 8.2?

-jwb




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Josh Berkus
Jeff,

 Josh, do you happen to know how many passes are needed in the multiphase
 merge on your 60GB table?

No, any idea how to test that?

 I think the largest speedup will be to dump the multiphase merge and
 merge all tapes in one pass, no matter how large M.  Currently M is
 capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
 the tape.  It could be done in a single pass heap merge with N*log(M)
 comparisons, and, more importantly, far less input and output.

Yes, but the evidence suggests that we're actually not using the whole 1GB 
of RAM ... maybe using only 32MB of it which would mean over 200 passes 
(I'm not sure of the exact match).  Just fixing our algorithm so that it 
used all of the work_mem permitted might improve things tremendously.

 I would also recommend using an external processes to asynchronously
 feed the tuples into the heap during the merge.

 What's the timeframe for 8.2?

Too far out to tell yet.  Probably 9mo to 1 year, that's been our history.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Jeffrey W. Baker
On Thu, 2005-09-29 at 11:03 -0700, Josh Berkus wrote:
 Jeff,
 
  Josh, do you happen to know how many passes are needed in the multiphase
  merge on your 60GB table?
 
 No, any idea how to test that?

I would just run it under the profiler and see how many times
beginmerge() is called.

-jwb


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Josh Berkus
Jeff,

 I would just run it under the profiler and see how many times
 beginmerge() is called.

Hmm, I'm not seeing it at all in the oprofile results on a 100million-row 
sort.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-29 Thread Dann Corbit
If I were to be nosy and poke around in this, what patches of code would
I be interested in?

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Josh Berkus
 Sent: Thursday, September 29, 2005 11:28 AM
 To: pgsql-hackers@postgresql.org
 Cc: Jeffrey W. Baker
 Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
 
 Jeff,
 
  I would just run it under the profiler and see how many times
  beginmerge() is called.
 
 Hmm, I'm not seeing it at all in the oprofile results on a
100million-row
 sort.
 
 --
 --Josh
 
 Josh Berkus
 Aglio Database Solutions
 San Francisco
 
 ---(end of
broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


  1   2   >