subject:"\[HACKERS\] qsort again"

On Thu, Feb 16, 2006 at 01:10:48PM +0100, Florian Weimer wrote:
 * Neil Conway:
 
  On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
  It seems clear that our qsort.c is doing a pretty awful job of picking
  qsort pivots, while glibc is mostly managing not to make that mistake.
  I haven't looked at the glibc code yet to see what they are doing
  differently.
 
  glibc qsort is actually merge sort, so I'm not surprised it avoids this
  problem.
 
 qsort also performs twice as many key comparisons as the theoretical
 minimum.  If key comparison is not very cheap, other schemes (like
 heapsort, for example) are more attractive.

Last time around there were a number of different algorithms tested.
Did anyone run those tests while getting it to count the number of
actual comparisons (which could easily swamp the time taken to do the
actual sort in some cases)?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature

Re: [PERFORM] [HACKERS] qsort again

2006-02-16 Thread Sven Geisler


Martijn van Oosterhout schrieb:


Last time around there were a number of different algorithms tested.
Did anyone run those tests while getting it to count the number of
actual comparisons (which could easily swamp the time taken to do the
actual sort in some cases)?



The last time I did such tests is almost 10 years ago. I had used 
MetroWerks CodeWarrior C/C++, which had Quicksort as algorithm in the Lib C.
Anyhow, I tested a few algorithms including merge sort and heapsort. I 
end up with heapsort because it was the fastest algorithm for our issue. 
We joined two arrays where each array was sorted and run qsort to sort 
the new array.


Sven.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index


At 06:35 AM 2/16/2006, Steinar H. Gunderson wrote:

On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote:
 Even better (and more easily scaled as the number of GPR's in the CPU
 changes) is to use
 the set {L; L+1; L+2; t1; R-2; R-1; R}
 This means that instead of 7 random memory accesses, we have 3; two
 of which result in a burst access for three elements each.

Isn't that improvement going to disappear competely if you choose a bad
pivot?
Only if you _consistently_ (read: the vast majority of the time: 
quicksort is actually darn robust) choose a _pessimal_, not just 
bad, pivot quicksort will degenerate to the O(N^2) behavior 
everyone worries about.  See Corman  Rivest for a proof on this.


Even then, doing things as above has benefits:
1= The worst case is less bad since the guaranteed O(lgs!) pivot 
choosing algorithm puts s elements into final position.

Worst case becomes better than O(N^2/(s-1)).

2=  The overhead of pivot choosing can overshadow the benefits using 
more traditional methods for even moderate values of s.  See 
discussions on the quicksort variant known as samplesort and 
Sedgewick's PhD thesis for details.  Using a pivot choosing algorithm 
that actually does some of the partitioning (and does it more 
efficiently than the usual partitioning algorithm does) plus using 
partition-in-place (rather then Lomuto's method) reduces overhead 
very effectively (at the cost of more complicated / delicate to get 
right partitioning code).  The above reduces the number of moves used 
in a quicksort pass considerably regardless of the number of compares used.


3= Especially in modern systems where the gap between internal CPU 
bandwidth and memory bandwidth is so great, the overhead of memory 
accesses for comparisons and moves is the majority of the overhead 
for both the pivot choosing and the partitioning algorithms within 
quicksort.  Particularly random memory accesses.  The reason (#GPRs - 
1) is a magic constant is that it's the most you can compare and move 
using only register-to-register operations.


In addition, replacing as many of the memory accesses you must do 
with sequential rather than random memory accesses is a big deal: 
sequential memory access is measured in 10's of CPU cycles while 
random memory access is measured in hundreds of CPU cycles.  It's no 
accident that the advances in Grey's sorting contest have involved 
algorithms that are both register and cache friendly, minimizing 
overall memory access and using sequential memory access as much as 
possible when said access can not be avoided.  As caches grow larger 
and memory accesses more expensive, it's often worth it to use a 
BucketSort+QuickSort hybrid rather than just QuickSort.


...and of course if you know enough about the data to be sorted so as 
to constrain it appropriately, one should use a non comparison based 
O(N) sorting algorithm rather than any of the general comparison 
based O(NlgN) methods.




 SIDE NOTE: IIRC glibc's qsort is actually merge sort.  Merge sort
 performance is insensitive to all inputs, and there are way to
 optimize it as well.

glibc-2.3.5/stdlib/qsort.c:

  /* Order size using quicksort.  This implementation incorporates
 four optimizations discussed in Sedgewick:

I can't see any references to merge sort in there at all.

Well, then I'm not the only person on the lists whose memory is faulty ;-)

The up side of MergeSort is that its performance is always O(NlgN).
The down sides are that it is far more memory hungry than QuickSort and slower.


Ron




---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] [HACKERS] qsort again


At 07:10 AM 2/16/2006, Florian Weimer wrote:

* Neil Conway:

 On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
 It seems clear that our qsort.c is doing a pretty awful job of picking
 qsort pivots, while glibc is mostly managing not to make that mistake.
 I haven't looked at the glibc code yet to see what they are doing
 differently.

 glibc qsort is actually merge sort, so I'm not surprised it avoids this
 problem.

qsort also performs twice as many key comparisons as the theoretical
minimum.


The theoretical minimum number of comparisons for a general purpose 
comparison based sort is O(lgN!).
QuickSort uses 2NlnN ~= 1.38NlgN or ~1.38x the optimum without tuning 
(see Knuth, Sedgewick, Corman, ... etc)
OTOH, QuickSort uses ~2x as many =moves= as the theoretical minimum 
unless tuned, and moves are more expensive than compares in modern systems.


See my other posts for QuickSort tuning methods that attempt to 
directly address both issues.



Ron 




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Markus Schaber

Hi, Ron,

Ron wrote:

 ...and of course if you know enough about the data to be sorted so as to
 constrain it appropriately, one should use a non comparison based O(N)
 sorting algorithm rather than any of the general comparison based
 O(NlgN) methods.

Sounds interesting, could you give us some pointers (names, URLs,
papers) to such algorithms?

Thanks a lot,
Markus



-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Jonah H. Harris

Last night I implemented a non-recursive introsort in C... let me test it a bit more and then I'll post it here for everyone else to try out.On 2/16/06, Markus Schaber
 [EMAIL PROTECTED] wrote:Hi, Ron,
Ron wrote: ...and of course if you know enough about the data to be sorted so as to constrain it appropriately, one should use a non comparison based O(N) sorting algorithm rather than any of the general comparison based
 O(NlgN) methods.Sounds interesting, could you give us some pointers (names, URLs,papers) to such algorithms?Thanks a lot,Markus--Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GISFight against software patents in EU! www.ffii.org www.nosoftwarepatents.org---(end of broadcast)---
TIP 4: Have you searched our list archives? http://archives.postgresql.org-- Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation732.331.1324

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
 3= Especially in modern systems where the gap between internal CPU 
 bandwidth and memory bandwidth is so great, the overhead of memory 
 accesses for comparisons and moves is the majority of the overhead 
 for both the pivot choosing and the partitioning algorithms within 
 quicksort.  Particularly random memory accesses.  The reason (#GPRs - 
 1) is a magic constant is that it's the most you can compare and move 
 using only register-to-register operations.

But how much of this applies to us? We're not sorting arrays of
integers, we're sorting pointers to tuples. So while moves cost very
little, a comparison costs hundreds, maybe thousands of cycles. A tuple
can easily be two or three cachelines and you're probably going to
access all of it, not to mention the Fmgr structures and the Datums
themselves.

None of this is cache friendly. The actual tuples themselves could be
spread all over memory (I don't think any particular effort is expended
trying to minimize fragmentation).

Do these algorithms discuss the case where a comparison is more than
1000 times the cost of a move?

Where this does become interesting is where we can convert a datum to
an integer such that if f(A)  f(B) then A  B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create


At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:

On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
 3= Especially in modern systems where the gap between internal CPU
 bandwidth and memory bandwidth is so great, the overhead of memory
 accesses for comparisons and moves is the majority of the overhead
 for both the pivot choosing and the partitioning algorithms within
 quicksort.  Particularly random memory accesses.  The reason (#GPRs -
 1) is a magic constant is that it's the most you can compare and move
 using only register-to-register operations.

But how much of this applies to us? We're not sorting arrays of
integers, we're sorting pointers to tuples. So while moves cost very
little, a comparison costs hundreds, maybe thousands of cycles. A tuple
can easily be two or three cachelines and you're probably going to
access all of it, not to mention the Fmgr structures and the Datums
themselves.
Pointers are simply fixed size 32b or 64b quantities.  They are 
essentially integers.  Comparing and moving pointers or fixed size 
keys to those pointers is exactly the same problem as comparing and 
moving integers.


Comparing =or= moving the actual data structures is a much more 
expensive and variable cost proposition.  I'm sure that pg's sort 
functionality is written intelligently enough that the only real data 
moves are done in a final pass after the exact desired order has been 
found using pointer compares and (re)assignments during the sorting 
process.  That's a standard technique for sorting data whose key or 
pointer is much smaller than a datum.


Your cost comment basically agrees with mine regarding the cost of 
random memory accesses.  The good news is that the number of datums 
to be examined during the pivot choosing process is small enough that 
the datums can fit into CPU cache while the pointers to them can be 
assigned to registers: making pivot choosing +very+ fast when done correctly.


As you've noted, actual partitioning is going to be more expensive 
since it involves accessing enough actual datums that they can't all 
fit into CPU cache.  The good news is that QuickSort has a very 
sequential access pattern within its inner loop.  So while we must go 
to memory for compares, we are at least keeping the cost for it down 
it a minimum.  In addition, said access is nice enough to be very 
prefetch and CPU cache hierarchy friendly.




None of this is cache friendly. The actual tuples themselves could be
spread all over memory (I don't think any particular effort is expended
trying to minimize fragmentation).
It probably would be worth it to spend some effort on memory layout 
just as we do for HD layout.




Do these algorithms discuss the case where a comparison is more than
1000 times the cost of a move?
A move is always more expensive than a compare when the datum is 
larger than its pointer or key.  A move is always more expensive than 
a compare when it involves memory to memory movement rather than CPU 
location to CPU location movement.  A move is especially more 
expensive than a compare when it involves both factors.  Most moves 
do involve both.


What I suspect you meant is that a key comparison that involves 
accessing the data in memory is more expensive than reassigning the 
pointers associated with those keys.   That is certainly true.


Yes.  The problem has been extensively studied. ;-)



Where this does become interesting is where we can convert a datum to
an integer such that if f(A)  f(B) then A  B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting 
just the keys and the pointers to those datums followed by an 
optional final pass where we do the actual data movement is also a 
standard technique for handling large data structures.



Regardless of what tweaks beyond the basic algorithms we use, the 
algorithms themselves have been well studied and their performance 
well established.  QuickSort is the best performing of the O(nlgn) 
comparison based sorts and it uses less resources than HeapSort or MergeSort.


Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Tom Lane

Ron [EMAIL PROTECTED] writes:
 Your cost comment basically agrees with mine regarding the cost of 
 random memory accesses.  The good news is that the number of datums 
 to be examined during the pivot choosing process is small enough that 
 the datums can fit into CPU cache while the pointers to them can be 
 assigned to registers: making pivot choosing +very+ fast when done correctly.

This is more or less irrelevant given that comparing the pointers is not
the operation we need to do.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Craig A. James

Markus Schaber wrote:

Ron wrote:

...and of course if you know enough about the data to be sorted so as to
constrain it appropriately, one should use a non comparison based O(N)
sorting algorithm rather than any of the general comparison based
O(NlgN) methods.

Sounds interesting, could you give us some pointers (names, URLs,
papers) to such algorithms?

Most of these techniques boil down to good ol' bucket sort. A simple example: suppose
you have a column of integer percentages, range zero to 100. You know there are only 101 distinct
values. So create 101 buckets (e.g. linked lists), make a single pass through your
data and drop each row's ID into the right bucket, then make a second pass through the buckets, and
write the row ID's out in bucket order. This is an O(N) sort technique.

Any time you have a restricted data range, you can do this. Say you have 100
million rows of scientific results known to be good to only three digits -- it
can have at most 1,000 distinct values (regardless of the magnitude of the
values), so you can do this with 1,000 buckets and just two passes through the
data.

You can also use this trick when the optimizer is asked for fastest first result. Say you have a cursor on a column of numbers with good distribution. If you do a bucket sort on the first two or three digits only, you know the first page of results will be in the first bucket. So you only need to apply qsort to that first bucket (which is very fast, since it's small), and you can deliver the first page of data to the application. This can be particularly effective in interactive situations, where the user typically looks at a few pages of data and then abandons the search.

I doubt this is very relevant to Postgres. A relational database has to be general
purpose, and it's hard to give it hints that would tell it when to use this
particular optimization.

Craig

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create


At 10:52 AM 2/16/2006, Ron wrote:

At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:


Where this does become interesting is where we can convert a datum to
an integer such that if f(A)  f(B) then A  B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting 
just the keys and the pointers to those datums followed by an 
optional final pass where we do the actual data movement is also a 
standard technique for handling large data structures.

I thought some follow up might be in order here.

Let's pretend that we have the typical DB table where rows are ~2-4KB 
apiece.  1TB of storage will let us have 256M-512M rows in such a table.


A 32b hash code can be assigned to each row value such that only 
exactly equal rows will have the same hash code.

A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= 
64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an optional 
pass to rearrange the actual rows if we so wish.


We get the same result while only examining and manipulating 1/50 to 
1/25 as much data during the sort.


If we want to spend more CPU time in order to save more space, we can 
compress the key+pointer representation.  That usually reduces the 
amount of data to be manipulated to ~1/4 the original key+pointer 
representation, reducing things to ~512M-1GB worth of compressed 
pointers+keys.  Or ~1/200 - ~1/100 the original amount of data we 
were discussing.


Either representation is small enough to fit within RAM rather than 
requiring HD IO, so we solve the HD IO bottleneck in the best 
possible way: we avoid ever doing it.


Ron   




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

On Thu, Feb 16, 2006 at 11:32:55AM -0500, Ron wrote:
 At 10:52 AM 2/16/2006, Ron wrote:
 In fact we can do better.
 Using hash codes or what-not to map datums to keys and then sorting 
 just the keys and the pointers to those datums followed by an 
 optional final pass where we do the actual data movement is also a 
 standard technique for handling large data structures.

Or in fact required if the Datums are not all the same size, which is
the case in PostgreSQL.

 I thought some follow up might be in order here.
 
 Let's pretend that we have the typical DB table where rows are ~2-4KB 
 apiece.  1TB of storage will let us have 256M-512M rows in such a table.
 
 A 32b hash code can be assigned to each row value such that only 
 exactly equal rows will have the same hash code.
 A 32b pointer can locate any of the 256M-512M rows.

That hash code is impossible the way you state it, since the set of
strings is not mappable to a 32bit integer. You probably meant that a
hash code can be assigned such that equal rows have equal hashes (drop
the only).

 Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= 
 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an optional 
 pass to rearrange the actual rows if we so wish.
 
 We get the same result while only examining and manipulating 1/50 to 
 1/25 as much data during the sort.

But this is what we do now. The tuples are loaded, we sort an array of
pointers, then we write the output. Except we don't have the hash, so
we require access to the 1TB of data to do the actual comparisons. Even
if we did have the hash, we'd *still* need access to the data to handle
tie-breaks.

That's why your comment about moves always being more expensive than
compares makes no sense. A move can be acheived simply by swapping two
pointers in the array. A compare actually needs to call all sorts of
functions. If and only if we have functions for every data type to
produce an ordered hash, we can optimise sorts based on single
integers.

For reference, look at comparetup_heap(). It's just 20 lines, but each
function call there expands to maybe a dozen lines of code. And it has
a loop. I don't think we're anywhere near the stage where locality of
reference makes much difference.

We very rarely needs the tuples actualised in memory in the required
order, just the pointers are enough.

Have a ncie day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Tom Lane

Craig A. James [EMAIL PROTECTED] writes:
 You can also use this trick when the optimizer is asked for fastest first 
 result.  Say you have a cursor on a column of numbers with good 
 distribution.  If you do a bucket sort on the first two or three digits only, 
 you know the first page of results will be in the first bucket.  So you 
 only need to apply qsort to that first bucket (which is very fast, since it's 
 small), and you can deliver the first page of data to the application.  This 
 can be particularly effective in interactive situations, where the user 
 typically looks at a few pages of data and then abandons the search.  

 I doubt this is very relevant to Postgres.  A relational database has to be 
 general purpose, and it's hard to give it hints that would tell it when to 
 use this particular optimization.

Actually, LIMIT does nicely for that hint; the PG planner has definitely
got a concept of preferring fast-start plans for limited queries.  The
real problem in applying bucket-sort ideas is the lack of any
datatype-independent way of setting up the buckets.

Once or twice we've kicked around the idea of having some
datatype-specific sorting code paths alongside the general-purpose one,
but I can't honestly see this as being workable from a code maintenance
standpoint.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Scott Lamb


On Feb 16, 2006, at 8:32 AM, Ron wrote:
Let's pretend that we have the typical DB table where rows are  
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in  
such a table.


A 32b hash code can be assigned to each row value such that only  
exactly equal rows will have the same hash code.

A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an  
optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) = (x != y)
This is not: (H(x)  H(y)) = (x  y)

Hash keys can tell you there's an inequality, but they can't tell you  
how the values compare. If you want 32-bit keys that compare in the  
same order as the original values, here's how you have to get them:


(1) sort the values into an array
(2) use each value's array index as its key

It reduces to the problem you're trying to use it to solve.


--
Scott Lamb http://www.slamb.org/



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Dann Corbit

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Markus Schaber
 Sent: Thursday, February 16, 2006 5:45 AM
 To: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index

 Hi, Ron,

 Ron wrote:

  ...and of course if you know enough about the data to be sorted so
as to
  constrain it appropriately, one should use a non comparison based
O(N)
  sorting algorithm rather than any of the general comparison based
  O(NlgN) methods.

 Sounds interesting, could you give us some pointers (names, URLs,
 papers) to such algorithms?

He refers to counting sort and radix sort (which comes in most
significant digit and least significant digit format).  These are also
called distribution (as opposed to comparison) sorts.

These sorts are O(n) as a function of the data size, but really they are
O(M*n) where M is the average key length and n is the data set size.
(In the case of MSD radix sort M is the average length to completely
differentiate strings)

So if you have an 80 character key, then 80*log(n) will only be faster
than n*log(n) when you have 2^80th elements -- in other words -- never.

If you have short keys, on the other hand, distribution sorts will be
dramatically faster.  On an unsigned integer, for instance, it requires
4 passes with 8 bit buckets and so 16 elements is the crossover to radix
is faster than an O(n*log(n)) sort.  Of course, there is a fixed
constant of proportionality and so it will really be higher than that,
but for large data sets distribution sorting is the best thing that
there is for small keys.

You could easily have an introspective MSD radix sort.  The nice thing
about MSD radix sort is that you can retain the ordering that has
occurred so far and switch to another algorithm.

An introspective MSD radix sort could call an introspective quick sort
algorithm once it processed a crossover point of buckets of key data.

In order to have distribution sorts that will work with a database
system, for each and every data type you will need a function to return
the bucket of bits of significance for the kth bucket of bits.  For a
character string, you could return key[bucket].  For an unsigned integer
it is the byte of the integer to return will be a function of the
endianness of the CPU.  And for each other distinct data type a bucket
function would be needed or a sort could not be generated for that type
using the distribution method.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create


At 12:19 PM 2/16/2006, Scott Lamb wrote:

On Feb 16, 2006, at 8:32 AM, Ron wrote:

Let's pretend that we have the typical DB table where rows are
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an

optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) = (x != y)
This is not: (H(x)  H(y)) = (x  y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or 
hash-like codes that maintains the mapping to support that second statement.


More later when I can get more time.
Ron 




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Neil Conway

On Thu, 2006-02-16 at 12:35 +0100, Steinar H. Gunderson wrote:
 glibc-2.3.5/stdlib/qsort.c:
 
   /* Order size using quicksort.  This implementation incorporates
  four optimizations discussed in Sedgewick:
 
 I can't see any references to merge sort in there at all.

stdlib/qsort.c defines _quicksort(), not qsort(), which is defined by
msort.c. On looking closer, it seems glibc actually tries to determine
the physical memory in the machine -- if it is sorting a single array
that exceeds 1/4 of the machine's physical memory, it uses quick sort,
otherwise it uses merge sort.

-Neil



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Mark Lewis

On Thu, 2006-02-16 at 12:15 -0500, Tom Lane wrote:
 Once or twice we've kicked around the idea of having some
 datatype-specific sorting code paths alongside the general-purpose one,
 but I can't honestly see this as being workable from a code maintenance
 standpoint.
 
   regards, tom lane


It seems that instead of maintaining a different sorting code path for
each data type, you could get away with one generic path and one
(hopefully faster) path if you allowed data types to optionally support
a 'sortKey' interface by providing a function f which maps inputs to 32-
bit int outputs, such that the following two properties hold:

f(a)=f(b) iff a=b
if a==b then f(a)==f(b)

So if a data type supports the sortKey interface you could perform the
sort on f(value) and only refer back to the actual element comparison
functions when two sortKeys have the same value.

Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

Depending on the overhead, you might not even need to maintain 2
independent search code paths, since you could always use f(x)=0 as the
default sortKey function which would degenerate to the exact same sort
behavior in use today.

-- Mark Lewis

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Markus Schaber

Hi, Mark,

Mark Lewis schrieb:

 It seems that instead of maintaining a different sorting code path for
 each data type, you could get away with one generic path and one
 (hopefully faster) path if you allowed data types to optionally support
 a 'sortKey' interface by providing a function f which maps inputs to 32-
 bit int outputs, such that the following two properties hold:
 
 f(a)=f(b) iff a=b
 if a==b then f(a)==f(b)

Hmm, to remove redundancy, I'd change the = to a  and define:

if a==b then f(a)==f(b)
if ab  then f(a)=f(b)

 Data types which could probably provide a useful function for f would be
 int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

With int2 or some restricted ranges of oid and int4, we could even
implement a bucket sort.

Markus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

On Thu, Feb 16, 2006 at 02:17:36PM -0800, Mark Lewis wrote:
 It seems that instead of maintaining a different sorting code path for
 each data type, you could get away with one generic path and one
 (hopefully faster) path if you allowed data types to optionally support
 a 'sortKey' interface by providing a function f which maps inputs to 32-
 bit int outputs, such that the following two properties hold:
 
 f(a)=f(b) iff a=b
 if a==b then f(a)==f(b)

Note this is a property of the collation, not the type. For example
strings can be sorted in many ways and the sortKey must reflect that.
So in postgres terms it's a property of the btree operator class.

It's something I'd like to do if I get A Round Tuit. :)

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Greg Stark

Markus Schaber [EMAIL PROTECTED] writes:

 Hmm, to remove redundancy, I'd change the = to a  and define:
 
 if a==b then f(a)==f(b)
 if ab  then f(a)=f(b)
 
  Data types which could probably provide a useful function for f would be
  int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

How exactly do you imagine doing this for text?

I could see doing it for char(n)/varchar(n) where n=4 in SQL_ASCII though.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Mark Lewis

On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:
   Data types which could probably provide a useful function for f would be
   int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
 
 How exactly do you imagine doing this for text?
 
 I could see doing it for char(n)/varchar(n) where n=4 in SQL_ASCII though.


In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
sortKey as elsewhere suggested).  The sorting key doesn't need to be a
one-to-one mapping.

-- Mark Lewis

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread David Lang


On Thu, 16 Feb 2006, Mark Lewis wrote:


On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:

Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).


How exactly do you imagine doing this for text?

I could see doing it for char(n)/varchar(n) where n=4 in SQL_ASCII though.



In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
sortKey as elsewhere suggested).  The sorting key doesn't need to be a
one-to-one mapping.


that would violate your second contraint ( f(a)==f(b) iff (a==b) )

if you could drop that constraint (the cost of which would be extra 'real' 
compares within a bucket) then a helper function per datatype could work 
as you are talking.


David Lang

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create


At 01:47 PM 2/16/2006, Ron wrote:

At 12:19 PM 2/16/2006, Scott Lamb wrote:

On Feb 16, 2006, at 8:32 AM, Ron wrote:

Let's pretend that we have the typical DB table where rows are
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an

optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) = (x != y)
This is not: (H(x)  H(y)) = (x  y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or 
hash-like codes that maintains the mapping to support that second statement.


More later when I can get more time.
Ron


OK, so here's _a_ way (there are others) to obtain a mapping such that
 if a  b then f(a)  f (b) and
 if a == b then f(a) == f(b)

Pretend each row is a integer of row size (so a 2KB row becomes a 
16Kb integer; a 4KB row becomes a 32Kb integer; etc)
Since even a 1TB table made of such rows can only have 256M - 512M 
possible values even if each row is unique, a 28b or 29b key is large 
enough to represent each row's value and relative rank compared to 
all of the others even if all row values are unique.


By scanning the table once, we can map say 001h (Hex used to ease 
typing) to the row with the minimum value and 111h to the row 
with the maximum value as well as mapping everything in between to 
their appropriate keys.  That same scan can be used to assign a 
pointer to each record's location.


We can now sort the key+pointer pairs instead of the actual data and 
use an optional final pass to rearrange the actual rows if we wish.


That initial scan to set up the keys is expensive, but if we wish 
that cost can be amortized over the life of the table so we don't 
have to pay it all at once.  In addition, once we have created those 
keys, then can be saved for later searches and sorts.


Further space savings can be obtained whenever there are duplicate 
keys and/or when compression methods are used on the Key+pointer pairs.


Ron







---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades [EMAIL PROTECTED] writes:
 If I run the script again, it is not always the first case that is slow, 
 it varies from run to run, which is why I repeated it quite a few times 
 for the test.

For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.

I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c.  For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,10), 0;
insert into atest (i,r) select generate_series(1,10), random()*10;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes.  (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.)  Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

regards, tom lane


100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
Time: 460.209 ms
Time: 460.704 ms
Time: 461.317 ms
Time: 461.538 ms
Time: 461.652 ms
Time: 461.988 ms
Time: 462.573 ms
Time: 462.638 ms
Time: 462.716 ms
Time: 462.917 ms
Time: 463.219 ms
Time: 463.455 ms
Time: 463.650 ms
Time: 463.723 ms
Time: 463.737 ms
Time: 463.750 ms
Time: 463.852 ms
Time: 463.964 ms
Time: 463.988 ms
Time: 464.003 ms
Time: 464.135 ms
Time: 464.372 ms
Time: 464.458 ms
Time: 464.496 ms
Time: 464.551 ms
Time: 464.599 ms
Time: 464.655 ms
Time: 464.656 ms
Time: 464.722 ms
Time: 464.814 ms
Time: 464.827 ms
Time: 464.878 ms
Time: 464.899 ms
Time: 464.905 ms
Time: 464.987 ms
Time: 465.055 ms
Time: 465.138 ms
Time: 465.159 ms
Time: 465.194 ms
Time: 465.310 ms
Time: 465.316 ms
Time: 465.375 ms
Time: 465.450 ms
Time: 465.535 ms
Time: 465.595 ms
Time: 465.680 ms
Time: 465.769 ms
Time: 465.865 ms
Time: 465.892 ms
Time: 465.903 ms
Time: 466.003 ms
Time: 466.154 ms
Time: 466.164 ms
Time: 466.203 ms
Time: 466.305 ms
Time: 466.344 ms
Time: 466.364 ms
Time: 466.388 ms
Time: 466.502 ms
Time: 466.593 ms
Time: 466.725 ms
Time: 466.794 ms
Time: 466.798 ms
Time: 466.904 ms
Time: 466.971 ms
Time: 466.997 ms
Time: 467.122 ms
Time: 467.146 ms
Time: 467.221 ms
Time: 467.224 ms
Time: 467.244 ms
Time: 467.277 ms
Time: 467.587 ms
Time: 468.142 ms
Time: 468.207 ms
Time: 468.237 ms
Time: 468.471 ms
Time: 468.663 ms
Time: 468.700 ms
Time: 469.235 ms
Time: 469.840 ms
Time: 470.472 ms
Time: 471.140 ms
Time: 472.811 ms
Time: 472.959 ms
Time: 474.858 ms
Time: 477.210 ms
Time: 479.571 ms
Time: 479.671 ms
Time: 482.797 ms
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms


100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
Time: 420.611 ms
Time: 420.764 ms
Time: 420.904 ms
Time: 421.706 ms
Time: 422.466 ms
Time: 422.627 ms
Time: 423.189 ms
Time: 423.302 ms
Time: 425.096 ms
Time: 425.731 ms
Time: 425.851 ms
Time: 427.253 ms
Time: 430.113 ms
Time: 432.756 ms
Time: 432.963 ms
Time: 440.502 ms
Time: 440.640 ms
Time: 450.452 ms
Time: 458.143 ms
Time: 459.212 ms
Time: 467.706 ms
Time: 468.006 ms
Time: 468.574 ms
Time: 470.003 ms
Time: 472.313 ms
Time: 483.622 ms
Time: 492.395 ms
Time: 509.564 ms
Time: 531.037 ms
Time: 533.366 ms
Time: 535.610 ms
Time: 575.523 ms
Time: 582.688 ms
Time: 593.545 ms
Time: 647.364 ms
Time: 660.612 ms
Time: 677.312 ms
Time: 680.288 ms
Time: 697.626 ms
Time: 833.066 ms
Time: 834.511 ms
Time: 851.819 ms
Time: 920.443 ms
Time: 926.731 ms
Time: 954.289 ms
Time: 1045.214 ms
Time: 1059.200 ms
Time: 1062.328 ms
Time: 1136.018 ms
Time: 1260.091 ms
Time: 1276.883 ms
Time: 1319.351 ms
Time: 1438.854 ms
Time: 1475.457 ms
Time: 1538.211 ms
Time: 1549.004 ms
Time: 1744.642 ms
Time: 1771.258 ms
Time: 1959.530 ms
Time: 2300.140 ms
Time: 2589.641 ms
Time: 2612.780 ms
Time: 3100.024 ms
Time: 3284.125 ms
Time: 3379.792 ms
Time: 3750.278 ms
Time: 4302.278 ms
Time: 4780.624 ms
Time: 5000.056 ms
Time: 5092.604 ms
Time: 5168.722 ms
Time: 5292.941 ms
Time: 5895.964 ms
Time: 7003.164 ms
Time: 7099.449 ms
Time: 7115.083 ms
Time: 7384.940 ms
Time: 8214.010 ms
Time: 8700.771 ms
Time: 9331.225 ms
Time: 10503.360 ms
Time: 12496.026 ms
Time: 12982.474 ms
Time: 15192.390 ms

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Gary Doades


Tom Lane wrote:

For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.


  It seems clear that our qsort.c is doing a pretty awful job of picking

qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.


[snip]


Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms


Ouch! That confirms my problem. I generated the random test case because 
it was easier than including the dump of my tables, but you can 
appreciate that tables 20 times the size are basically crippled when it 
comes to creating an index on them.


Examining the dump and the associated times during restore it looks like 
I have 7 tables with this approximate distribution, thus the 
ridiculously long restore time. Better not re-index soon!


Is this likely to hit me in a random fashion during normal operation, 
joins, sorts, order by for example?


So the options are:
1) Fix the included qsort.c code and use that
2) Get FreeBSD to fix their qsort code
3) Both

I guess that 1 is the real solution in case anyone else's qsort is 
broken in the same way. Then at least you *could* use it all the time :)


Regards,
Gary.




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades [EMAIL PROTECTED] writes:
 Is this likely to hit me in a random fashion during normal operation, 
 joins, sorts, order by for example?

Yup, anytime you're passing data with that kind of distribution
through a sort.

 So the options are:
 1) Fix the included qsort.c code and use that
 2) Get FreeBSD to fix their qsort code
 3) Both

 I guess that 1 is the real solution in case anyone else's qsort is 
 broken in the same way. Then at least you *could* use it all the time :)

It's reasonable to assume that most of the *BSDen have basically the
same qsort code.  Ours claims to have come from NetBSD sources, but
I don't doubt that they all trace back to a common ancestor.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Gary Doades [EMAIL PROTECTED] writes:
 Ouch! That confirms my problem. I generated the random test case because 
 it was easier than including the dump of my tables, but you can 
 appreciate that tables 20 times the size are basically crippled when it 
 comes to creating an index on them.

Actually... we only use qsort when we have a sorting problem that fits
within the allowed sort memory.  The external-sort logic doesn't go
through that code at all.  So all the analysis we just did on your test
case doesn't necessarily apply to sort problems that are too large for
the sort_mem setting.

The test case would be sorting 20 index entries, which'd probably
occupy at least 24 bytes apiece of sort memory, so probably about 5 meg.
A problem 20 times that size would definitely not fit in the default
16MB maintenance_work_mem.  Were you using a large value of
maintenance_work_mem for your restore?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-15 Thread Ron

This behavior is consistent with the pivot choosing algorithm 
assuming certain distribution(s) for the data.  For instance, 
median-of-three partitioning is known to be pessimal when the data is 
geometrically or hyper-geometrically distributed.  Also, care must be 
taken that sometimes is not when there are many equal values in the 
data.  Even pseudo random number generator based pivot choosing 
algorithms are not immune if the PRNG is flawed in some way.


How are we choosing our pivots?


At 06:28 PM 2/15/2006, Tom Lane wrote:


I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c.  For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,10), 0;
insert into atest (i,r) select generate_series(1,10), random()*10;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes.  (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.)  Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

regards, tom lane


100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
snip
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms


100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
snip
Time: 20865.979 ms
Time: 21000.907 ms
Time: 21297.585 ms
Time: 21714.518 ms
Time: 25423.235 ms
Time: 27543.052 ms
Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

I wrote:
 Gary Doades [EMAIL PROTECTED] writes:
 Ouch! That confirms my problem. I generated the random test case because 
 it was easier than including the dump of my tables, but you can 
 appreciate that tables 20 times the size are basically crippled when it 
 comes to creating an index on them.

 Actually... we only use qsort when we have a sorting problem that fits
 within the allowed sort memory.  The external-sort logic doesn't go
 through that code at all.  So all the analysis we just did on your test
 case doesn't necessarily apply to sort problems that are too large for
 the sort_mem setting.

I increased the size of the test case by 10x (basically s/10/100/)
which is enough to push it into the external-sort regime.  I get
amazingly stable runtimes now --- I didn't have the patience to run 100
trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
So this code path is definitely not very sensitive to this data
distribution.

While these numbers aren't glittering in comparison to the best-case
qsort times (~450 msec to sort 10% as much data), they are sure a lot
better than the worst-case times.  So maybe a workaround for you is
to decrease maintenance_work_mem, counterintuitive though that be.
(Now, if you *weren't* using maintenance_work_mem of 100MB or more
for your problem restore, then I'm not sure I know what's going on...)

We still ought to try to fix qsort of course.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Ron [EMAIL PROTECTED] writes:
 How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices.  With half of the
data inputs zero, it's not too improbable for two out of the three
samples to be zeroes in which case I think the med3 result will be zero
--- so choosing a pivot of zero is much more probable than one would
like, and doing so in many levels of recursion causes the problem.

I think.  I'm not too sure if the code isn't just being sloppy about the
case where many data values are equal to the pivot --- there's a special
case there to switch to insertion sort, and maybe that's getting invoked
too soon.  It'd be useful to get a line-level profile of the behavior of
this code in the slow cases...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Dann Corbit

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Tom Lane
 Sent: Wednesday, February 15, 2006 5:22 PM
 To: Ron
 Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index
 behaviour)

 Ron [EMAIL PROTECTED] writes:
  How are we choosing our pivots?

 See qsort.c: it looks like median of nine equally spaced inputs (ie,
 the 1/8th points of the initial input array, plus the end points),
 implemented as two rounds of median-of-three choices.  With half of
the
 data inputs zero, it's not too improbable for two out of the three
 samples to be zeroes in which case I think the med3 result will be
zero
 --- so choosing a pivot of zero is much more probable than one would
 like, and doing so in many levels of recursion causes the problem.

Adding some randomness to the selection of the pivot is a known
technique to fix the oddball partitions problem.  However, Bentley and
Sedgewick proved that every quick sort algorithm has some input set that
makes it go quadratic (hence the recent popularity of introspective
sort, which switches to heapsort if quadratic behavior is detected.  The
C++ template I submitted was an example of introspective sort, but
PostgreSQL does not use C++ so it was not helpful).

 I think.  I'm not too sure if the code isn't just being sloppy about
the
 case where many data values are equal to the pivot --- there's a
special
 case there to switch to insertion sort, and maybe that's getting
invoked
 too soon.  

Here are some cases known to make qsort go quadratic:
1. Data already sorted
2. Data reverse sorted
3. Data organ-pipe sorted or ramp
4. Almost all data of the same value

There are probably other cases.  Randomizing the pivot helps some, as
does check for in-order or reverse order partitions.

Imagine if 1/3 of the partitions fall into a category that causes
quadratic behavior (have one of the above formats and have more than
CUTOFF elements in them).

It is doubtful that the switch to insertion sort is causing any sort of
problems.  It is only going to be invoked on tiny sets, for which it has
a fixed cost that is probably less that qsort() function calls on sets
of the same size.

It'd be useful to get a line-level profile of the behavior of
 this code in the slow cases...

I guess that my in-order or presorted tests [which often arise when
there are very few distinct values] may solve the bad partition
problems.  Don't forget that the algorithm is called recursively.

   regards, tom lane

 ---(end of
broadcast)---
 TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Christopher Kings-Lynne

Ouch! That confirms my problem. I generated the random test case because 
it was easier than including the dump of my tables, but you can 
appreciate that tables 20 times the size are basically crippled when it 
comes to creating an index on them.



I have to say that I restored a few gigabyte dump on freebsd the other 
day, and most of the restore time was in index creation - I didn't think 
too much of it though at the time.  FreeBSD 4.x.


Chris


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-15 Thread Simon Riggs

On Wed, 2006-02-15 at 19:59 -0500, Tom Lane wrote:

  I get
 amazingly stable runtimes now --- I didn't have the patience to run 100
 trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
 So this code path is definitely not very sensitive to this data
 distribution.

The worst-case behavior of replacement-selection is very close to its
average behavior, while the worst-case behavior of QuickSort is terrible
(N2) – a strong argument in favor of replacement-selection. Despite this
risk, QuickSort is widely used because, in practice, it has superior
performance. p.8, AlphaSort: A Cache-Sensitive Parallel External
Sort, Nyberg et al, VLDB Journal 4(4): 603-627 (1995)

I think your other comment about flipping to insertion sort too early
(and not returning...) is a plausible cause for the poor pg qsort
behaviour, but the overall spread of values seems as expected.

Some test results I've seen seem consistent with the view that
increasing memory also increases run-time for larger settings of
work_mem/maintenance_work_mem. Certainly, as I observed a while back,
having a large memory settings doesn't help you at all when you are
doing final run merging on the external sort. Whatever we do, we should
look at the value high memory settings bring to each phase of a sort
separately from the other phases.

There is work underway on improving external sorts, so I hear (not me).
Plus my WIP on randomAccess requirements.

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-15 Thread Neil Conway

On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
 It seems clear that our qsort.c is doing a pretty awful job of picking
 qsort pivots, while glibc is mostly managing not to make that mistake.
 I haven't looked at the glibc code yet to see what they are doing
 differently.

glibc qsort is actually merge sort, so I'm not surprised it avoids this
problem.

-Neil



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Qingqing Zhou


Tom Lane [EMAIL PROTECTED] wrote

 I did this 100 times and sorted the reported runtimes.

 I'd say this puts a considerable damper on my enthusiasm for using our
 qsort all the time, as was recently debated in this thread:
 http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php

 100 runtimes for glibc qsort, sorted ascending:

 Time: 866.814 ms
 Time: 1234.848 ms
 Time: 1267.398 ms

 100 runtimes for port/qsort.c, sorted ascending:

 Time: 28314.182 ms
 Time: 29400.278 ms
 Time: 34142.534 ms


By did this 100 times do you mean generate a sequence of at most
20*100 numbers, and for every 20 numbers, the first half are all
zeros and the other half are uniform random numbers? I tried to confirm it
by patching the program mentioned in the link, but seems BSDqsort is still a
little bit leading.

Regards,
Qingqing

---
Result

sort#./sort
[3] [glibc qsort]: nelem(2000), range(4294901760) distr(halfhalf)
ccost(2) : 18887.285000 ms
[3] [BSD qsort]: nelem(2000), range(4294901760) distr(halfhalf) ccost(2)
: 18801.018000 ms
[3] [qsortG]: nelem(2000), range(4294901760) distr(halfhalf) ccost(2) :
22997.004000 ms

---
Patch to sort.c

sort#diff -c sort.c sort1.c
*** sort.c  Thu Dec 15 12:18:59 2005
--- sort1.c Wed Feb 15 22:21:15 2006
***
*** 35,43 
{BSD qsort, qsortB},
{qsortG, qsortG}
  };
! static const size_t d_nelem[] = {1000, 1, 10, 100, 500};
! static const size_t d_range[] = {2, 32, 1024, 0xL};
! static const char *d_distr[] = {uniform, gaussian, 95sorted,
95reversed};
  static const size_t d_ccost[] = {2};

  /* factor index */
--- 35,43 
{BSD qsort, qsortB},
{qsortG, qsortG}
  };
! static const size_t d_nelem[] = {500, 1000, 2000};
! static const size_t d_range[] = {0xL};
! static const char *d_distr[] = {halfhalf};
  static const size_t d_ccost[] = {2};

  /* factor index */
***
*** 180,185 
--- 180,192 
swap(karray[i], karray[nelem-i-1]);
}
}
+   else if (!strcmp(distr, halfhalf))
+   {
+   int j;
+   for (i = 0; i  nelem/20; i++)
+   for (j = 0; j  10; j++)
+   karray[i*20 + j] = 0;
+   }

return array;
  }




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

Qingqing Zhou [EMAIL PROTECTED] writes:
 By did this 100 times do you mean generate a sequence of at most
 20*100 numbers, and for every 20 numbers, the first half are all
 zeros and the other half are uniform random numbers?

No, I mean I ran the bit of SQL script I gave 100 separate times.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Qingqing Zhou


Tom Lane [EMAIL PROTECTED] wrote
 Qingqing Zhou [EMAIL PROTECTED] writes:
  By did this 100 times do you mean generate a sequence of at most
  20*100 numbers, and for every 20 numbers, the first half are all
  zeros and the other half are uniform random numbers?

 No, I mean I ran the bit of SQL script I gave 100 separate times.


I must misunderstand something here -- I can't figure out that why the cost
of the same procedure keep climbing?

Regards,
Qingqing



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Qingqing Zhou


Qingqing Zhou [EMAIL PROTECTED] wrote

 I must misunderstand something here -- I can't figure out that why the
cost
 of the same procedure keep climbing?


Ooops, I mis-intepret the sentence --  you sorted the results ...

Regards,
Qingqing



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)