Re: [HACKERS] Re: Which qsort is used

2005-12-22 Thread Manfred Koizar
On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout
kleptog@svana.org wrote:
But where are you including the cost to check how many cells are
already sorted? That would be O(H), right?

Yes.  I didn't mention it, because H  N.

 This is where we come back
to the issue that comparisons in PostgreSQL are expensive.

So we agree that we should try to reduce the number of comparisons.
How many comparisons does it take to sort 10 items?  1.5 million?

Hmm, what are the chances you have 10 unordered items to sort and
that the first 8% will already be in order. ISTM that that probability
will be close enough to zero to not matter...

If the items are totally unordered, the check is so cheap you won't
even notice.  OTOH in Tom's example ...

|What I think is much more probable in the Postgres environment
|is almost-but-not-quite-ordered inputs --- eg, a table that was
|perfectly ordered by key when filled, but some of the tuples have since
|been moved by UPDATEs.

... I'd not be surprised if H is 90% of N.
Servus
 Manfred

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Re: Which qsort is used

2005-12-22 Thread Dann Corbit
An interesting article on sorting and comparison count:
http://www.acm.org/jea/ARTICLES/Vol7Nbr5.pdf

Here is the article, the code, and an implementation that I have been
toying with:
http://cap.connx.com/chess-engines/new-approach/algos.zip

Algorithm quickheap is especially interesting because it does not
require much additional space (just an array of integers up to size
log(element_count) and in addition, it has very few data movements.

 -Original Message-
 From: Manfred Koizar [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 22, 2005 1:59 PM
 To: Martijn van Oosterhout
 Cc: Tom Lane; Dann Corbit; Qingqing Zhou; Bruce Momjian; Luke
Lonergan;
 Neil Conway; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Re: Which qsort is used
 
 On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout
 kleptog@svana.org wrote:
 But where are you including the cost to check how many cells are
 already sorted? That would be O(H), right?
 
 Yes.  I didn't mention it, because H  N.
 
  This is where we come back
 to the issue that comparisons in PostgreSQL are expensive.
 
 So we agree that we should try to reduce the number of comparisons.
 How many comparisons does it take to sort 10 items?  1.5 million?
 
 Hmm, what are the chances you have 10 unordered items to sort and
 that the first 8% will already be in order. ISTM that that
probability
 will be close enough to zero to not matter...
 
 If the items are totally unordered, the check is so cheap you won't
 even notice.  OTOH in Tom's example ...
 
 |What I think is much more probable in the Postgres environment
 |is almost-but-not-quite-ordered inputs --- eg, a table that was
 |perfectly ordered by key when filled, but some of the tuples have
since
 |been moved by UPDATEs.
 
 ... I'd not be surprised if H is 90% of N.
 Servus
  Manfred

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Re: Which qsort is used

2005-12-21 Thread Manfred Koizar
On Sat, 17 Dec 2005 00:03:25 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
I've still got a problem with these checks; I think they are a net
waste of cycles on average.  [...]
 and when they fail, those cycles are entirely wasted;
you have not advanced the state of the sort at all.

How can we make the initial check adavance the state of the sort?
One answer might be to exclude the sorted sequence at the start of the
array from the qsort, and merge the two sorted lists as the final
stage of the sort.

Qsorting N elements costs O(N*lnN), so excluding H elements from the
sort reduces the cost by at least O(H*lnN).  The merge step costs O(N)
plus some (=50%) more memory, unless someone knows a fast in-place
merge.  So depending on the constant factors involved there might be a
usable solution.

I've been playing with some numbers and assuming the constant factors
to be equal for all the O()'s this method starts to pay off at
  H  for N
  20   100
 130  1000
800010
Servus
 Manfred

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Re: Which qsort is used

2005-12-21 Thread Martijn van Oosterhout
On Thu, Dec 22, 2005 at 01:43:34AM +0100, Manfred Koizar wrote:
 Qsorting N elements costs O(N*lnN), so excluding H elements from the
 sort reduces the cost by at least O(H*lnN).  The merge step costs O(N)
 plus some (=50%) more memory, unless someone knows a fast in-place
 merge.  So depending on the constant factors involved there might be a
 usable solution.

But where are you including the cost to check how many cells are
already sorted? That would be O(H), right? This is where we come back
to the issue that comparisons in PostgreSQL are expensive. The cpu_cost
in the tests I saw so far is unrealistically low.

 I've been playing with some numbers and assuming the constant factors
 to be equal for all the O()'s this method starts to pay off at
 H  for N
 20   100  20%
130  1000  13%
   800010   8%

Hmm, what are the chances you have 10 unordered items to sort and
that the first 8% will already be in order. ISTM that that probability
will be close enough to zero to not matter...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpg6RoCjt5SA.pgp
Description: PGP signature


Re: [HACKERS] Re: Which qsort is used

2005-12-19 Thread Martijn van Oosterhout
On Fri, Dec 16, 2005 at 10:43:58PM -0800, Dann Corbit wrote:
 I am actually quite impressed with the excellence of Bentley's sort out
 of the box.  It's definitely the best library implementation of a sort I
 have seen.

I'm not sure whether we have a conclusion here, but I do have one
question: is there a significant difference in the number of times the
comparison routines are called? Comparisons in PostgreSQL are fairly
expensive given the fmgr overhead and when comparing tuples it's even
worse.

We don't want to accedently pick a routine that saves data shuffling by
adding extra comparisons. The stats at [1] don't say. They try to
factor in CPU cost but they seem to use unrealistically small values. I
would think a number around 50 (or higher) would be more
representative.

[1] http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpWGyYf8ywud.pgp
Description: PGP signature


Re: [HACKERS] Re: Which qsort is used

2005-12-19 Thread Luke Lonergan
Martin,

On 12/19/05 3:37 AM, Martijn van Oosterhout kleptog@svana.org wrote:

 I'm not sure whether we have a conclusion here, but I do have one
 question: is there a significant difference in the number of times the
 comparison routines are called? Comparisons in PostgreSQL are fairly
 expensive given the fmgr overhead and when comparing tuples it's even
 worse.

It would be interesting to note the comparison count of the different
routines.

Something that really grabbed me about the results though is that the
relative performance of the routines dramatically shifted when the indirect
references in the comparators went in.  The first test I did sorted an array
of int4 - these tests that Qingqing did sorted arrays using an indirect
pointer list, at which point the same distributions performed very
differently.

I suspect that it is the number of comparisons that caused this, and further
that the indirection has disabled the compiler optimizations for memory
prefetch and other things that it could normally recognize.  Given the usage
pattern in Postgres, where sorted things are a mix of strings and intrinsic
types, I'm not sure those optimizations could be done by one routine.

I haven't verified this, but it certainly seems that the NetBSD routine is
the overall winner for the type of use that Postgres has (sorting the using
a pointer list).

- Luke



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Bruce Momjian
Luke Lonergan wrote:
 Tom,
 
 On 12/12/05 2:47 PM, Tom Lane [EMAIL PROTECTED] wrote:
 
  As those results suggest, there can be huge differences in sort
  performance depending on whether the input is random, nearly sorted,
  nearly reverse sorted, possesses many equal keys, etc.  It would be very
  dangerous to say implementation A is better than implementation B
  without having tested all those scenarios.
 
 Yes.  The Linux glibc qsort is proven terrible in the general case by these
 examples though.
 
 Bruce's point on that thread was that we shouldn't be replacing the OS
 routine in the general case.  On the other hand, there is the precedent of
 replacing Solaris' routine with the NetBSD version.

At this point, I think we have done enough testing on enough platforms
to just use port/qsort on all platforms in 8.2.  It seems whenever
someone tries to improve the BSD qsort, they make it worse.  

Were the BSD programmers geniuses and we are all idiots now, or what?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Qingqing Zhou


On Fri, 16 Dec 2005, Bruce Momjian wrote:


 At this point, I think we have done enough testing on enough platforms
 to just use port/qsort on all platforms in 8.2.  It seems whenever
 someone tries to improve the BSD qsort, they make it worse.


Not necessariliy true. Dann Corbit sent me an implementation a while ago
(you can see it on the same site). BSD qsort is improved, though not that
much, by two methods. Since Dann write the program from scratch, so I am
not sure if we are afford to take the efforts for the improvement.

Regards,
Qingqing




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Dann Corbit
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
 [EMAIL PROTECTED] On Behalf Of Qingqing Zhou
 Sent: Friday, December 16, 2005 5:14 PM
 To: Bruce Momjian
 Cc: Luke Lonergan; Tom Lane; Neil Conway; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Re: Which qsort is used
 
 On Fri, 16 Dec 2005, Bruce Momjian wrote:
 
 
  At this point, I think we have done enough testing on enough
platforms
  to just use port/qsort on all platforms in 8.2.  It seems whenever
  someone tries to improve the BSD qsort, they make it worse.
 
 
 Not necessariliy true. Dann Corbit sent me an implementation a while
ago
 (you can see it on the same site). BSD qsort is improved, though not
that
 much, by two methods. Since Dann write the program from scratch, so I
am
 not sure if we are afford to take the efforts for the improvement.

Both of them are modified versions of Bentley's sort algorithm.

I just added the in-order check to both of them, and the reversed
partition check for the second method (in the case of the subfiles
because insertion sort is allergic to reversed partitions).

At any rate, neither is much of an improvement on Bentley's version.
For the rare cases of completely ordered or completely reversed, it will
be a monumental improvement.  But that is a pretty rare case.

If I could use C++, I could do much better.  It is very difficult for me
to write an ADT in C instead of in C++.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Tom Lane
Dann Corbit [EMAIL PROTECTED] writes:
 Both of them are modified versions of Bentley's sort algorithm.

Jon Bentley of Bell Labs?  Small world ... he was my thesis adviser
for awhile when he was at CMU.  He's a good algorithms man, for sure.

 I just added the in-order check to both of them, and the reversed
 partition check for the second method (in the case of the subfiles
 because insertion sort is allergic to reversed partitions).

I've still got a problem with these checks; I think they are a net
waste of cycles on average.  They would only be a win if you expected a
nontrivial percentage of perfectly-in-order or perfectly-reverse-order
inputs.  What I think is much more probable in the Postgres environment
is almost-but-not-quite-ordered inputs --- eg, a table that was
perfectly ordered by key when filled, but some of the tuples have since
been moved by UPDATEs.  This is the worst possible case for the in-order
checks, because they can then grovel over large percentages of the file
before failing ... and when they fail, those cycles are entirely wasted;
you have not advanced the state of the sort at all.

For the usual case of randomly ordered input, of course it doesn't
matter much at all because the in-order checks will fail after examining
not too many items.  But to argue that the checks are worth making, you
have to assume that perfectly-ordered inputs are more common than
almost-ordered inputs, and I think that is exactly the wrong assumption
for the Postgres environment.  I sure haven't seen any evidence that
it's a good assumption.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Dann Corbit
 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 16, 2005 9:03 PM
 To: Dann Corbit
 Cc: Qingqing Zhou; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql-
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Re: Which qsort is used
 
 Dann Corbit [EMAIL PROTECTED] writes:
  Both of them are modified versions of Bentley's sort algorithm.
 
 Jon Bentley of Bell Labs?  Small world ... he was my thesis adviser
 for awhile when he was at CMU.  He's a good algorithms man, for sure.
 
  I just added the in-order check to both of them, and the reversed
  partition check for the second method (in the case of the subfiles
  because insertion sort is allergic to reversed partitions).
 
 I've still got a problem with these checks; I think they are a net
 waste of cycles on average.  They would only be a win if you expected
a
 nontrivial percentage of perfectly-in-order or perfectly-reverse-order
 inputs.  What I think is much more probable in the Postgres
environment
 is almost-but-not-quite-ordered inputs --- eg, a table that was
 perfectly ordered by key when filled, but some of the tuples have
since
 been moved by UPDATEs.  This is the worst possible case for the
in-order
 checks, because they can then grovel over large percentages of the
file
 before failing ... and when they fail, those cycles are entirely
wasted;
 you have not advanced the state of the sort at all.
 
 For the usual case of randomly ordered input, of course it doesn't
 matter much at all because the in-order checks will fail after
examining
 not too many items.  But to argue that the checks are worth making,
you
 have to assume that perfectly-ordered inputs are more common than
 almost-ordered inputs, and I think that is exactly the wrong
assumption
 for the Postgres environment.  I sure haven't seen any evidence that
 it's a good assumption.

The benchmarks say that they (order checks) are a good idea on average
for ordered data, random data, and partly ordered data.

It does not require perfectly ordered data for the checks to be useful.
On mostly ordered data, it is likely that some partitions are perfectly
ordered.

If you trace the algorithms in a debugger you will be surprised at how
often the partitions are ordered, even with random sets as input.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Qingqing Zhou


On Sat, 17 Dec 2005, Dann Corbit wrote:


 The benchmarks say that they (order checks) are a good idea on average
 for ordered data, random data, and partly ordered data.


I interpret that in linux, 500 seems a divide for qsortpdq. Before
that number, it wins, after that, bsd wins more. On SunOS, qsortpdq takes
the lead till the last second -- I suspect this is due to the rand()
function:

Linux - #define   RAND_MAX2147483647
SunOS - #define   RAND_MAX32767

So in SunOS, the data actually not that scattered - so more favourate for
sorted() or reversed() check?

Regards,
Qingqing

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Dann Corbit
 -Original Message-
 From: Qingqing Zhou [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 16, 2005 10:13 PM
 To: Dann Corbit
 Cc: Tom Lane; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql-
 [EMAIL PROTECTED]
 Subject: RE: [HACKERS] Re: Which qsort is used
 
 
 
 On Sat, 17 Dec 2005, Dann Corbit wrote:
 
 
  The benchmarks say that they (order checks) are a good idea on
average
  for ordered data, random data, and partly ordered data.
 
 
 I interpret that in linux, 500 seems a divide for qsortpdq. Before
 that number, it wins, after that, bsd wins more. On SunOS, qsortpdq
takes
 the lead till the last second -- I suspect this is due to the rand()
 function:
 
   Linux - #define   RAND_MAX2147483647
   SunOS - #define   RAND_MAX32767
 
 So in SunOS, the data actually not that scattered - so more favourate
for
 sorted() or reversed() check?

There is a lot of variability from system to system even for the same
tests.  I see different results depending on whether I use GCC or Intel
or MS compilers.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Tom Lane
Dann Corbit [EMAIL PROTECTED] writes:
 I've still got a problem with these checks; I think they are a net
 waste of cycles on average.

 The benchmarks say that they (order checks) are a good idea on average
 for ordered data, random data, and partly ordered data.

There are lies, damn lies, and benchmarks ;-)

The problem with citing a benchmark for this discussion is that a
benchmark can't tell you anything about real-world probabilities;
it only tells you about the probabilities occuring in the benchmark
case.  You need to make the case that the benchmark reflects the
real world, which you didn't.

 If you trace the algorithms in a debugger you will be surprised at how
 often the partitions are ordered, even with random sets as input.

Well, I do agree that checking for orderedness on small partitions would
succeed more often than on larger partitions or the whole file --- but
the code-as-given checks all the way down.  Moreover, the argument given
for spending these cycles is that insertion sort sucks on reverse-order
input ... where sucks means that it spends O(N^2) time.  But it spends
O(N^2) in the average case, too.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Re: Which qsort is used

2005-12-16 Thread Dann Corbit
 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 16, 2005 10:41 PM
 To: Dann Corbit
 Cc: Qingqing Zhou; Bruce Momjian; Luke Lonergan; Neil Conway; pgsql-
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Re: Which qsort is used
 
 Dann Corbit [EMAIL PROTECTED] writes:
  I've still got a problem with these checks; I think they are a net
  waste of cycles on average.
 
  The benchmarks say that they (order checks) are a good idea on
average
  for ordered data, random data, and partly ordered data.
 
 There are lies, damn lies, and benchmarks ;-)
 
 The problem with citing a benchmark for this discussion is that a
 benchmark can't tell you anything about real-world probabilities;
 it only tells you about the probabilities occuring in the benchmark
 case.  You need to make the case that the benchmark reflects the
 real world, which you didn't.
 
  If you trace the algorithms in a debugger you will be surprised at
how
  often the partitions are ordered, even with random sets as input.
 
 Well, I do agree that checking for orderedness on small partitions
would
 succeed more often than on larger partitions or the whole file --- but
 the code-as-given checks all the way down.  Moreover, the argument
given
 for spending these cycles is that insertion sort sucks on
reverse-order
 input ... where sucks means that it spends O(N^2) time.  But it
spends
 O(N^2) in the average case, too.

I agree that in general these checks are not important and they
complicate what is a simple and elegant algorithm.

The cases where the checks are important (highly ordered data sets) are
rare and so the value added is minimal.

I am actually quite impressed with the excellence of Bentley's sort out
of the box.  It's definitely the best library implementation of a sort I
have seen.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly