Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-03-02 Thread Bruce Momjian

Added to TODO:

* Improve port/qsort() to handle sorts with 50% unique and 50% duplicate
  value [qsort]

  This involves choosing better pivot points for the quicksort.


---

Dann Corbit wrote:
> 
> 
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> > [EMAIL PROTECTED] On Behalf Of Tom Lane
> > Sent: Wednesday, February 15, 2006 5:22 PM
> > To: Ron
> > Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
> Index
> > behaviour)
> > 
> > Ron <[EMAIL PROTECTED]> writes:
> > > How are we choosing our pivots?
> > 
> > See qsort.c: it looks like median of nine equally spaced inputs (ie,
> > the 1/8th points of the initial input array, plus the end points),
> > implemented as two rounds of median-of-three choices.  With half of
> the
> > data inputs zero, it's not too improbable for two out of the three
> > samples to be zeroes in which case I think the med3 result will be
> zero
> > --- so choosing a pivot of zero is much more probable than one would
> > like, and doing so in many levels of recursion causes the problem.
> 
> Adding some randomness to the selection of the pivot is a known
> technique to fix the oddball partitions problem.  However, Bentley and
> Sedgewick proved that every quick sort algorithm has some input set that
> makes it go quadratic (hence the recent popularity of introspective
> sort, which switches to heapsort if quadratic behavior is detected.  The
> C++ template I submitted was an example of introspective sort, but
> PostgreSQL does not use C++ so it was not helpful).
> 
> > I think.  I'm not too sure if the code isn't just being sloppy about
> the
> > case where many data values are equal to the pivot --- there's a
> special
> > case there to switch to insertion sort, and maybe that's getting
> invoked
> > too soon.  
> 
> Here are some cases known to make qsort go quadratic:
> 1. Data already sorted
> 2. Data reverse sorted
> 3. Data organ-pipe sorted or ramp
> 4. Almost all data of the same value
> 
> There are probably other cases.  Randomizing the pivot helps some, as
> does check for in-order or reverse order partitions.
> 
> Imagine if 1/3 of the partitions fall into a category that causes
> quadratic behavior (have one of the above formats and have more than
> CUTOFF elements in them).
> 
> It is doubtful that the switch to insertion sort is causing any sort of
> problems.  It is only going to be invoked on tiny sets, for which it has
> a fixed cost that is probably less that qsort() function calls on sets
> of the same size.
> 
> >It'd be useful to get a line-level profile of the behavior of
> > this code in the slow cases...
> 
> I guess that my in-order or presorted tests [which often arise when
> there are very few distinct values] may solve the bad partition
> problems.  Don't forget that the algorithm is called recursively.
> 
> > regards, tom lane
> > 
> > ---(end of
> broadcast)---
> > TIP 3: Have you checked our extensive FAQ?
> > 
> >http://www.postgresql.org/docs/faq
> 
> ---(end of broadcast)---
> TIP 2: Don't 'kill -9' the postmaster
> 

-- 
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-18 Thread Ron

At 08:37 PM 2/15/2006, Dann Corbit wrote:

Adding some randomness to the selection of the pivot is a known 
technique to fix the oddball partitions problem.


True, but it makes QuickSort slower than say MergeSort because of the 
expense of the PRNG being called ~O(lgN) times during a sort.



However, Bentley and Sedgewick proved that every quick sort 
algorithm has some input set that makes it go quadratic


Yep.  OTOH, that input set can be so specific and so unusual as to 
require astronomically unlikely bad luck or hostile hacking in order 
for it to actually occur.



 (hence the recent popularity of introspective sort, which switches 
to heapsort if quadratic behavior is detected.  The C++ template I 
submitted was an example of introspective sort, but PostgreSQL does 
not use C++ so it was not helpful).
...and there are other QuickSort+Other hybrids that address the issue 
as well.  MergeSort, RadixExchangeSort, and BucketSort all come to 
mind.  See Gonnet and Baeza-Yates, etc.




Here are some cases known to make qsort go quadratic:
1. Data already sorted


Only if one element is used to choose the pivot; _and_ only if the 
pivot is the first or last element of each pass.
Even just always using the middle element as the pivot avoids this 
problem.  See Sedgewick or Knuth.




2. Data reverse sorted


Ditto above.



3. Data organ-pipe sorted or ramp


Not sure what this means?  Regardless, median of n partitioning that 
includes samples from each of the 1st 1/3, 2nd 1/3, and final 3rd of 
the data is usually enough to guarantee O(NlgN) behavior unless the 
_specific_ distribution known to be pessimal to that sampling 
algorithm is encountered.  The only times I've ever seen it ITRW was 
as a result of hostile activity: purposely arranging the data in such 
a manner is essentially a DoS attack.




4. Almost all data of the same value


Well known fixes to inner loop available to avoid this problem.


There are probably other cases.  Randomizing the pivot helps some, 
as does check for in-order or reverse order partitions.
Randomizing the choice of pivot essentially guarantees O(NlgN) 
behavior no matter what the distribution of the data at the price of 
increasing the cost of each pass by a constant factor (the generation 
of a random number or numbers).



In sum, QuickSort gets all sorts of bad press that is far more FUD 
than fact ITRW.
Ron.  




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Gregory Maxwell
On 2/17/06, Ragnar <[EMAIL PROTECTED]> wrote:
> Say again ?
> Let us say you have 1 billion rows, where the
> column in question contains strings like
> baaaaaa
> baaaaab
> baaaaac
> ...
> not necessarily in this order on disc of course
>
> The minimum value would be keyed as 0001h,
> the next one as 0002h and so on.
>
> Now insert new value 'a'
>
> Not only will you have to update 1 billion records,
> but also all the values in your map.
>
> please explain

No comment on the usefulness of the idea overall.. but the solution
would be to insert with the colliding value of the existing one lesser
than it..

It will falsly claim equal, which you then must fix with a second
local sort which should be fast because you only need to sort the
duplicates/false dupes.  If you insert too much then this obviously
becomes completely useless.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-17 Thread Dann Corbit
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> [EMAIL PROTECTED] On Behalf Of Markus Schaber
> Sent: Thursday, February 16, 2006 5:45 AM
> To: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index
> 
> Hi, Ron,
> 
> Ron wrote:
> 
> > ...and of course if you know enough about the data to be sorted so
as to
> > constrain it appropriately, one should use a non comparison based
O(N)
> > sorting algorithm rather than any of the general comparison based
> > O(NlgN) methods.
> 
> Sounds interesting, could you give us some pointers (names, URLs,
> papers) to such algorithms?

He refers to counting sort and radix sort (which comes in most
significant digit and least significant digit format).  These are also
called distribution (as opposed to comparison) sorts.

These sorts are O(n) as a function of the data size, but really they are
O(M*n) where M is the average key length and n is the data set size.
(In the case of MSD radix sort M is the average length to completely
differentiate strings)

So if you have an 80 character key, then 80*log(n) will only be faster
than n*log(n) when you have 2^80th elements -- in other words -- never.

If you have short keys, on the other hand, distribution sorts will be
dramatically faster.  On an unsigned integer, for instance, it requires
4 passes with 8 bit buckets and so 16 elements is the crossover to radix
is faster than an O(n*log(n)) sort.  Of course, there is a fixed
constant of proportionality and so it will really be higher than that,
but for large data sets distribution sorting is the best thing that
there is for small keys.

You could easily have an introspective MSD radix sort.  The nice thing
about MSD radix sort is that you can retain the ordering that has
occurred so far and switch to another algorithm.

An introspective MSD radix sort could call an introspective quick sort
algorithm once it processed a crossover point of buckets of key data.

In order to have distribution sorts that will work with a database
system, for each and every data type you will need a function to return
the bucket of bits of significance for the kth bucket of bits.  For a
character string, you could return key[bucket].  For an unsigned integer
it is the byte of the integer to return will be a function of the
endianness of the CPU.  And for each other distinct data type a bucket
function would be needed or a sort could not be generated for that type
using the distribution method.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-17 Thread Jonah H. Harris
Last night I implemented a non-recursive introsort in C... let me test it a bit more and then I'll post it here for everyone else to try out.On 2/16/06, Markus Schaber
 <[EMAIL PROTECTED]> wrote:Hi, Ron,
Ron wrote:> ...and of course if you know enough about the data to be sorted so as to> constrain it appropriately, one should use a non comparison based O(N)> sorting algorithm rather than any of the general comparison based
> O(NlgN) methods.Sounds interesting, could you give us some pointers (names, URLs,papers) to such algorithms?Thanks a lot,Markus--Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GISFight against software patents in EU! www.ffii.org www.nosoftwarepatents.org---(end of broadcast)---
TIP 4: Have you searched our list archives?   http://archives.postgresql.org-- Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation732.331.1324


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-17 Thread Dann Corbit


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> [EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Wednesday, February 15, 2006 5:22 PM
> To: Ron
> Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index
> behaviour)
> 
> Ron <[EMAIL PROTECTED]> writes:
> > How are we choosing our pivots?
> 
> See qsort.c: it looks like median of nine equally spaced inputs (ie,
> the 1/8th points of the initial input array, plus the end points),
> implemented as two rounds of median-of-three choices.  With half of
the
> data inputs zero, it's not too improbable for two out of the three
> samples to be zeroes in which case I think the med3 result will be
zero
> --- so choosing a pivot of zero is much more probable than one would
> like, and doing so in many levels of recursion causes the problem.

Adding some randomness to the selection of the pivot is a known
technique to fix the oddball partitions problem.  However, Bentley and
Sedgewick proved that every quick sort algorithm has some input set that
makes it go quadratic (hence the recent popularity of introspective
sort, which switches to heapsort if quadratic behavior is detected.  The
C++ template I submitted was an example of introspective sort, but
PostgreSQL does not use C++ so it was not helpful).

> I think.  I'm not too sure if the code isn't just being sloppy about
the
> case where many data values are equal to the pivot --- there's a
special
> case there to switch to insertion sort, and maybe that's getting
invoked
> too soon.  

Here are some cases known to make qsort go quadratic:
1. Data already sorted
2. Data reverse sorted
3. Data organ-pipe sorted or ramp
4. Almost all data of the same value

There are probably other cases.  Randomizing the pivot helps some, as
does check for in-order or reverse order partitions.

Imagine if 1/3 of the partitions fall into a category that causes
quadratic behavior (have one of the above formats and have more than
CUTOFF elements in them).

It is doubtful that the switch to insertion sort is causing any sort of
problems.  It is only going to be invoked on tiny sets, for which it has
a fixed cost that is probably less that qsort() function calls on sets
of the same size.

>It'd be useful to get a line-level profile of the behavior of
> this code in the slow cases...

I guess that my in-order or presorted tests [which often arise when
there are very few distinct values] may solve the bad partition
problems.  Don't forget that the algorithm is called recursively.

>   regards, tom lane
> 
> ---(end of
broadcast)---
> TIP 3: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faq

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-17 Thread Tom Lane
Mark Lewis <[EMAIL PROTECTED]> writes:
> I think we're actually on the same page here; you're right that the
> constraint above ( f(a)==f(b) iff a==b ) can't be extended to data types
> with more than 32 bits of value space.  But the constraint I listed was
> actually:

> if a==b then f(a)==f(b)

I believe Martijn had it right: the important constraint is

f(a) > f(b) implies a > b

which implies by commutativity

f(a) < f(b) implies a < b

and these two together imply

a == b implies f(a) == f(b)

Now you can't do any sorting if you only have the equality rule, you
need the inequality rule.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ragnar
On fös, 2006-02-17 at 08:01 -0500, Ron wrote:
> At 04:24 AM 2/17/2006, Ragnar wrote:
> >On fös, 2006-02-17 at 01:20 -0500, Ron wrote:
> > >
> > > OK, so here's _a_ way (there are others) to obtain a mapping such that
> > >   if a < b then f(a) < f (b) and
> > >   if a == b then f(a) == f(b)
> >
> > > By scanning the table once, we can map say 001h (Hex used to ease
> > > typing) to the row with the minimum value and 111h to the row
> > > with the maximum value as well as mapping everything in between to
> > > their appropriate keys.  That same scan can be used to assign a
> > > pointer to each record's location.
> >
> >This step is just as expensive as the original 
> >sort you want to replace/improve.
> 
> Why do you think that?  External sorts involve 
> the equivalent of multiple scans of the table to 
> be sorted, sometimes more than lgN (where N is 
> the number of items in the table to be 
> sorted).  Since this is physical IO we are 
> talking about, each scan is very expensive, and 
> therefore 1 scan is going to take considerably 
> less time than >= lgN scans will be.

Call me dim, but please explain exactly how you are going
to build this mapping in one scan. Are you assuming
the map will fit in memory? 

> 
> 
> >If you want to keep this mapping saved as a sort 
> >of an index, or as part ot each row data, this 
> >will make the cost of inserts and updates enormous.
> 
> Not sure you've got this right either.  Looks to 
> me like we are adding a <= 32b quantity to each 
> row.  Once we know the mapping, incrementally 
> updating it upon insert or update would seem to 
> be simple matter of a fast search for the correct 
> ranking [Interpolation search, which we have all 
> the needed data for, is O(lglgN).  Hash based 
> search is O(1)]; plus an increment/decrement of 
> the key values greater/less than the key value of 
> the row being inserted / updated.  Given than we 
> are updating all the keys in a specific range 
> within a tree structure, that update can be done 
> in O(lgm) (where m is the number of records affected).

Say again ?
Let us say you have 1 billion rows, where the
column in question contains strings like 
baaaaaa
baaaaab
baaaaac
...
not necessarily in this order on disc of course

The minimum value would be keyed as 0001h,
the next one as 0002h and so on.

Now insert new value 'a'

Not only will you have to update 1 billion records,
but also all the values in your map.

please explain

gnari



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-17 Thread Mark Lewis
On Thu, 2006-02-16 at 21:33 -0800, David Lang wrote:
> > In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
> > sortKey as elsewhere suggested).  The sorting key doesn't need to be a
> > one-to-one mapping.
> 
> that would violate your second contraint ( f(a)==f(b) iff (a==b) )
> 
> if you could drop that constraint (the cost of which would be extra 'real' 
> compares within a bucket) then a helper function per datatype could work 
> as you are talking.

I think we're actually on the same page here; you're right that the
constraint above ( f(a)==f(b) iff a==b ) can't be extended to data types
with more than 32 bits of value space.  But the constraint I listed was
actually:

if a==b then f(a)==f(b)

Which doesn't imply 'if and only if'.  It's a similar constraint to
hashcodes; the same value will always have the same hash, but you're not
guaranteed that the hashcodes for two distinct values will be unique.

-- Mark

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ron

At 10:53 AM 2/17/2006, Martijn van Oosterhout wrote:

On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote:
> >For this mapping, you need a full table sort.
> One physical IO pass should be all that's needed.  However, let's
> pretend you are correct and that we do need to sort the table to get
> the key mapping.  Even so, we would only need to do it =once= and
> then we would be able to use and incrementally update the results
> forever afterward.  Even under this assumption, one external sort to
> save all subsequent such sorts seems well worth it.
>
> IOW, even if I'm wrong about the initial cost to do this; it is still
> worth doing ;-)

I think you're talking about something different here. You're thinking
of having the whole table sorted and when you add a new value you add
it in such a way to keep it sorted. The problem is, what do you sort it
by? If you've sorted the table by col1, then when the user does ORDER
BY col2 it's useless.

No, I'm thinking about how to represent DB row data in such a way that
a= we use a compact enough representation that we can sort internally 
rather than externally.
b= we do the sort once and avoid most of the overhead upon subsequent 
similar requests.


I used the example of sorting on the entire row to show that the 
approach works even when the original record being sorted by is very large.
All my previous comments on this topic hold for the case where we are 
sorting on only part of a row as well.


If all you are doing is sorting on a column or a few columns, what 
I'm discussing is even easier since treating the columns actually 
being used a sort criteria as integers rather than the whole row as 
an atomic unit eats less resources during the key creation and 
mapping process.  If the row is 2-4KB in size, but we only care about 
some combination of columns that only takes on <= 2^8 or <= 2^16 
different values, then what I've described will be even better than 
the original example I gave.


Basically, the range of a key is going to be restricted by how
a= big the field is that represents the key (columns and such are 
usually kept narrow for performance reasons) or
b= big each row is (the more space each row takes, the fewer rows fit 
into any given amount of storage)

c= many rows there are in the table
Between the conditions, the range of a key tends to be severely 
restricted and therefore use much less space than sorting the actual 
DB records would.  ...and that gives us something we can take advantage of.




Indeed, this is what btrees do, you store the order of the table
seperate from the data. And you can store multiple orders. But even
then, when someone does ORDER BY lower(col1), it's still useless.

And you're right, we still need to do the single mass sort in the
beginning, which is precisely what we're trying to optimise here.

Sigh.  My points were:
1= we have information available to us that allows us to map the rows 
in such a way as to turn most external sorts into internal sorts, 
thereby avoiding the entire external sorting problem in those 
cases.  This is a huge performance improvement.


2= that an external sort is =NOT= required for initial key 
assignment, but even if it was it would be worth it.


3= that this is a variation of a well known technique so I'm not 
suggesting heresy here.



Ron 




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-17 Thread Martijn van Oosterhout
On Fri, Feb 17, 2006 at 08:18:41AM -0800, Scott Lamb wrote:
> On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote:
> >Data types which could probably provide a useful function for f  
> >would be
> >int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
> 
> ...and with some work, floats (I think just the exponent would work,  
> if nothing else). bytea. Probably just about anything.
> 
> Interesting. If you abandon the idea that collisions should be  
> impossible (they're not indexes) or extremely rare (they're not  
> hashes), it's pretty easy to come up with a decent hint to avoid a  
> lot of dereferences.

Yep, pretty much for any datatype you create a mapping function to map
it to a signed int32. All you have to guarentee is that f(a) > f(b)
implies that a > b. Only if f(a) == f(b) do you need to compare a and
b.

You then change the sorting code to have an array of (Datum,int32)
(ouch, double the storage) where the int32 is the f(Datum). And in the
comparison routines you first check the int32. If they give an order
you're done. On match you do the full comparison.

For integer types (int2,int4,int8,oid) the conversion is straight
forward. For float you'd use the exponent and the first few bits of the
mantissa. For strings you'd have to bail, or use a strxfrm equivalent.
NULL would be INT_MAX pr INT_MIN depending on where you want it. Thing
is, even if you don't have such a function and always return zero, the
results will still be right.

Not a new idea, but it would be very nice to implement. If would
produce nice speedups for type where comparisons are expensive. And
more importantly, the bulk of the comparisons can be moved inline and
make the whole cache-friendlyness discussed here much more meaningful.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-17 Thread Scott Lamb
On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote:Data types which could probably provide a useful function for f would be int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). ...and with some work, floats (I think just the exponent would work, if nothing else). bytea. Probably just about anything.Interesting. If you abandon the idea that collisions should be impossible (they're not indexes) or extremely rare (they're not hashes), it's pretty easy to come up with a decent hint to avoid a lot of dereferences. --Scott Lamb  

Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Martijn van Oosterhout
On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote:
> >For this mapping, you need a full table sort.
> One physical IO pass should be all that's needed.  However, let's 
> pretend you are correct and that we do need to sort the table to get 
> the key mapping.  Even so, we would only need to do it =once= and 
> then we would be able to use and incrementally update the results 
> forever afterward.  Even under this assumption, one external sort to 
> save all subsequent such sorts seems well worth it.
> 
> IOW, even if I'm wrong about the initial cost to do this; it is still 
> worth doing ;-)

I think you're talking about something different here. You're thinking
of having the whole table sorted and when you add a new value you add
it in such a way to keep it sorted. The problem is, what do you sort it
by? If you've sorted the table by col1, then when the user does ORDER
BY col2 it's useless.

Indeed, this is what btrees do, you store the order of the table
seperate from the data. And you can store multiple orders. But even
then, when someone does ORDER BY lower(col1), it's still useless.

And you're right, we still need to do the single mass sort in the
beginning, which is precisely what we're trying to optimise here.

> ??? You do not need to resort already ordered data to insert a new 
> element into it such that the data stays ordered!  Once we have done 
> the key ordering operation once, we should not ever need to do it 
> again on the original data.  Else most sorting algorithms wouldn't work ;-)

We already do this with btree indexes. I'm not sure what you are
proposing that improves on that.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ron

At 05:19 AM 2/17/2006, Markus Schaber wrote:

Hi, Ron,

Ron schrieb:

> OK, so here's _a_ way (there are others) to obtain a mapping such that
>  if a < b then f(a) < f (b) and
>  if a == b then f(a) == f(b)
>
> Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb
> integer; a 4KB row becomes a 32Kb integer; etc)
> Since even a 1TB table made of such rows can only have 256M - 512M
> possible values even if each row is unique, a 28b or 29b key is large
> enough to represent each row's value and relative rank compared to all
> of the others even if all row values are unique.
>
> By scanning the table once, we can map say 001h (Hex used to ease
> typing) to the row with the minimum value and 111h to the row with
> the maximum value as well as mapping everything in between to their
> appropriate keys.  That same scan can be used to assign a pointer to
> each record's location.

But with a single linear scan, this cannot be accomplished, as the table
contents are neither sorted nor distributed linearly between the minimum
and the maximum.
So what?  We are talking about key assignment here, not anything that 
requires physically manipulating the actual DB rows.

One physical IO pass should be all that's needed.



For this mapping, you need a full table sort.
One physical IO pass should be all that's needed.  However, let's 
pretend you are correct and that we do need to sort the table to get 
the key mapping.  Even so, we would only need to do it =once= and 
then we would be able to use and incrementally update the results 
forever afterward.  Even under this assumption, one external sort to 
save all subsequent such sorts seems well worth it.


IOW, even if I'm wrong about the initial cost to do this; it is still 
worth doing ;-)




> That initial scan to set up the keys is expensive, but if we wish that
> cost can be amortized over the life of the table so we don't have to pay
> it all at once.  In addition, once we have created those keys, then can
> be saved for later searches and sorts.

But for every update or insert, you have to resort the keys, which is
_very_ expensive as it basically needs to update a huge part of the table.


??? You do not need to resort already ordered data to insert a new 
element into it such that the data stays ordered!  Once we have done 
the key ordering operation once, we should not ever need to do it 
again on the original data.  Else most sorting algorithms wouldn't work ;-)



Ron 




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ron

At 04:24 AM 2/17/2006, Ragnar wrote:

On fös, 2006-02-17 at 01:20 -0500, Ron wrote:
>
> OK, so here's _a_ way (there are others) to obtain a mapping such that
>   if a < b then f(a) < f (b) and
>   if a == b then f(a) == f(b)

> By scanning the table once, we can map say 001h (Hex used to ease
> typing) to the row with the minimum value and 111h to the row
> with the maximum value as well as mapping everything in between to
> their appropriate keys.  That same scan can be used to assign a
> pointer to each record's location.

This step is just as expensive as the original 
sort you want to replace/improve.


Why do you think that?  External sorts involve 
the equivalent of multiple scans of the table to 
be sorted, sometimes more than lgN (where N is 
the number of items in the table to be 
sorted).  Since this is physical IO we are 
talking about, each scan is very expensive, and 
therefore 1 scan is going to take considerably 
less time than >= lgN scans will be.



If you want to keep this mapping saved as a sort 
of an index, or as part ot each row data, this 
will make the cost of inserts and updates enormous.


Not sure you've got this right either.  Looks to 
me like we are adding a <= 32b quantity to each 
row.  Once we know the mapping, incrementally 
updating it upon insert or update would seem to 
be simple matter of a fast search for the correct 
ranking [Interpolation search, which we have all 
the needed data for, is O(lglgN).  Hash based 
search is O(1)]; plus an increment/decrement of 
the key values greater/less than the key value of 
the row being inserted / updated.  Given than we 
are updating all the keys in a specific range 
within a tree structure, that update can be done 
in O(lgm) (where m is the number of records affected).



>  We can now sort the key+pointer pairs instead of the actual data and
> use an optional final pass to rearrange the actual rows if we wish.

How are you suggesting this mapping be accessed? 
If the mapping is kept separate from the tuple 
data, as in an index, then how will you look up the key?
???  We've effectively created a data set where 
each record is a pointer to a DB row plus its 
key.  We can now sort the data set by key and 
then do an optional final pass to rearrange the 
actual DB rows if we so wish.  Since that final 
pass is very expensive, it is good that not all 
use scenarios will need that final pass.


The amount of storage required to sort this 
representation of the table rather than the 
actual table is so much less that it turns an 
external sorting problem into a internal sorting 
problem with an optional final pass that is =1= 
scan (albeit one scan with a lot of seeks and 
data movement).  This is a big win.  It is a 
variation of a well known technique.  See Sedgewick, Knuth, etc.




> That initial scan to set up the keys is expensive, but if we wish
> that cost can be amortized over the life of the table so we don't
> have to pay it all at once.  In addition, once we have created those
> keys, then can be saved for later searches and sorts.

What is the use case where this would work better than a
regular btree index ?
Again, ???  btree indexes address different 
issues.  They do not in any way help create a 
compact data representation of the original data 
that saves enough space so as to turn an external 
ranking or sorting problem into an internal one.



Ron 




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread PFC


	Has anybody got some profiler data on the amount of time spent in  
comparisons during a sort ? Say, the proposals here would give the most  
gains on simple types like INTEGER ; so it would be interesting to know  
how much time is now spent in comparisons for sorting a column of ints. If  
it's like, 10% of the total time, well...


More hand-waving :

What are the usage case for sorts ?

	- potentially huge data sets : create index, big joins, reporting queries  
etc.
	- small data sets : typically, a query with an ORDER BY which will return  
a small amount of rows (website stuff), or joins not small enough to use a  
HashAggregate, but not big enough to create an index just for them.


	The cost of a comparison vs. moving stuff around and fetching stuff is  
probably very different in these two cases. If it all neatly fits in  
sort_mem, you can do fancy stuff (like sorting on SortKeys) which will  
need to access the data in almost random order when time comes to hand the  
sorted data back. So, I guess the SortKey idea would rather apply to the  
latter case only, which is CPU limited.


	Anyway, I was wondering about queries with multipart keys, like ORDER BY  
zipcode, client_name, date and the like. Using just an int64 as the key  
isn't going to help a lot here. Why not use a binary string of limited  
length ? I'd tend to think it would not be that slower than comparing  
ints, and it would be faster than calling each comparison function for  
each column. Each key part would get assigned to a byte range in the  
string.
	It would waste some memory, but for instance, using 2x the memory for  
half the time would be a good tradeoff if the amount of memory involved is  
in the order of megabytes.
	Also, it would solve the problem of locales. Comparisons involving  
locales are slow, but strings can be converted to a canonical form  
(removing accents and stuff), and then binary sorted.


	Also I'll insert a plug for the idea that the Sort needs to know if there  
will be a LIMIT afterwards ; this way it could reduce its working set by  
simply discarding the rows which would have been discarded anyway by the  
LIMIT. Say you want the 100 first rows out of a million ordered rows. If  
the sort knows this, it can be performed in the amount of memory for a 100  
rows sort.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Markus Schaber
Hi, Ron,

Ron schrieb:

> OK, so here's _a_ way (there are others) to obtain a mapping such that
>  if a < b then f(a) < f (b) and
>  if a == b then f(a) == f(b)
> 
> Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb
> integer; a 4KB row becomes a 32Kb integer; etc)
> Since even a 1TB table made of such rows can only have 256M - 512M
> possible values even if each row is unique, a 28b or 29b key is large
> enough to represent each row's value and relative rank compared to all
> of the others even if all row values are unique.
> 
> By scanning the table once, we can map say 001h (Hex used to ease
> typing) to the row with the minimum value and 111h to the row with
> the maximum value as well as mapping everything in between to their
> appropriate keys.  That same scan can be used to assign a pointer to
> each record's location.

But with a single linear scan, this cannot be accomplished, as the table
contents are neither sorted nor distributed linearly between the minimum
and the maximum.

For this mapping, you need a full table sort.

> That initial scan to set up the keys is expensive, but if we wish that
> cost can be amortized over the life of the table so we don't have to pay
> it all at once.  In addition, once we have created those keys, then can
> be saved for later searches and sorts.

But for every update or insert, you have to resort the keys, which is
_very_ expensive as it basically needs to update a huge part of the table.

Markus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-17 Thread Markus Schaber
Hi, David,

David Lang schrieb:

>> In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
>> sortKey as elsewhere suggested).  The sorting key doesn't need to be a
>> one-to-one mapping.

> that would violate your second contraint ( f(a)==f(b) iff (a==b) )

no, it doesn't.

When both strings are equal, then the first characters are equal, too.

If they are not equal, the constraint condition does not match.

The first characters of the strings may be equal as f(a) may be equal to
f(b) as to the other constraint.

Markus

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-17 Thread Ragnar
On fös, 2006-02-17 at 01:20 -0500, Ron wrote:
> At 01:47 PM 2/16/2006, Ron wrote:
> >At 12:19 PM 2/16/2006, Scott Lamb wrote:
> >>On Feb 16, 2006, at 8:32 AM, Ron wrote:
> >>>Let's pretend that we have the typical DB table where rows are
> >>>~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
> >>>such a table.
> >>>
> >>>A 32b hash code can be assigned to each row value such that only
> >>>exactly equal rows will have the same hash code.
> >>>A 32b pointer can locate any of the 256M-512M rows.
> >>>
> >>>Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
> >>>+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an
> >>>optional pass to rearrange the actual rows if we so wish.
> >>
> >>I don't understand this.
> >>
> >>This is a true statement: (H(x) != H(y)) => (x != y)
> >>This is not: (H(x) < H(y)) => (x < y)
> >>
> >>Hash keys can tell you there's an inequality, but they can't tell you
> >>how the values compare. If you want 32-bit keys that compare in the
> >>same order as the original values, here's how you have to get them:
> >For most hash codes, you are correct.  There is a class of hash or 
> >hash-like codes that maintains the mapping to support that second statement.
> >
> >More later when I can get more time.
> >Ron
> 
> OK, so here's _a_ way (there are others) to obtain a mapping such that
>   if a < b then f(a) < f (b) and
>   if a == b then f(a) == f(b)

> By scanning the table once, we can map say 001h (Hex used to ease 
> typing) to the row with the minimum value and 111h to the row 
> with the maximum value as well as mapping everything in between to 
> their appropriate keys.  That same scan can be used to assign a 
> pointer to each record's location.

This step is just as expensive as the original sort you
want to replace/improve. If you want to keep this mapping
saved as a sort of an index, or as part ot each row data, this will make
the cost of inserts and updates enormous.

> 
> We can now sort the key+pointer pairs instead of the actual data and 
> use an optional final pass to rearrange the actual rows if we wish.

How are you suggesting this mapping be accessed? If the 
mapping is kept separate from the tuple data, as in an index, then how
will you look up the key?

> That initial scan to set up the keys is expensive, but if we wish 
> that cost can be amortized over the life of the table so we don't 
> have to pay it all at once.  In addition, once we have created those 
> keys, then can be saved for later searches and sorts.

What is the use case where this would work better than a 
regular btree index ?

gnari



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 01:47 PM 2/16/2006, Ron wrote:

At 12:19 PM 2/16/2006, Scott Lamb wrote:

On Feb 16, 2006, at 8:32 AM, Ron wrote:

Let's pretend that we have the typical DB table where rows are
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an

optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) => (x != y)
This is not: (H(x) < H(y)) => (x < y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or 
hash-like codes that maintains the mapping to support that second statement.


More later when I can get more time.
Ron


OK, so here's _a_ way (there are others) to obtain a mapping such that
 if a < b then f(a) < f (b) and
 if a == b then f(a) == f(b)

Pretend each row is a integer of row size (so a 2KB row becomes a 
16Kb integer; a 4KB row becomes a 32Kb integer; etc)
Since even a 1TB table made of such rows can only have 256M - 512M 
possible values even if each row is unique, a 28b or 29b key is large 
enough to represent each row's value and relative rank compared to 
all of the others even if all row values are unique.


By scanning the table once, we can map say 001h (Hex used to ease 
typing) to the row with the minimum value and 111h to the row 
with the maximum value as well as mapping everything in between to 
their appropriate keys.  That same scan can be used to assign a 
pointer to each record's location.


We can now sort the key+pointer pairs instead of the actual data and 
use an optional final pass to rearrange the actual rows if we wish.


That initial scan to set up the keys is expensive, but if we wish 
that cost can be amortized over the life of the table so we don't 
have to pay it all at once.  In addition, once we have created those 
keys, then can be saved for later searches and sorts.


Further space savings can be obtained whenever there are duplicate 
keys and/or when compression methods are used on the Key+pointer pairs.


Ron







---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread David Lang

On Thu, 16 Feb 2006, Mark Lewis wrote:


On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:

Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).


How exactly do you imagine doing this for text?

I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though.



In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
sortKey as elsewhere suggested).  The sorting key doesn't need to be a
one-to-one mapping.


that would violate your second contraint ( f(a)==f(b) iff (a==b) )

if you could drop that constraint (the cost of which would be extra 'real' 
compares within a bucket) then a helper function per datatype could work 
as you are talking.


David Lang

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Steinar H. Gunderson
On Fri, Feb 17, 2006 at 12:05:23AM +0100, PFC wrote:
>   I would have said a 64 bit int, but it's the same idea. However it 
>   won't  work for floats, which is a pity, because floats fit in 64 bits. 

Actually, you can compare IEEE floats directly as ints, as long as they're
positive. (If they can be both positive and negative, you need to take
special care of the sign bit first, but it's still doable.)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Markus Schaber
Hi, PFC,

PFC schrieb:

> By the way, I'd like to declare my zipcode columns as SQL_ASCII
> while the  rest of my database is in UNICODE, so they are faster to
> index and sort.  Come on, MySQL does it...

Another use case for parametric column definitions - charset definitions
- and the first one that cannot be emulated via constraints.

Other use cases I remember were range definitions for numbers or PostGIS
dimension, subtype and SRID, but those cann all be emulated via checks /
constraints.

Markus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Mark Lewis
On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:
> > > Data types which could probably provide a useful function for f would be
> > > int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
> 
> How exactly do you imagine doing this for text?
> 
> I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though.


In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
sortKey as elsewhere suggested).  The sorting key doesn't need to be a
one-to-one mapping.

-- Mark Lewis

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread PFC




It seems that instead of maintaining a different sorting code path for
each data type, you could get away with one generic path and one
(hopefully faster) path if you allowed data types to optionally support
a 'sortKey' interface by providing a function f which maps inputs to 32-
bit int outputs, such that the following two properties hold:


	Looks like the decorate-sort-undecorate pattern, which works quite well.  
Good idea.
	I would have said a 64 bit int, but it's the same idea. However it won't  
work for floats, which is a pity, because floats fit in 64 bits. Unless  
more types creep in the code path (which would not necessarily make it  
that slower).
	As for text, the worst case is when all strings start with the same 8  
letters, but a good case pops up when a few-letter code is used as a key  
in a table. Think about a zipcode, for instance. If a merge join needs to  
sort on zipcodes, it might as well sort on 64-bits integers...


	By the way, I'd like to declare my zipcode columns as SQL_ASCII while the  
rest of my database is in UNICODE, so they are faster to index and sort.  
Come on, MySQL does it...


Keep up !

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Greg Stark
Markus Schaber <[EMAIL PROTECTED]> writes:

> Hmm, to remove redundancy, I'd change the <= to a < and define:
> 
> if a==b then f(a)==f(b)
> if a 
> > Data types which could probably provide a useful function for f would be
> > int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

How exactly do you imagine doing this for text?

I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Martijn van Oosterhout
On Thu, Feb 16, 2006 at 02:17:36PM -0800, Mark Lewis wrote:
> It seems that instead of maintaining a different sorting code path for
> each data type, you could get away with one generic path and one
> (hopefully faster) path if you allowed data types to optionally support
> a 'sortKey' interface by providing a function f which maps inputs to 32-
> bit int outputs, such that the following two properties hold:
> 
> f(a)>=f(b) iff a>=b
> if a==b then f(a)==f(b)

Note this is a property of the collation, not the type. For example
strings can be sorted in many ways and the sortKey must reflect that.
So in postgres terms it's a property of the btree operator class.

It's something I'd like to do if I get A Round Tuit. :)

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Markus Schaber
Hi, Mark,

Mark Lewis schrieb:

> It seems that instead of maintaining a different sorting code path for
> each data type, you could get away with one generic path and one
> (hopefully faster) path if you allowed data types to optionally support
> a 'sortKey' interface by providing a function f which maps inputs to 32-
> bit int outputs, such that the following two properties hold:
> 
> f(a)>=f(b) iff a>=b
> if a==b then f(a)==f(b)

Hmm, to remove redundancy, I'd change the <= to a < and define:

if a==b then f(a)==f(b)
if a Data types which could probably provide a useful function for f would be
> int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

With int2 or some restricted ranges of oid and int4, we could even
implement a bucket sort.

Markus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Mark Lewis
On Thu, 2006-02-16 at 12:15 -0500, Tom Lane wrote:
> Once or twice we've kicked around the idea of having some
> datatype-specific sorting code paths alongside the general-purpose one,
> but I can't honestly see this as being workable from a code maintenance
> standpoint.
> 
>   regards, tom lane


It seems that instead of maintaining a different sorting code path for
each data type, you could get away with one generic path and one
(hopefully faster) path if you allowed data types to optionally support
a 'sortKey' interface by providing a function f which maps inputs to 32-
bit int outputs, such that the following two properties hold:

f(a)>=f(b) iff a>=b
if a==b then f(a)==f(b)

So if a data type supports the sortKey interface you could perform the
sort on f(value) and only refer back to the actual element comparison
functions when two sortKeys have the same value.

Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

Depending on the overhead, you might not even need to maintain 2
independent search code paths, since you could always use f(x)=0 as the
default sortKey function which would degenerate to the exact same sort
behavior in use today.

-- Mark Lewis

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 12:19 PM 2/16/2006, Scott Lamb wrote:

On Feb 16, 2006, at 8:32 AM, Ron wrote:

Let's pretend that we have the typical DB table where rows are
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an

optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) => (x != y)
This is not: (H(x) < H(y)) => (x < y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or 
hash-like codes that maintains the mapping to support that second statement.


More later when I can get more time.
Ron 




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Scott Lamb

On Feb 16, 2006, at 8:32 AM, Ron wrote:
Let's pretend that we have the typical DB table where rows are  
~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in  
such a table.


A 32b hash code can be assigned to each row value such that only  
exactly equal rows will have the same hash code.

A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b 
+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an  
optional pass to rearrange the actual rows if we so wish.


I don't understand this.

This is a true statement: (H(x) != H(y)) => (x != y)
This is not: (H(x) < H(y)) => (x < y)

Hash keys can tell you there's an inequality, but they can't tell you  
how the values compare. If you want 32-bit keys that compare in the  
same order as the original values, here's how you have to get them:


(1) sort the values into an array
(2) use each value's array index as its key

It reduces to the problem you're trying to use it to solve.


--
Scott Lamb 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-16 Thread Tom Lane
"Gary Doades" <[EMAIL PROTECTED]> writes:
> I think the reason I wasn't seeing performance issues with normal sort
> operations is because they use work_mem not maintenance_work_mem which was
> only set to 2048 anyway. Does that sound right?

Very probable.  Do you want to test the theory by jacking that up?  ;-)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Tom Lane
"Craig A. James" <[EMAIL PROTECTED]> writes:
> You can also use this trick when the optimizer is asked for "fastest first 
> result."  Say you have a cursor on a column of numbers with good 
> distribution.  If you do a bucket sort on the first two or three digits only, 
> you know the first "page" of results will be in the first bucket.  So you 
> only need to apply qsort to that first bucket (which is very fast, since it's 
> small), and you can deliver the first page of data to the application.  This 
> can be particularly effective in interactive situations, where the user 
> typically looks at a few pages of data and then abandons the search.  

> I doubt this is very relevant to Postgres.  A relational database has to be 
> general purpose, and it's hard to give it "hints" that would tell it when to 
> use this particular optimization.

Actually, LIMIT does nicely for that hint; the PG planner has definitely
got a concept of preferring fast-start plans for limited queries.  The
real problem in applying bucket-sort ideas is the lack of any
datatype-independent way of setting up the buckets.

Once or twice we've kicked around the idea of having some
datatype-specific sorting code paths alongside the general-purpose one,
but I can't honestly see this as being workable from a code maintenance
standpoint.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Martijn van Oosterhout
On Thu, Feb 16, 2006 at 11:32:55AM -0500, Ron wrote:
> At 10:52 AM 2/16/2006, Ron wrote:
> >In fact we can do better.
> >Using hash codes or what-not to map datums to keys and then sorting 
> >just the keys and the pointers to those datums followed by an 
> >optional final pass where we do the actual data movement is also a 
> >standard technique for handling large data structures.

Or in fact required if the Datums are not all the same size, which is
the case in PostgreSQL.

> I thought some follow up might be in order here.
> 
> Let's pretend that we have the typical DB table where rows are ~2-4KB 
> apiece.  1TB of storage will let us have 256M-512M rows in such a table.
> 
> A 32b hash code can be assigned to each row value such that only 
> exactly equal rows will have the same hash code.
> A 32b pointer can locate any of the 256M-512M rows.

That hash code is impossible the way you state it, since the set of
strings is not mappable to a 32bit integer. You probably meant that a
hash code can be assigned such that equal rows have equal hashes (drop
the only).

> Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= 
> 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an optional 
> pass to rearrange the actual rows if we so wish.
> 
> We get the same result while only examining and manipulating 1/50 to 
> 1/25 as much data during the sort.

But this is what we do now. The tuples are loaded, we sort an array of
pointers, then we write the output. Except we don't have the hash, so
we require access to the 1TB of data to do the actual comparisons. Even
if we did have the hash, we'd *still* need access to the data to handle
tie-breaks.

That's why your comment about moves always being more expensive than
compares makes no sense. A move can be acheived simply by swapping two
pointers in the array. A compare actually needs to call all sorts of
functions. If and only if we have functions for every data type to
produce an ordered hash, we can optimise sorts based on single
integers.

For reference, look at comparetup_heap(). It's just 20 lines, but each
function call there expands to maybe a dozen lines of code. And it has
a loop. I don't think we're anywhere near the stage where locality of
reference makes much difference.

We very rarely needs the tuples actualised in memory in the required
order, just the pointers are enough.

Have a ncie day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 10:52 AM 2/16/2006, Ron wrote:

At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:


Where this does become interesting is where we can convert a datum to
an integer such that if f(A) > f(B) then A > B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting 
just the keys and the pointers to those datums followed by an 
optional final pass where we do the actual data movement is also a 
standard technique for handling large data structures.

I thought some follow up might be in order here.

Let's pretend that we have the typical DB table where rows are ~2-4KB 
apiece.  1TB of storage will let us have 256M-512M rows in such a table.


A 32b hash code can be assigned to each row value such that only 
exactly equal rows will have the same hash code.

A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b= 
64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an optional 
pass to rearrange the actual rows if we so wish.


We get the same result while only examining and manipulating 1/50 to 
1/25 as much data during the sort.


If we want to spend more CPU time in order to save more space, we can 
compress the key+pointer representation.  That usually reduces the 
amount of data to be manipulated to ~1/4 the original key+pointer 
representation, reducing things to ~512M-1GB worth of compressed 
pointers+keys.  Or ~1/200 - ~1/100 the original amount of data we 
were discussing.


Either representation is small enough to fit within RAM rather than 
requiring HD IO, so we solve the HD IO bottleneck in the best 
possible way: we avoid ever doing it.


Ron   




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Tom Lane
Ron <[EMAIL PROTECTED]> writes:
> Your cost comment basically agrees with mine regarding the cost of 
> random memory accesses.  The good news is that the number of datums 
> to be examined during the pivot choosing process is small enough that 
> the datums can fit into CPU cache while the pointers to them can be 
> assigned to registers: making pivot choosing +very+ fast when done correctly.

This is more or less irrelevant given that comparing the pointers is not
the operation we need to do.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create

2006-02-16 Thread Ron

At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:

On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
> 3= Especially in modern systems where the gap between internal CPU
> bandwidth and memory bandwidth is so great, the overhead of memory
> accesses for comparisons and moves is the majority of the overhead
> for both the pivot choosing and the partitioning algorithms within
> quicksort.  Particularly random memory accesses.  The reason (#GPRs -
> 1) is a magic constant is that it's the most you can compare and move
> using only register-to-register operations.

But how much of this applies to us? We're not sorting arrays of
integers, we're sorting pointers to tuples. So while moves cost very
little, a comparison costs hundreds, maybe thousands of cycles. A tuple
can easily be two or three cachelines and you're probably going to
access all of it, not to mention the Fmgr structures and the Datums
themselves.
Pointers are simply fixed size 32b or 64b quantities.  They are 
essentially integers.  Comparing and moving pointers or fixed size 
keys to those pointers is exactly the same problem as comparing and 
moving integers.


Comparing =or= moving the actual data structures is a much more 
expensive and variable cost proposition.  I'm sure that pg's sort 
functionality is written intelligently enough that the only real data 
moves are done in a final pass after the exact desired order has been 
found using pointer compares and (re)assignments during the sorting 
process.  That's a standard technique for sorting data whose "key" or 
pointer is much smaller than a datum.


Your cost comment basically agrees with mine regarding the cost of 
random memory accesses.  The good news is that the number of datums 
to be examined during the pivot choosing process is small enough that 
the datums can fit into CPU cache while the pointers to them can be 
assigned to registers: making pivot choosing +very+ fast when done correctly.


As you've noted, actual partitioning is going to be more expensive 
since it involves accessing enough actual datums that they can't all 
fit into CPU cache.  The good news is that QuickSort has a very 
sequential access pattern within its inner loop.  So while we must go 
to memory for compares, we are at least keeping the cost for it down 
it a minimum.  In addition, said access is nice enough to be very 
prefetch and CPU cache hierarchy friendly.




None of this is cache friendly. The actual tuples themselves could be
spread all over memory (I don't think any particular effort is expended
trying to minimize fragmentation).
It probably would be worth it to spend some effort on memory layout 
just as we do for HD layout.




Do these algorithms discuss the case where a comparison is more than
1000 times the cost of a move?
A move is always more expensive than a compare when the datum is 
larger than its pointer or key.  A move is always more expensive than 
a compare when it involves memory to memory movement rather than CPU 
location to CPU location movement.  A move is especially more 
expensive than a compare when it involves both factors.  Most moves 
do involve both.


What I suspect you meant is that a key comparison that involves 
accessing the data in memory is more expensive than reassigning the 
pointers associated with those keys.   That is certainly true.


Yes.  The problem has been extensively studied. ;-)



Where this does become interesting is where we can convert a datum to
an integer such that if f(A) > f(B) then A > B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting 
just the keys and the pointers to those datums followed by an 
optional final pass where we do the actual data movement is also a 
standard technique for handling large data structures.



Regardless of what tweaks beyond the basic algorithms we use, the 
algorithms themselves have been well studied and their performance 
well established.  QuickSort is the best performing of the O(nlgn) 
comparison based sorts and it uses less resources than HeapSort or MergeSort.


Ron



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Gary Doades
> "Gary Doades" <[EMAIL PROTECTED]> writes:
>> I think the reason I wasn't seeing performance issues with normal sort
>> operations is because they use work_mem not maintenance_work_mem which
>> was
>> only set to 2048 anyway. Does that sound right?
>
> Very probable.  Do you want to test the theory by jacking that up?  ;-)

Hmm, played around a bit. I have managed to get it to do a sort on one of
the "bad" columns using a select of two whole tables that results in a
sequntial scan, sort and merge join. I also tried a simple select column
order by column for a bad column.

I tried varying maintenance_work_mem and work_mem up and down between 2048
and 65536 but I always get similar results. The sort phase always takes 4
to 5 seconds which seems about right for 900,000 rows.

This was on a colunm that took 12 minutes to create an index on.

I've no idea why it should behave this way, but probably explains why I
(and others) may not have noticed it before.

Regards,
Gary.



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Martijn van Oosterhout
On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
> 3= Especially in modern systems where the gap between internal CPU 
> bandwidth and memory bandwidth is so great, the overhead of memory 
> accesses for comparisons and moves is the majority of the overhead 
> for both the pivot choosing and the partitioning algorithms within 
> quicksort.  Particularly random memory accesses.  The reason (#GPRs - 
> 1) is a magic constant is that it's the most you can compare and move 
> using only register-to-register operations.

But how much of this applies to us? We're not sorting arrays of
integers, we're sorting pointers to tuples. So while moves cost very
little, a comparison costs hundreds, maybe thousands of cycles. A tuple
can easily be two or three cachelines and you're probably going to
access all of it, not to mention the Fmgr structures and the Datums
themselves.

None of this is cache friendly. The actual tuples themselves could be
spread all over memory (I don't think any particular effort is expended
trying to minimize fragmentation).

Do these algorithms discuss the case where a comparison is more than
1000 times the cost of a move?

Where this does become interesting is where we can convert a datum to
an integer such that if f(A) > f(B) then A > B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-16 Thread Gary Doades
Tom Lane wrote:
> I increased the size of the test case by 10x (basically s/10/100/)
> which is enough to push it into the external-sort regime.  I get
> amazingly stable runtimes now --- I didn't have the patience to run 100
> trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
> So this code path is definitely not very sensitive to this data
> distribution.
>
> While these numbers aren't glittering in comparison to the best-case
> qsort times (~450 msec to sort 10% as much data), they are sure a lot
> better than the worst-case times.  So maybe a workaround for you is
> to decrease maintenance_work_mem, counterintuitive though that be.
> (Now, if you *weren't* using maintenance_work_mem of 100MB or more
> for your problem restore, then I'm not sure I know what's going on...)
>

Good call. I basically reversed your test by keeping the number of rows
the same (20), but reducing maintenance_work_mem. Reducing to 8192
made no real difference. Reducing to 4096 flattened out all the times
nicely. Slower overall, but at least predictable. Hopefully only a
temporary solution until qsort is fixed.

My restore now takes 22 minutes :)

I think the reason I wasn't seeing performance issues with normal sort
operations is because they use work_mem not maintenance_work_mem which was
only set to 2048 anyway. Does that sound right?

Regards,
Gary.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-15 Thread Neil Conway
On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
> It seems clear that our qsort.c is doing a pretty awful job of picking
> qsort pivots, while glibc is mostly managing not to make that mistake.
> I haven't looked at the glibc code yet to see what they are doing
> differently.

glibc qsort is actually merge sort, so I'm not surprised it avoids this
problem.

-Neil



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index

2006-02-15 Thread Simon Riggs
On Wed, 2006-02-15 at 19:59 -0500, Tom Lane wrote:

>  I get
> amazingly stable runtimes now --- I didn't have the patience to run 100
> trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
> So this code path is definitely not very sensitive to this data
> distribution.

"The worst-case behavior of replacement-selection is very close to its
average behavior, while the worst-case behavior of QuickSort is terrible
(N2) – a strong argument in favor of replacement-selection. Despite this
risk, QuickSort is widely used because, in practice, it has superior
performance." p.8, "AlphaSort: A Cache-Sensitive Parallel External
Sort", Nyberg et al, VLDB Journal 4(4): 603-627 (1995)

I think your other comment about flipping to insertion sort too early
(and not returning...) is a plausible cause for the poor pg qsort
behaviour, but the overall spread of values seems as expected.

Some test results I've seen seem consistent with the view that
increasing memory also increases run-time for larger settings of
work_mem/maintenance_work_mem. Certainly, as I observed a while back,
having a large memory settings doesn't help you at all when you are
doing final run merging on the external sort. Whatever we do, we should
look at the value high memory settings bring to each phase of a sort
separately from the other phases.

There is work underway on improving external sorts, so I hear (not me).
Plus my WIP on randomAccess requirements.

Best Regards, Simon Riggs




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Tom Lane
I wrote:
> Gary Doades <[EMAIL PROTECTED]> writes:
>> Ouch! That confirms my problem. I generated the random test case because 
>> it was easier than including the dump of my tables, but you can 
>> appreciate that tables 20 times the size are basically crippled when it 
>> comes to creating an index on them.

> Actually... we only use qsort when we have a sorting problem that fits
> within the allowed sort memory.  The external-sort logic doesn't go
> through that code at all.  So all the analysis we just did on your test
> case doesn't necessarily apply to sort problems that are too large for
> the sort_mem setting.

I increased the size of the test case by 10x (basically s/10/100/)
which is enough to push it into the external-sort regime.  I get
amazingly stable runtimes now --- I didn't have the patience to run 100
trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
So this code path is definitely not very sensitive to this data
distribution.

While these numbers aren't glittering in comparison to the best-case
qsort times (~450 msec to sort 10% as much data), they are sure a lot
better than the worst-case times.  So maybe a workaround for you is
to decrease maintenance_work_mem, counterintuitive though that be.
(Now, if you *weren't* using maintenance_work_mem of 100MB or more
for your problem restore, then I'm not sure I know what's going on...)

We still ought to try to fix qsort of course.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index behaviour)

2006-02-15 Thread Tom Lane
Gary Doades <[EMAIL PROTECTED]> writes:
> Ouch! That confirms my problem. I generated the random test case because 
> it was easier than including the dump of my tables, but you can 
> appreciate that tables 20 times the size are basically crippled when it 
> comes to creating an index on them.

Actually... we only use qsort when we have a sorting problem that fits
within the allowed sort memory.  The external-sort logic doesn't go
through that code at all.  So all the analysis we just did on your test
case doesn't necessarily apply to sort problems that are too large for
the sort_mem setting.

The test case would be sorting 20 index entries, which'd probably
occupy at least 24 bytes apiece of sort memory, so probably about 5 meg.
A problem 20 times that size would definitely not fit in the default
16MB maintenance_work_mem.  Were you using a large value of
maintenance_work_mem for your restore?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings