Re: [HACKERS] Inlining comparators as a performance optimisation

2012-01-28 Thread Bruce Momjian
On Fri, Jan 13, 2012 at 10:48:56AM +0100, Pierre C wrote:
 Maybe an extra column in pg_proc would do (but then, the proargtypes
 and friends would describe only the sql-callable version) ?
 Or an extra table ? pg_cproc ?
 Or an in-memory hash : hashtable[ fmgr-callable function ] = C version
 - What happens if a C function has no SQL-callable equivalent ?
 Or (ugly) introduce an extra per-type function
 type_get_function_ptr( function_kind ) which returns the requested
 function ptr
 
 If one of those happens, I'll dust off my old copy-optimization patch ;)

I agree that COPY is ripe for optimization, and I am excited you have
some ideas on this.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2012-01-13 Thread Pierre C

On Wed, 21 Sep 2011 18:13:07 +0200, Tom Lane t...@sss.pgh.pa.us wrote:


Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:

On 21.09.2011 18:46, Tom Lane wrote:

The idea that I was toying with was to allow the regular SQL-callable
comparison function to somehow return a function pointer to the
alternate comparison function,



You could have a new function with a pg_proc entry, that just returns a
function pointer to the qsort-callback.


Yeah, possibly.  That would be a much more invasive change, but cleaner
in some sense.  I'm not really prepared to do all the legwork involved
in that just to get to a performance-testable patch though.


A few years ago I had looked for a way to speed up COPY operations, and it  
turned out that COPY TO has a good optimization opportunity. At that time,  
for each datum, COPY TO would :


- test for nullness
- call an outfunc through fmgr
- outfunc pallocs() a bytea or text, fills it with data, and returns it  
(sometimes it uses an extensible string buffer which may be repalloc()d  
several times)
- COPY memcpy()s returned data to a buffer and eventually flushes the  
buffer to client socket.


I introduced a special write buffer with an on-flush callback (ie, a close  
relative of the existing string-buffer), in this case the callback was  
flush to client socket, and several outfuncs (one per type) which took  
that buffer as argument, besides the datum to output, and simply put the  
datum inside the buffer, with appropriate transformations (like converting  
to bytea or text), and flushed if needed.


Then the COPY TO BINARY of a constant-size datum would turn to :
- one test for nullness
- one C function call
- one test to ensure appropriate space available in buffer (flush if  
needed)
- one htonl() and memcpy of constant size, which the compiler turns out  
into a couple of simple instructions


I recall measuring speedups of 2x - 8x on COPY BINARY, less for text, but  
still large gains.


Although eliminating fmgr call and palloc overhead was an important part  
of it, another large part was getting rid of memcpy()'s which the compiler  
turned into simple movs for known-size types, a transformation that can be  
done only if the buffer write functions are inlined inside the outfuncs.  
Compilers love constants...


Additionnally, code size growth was minimal since I moved the old outfuncs  
code into the new outfuncs, and replaced the old fmgr-callable outfuncs  
with create buffer with on-full callback=extend_and_repalloc() - pass to  
new outfunc(buffer,datum) - return buffer. Which is basically equivalent  
to the previous palloc()-based code, maybe with a few extra instructions.


When I submitted the patch for review, Tom rightfully pointed out that my  
way of obtaining the C function pointer sucked very badly (I don't  
remember how I did it, only that it was butt-ugly) but the idea was to get  
a quick measurement of what could be gained, and the result was positive.  
Unfortunately I had no time available to finish it and make it into a real  
patch, I'm sorry about that.


So why do I post in this sorting topic ? It seems, by bypassing fmgr for  
functions which are small, simple, and called lots of times, there is a  
large gain to be made, not only because of fmgr overhead but also because  
of the opportunity for new compiler optimizations, palloc removal, etc.  
However, in my experiment the arguments and return types of the new  
functions were DIFFERENT from the old functions : the new ones do the same  
thing, but in a different manner. One manner was suited to sql-callable  
functions (ie, palloc and return a bytea) and another one to writing large  
amounts of data (direct buffer write). Since both have very different  
requirements, being fast at both is impossible for the same function.


Anyway, all that rant boils down to :

Some functions could benefit having two versions (while sharing almost all  
the code between them) :

- User-callable (fmgr) version (current one)
- C-callable version, usually with different parameters and return type

And it would be cool to have a way to grab a bare function pointer on the  
second one.


Maybe an extra column in pg_proc would do (but then, the proargtypes and  
friends would describe only the sql-callable version) ?

Or an extra table ? pg_cproc ?
Or an in-memory hash : hashtable[ fmgr-callable function ] = C version
- What happens if a C function has no SQL-callable equivalent ?
Or (ugly) introduce an extra per-type function type_get_function_ptr(  
function_kind ) which returns the requested function ptr


If one of those happens, I'll dust off my old copy-optimization patch ;)

Hmm... just my 2c

Regards
Pierre

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-16 Thread Greg Smith
I think we can call a new sorting infrastructure popping out and what 
looks to be over 90 messages on this topic as successful progress on 
this front.  Peter's going to rev a new patch, but with more performance 
results to review and followup discussion I can't see this one as 
wrapping for the current CommitFest.  Marking it returned and we'll 
return to this topic during or before the next CF.  With the majority of 
the comments here coming from committers, I think continuing progress in 
that direction won't be a problem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-07 Thread Peter Geoghegan
On 7 December 2011 03:45, Robert Haas robertmh...@gmail.com wrote:
 In this regard, I think Heikki's remarks upthread are worth some
 thought.  If inlining is a win just because it avoids saving and
 restoring registers or allows better instruction scheduling, then
 inlining is the (probably?) the only way to get the benefit.  But if
 most of the benefit is in having a separate path for the
 single-sort-key case, we can do that without duplicating the qsort()
 code and get the benefit for every data type without much code bloat.
 I'd like to see us dig into that a little, so that we get the broadest
 possible benefit out of this work.  It doesn't bother me that not
 every optimization will apply to every case, and I don't object to
 optimizations that are intrinsically narrow (within some reasonable
 limits).  But I'd rather not take what could be a fairly broad-based
 optimization and apply it only narrowly, all things being equal.

I think we're in agreement then.

It's important to realise, though I'm sure you do, that once you've
done what I did with sorting single scanKey integers, that's it -
there isn't really any way that I can think of to speed up quick
sorting further, other than parallelism which has major challenges,
and isn't particularly appropriate at this granularity. The real
significance of inlining is, well, that's it, that's all there is -
after inlining, I predict that the well will run dry, or practically
dry. Inlining actually is a pretty great win when you look at sorting
in isolation, but it's just that there's been a number of other
significant wins in my patch, and of course there is overhead from a
number of other areas, so perhaps it's easy to lose sight of that.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-07 Thread Robert Haas
On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 1. Adding sortsupport infrastructure for more datatypes.
 2. Revising nbtree and related code to use this infrastructure.
 3. Integrating Peter's work into this framework.

 I'll try to take care of #1 for at least a few key datatypes before
 I commit, but I think #2 is best done as a separate patch, so I'll
 postpone that till later.

I see you've committed a chunk of this now.  Does it make sense to do
#1 for every data type we support, or should we be more selective than
that?  My gut feeling would be to add it across the board and
introduce an opr_sanity check for it.  The general utility of adding
it to deprecated types like abstime is perhaps questionable, but it
strikes me that the value of making everything consistent probably
outweighs the cost of a few extra lines of code.

Are you planning to do anything about #2 or #3?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 1. Adding sortsupport infrastructure for more datatypes.
 2. Revising nbtree and related code to use this infrastructure.
 3. Integrating Peter's work into this framework.
 
 I'll try to take care of #1 for at least a few key datatypes before
 I commit, but I think #2 is best done as a separate patch, so I'll
 postpone that till later.

 I see you've committed a chunk of this now.  Does it make sense to do
 #1 for every data type we support, or should we be more selective than
 that?

Basically, I tried to do #1 for every datatype for which the comparator
was cheap enough that reducing the call overhead seemed likely to make a
useful difference.  I'm not in favor of adding sortsupport functions
where this is not true, as I think it'll be useless code and catalog
bloat.  I don't want to add 'em for cruft like abstime either.

There's some stuff that's debatable according to this criterion --- in
particular, I wondered whether it'd be worth having a fast path for
bttextcmp, especially if we pre-tested the collate_is_c condition and
had a separate version that just hardwired the memcmp code path.  (The
idea of doing that was one reason I insisted on collation being known at
the setup step.)  But it would still have to be prepared for detoasting,
so in the end I was unenthused.  Anyone who feels like testing could try
to prove me wrong about it though.

 Are you planning to do anything about #2 or #3?

I am willing to do #2, but not right now; I feel what I need to do next
is go review SPGist.  I don't believe that #2 blocks progress on #3
anyway.  I think #3 is in Peter's court, or yours if you want to do it.

(BTW, I agree with your comments yesterday about trying to break down
the different aspects of what Peter did, and put as many of them as we
can into the non-inlined code paths.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-07 Thread Robert Haas
On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 There's some stuff that's debatable according to this criterion --- in
 particular, I wondered whether it'd be worth having a fast path for
 bttextcmp, especially if we pre-tested the collate_is_c condition and
 had a separate version that just hardwired the memcmp code path.  (The
 idea of doing that was one reason I insisted on collation being known at
 the setup step.)  But it would still have to be prepared for detoasting,
 so in the end I was unenthused.  Anyone who feels like testing could try
 to prove me wrong about it though.

I think that'd definitely be worth investigating (although I'm not
sure I have the time to do it myself any time real soon).

 Are you planning to do anything about #2 or #3?

 I am willing to do #2, but not right now; I feel what I need to do next
 is go review SPGist.

Yeah, makes sense.  That one seems likely to be a challenge to absorb.

 I don't believe that #2 blocks progress on #3
 anyway.  I think #3 is in Peter's court, or yours if you want to do it.

 (BTW, I agree with your comments yesterday about trying to break down
 the different aspects of what Peter did, and put as many of them as we
 can into the non-inlined code paths.)

Cool.  Peter, can you rebase your patch and integrate it into the
sortsupport framework that's now committed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-07 Thread Peter Geoghegan
On 7 December 2011 15:15, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 But it would still have to be prepared for detoasting,
 so in the end I was unenthused.  Anyone who feels like testing could try
 to prove me wrong about it though.

 I think that'd definitely be worth investigating (although I'm not
 sure I have the time to do it myself any time real soon).

I'll at least take a look at it. Sorting text is a fairly common case.
I'm not hugely enthused about spending too much time on work that will
only be useful if collate_is_c.

 I don't believe that #2 blocks progress on #3
 anyway.  I think #3 is in Peter's court, or yours if you want to do it.

 (BTW, I agree with your comments yesterday about trying to break down
 the different aspects of what Peter did, and put as many of them as we
 can into the non-inlined code paths.)

I'm confident that we should have everything for the simple case of
ordering by a single int4 and int8 column, and I think you'd probably
agree with that - they're extremely common cases. Anything beyond that
will need to be justified, probably in part by running additional
benchmarks.

 Cool.  Peter, can you rebase your patch and integrate it into the
 sortsupport framework that's now committed?

Yes, I'd be happy to, though I don't think I'm going to be getting
around to it this side of Friday. Since it isn't a blocker, I assume
that's okay.

The rebased revision will come complete with a well thought-out
rationale for my use of inlining specialisations, that takes account
of the trade-off against binary bloat that Tom highlighted. I wasn't
ignoring that issue, but I did fail to articulate my thoughts there,
mostly because I felt the need to do some additional research to
justify my position.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-07 Thread Robert Haas
On Wed, Dec 7, 2011 at 10:58 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 7 December 2011 15:15, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 But it would still have to be prepared for detoasting,
 so in the end I was unenthused.  Anyone who feels like testing could try
 to prove me wrong about it though.

 I think that'd definitely be worth investigating (although I'm not
 sure I have the time to do it myself any time real soon).

 I'll at least take a look at it. Sorting text is a fairly common case.
 I'm not hugely enthused about spending too much time on work that will
 only be useful if collate_is_c.

Well, text_pattern_ops is not exactly an uncommon case, but sure.  And
there might be some benefit for other collations, too.

 Cool.  Peter, can you rebase your patch and integrate it into the
 sortsupport framework that's now committed?

 Yes, I'd be happy to, though I don't think I'm going to be getting
 around to it this side of Friday. Since it isn't a blocker, I assume
 that's okay.

Sounds fine to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Nicolas Barbier
2011/12/5 Tom Lane t...@sss.pgh.pa.us:

 What is bothering me is that this approach is going to cause substantial
 bloat of the executable code, and such bloat has got distributed costs,
 which we don't have any really good way to measure but for sure
 micro-benchmarks addressing only sort speed aren't going to reveal them.
 Cache lines filled with sort code take cycles to flush and replace with
 something else.

 I think it's possibly reasonable to have inlined copies of qsort for a
 small number of datatypes, but it seems much less reasonable to have
 multiple copies per datatype in order to obtain progressively tinier
 micro-benchmark speedups.  We need to figure out what the tradeoff
 against executable size really is, but so far it seems like you've been
 ignoring the fact that there is any such tradeoff at all.

[ Randomly spouting ideas here: ]

Might it not be a good idea to decide whether to use the inlined
copies vs. the normal version, based on how much data to sort? Surely
for a very large sort, the cache blow-out doesn't matter that much
relative to the amount of time that can be saved sorting?

Assuming that all types would have an inlined sort function, although
that would indeed result in a larger binary, most of that binary would
never touch the cache, because corresponding large sorts are never
performed. If they would sporadically occur (and assuming the points
at which inline sorting starts to get used are chosen wisely), the
possibly resulting cache blow-out would be a net win.

I am also assuming here that instruction cache lines are small enough
for case line aliasing not to become a problem; putting all sort
functions next to each other in the binary (so that they don't alias
with the rest of the backend code that might be used more often) might
alleviate that.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Robert Haas
On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * I invented a SortKey struct that replaces ScanKey for tuplesort's
 purposes.  Right now that's local in tuplesort.c, but we're more than
 likely going to need it elsewhere as well.  Should we just define it
 in sortsupport.h?  Or perhaps we should just add the few additional
 fields to SortSupportInfoData, and not bother with two struct types?
 Before you answer, consider the next point.

+1 for not bothering with two struct types.  We might want to consider
calling the resulting structure SortKey rather than
SortSupportInfoData, however.

 * I wonder whether it would be worthwhile to elide inlineApplyComparator
 altogether, pushing what it does down to the level of the
 datatype-specific functions.  That would require changing the
 comparator API to include isnull flags, and exposing the
 reverse/nulls_first sort flags to the comparators (presumably by
 including them in SortSupportInfoData).  The main way in which that
 could be a win would be if the setup function could choose one of four
 comparator functions that are pre-specialized for each flags
 combination; but that seems like it would add a lot of code bulk, and
 the bigger problem is that we need to be able to change the flags after
 sort initialization (cf. the reversedirection code in tuplesort.c), so
 we'd also need some kind of re-select the comparator call API.  On the
 whole this doesn't seem promising, but maybe somebody else has a
 different idea.

I thought about this, too, but it didn't seem promising to me, either.

 * We're going to want to expose PrepareSortSupportComparisonShim
 for use outside tuplesort.c too, and possibly refactor
 tuplesort_begin_heap so that the SortKey setup logic inside it
 can be extracted for use elsewhere.  Shall we just add those to
 tuplesort's API, or would it be better to create a sortsupport.c
 with these sorts of functions?

Why are we going to want to do that?  If it's because there are other
places in the code that can make use of a fast comparator that don't
go through tuplesort.c, then we should probably break it off into a
separate file (sortkey.c?).  But if it's because we think that clients
of the tuplesort code are going to need it for some reason, then we
may as well keep it in tuplesort.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * We're going to want to expose PrepareSortSupportComparisonShim
 for use outside tuplesort.c too, and possibly refactor
 tuplesort_begin_heap so that the SortKey setup logic inside it
 can be extracted for use elsewhere.  Shall we just add those to
 tuplesort's API, or would it be better to create a sortsupport.c
 with these sorts of functions?

 Why are we going to want to do that?  If it's because there are other
 places in the code that can make use of a fast comparator that don't
 go through tuplesort.c, then we should probably break it off into a
 separate file (sortkey.c?).  But if it's because we think that clients
 of the tuplesort code are going to need it for some reason, then we
 may as well keep it in tuplesort.c.

My expectation is that nbtree, as well as mergejoin and mergeappend,
would get converted over to use the fast comparator API.  I looked at
that a little bit but didn't push it far enough to be very sure about
whether they'd be able to share the initialization code from
tuplesort_begin_heap.  But they're definitely going to need the shim
function for backwards compatibility, and
PrepareSortSupportComparisonShim was my first cut at a wrapper that
would be generally useful.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Robert Haas
On Tue, Dec 6, 2011 at 12:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * We're going to want to expose PrepareSortSupportComparisonShim
 for use outside tuplesort.c too, and possibly refactor
 tuplesort_begin_heap so that the SortKey setup logic inside it
 can be extracted for use elsewhere.  Shall we just add those to
 tuplesort's API, or would it be better to create a sortsupport.c
 with these sorts of functions?

 Why are we going to want to do that?  If it's because there are other
 places in the code that can make use of a fast comparator that don't
 go through tuplesort.c, then we should probably break it off into a
 separate file (sortkey.c?).  But if it's because we think that clients
 of the tuplesort code are going to need it for some reason, then we
 may as well keep it in tuplesort.c.

 My expectation is that nbtree, as well as mergejoin and mergeappend,
 would get converted over to use the fast comparator API.  I looked at
 that a little bit but didn't push it far enough to be very sure about
 whether they'd be able to share the initialization code from
 tuplesort_begin_heap.  But they're definitely going to need the shim
 function for backwards compatibility, and
 PrepareSortSupportComparisonShim was my first cut at a wrapper that
 would be generally useful.

OK.  Well, then pushing it out to a separate file probably makes
sense.  Do you want to do that or shall I have a crack at it?  If the
latter, what do you think about using the name SortKey for everything
rather than SortSupport?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 OK.  Well, then pushing it out to a separate file probably makes
 sense.  Do you want to do that or shall I have a crack at it?  If the
 latter, what do you think about using the name SortKey for everything
 rather than SortSupport?

I'll take another crack at it.  I'm not entirely sold yet on merging
the two structs; I think first we'd better look and see what the needs
are in the other potential callers I mentioned.  If we'd end up
cluttering the struct with half a dozen weird fields, it'd be better to
stick to a minimal interface struct with various wrapper structs, IMO.

OTOH it did seem that the names were getting a bit long.  If we do
keep the two-struct-levels approach, what do you think of
s/SortSupportInfo/SortSupport/g ?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Robert Haas
On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 OK.  Well, then pushing it out to a separate file probably makes
 sense.  Do you want to do that or shall I have a crack at it?  If the
 latter, what do you think about using the name SortKey for everything
 rather than SortSupport?

 I'll take another crack at it.  I'm not entirely sold yet on merging
 the two structs; I think first we'd better look and see what the needs
 are in the other potential callers I mentioned.  If we'd end up
 cluttering the struct with half a dozen weird fields, it'd be better to
 stick to a minimal interface struct with various wrapper structs, IMO.

OK.  I'll defer to whatever you come up with after looking at it.

 OTOH it did seem that the names were getting a bit long.  If we do
 keep the two-struct-levels approach, what do you think of
 s/SortSupportInfo/SortSupport/g ?

+1.  I had that thought when you originally suggested that name, but
it didn't seem worth arguing about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'll take another crack at it.  I'm not entirely sold yet on merging
 the two structs; I think first we'd better look and see what the needs
 are in the other potential callers I mentioned.  If we'd end up
 cluttering the struct with half a dozen weird fields, it'd be better to
 stick to a minimal interface struct with various wrapper structs, IMO.

 OK.  I'll defer to whatever you come up with after looking at it.

OK, it looks like nodeMergeAppend.c could use something exactly like the
draft SortKey struct, while nodeMergejoin.c could embed such a struct in
MergeJoinClauseData.  The btree stuff needs something more nearly
equivalent to a ScanKey, including a datum-to-compare-to and a flags
field.  I'm inclined to think the latter would be too specialized to put
in the generic struct.  On the other hand, including the reverse and
nulls_first flags in the generic struct is clearly a win since it allows
ApplyComparator() to be defined as a generic function.  So the only
thing that's really debatable is the attno field, and I'm not anal
enough to insist on a separate level of struct just for that.

I am however inclined to stick with the shortened struct name SortSupport
rather than using SortKey.  The presence of the function pointer fields
(especially the inlined-qsort pointers, assuming we adopt some form of
Peter's patch) changes the struct's nature in my view; it's not really
describing just a sort key (ie an ORDER BY column specification).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Robert Haas
On Tue, Dec 6, 2011 at 4:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'll take another crack at it.  I'm not entirely sold yet on merging
 the two structs; I think first we'd better look and see what the needs
 are in the other potential callers I mentioned.  If we'd end up
 cluttering the struct with half a dozen weird fields, it'd be better to
 stick to a minimal interface struct with various wrapper structs, IMO.

 OK.  I'll defer to whatever you come up with after looking at it.

 OK, it looks like nodeMergeAppend.c could use something exactly like the
 draft SortKey struct, while nodeMergejoin.c could embed such a struct in
 MergeJoinClauseData.  The btree stuff needs something more nearly
 equivalent to a ScanKey, including a datum-to-compare-to and a flags
 field.  I'm inclined to think the latter would be too specialized to put
 in the generic struct.  On the other hand, including the reverse and
 nulls_first flags in the generic struct is clearly a win since it allows
 ApplyComparator() to be defined as a generic function.  So the only
 thing that's really debatable is the attno field, and I'm not anal
 enough to insist on a separate level of struct just for that.

 I am however inclined to stick with the shortened struct name SortSupport
 rather than using SortKey.  The presence of the function pointer fields
 (especially the inlined-qsort pointers, assuming we adopt some form of
 Peter's patch) changes the struct's nature in my view; it's not really
 describing just a sort key (ie an ORDER BY column specification).

Works for me.  I think we should go ahead and get this part committed
first, and then we can look at the inlining stuff as a further
optimization for certain cases...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Peter Geoghegan
On 7 December 2011 00:18, Robert Haas robertmh...@gmail.com wrote:
 Works for me.  I think we should go ahead and get this part committed
 first, and then we can look at the inlining stuff as a further
 optimization for certain cases...

Do you mean just inlining, or inlining and the numerous other
optimisations that my patch had?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Robert Haas
On Tue, Dec 6, 2011 at 8:13 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 7 December 2011 00:18, Robert Haas robertmh...@gmail.com wrote:
 Works for me.  I think we should go ahead and get this part committed
 first, and then we can look at the inlining stuff as a further
 optimization for certain cases...

 Do you mean just inlining, or inlining and the numerous other
 optimisations that my patch had?

Whichever you like.  But I think part of the point here is to
disentangle those optimizations from each other and decide how broadly
it makes sense to apply each one.  Avoiding the FunctionCallInfo stuff
is one, and it seems like we can apply that to a wide variety of data
types (maybe all of them) for both in-memory and on-disk sorting, plus
btree index ops, merge joins, and merge append.  The gains may be
modest, but they will benefit many use cases.  Your original patch
targets a much narrower use case (in-memory sorting of POD types) but
the benefits are larger.  We don't have to pick between a general but
small optimization and a narrower but larger one; we can do both.

In this regard, I think Heikki's remarks upthread are worth some
thought.  If inlining is a win just because it avoids saving and
restoring registers or allows better instruction scheduling, then
inlining is the (probably?) the only way to get the benefit.  But if
most of the benefit is in having a separate path for the
single-sort-key case, we can do that without duplicating the qsort()
code and get the benefit for every data type without much code bloat.
I'd like to see us dig into that a little, so that we get the broadest
possible benefit out of this work.  It doesn't bother me that not
every optimization will apply to every case, and I don't object to
optimizations that are intrinsically narrow (within some reasonable
limits).  But I'd rather not take what could be a fairly broad-based
optimization and apply it only narrowly, all things being equal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-05 Thread Heikki Linnakangas

On 05.12.2011 02:14, Peter Geoghegan wrote:

On 4 December 2011 19:17, Tom Lanet...@sss.pgh.pa.us  wrote:

I have not done any performance testing on this patch, but it might be
interesting to check it with the same test cases Peter's been using.


I've attached a revision of exactly the same benchmark run to get the
results in results_server.ods .

You'll see very similar figures to results_server.ods for HEAD and for
my patch, as you'd expect. I think the results speak for themselves. I
maintain that we should use specialisations - that's where most of the
benefit is to be found.


I'm still not convinced the big gain is in inlining the comparison 
functions. Your patch contains a few orthogonal tricks, too:


* A fastpath for the case of just one scankey. No reason we can't do 
that without inlining.


* inlining the swap-functions within qsort. Thaẗ́'s different from 
inlining the comparison functions, and could be done regardless of the 
data type. The qsort element size in tuplesort.c is always sizeof(SortTuple)


And there's some additional specializations we can do, not implemented 
in your patch, that don't depend on inlining the comparison functions:


* A fastpath for the case of no NULLs
* separate comparetup functions for ascending and descending sorts (this 
allows tail call elimination of the call to type-specific comparison 
function in the ascending version)

* call CHECK_FOR_INTERRUPTS() less frequently.

To see how much difference those things make, I hacked together the 
attached patch. I didn't base this on Robert's/Tom's patch, but instead 
just added a quick  dirty FunctionCallInvoke-overhead-skipping version 
of btint4cmp. I believe that aspect of this patch it's similar in 
performance characteristics, though.


In my quick tests, it gets quite close in performance to your inlining 
patch, when sorting int4s and the input contains no NULLs. But please 
give it a try in your test environment, to get numbers comparable with 
your other tests.


Perhaps you can get even more gain by adding the no-NULLs, and asc/desc 
special cases to your inlining patch, too, but let's squeeze all the fat 
out of the non-inlined version first. One nice aspect of many of these 
non-inlining optimizations is that they help with external sorts, too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3505236..85bec20 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -114,6 +114,8 @@
 #include utils/rel.h
 #include utils/tuplesort.h
 
+#include utils/builtins.h
+
 
 /* sort-type codes for sort__start probes */
 #define HEAP_SORT		0
@@ -273,6 +275,8 @@ struct Tuplesortstate
 	int			memtupcount;	/* number of tuples currently present */
 	int			memtupsize;		/* allocated length of memtuples array */
 
+	bool		hasnulls_initial;	/* any NULLs in the first key col? */
+
 	/*
 	 * While building initial runs, this is the current output run number
 	 * (starting at 0).  Afterwards, it is the number of initial runs we made.
@@ -341,6 +345,8 @@ struct Tuplesortstate
 	TupleDesc	tupDesc;
 	ScanKey		scanKeys;		/* array of length nKeys */
 
+	int (*comparator) (Datum, Datum);
+
 	/*
 	 * These variables are specific to the CLUSTER case; they are set by
 	 * tuplesort_begin_cluster.  Note CLUSTER also uses tupDesc and
@@ -459,6 +465,11 @@ static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 Tuplesortstate *state);
+static int comparetup_heap_1key_asc_nonulls(const SortTuple *a, const SortTuple *b,
+Tuplesortstate *state);
+static int comparetup_heap_1key_asc_nonulls_btint4cmp(const SortTuple *a, const SortTuple *b,
+Tuplesortstate *state);
+static int compare_int4(Datum first, Datum second);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
@@ -551,6 +562,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	state-availMem = state-allowedMem;
 	state-sortcontext = sortcontext;
 	state-tapeset = NULL;
+	state-hasnulls_initial = false;
 
 	state-memtupcount = 0;
 	state-memtupsize = 1024;	/* initial guess */
@@ -1247,11 +1259,41 @@ tuplesort_performsort(Tuplesortstate *state)
 			 * amount of memory.  Just qsort 'em and we're done.
 			 */
 			if (state-memtupcount  1)
+			{
+int	(*comparetup) (const SortTuple *a, const SortTuple *b,
+   Tuplesortstate *state);
+
+/* If there is only one key col, the input contains no NULLs,
+ * and it's ascending, we can use a fastpath version of
+ * comparetup_heap that skips the over those things.
+ */
+if (state-comparetup == comparetup_heap 
+	state-nKeys == 1 
+	!(state-scanKeys[0].sk_flags  SK_BT_DESC) 
+	

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-05 Thread Peter Geoghegan
On 5 December 2011 13:23, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I'm still not convinced the big gain is in inlining the comparison
 functions. Your patch contains a few orthogonal tricks, too:

 * A fastpath for the case of just one scankey. No reason we can't do that
 without inlining.

True, though Tom did seem to not like that idea very much.

 * inlining the swap-functions within qsort. Thaẗ́'s different from inlining
 the comparison functions, and could be done regardless of the data type. The
 qsort element size in tuplesort.c is always sizeof(SortTuple)

Sure. I think that Tom mostly objects to hard-coded intelligence about
which datatypes are used, rather than specialisations per se.

 And there's some additional specializations we can do, not implemented in
 your patch, that don't depend on inlining the comparison functions:

 * A fastpath for the case of no NULLs
 * separate comparetup functions for ascending and descending sorts (this
 allows tail call elimination of the call to type-specific comparison
 function in the ascending version)
 * call CHECK_FOR_INTERRUPTS() less frequently.

All of those had occurred to me, except the last - if you look at the
definition of that macro, it looks like more trouble than it's worth
to do less frequently. I didn't revise my patch with them though,
because the difference that they made, while clearly measurable,
seemed fairly small, or at least relatively so. I wasn't about to add
additional kludge for marginal benefits given the existing
controversy, though I have not dismissed the idea entirely - it could
be important on some other machine.

 Perhaps you can get even more gain by adding the no-NULLs, and asc/desc
 special cases to your inlining patch, too, but let's squeeze all the fat out
 of the non-inlined version first.

As I said, those things are simple enough to test, and were not found
to make a significant difference, at least to my exact test case +
hardware.

 One nice aspect of many of these
 non-inlining optimizations is that they help with external sorts, too.

I'd expect the benefits to be quite a lot smaller there, but fair point.

Results from running the same test on your patch are attached. I think
that while you're right to suggest that the inlining of the comparator
proper, rather than any other function or other optimisation isn't
playing a dominant role in these optimisations, I think that you're
underestimating the role of locality of reference. To illustrate this,
I've also included results for when I simply move the comparator to
another translation unit, logtape.c . There's clearly a regression
from doing so (I ran the numbers twice, in a clinical environment).
The point is, there is a basically unknown overhead paid for not
inlining - maybe inlining is not worth it, but that's a hard call to
make.

I didn't try to make the numbers look worse by putting the comparator
proper in some other translation unit, but it may be that I'd have
shown considerably worse numbers by doing so.

Why the strong aversion to what I've done? I accept that it's ugly,
but is it really worth worrying about that sort of regression in
maintainability for something that was basically untouched since 2006,
and will probably remain untouched after this work concludes, whatever
happens?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


results_server_w_heikki.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-05 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 Why the strong aversion to what I've done? I accept that it's ugly,
 but is it really worth worrying about that sort of regression in
 maintainability for something that was basically untouched since 2006,
 and will probably remain untouched after this work concludes, whatever
 happens?

Maintainability is only part of the issue --- though it's definitely one
part, since this code has hardly gone untouched since 2006, cf
http://git.postgresql.org/gitweb/?p=postgresql.git;a=history;f=src/backend/utils/sort/tuplesort.c;hb=HEAD

What is bothering me is that this approach is going to cause substantial
bloat of the executable code, and such bloat has got distributed costs,
which we don't have any really good way to measure but for sure
micro-benchmarks addressing only sort speed aren't going to reveal them.
Cache lines filled with sort code take cycles to flush and replace with
something else.

I think it's possibly reasonable to have inlined copies of qsort for a
small number of datatypes, but it seems much less reasonable to have
multiple copies per datatype in order to obtain progressively tinier
micro-benchmark speedups.  We need to figure out what the tradeoff
against executable size really is, but so far it seems like you've been
ignoring the fact that there is any such tradeoff at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-04 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 OK, so I tried to code this up.  Adding the new amproc wasn't too
 difficult (see attached).  It wasn't obvious to me how to tie it into
 the tuplesort infrastructure, though, so instead of wasting time
 guessing what a sensible approach might be I'm going to use one of my
 lifelines and poll the audience (or is that ask an expert?).

Here's another cut at this.  I only went as far as converting the
heap-sort code path in tuplesort.c, so there's lots more to do, but
this does sort integers successfully.  Before expanding the patch
to do more, I think we need to have consensus on some API details,
in particular:

* I invented a SortKey struct that replaces ScanKey for tuplesort's
purposes.  Right now that's local in tuplesort.c, but we're more than
likely going to need it elsewhere as well.  Should we just define it
in sortsupport.h?  Or perhaps we should just add the few additional
fields to SortSupportInfoData, and not bother with two struct types?
Before you answer, consider the next point.

* I wonder whether it would be worthwhile to elide inlineApplyComparator
altogether, pushing what it does down to the level of the
datatype-specific functions.  That would require changing the
comparator API to include isnull flags, and exposing the
reverse/nulls_first sort flags to the comparators (presumably by
including them in SortSupportInfoData).  The main way in which that
could be a win would be if the setup function could choose one of four
comparator functions that are pre-specialized for each flags
combination; but that seems like it would add a lot of code bulk, and
the bigger problem is that we need to be able to change the flags after
sort initialization (cf. the reversedirection code in tuplesort.c), so
we'd also need some kind of re-select the comparator call API.  On the
whole this doesn't seem promising, but maybe somebody else has a
different idea.

* We're going to want to expose PrepareSortSupportComparisonShim
for use outside tuplesort.c too, and possibly refactor
tuplesort_begin_heap so that the SortKey setup logic inside it
can be extracted for use elsewhere.  Shall we just add those to
tuplesort's API, or would it be better to create a sortsupport.c
with these sorts of functions?

I have not done any performance testing on this patch, but it might be
interesting to check it with the same test cases Peter's been using.

regards, tom lane


diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 28324361a950f31737c0cbb6909726d02ce1af5f..51f847c9f907756a823e1f76ef0275bceed7 100644
*** a/doc/src/sgml/xindex.sgml
--- b/doc/src/sgml/xindex.sgml
*** DEFAULT FOR TYPE int4 USING btree FAMILY
*** 758,764 
OPERATOR 3 = ,
OPERATOR 4 = ,
OPERATOR 5  ,
!   FUNCTION 1 btint4cmp(int4, int4) ;
  
  CREATE OPERATOR CLASS int2_ops
  DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
--- 758,765 
OPERATOR 3 = ,
OPERATOR 4 = ,
OPERATOR 5  ,
!   FUNCTION 1 btint4cmp(int4, int4) ,
!   FUNCTION 2 btint4sortsupport(internal) ;
  
  CREATE OPERATOR CLASS int2_ops
  DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 23f2b61fe902cff7eebcf8526cf8116d90be75e8..62d03f96de07cbbe14dedefb1596220af398caea 100644
*** a/src/backend/access/nbtree/nbtcompare.c
--- b/src/backend/access/nbtree/nbtcompare.c
***
*** 49,54 
--- 49,55 
  #include postgres.h
  
  #include utils/builtins.h
+ #include utils/sortsupport.h
  
  
  Datum
*** btint4cmp(PG_FUNCTION_ARGS)
*** 83,88 
--- 84,112 
  		PG_RETURN_INT32(-1);
  }
  
+ static int
+ btint4fastcmp(Datum x, Datum y, SortSupportInfo sinfo)
+ {
+ 	int32		a = DatumGetInt32(x);
+ 	int32		b = DatumGetInt32(y);
+ 
+ 	if (a  b)
+ 		return 1;
+ 	else if (a == b)
+ 		return 0;
+ 	else
+ 		return -1;
+ }
+ 
+ Datum
+ btint4sortsupport(PG_FUNCTION_ARGS)
+ {
+ 	SortSupportInfo	sinfo = (SortSupportInfo) PG_GETARG_POINTER(0);
+ 
+ 	sinfo-comparator = btint4fastcmp;
+ 	PG_RETURN_VOID();
+ }
+ 
  Datum
  btint8cmp(PG_FUNCTION_ARGS)
  {
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index cb341b8db679382c3ab8de31ff7a0bb57bc8dc50..cb6a38bcfd29fb17fc62040eef79d34aee9fe802 100644
*** a/src/backend/utils/cache/lsyscache.c
--- b/src/backend/utils/cache/lsyscache.c
*** get_compare_function_for_ordering_op(Oid
*** 287,292 
--- 287,348 
  }
  
  /*
+  * get_sort_function_for_ordering_op
+  *		Get the OID of the datatype-specific btree sort support function,
+  *		or if there is none, the btree comparison function,
+  *		associated with an ordering operator (a  or  operator).
+  *
+  * *sortfunc receives the support or comparison function OID.
+  * *issupport is set TRUE if it's a support func, FALSE if a comparison func.
+  * *reverse is set FALSE if the operator is , TRUE if it's 

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-04 Thread Peter Geoghegan
On 4 December 2011 19:17, Tom Lane t...@sss.pgh.pa.us wrote:
 I have not done any performance testing on this patch, but it might be
 interesting to check it with the same test cases Peter's been using.

I've attached a revision of exactly the same benchmark run to get the
results in results_server.ods .

You'll see very similar figures to results_server.ods for HEAD and for
my patch, as you'd expect. I think the results speak for themselves. I
maintain that we should use specialisations - that's where most of the
benefit is to be found.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


results_server_new.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-03 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 OK, so I tried to code this up.  Adding the new amproc wasn't too
 difficult (see attached).  It wasn't obvious to me how to tie it into
 the tuplesort infrastructure, though, so instead of wasting time
 guessing what a sensible approach might be I'm going to use one of my
 lifelines and poll the audience (or is that ask an expert?).

OK, I'll take a whack at improving this over the weekend.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-02 Thread Peter Geoghegan
On 2 December 2011 01:13, Tom Lane t...@sss.pgh.pa.us wrote:
 Regardless of that, I'm still of the opinion that this patch is
 fundamentally misdesigned.  The way it's set up, it is only possible to
 have a fast path for types that are hard-wired into tuplesort.c, and
 that flies in the face of twenty years' worth of effort to make Postgres
 an extensible system.

What the user doesn't know can't hurt them. I'm not of the opinion
that an internal optimisations flies in the face of anything - is it
really so hard for you to hold your nose for a fairly isolated patch
like this?

 I really don't care about performance
 measurements: this is not an acceptable design, period.

If ever there was a justification for doing something so distasteful,
I would think that a 60% decrease in the time taken to perform a raw
sort for POD types (plus common covariants) would be it.

 What I want to see is some mechanism that pushes this out to the
 individual datatypes, so that both core and plug-in datatypes have
 access to the benefits.  There are a number of ways that could be
 attacked, but the most obvious one is to invent additional, optional
 support function(s) for btree opclasses.

 I also still think that we should pursue the idea of a lower-overhead
 API for comparison functions.  It may be that it's worth the code bulk
 to have an inlined copy of qsort for a small number of high-usage
 datatypes, but I think there are going to be a lot for which we aren't
 going to want to pay that price, and yet we could get some benefit from
 cutting the call overhead.

I'm not opposed to that. It just seems fairly tangential to what I've
done here, as well as considerably less important to users,
particularly when you look at the numbers I've produced. It would be
nice to get a lesser gain for text and things like that too, but other
than that I don't see a huge demand.

 A lower-overhead API would also save cycles
 in usage of the comparison functions from btree itself (remember that?).

Yes, I remember that. Why not do something similar there, and get
similarly large benefits?

 I think in particular that the proposed approach to multiple sort keys
 is bogus and should be replaced by just using lower-overhead
 comparators.  The gain per added code byte in doing it as proposed
 has got to be too low to be reasonable.

Reasonable by what standard? We may well be better off with
lower-overhead comparators for the multiple scanKey case (though I
wouldn't like to bet on it) but we would most certainly not be better
off without discriminating against single scanKey and multiple scanKey
cases as I have.

Look at the spreadsheet results_server.ods . Compare the not inlined
and optimized sheets. It's clear from those numbers that a
combination of inlining and of simply not having to consider a case
that will never come up (multiple scanKeys), the inlining
specialisation far outperforms the non-inlining, multiple scanKey
specialization for the same data (I simply commented out inlining
specializations so that non-inlining specialisations would always be
called for int4 and friends, even if there was only a single scanKey).
That's where the really dramatic improvements are seen. It's well
worth having this additional benefit, because it's such a common case
and the benefit is so big, and while I will be quite happy to pursue a
plan to bring some of these benefits to all types such as the one you
describe, I do not want to jettison the additional benefits described
to do so. Isn't that reasonable?

We're talking about taking 6778.302ms off a 20468.0ms query, rather
than 1860.332ms. Not exactly a subtle difference. Assuming that those
figures are representative of the gains to be had by a generic
mechanism that does not inline/specialize across number of scanKeys,
are you sure that that's worthwhile?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-02 Thread Robert Haas
On Fri, Dec 2, 2011 at 12:16 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 It should be the same or better.  Right now, we are going through
 FunctionCall2Coll to reach the FCI-style comparator.  The shim function
 would be more or less equivalent to that, and since it's quite special
 purpose I would hope we could shave a cycle or two.  For instance, we
 could probably afford to set up a dedicated FunctionCallInfo struct
 associated with the SortSupportInfo struct, and not have to reinitialize
 one each time.

OK, but I think it's also going to cost you an extra syscache lookup,
no?  You're going to have to check for this new support function
first, and then if you don't find it, you'll have to check for the
original one.  I don't think there's any higher-level caching around
opfamilies to save our bacon here, is there?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 OK, but I think it's also going to cost you an extra syscache lookup,
 no?  You're going to have to check for this new support function
 first, and then if you don't find it, you'll have to check for the
 original one.  I don't think there's any higher-level caching around
 opfamilies to save our bacon here, is there?

[ shrug... ] If you are bothered by that, get off your duff and provide
the function for your datatype.  But it's certainly going to be in the
noise for btree index usage, and I submit that query parsing/setup
involves quite a lot of syscache lookups already.  I think that as a
performance objection, the above is nonsensical.  (And I would also
comment that your proposal with a handle is going to involve a table
search that's at least as expensive as a syscache lookup.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-02 Thread Bruce Momjian
Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  OK, but I think it's also going to cost you an extra syscache lookup,
  no?  You're going to have to check for this new support function
  first, and then if you don't find it, you'll have to check for the
  original one.  I don't think there's any higher-level caching around
  opfamilies to save our bacon here, is there?
 
 [ shrug... ] If you are bothered by that, get off your duff and provide
 the function for your datatype.  But it's certainly going to be in the
 noise for btree index usage, and I submit that query parsing/setup
 involves quite a lot of syscache lookups already.  I think that as a
 performance objection, the above is nonsensical.  (And I would also
 comment that your proposal with a handle is going to involve a table
 search that's at least as expensive as a syscache lookup.)

Agreed.  Doing something once and doing something in the sort loop are
two different overheads.

I am excited by this major speedup Peter Geoghegan has found.  Postgres
doesn't have parallel query, which is often used for sorting, so we are
already behind some of the databases are compared against.  Getting this
speedup is definitely going to help us.  And I do like the generic
nature of where we are heading!

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-02 Thread Pavel Stehule

 [ shrug... ] If you are bothered by that, get off your duff and provide
 the function for your datatype.  But it's certainly going to be in the
 noise for btree index usage, and I submit that query parsing/setup
 involves quite a lot of syscache lookups already.  I think that as a
 performance objection, the above is nonsensical.  (And I would also
 comment that your proposal with a handle is going to involve a table
 search that's at least as expensive as a syscache lookup.)

 Agreed.  Doing something once and doing something in the sort loop are
 two different overheads.

 I am excited by this major speedup Peter Geoghegan has found.  Postgres
 doesn't have parallel query, which is often used for sorting, so we are
 already behind some of the databases are compared against.  Getting this
 speedup is definitely going to help us.  And I do like the generic
 nature of where we are heading!


Oracle has not or had not parallel sort too, and I have a reports so
Oracle does sort faster then PostgreSQL (but without any numbers). So
any solution is welcome

Regards

Pavel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-02 Thread Robert Haas
On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian br...@momjian.us wrote:
 Agreed.  Doing something once and doing something in the sort loop are
 two different overheads.

OK, so I tried to code this up.  Adding the new amproc wasn't too
difficult (see attached).  It wasn't obvious to me how to tie it into
the tuplesort infrastructure, though, so instead of wasting time
guessing what a sensible approach might be I'm going to use one of my
lifelines and poll the audience (or is that ask an expert?).
Currently the Tuplesortstate[1] has a pointer to an array of
ScanKeyData objects, one per column being sorted.  But now instead of
FmgrInfo sk_func, the tuplesort code is going to want each scankey
to contain SortSupportInfo(Data?) sk_sortsupport[2], because that's
where we get the comparison function from.   Should I just go ahead
and add one more member to that struct, or is there some more
appropriate way to handle this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Consistent capitalization is for wimps.
[2] Hey, we could abbreviate that SSI!  Oh, wait...


sort-support.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-02 Thread Bruce Momjian
Robert Haas wrote:
 On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian br...@momjian.us wrote:
  Agreed. ?Doing something once and doing something in the sort loop are
  two different overheads.
 
 OK, so I tried to code this up.  Adding the new amproc wasn't too
 difficult (see attached).  It wasn't obvious to me how to tie it into
 the tuplesort infrastructure, though, so instead of wasting time
 guessing what a sensible approach might be I'm going to use one of my
 lifelines and poll the audience (or is that ask an expert?).
 Currently the Tuplesortstate[1] has a pointer to an array of
 ScanKeyData objects, one per column being sorted.  But now instead of
 FmgrInfo sk_func, the tuplesort code is going to want each scankey
 to contain SortSupportInfo(Data?) sk_sortsupport[2], because that's
 where we get the comparison function from.   Should I just go ahead
 and add one more member to that struct, or is there some more
 appropriate way to handle this?

Is this code immediately usable anywhere else in our codebasde, and if
so, is it generic enough?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Peter Geoghegan
Attached is revision of my patch with some clean-ups. In particular,
I'm now using switch statements for greater readability, plus
supporting fast path sorting of the time datatype. I've also updated
the documentation on Date/Time Types to note the additional
disadvantage of using the deprecated store timestamp + friends as
double precision floating-point numbers compile time option.

There is one aspect to this optimisation that I haven't touched on,
which is the effect on memory consumption. I think that much of the
value that this patch will deliver will come from being able to
release sort memory earlier. Consider that the substantial
improvements in raw sorting speed (far more substantial than the
improvements in query runtime) will sometimes result in a concomitant
reduction in the time that the executor holds onto memory allocated
for sorting. Maybe the effect will only be really noticeable for plans
with a sort node as their root node, but that isn't exactly a rare
occurrence, particularly among large, expensive sorts.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
From cffad91578edd698e89ef2bf7384d82aee26b57e Mon Sep 17 00:00:00 2001
From: peter2ndQuadrant pe...@2ndquadrant.com
Date: Sun, 20 Nov 2011 20:36:25 +
Subject: [PATCH] Initial commit of optimization

Stop directly using oid

Added int8 quicksort fast path specialisation, which can also be used in
place of F_TIMESTAMP_CMP for HAVE_INT64_TIMESTAMP builds.

Rebased, revised patch for -hackers, with timestamp and int8 fast path
sorting using the same int8 specialization.

Remove unneeded line

Rebase for -hackers

Descriminate against cases where nKeys  1 when selecting sort
specialisation.

We now support fast path sorting with multiple scanKeys.

We now inline for one scanKey, but do not inline our comparetup_heap
replacement when multiple scanKeys are used (although this is at the
compiler's discretion). However, when multiple scanKeys are used, we
still use a specialisation that will elide the SQL-function-call
machinery for the first scanKey, which is almost as good for most cases.

Further formatting clean-ups

Move specialization selection logic to tuplesort_begin_heap(), per
suggestion of Robert Haas

Add simple switch for testing optimization.

Improvements to template qsort_arg comments. Support fast path sorting
of dates.

Added float fast path sorting

Patch mimimization.

Additional clean-up.

Indentation issue.

Specialization clean-ups

Use switch statement for sort specialization selection

Further comment clean-up.

Additional clean-up

Make comparators consistent

Added date fast path support, documentation updates

Fix template_qsort_arg.h assertion

Assert nScanKeys == 1 for inlining case
---
 doc/src/sgml/datatype.sgml |6 +
 src/backend/utils/sort/tuplesort.c |  223 -
 src/include/utils/template_qsort_arg.h |  288 
 src/port/qsort.c   |3 +-
 4 files changed, 512 insertions(+), 8 deletions(-)
 create mode 100644 src/include/utils/template_qsort_arg.h

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index fe59a1c..f5dfbbb 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1619,6 +1619,12 @@ SELECT E'\\xDEADBEEF';
 floating-point case, large typeinterval/type values degrade in
 precision as the size of the interval increases.
/para
+   para
+Note that if the server is built to represent these types as
+floating point numbers, that will prevent it from using fast path
+sort specializations for the types, and sorting columns of those
+types will be noticeably slower.
+   /para
/note
 
para
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3505236..0a14a90 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -107,11 +107,13 @@
 #include miscadmin.h
 #include pg_trace.h
 #include utils/datum.h
+#include utils/fmgroids.h
 #include utils/logtape.h
 #include utils/lsyscache.h
 #include utils/memutils.h
 #include utils/pg_rusage.h
 #include utils/rel.h
+#include utils/template_qsort_arg.h
 #include utils/tuplesort.h
 
 
@@ -226,6 +228,13 @@ struct Tuplesortstate
 		   Tuplesortstate *state);
 
 	/*
+	 * This specialization function pointer is sometimes used as an alternative
+	 * to the standard qsort_arg, when it has been determined that we can 
+	 * benefit from various per-type performance optimizations.
+	 */
+	void (*qsort_arg_spec)(void *a, size_t n, size_t es, void *arg);
+
+	/*
 	 * Function to copy a supplied input tuple into palloc'd space and set up
 	 * its SortTuple representation (ie, set tuple/datum1/isnull1).  Also,
 	 * state-availMem must be decreased by the amount of space used for the
@@ -457,6 +466,11 @@ static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple 

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Robert Haas
On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Attached is revision of my patch with some clean-ups. In particular,
 I'm now using switch statements for greater readability, plus
 supporting fast path sorting of the time datatype. I've also updated
 the documentation on Date/Time Types to note the additional
 disadvantage of using the deprecated store timestamp + friends as
 double precision floating-point numbers compile time option.

One thing I'm starting to get a bit concerned about is the effect of
this patch on the size of the resulting binary.  The performance
results you've posted are getting steadily more impressive as you get
into this, which is cool, but it seems like the number of copies of
the qsort logic that you're proposing to include in the resulting
binary is also steadily climbing.  On my system, a stripped postgres
binary built with my usual compile options (except --enable-cassert,
which I took out) is 49336 bytes bigger with this patch applied, an
increase of close to 1%.  We might need to be a little bit choosy
about this, because I don't think that we want to end up with a
situation where some noticeable percentage of the final binary
consists of differently-inlined versions of the quicksort algorithm -
especially because there may be other places where we want to do
similar kinds of inlining.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 Attached is revision of my patch with some clean-ups.

 One thing I'm starting to get a bit concerned about is the effect of
 this patch on the size of the resulting binary.

Regardless of that, I'm still of the opinion that this patch is
fundamentally misdesigned.  The way it's set up, it is only possible to
have a fast path for types that are hard-wired into tuplesort.c, and
that flies in the face of twenty years' worth of effort to make Postgres
an extensible system.  I really don't care about performance
measurements: this is not an acceptable design, period.

What I want to see is some mechanism that pushes this out to the
individual datatypes, so that both core and plug-in datatypes have
access to the benefits.  There are a number of ways that could be
attacked, but the most obvious one is to invent additional, optional
support function(s) for btree opclasses.

I also still think that we should pursue the idea of a lower-overhead
API for comparison functions.  It may be that it's worth the code bulk
to have an inlined copy of qsort for a small number of high-usage
datatypes, but I think there are going to be a lot for which we aren't
going to want to pay that price, and yet we could get some benefit from
cutting the call overhead.  A lower-overhead API would also save cycles
in usage of the comparison functions from btree itself (remember that?).

I think in particular that the proposed approach to multiple sort keys
is bogus and should be replaced by just using lower-overhead
comparators.  The gain per added code byte in doing it as proposed
has got to be too low to be reasonable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Peter Geoghegan
On 1 December 2011 17:15, Robert Haas robertmh...@gmail.com wrote:
 One thing I'm starting to get a bit concerned about is the effect of
 this patch on the size of the resulting binary.  The performance
 results you've posted are getting steadily more impressive as you get
 into this, which is cool, but it seems like the number of copies of
 the qsort logic that you're proposing to include in the resulting
 binary is also steadily climbing.  On my system, a stripped postgres
 binary built with my usual compile options (except --enable-cassert,
 which I took out) is 49336 bytes bigger with this patch applied, an
 increase of close to 1%.

Do your usual compile options include debug symbols? I've been using
standard compile options for development of this patch, for obvious
reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase).

Binary bloat is a legitimate concern. I thought that I was
conservative in my choice of specialisations though.

 We might need to be a little bit choosy
 about this, because I don't think that we want to end up with a
 situation where some noticeable percentage of the final binary
 consists of differently-inlined versions of the quicksort algorithm -
 especially because there may be other places where we want to do
 similar kinds of inlining.

 Thoughts?

A simple caveat in a comment along the lines of this mechanism
instantiates 2 copies of qsort_arg per type, please use judiciously
sounds like the right balance. It could also be possible to be penny
wise and pound foolish here though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Robert Haas
On Thu, Dec 1, 2011 at 8:29 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Do your usual compile options include debug symbols? I've been using
 standard compile options for development of this patch, for obvious
 reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase).

They do, but I ran strip on the resulting binary before taking those
measurements, so I thought that would undo it...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Robert Haas
On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 What I want to see is some mechanism that pushes this out to the
 individual datatypes, so that both core and plug-in datatypes have
 access to the benefits.  There are a number of ways that could be
 attacked, but the most obvious one is to invent additional, optional
 support function(s) for btree opclasses.

I feel like what's missing here is a way for the catalogs to refer to
a function that doesn't take FunctionCallInfo as an argument.  But it
strikes me that it wouldn't be very hard to design such a mechanism.
It seems like all we really need to do is decide on a nomenclature.
The simplest possible approach would probably be to just refer to
functions by name.  In other words, we add a text or name column to
pg_amproc, and then inside the backend there's a function whose job it
is to map the string to a function address.  However, that would have
the annoying effect of causing catalog bloat, so I'm inclined to
instead propose an 8-byte integer.  (A 4-byte integer is enough, but
would increase the risk of collisions between values that might be
chosen by third party extensions.)

So, the API to this would look something like this:

void register_arbitrary_function_pointer(int64 handle, void *funcptr);
void *get_arbitrary_function_pointer(int64 handle);

So the idea is that if you need to refer to an arbitrary function in a
system catalog (such as pg_amproc), you generate a random non-zero
64-bit integer, stuff it into the catalog, and then register the same
64-bit integer with the appropriate function pointer.  To avoid
increasing backend startup time, we could have a Perl script generate
a prefab hash table for the built-in functions; loadable modules would
add their entries at PG_init() time using
register_arbitrary_function_pointer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 What I want to see is some mechanism that pushes this out to the
 individual datatypes, so that both core and plug-in datatypes have
 access to the benefits.  There are a number of ways that could be
 attacked, but the most obvious one is to invent additional, optional
 support function(s) for btree opclasses.

 I feel like what's missing here is a way for the catalogs to refer to
 a function that doesn't take FunctionCallInfo as an argument.  But it
 strikes me that it wouldn't be very hard to design such a mechanism.

IMO, entries in pg_proc ought to refer to functions that are callable
by the standard interface: breaking that basic contract is not going to
lead anywhere pleasant.  Nor do I really want yet more columns in
pg_proc.  Nor does the register a pointer scheme you suggest make
any sense to me: you still need to connect these things to catalog
entries, pg_opclass entries in particular, and the intermediate handle
adds nothing to the situation except for creating a risk of collisions.

The scheme that was rolling around in my mind was about like this:

* Define btree opclasses to have an optional support function,
amprocnum 2, that has a SQL signature of func(internal) returns void.

* The actual argument represented by the internal parameter would
be a pointer to a struct which is to be filled in by the support
function.  We'd call the support function once, during tuplesort
setup or btree index relcache entry setup, and save the struct
somewhere.

* The struct contents would be pointers to functions that have a
non-FunctionCallInfo interface.  We know we need at least two:
a full qsort replacement, and a non-FCI comparator.  We might want
more in future, if someone convinces us that additional specializations
of sorting are worth the trouble.  So I imagine a struct like this:

typedef struct SortSupportFunctions {
void(*inline_qsort) (Datum *elements, int nelements);
int (*comparator) (Datum a, Datum b);
} SortSupportFunctions;

with the agreement that the caller must zero out the struct before call,
and then the btree support function sets the function pointers for any
specializations it's going to offer.  If we later add a third or fourth
function pointer, datatypes that know about that could fill in those
pointers, but datatypes that haven't been updated aren't broken.

One thing I'm not too certain about is whether to define the APIs just
as above, or to support a passthrough argument of some sort (and if so,
what does it reference)?  Possibly any datatype that we'd actually care
about this for is going to be simple enough to not need any state data.
Or possibly not.  And what about collations?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Robert Haas
On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 IMO, entries in pg_proc ought to refer to functions that are callable
 by the standard interface: breaking that basic contract is not going to
 lead anywhere pleasant.  Nor do I really want yet more columns in
 pg_proc.

I wasn't proposing to create pg_proc entries for this.

 Nor does the register a pointer scheme you suggest make
 any sense to me: you still need to connect these things to catalog
 entries, pg_opclass entries in particular, and the intermediate handle
 adds nothing to the situation except for creating a risk of collisions.

I think you might be misinterpreting what I had in mind.  Right now,
pg_amproc entries have an amproc column that points to a pg_proc
entry that in turn points to a function that takes
FunctionCallInfoData as an argument.  What I'm proposing to do is add
an additional column to that catalog that points more or less directly
to a non-SQL-callable function, but it can't actually just be the
address of the function because that's not stable.  So what I'm
proposing is that we interpose the thinnest possible shim layer
between the catalog and a function pointer, and an int64 - function
pointer mapping seemed to me like something that would fit the bill.

 The scheme that was rolling around in my mind was about like this:

 * Define btree opclasses to have an optional support function,
 amprocnum 2, that has a SQL signature of func(internal) returns void.

 * The actual argument represented by the internal parameter would
 be a pointer to a struct which is to be filled in by the support
 function.  We'd call the support function once, during tuplesort
 setup or btree index relcache entry setup, and save the struct
 somewhere.

 * The struct contents would be pointers to functions that have a
 non-FunctionCallInfo interface.  We know we need at least two:
 a full qsort replacement, and a non-FCI comparator.  We might want
 more in future, if someone convinces us that additional specializations
 of sorting are worth the trouble.  So I imagine a struct like this:

        typedef struct SortSupportFunctions {
                void    (*inline_qsort) (Datum *elements, int nelements);
                int     (*comparator) (Datum a, Datum b);
        } SortSupportFunctions;

 with the agreement that the caller must zero out the struct before call,
 and then the btree support function sets the function pointers for any
 specializations it's going to offer.  If we later add a third or fourth
 function pointer, datatypes that know about that could fill in those
 pointers, but datatypes that haven't been updated aren't broken.

 One thing I'm not too certain about is whether to define the APIs just
 as above, or to support a passthrough argument of some sort (and if so,
 what does it reference)?  Possibly any datatype that we'd actually care
 about this for is going to be simple enough to not need any state data.
 Or possibly not.  And what about collations?

Maybe there should be a comparator_setup function that gets the
collation OID and returns void *, and then that void * value gets
passed as a third argument to each call to the comparator function.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Nor does the register a pointer scheme you suggest make
 any sense to me: you still need to connect these things to catalog
 entries, pg_opclass entries in particular, and the intermediate handle
 adds nothing to the situation except for creating a risk of collisions.

 I think you might be misinterpreting what I had in mind.  Right now,
 pg_amproc entries have an amproc column that points to a pg_proc
 entry that in turn points to a function that takes
 FunctionCallInfoData as an argument.  What I'm proposing to do is add
 an additional column to that catalog that points more or less directly
 to a non-SQL-callable function, but it can't actually just be the
 address of the function because that's not stable.  So what I'm
 proposing is that we interpose the thinnest possible shim layer
 between the catalog and a function pointer, and an int64 - function
 pointer mapping seemed to me like something that would fit the bill.

But you still need a lot of mechanism to do that mapping, including an
initialization function that has noplace to be executed (unless every
.so that defines a btree opclass has to be listed in preload_libraries),
so it doesn't seem like the thinnest possible shim to me.

 One thing I'm not too certain about is whether to define the APIs just
 as above, or to support a passthrough argument of some sort (and if so,
 what does it reference)?  Possibly any datatype that we'd actually care
 about this for is going to be simple enough to not need any state data.
 Or possibly not.  And what about collations?

 Maybe there should be a comparator_setup function that gets the
 collation OID and returns void *, and then that void * value gets
 passed as a third argument to each call to the comparator function.

Maybe.  Or perhaps we could merge that work into the
function-pointer-setup function --- that is, give it the collation and
some extra workspace to fool with.  We would always know the
collation at the time we're doing that setup.

At this point the struct filled in by the setup function is starting
to feel a lot like FmgrInfo, and history teaches us a lot about what
needs to be in there.  So maybe

typedef struct SortSupportInfoData *SortSupportInfo;

typedef struct SortSupportInfoData
{
MemoryContext   info_cxt;   /* where to allocate subsidiary data */
void*extra; /* any data needed by functions */
Oid collation;  /* provided by caller */

void(*inline_qsort) (Datum *elements, int nelements,
 SortSupportInfo info);
int (*comparator) (Datum a, Datum b,
   SortSupportInfo info);
/* possibly other function pointers in future */
} SortSupportInfoData;

I am thinking that the btree code, at least, would want to just
unconditionally do

colsortinfo-comparator(datum1, datum2, colsortinfo)

so for an opclass that fails to supply the low-overhead comparator,
it would insert into the comparator pointer a shim function that
calls the opclass' old-style FCI-using comparator.  (Anybody who
complains about the added overhead would be told to get busy and
supply a low-overhead comparator for their datatype...)  But to do
that, we have to have enough infrastructure here to cover all cases,
so omitting collation or not having a place to stash an FmgrInfo
won't do.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Robert Haas
On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But you still need a lot of mechanism to do that mapping, including an
 initialization function that has noplace to be executed (unless every
 .so that defines a btree opclass has to be listed in preload_libraries),
 so it doesn't seem like the thinnest possible shim to me.

PG_init?

 One thing I'm not too certain about is whether to define the APIs just
 as above, or to support a passthrough argument of some sort (and if so,
 what does it reference)?  Possibly any datatype that we'd actually care
 about this for is going to be simple enough to not need any state data.
 Or possibly not.  And what about collations?

 Maybe there should be a comparator_setup function that gets the
 collation OID and returns void *, and then that void * value gets
 passed as a third argument to each call to the comparator function.

 Maybe.  Or perhaps we could merge that work into the
 function-pointer-setup function --- that is, give it the collation and
 some extra workspace to fool with.  We would always know the
 collation at the time we're doing that setup.

Ah!  That seems quite nice.

 At this point the struct filled in by the setup function is starting
 to feel a lot like FmgrInfo, and history teaches us a lot about what
 needs to be in there.  So maybe

 typedef struct SortSupportInfoData *SortSupportInfo;

 typedef struct SortSupportInfoData
 {
        MemoryContext   info_cxt;       /* where to allocate subsidiary data */
        void            *extra;         /* any data needed by functions */
        Oid             collation;      /* provided by caller */

        void    (*inline_qsort) (Datum *elements, int nelements,
                                 SortSupportInfo info);
        int     (*comparator) (Datum a, Datum b,
                               SortSupportInfo info);
        /* possibly other function pointers in future */
 } SortSupportInfoData;

 I am thinking that the btree code, at least, would want to just
 unconditionally do

        colsortinfo-comparator(datum1, datum2, colsortinfo)

 so for an opclass that fails to supply the low-overhead comparator,
 it would insert into the comparator pointer a shim function that
 calls the opclass' old-style FCI-using comparator.  (Anybody who
 complains about the added overhead would be told to get busy and
 supply a low-overhead comparator for their datatype...)  But to do
 that, we have to have enough infrastructure here to cover all cases,
 so omitting collation or not having a place to stash an FmgrInfo
 won't do.

I'm slightly worried about whether that'll be adding too much overhead
to the case where there is no non-FCI comparator.  But it may be no
worse than what we're doing now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I am thinking that the btree code, at least, would want to just
 unconditionally do
 
colsortinfo-comparator(datum1, datum2, colsortinfo)
 
 so for an opclass that fails to supply the low-overhead comparator,
 it would insert into the comparator pointer a shim function that
 calls the opclass' old-style FCI-using comparator.  (Anybody who
 complains about the added overhead would be told to get busy and
 supply a low-overhead comparator for their datatype...)  But to do
 that, we have to have enough infrastructure here to cover all cases,
 so omitting collation or not having a place to stash an FmgrInfo
 won't do.

 I'm slightly worried about whether that'll be adding too much overhead
 to the case where there is no non-FCI comparator.  But it may be no
 worse than what we're doing now.

It should be the same or better.  Right now, we are going through
FunctionCall2Coll to reach the FCI-style comparator.  The shim function
would be more or less equivalent to that, and since it's quite special
purpose I would hope we could shave a cycle or two.  For instance, we
could probably afford to set up a dedicated FunctionCallInfo struct
associated with the SortSupportInfo struct, and not have to reinitialize
one each time.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-29 Thread Bruce Momjian
Peter Geoghegan wrote:
 Attached are the results from performing a similar process to the
 prior benchmark, but on Greg Smith's high-end server, and with an
 orderlines table that has been doubled-up until it is 1538 MB,
 making the same old query perform a quicksort that's over 3GB. Short
 version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
 gains hold-up very well for very large in-memory sorts, possibly even
 perfectly.
 
 Note that the doubling up had an interesting and desirable effect
 for the purposes of this patch review: It's approaching a worst case
 due to the low cardinality of the product + quantity columns, which
 crops up with the non-inlined functions only (int4regqsort_arg, etc).
 All too frequently, evaluating the first scankey results in equality,
 and so we cannot elide the SQL function call machinery. This is going
 to dampen the improvement for the multiple scanKey case considerably
 (and it looks like a smaller relative improvement than when the
 orderlines table was quite a lot smaller). As I said from the outset,
 there is no worst case for the single scanKey case that occurs to me.
 Multiple scanKeys are likely to be a problem for any effort to offer
 user-defined, per-type sort functions. I could probably make
 int4regqsort_arg and friends perform a bit better than we see here by
 increasing the cardinality of those two columns for the second query,
 so the first scanKey comparison would frequently suffice.

These are exciting advanced you are producing and I am hopeful we can
get this included in Postgres 9.2.  I have mentioned already that I
think parallelism is the next big Postgres challenge, and of course, one
of the first areas for parallelism is sorting.  If you can speed up
sorting as you have reported, this allows us to provide faster sorting
now, and get even larger benefits if we implement sorting parallelism.

In fact, because Postgres 9.2 has so many performance improvements, we
might find that we have a more limited need for parallelism.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-29 Thread Peter Geoghegan
On 29 November 2011 15:31, Bruce Momjian br...@momjian.us wrote:
 These are exciting advanced you are producing and I am hopeful we can
 get this included in Postgres 9.2.

Thanks Bruce.

I have mentioned already that I
 think parallelism is the next big Postgres challenge, and of course, one
 of the first areas for parallelism is sorting.

I'm not sure that sorting has that much to recommend it as an initial
target of some new backend parallelism other than being easy to
implement. I've observed the qsort_arg specialisations in this patch
out-perform stock qsort_arg by as much as almost 3 times. However, the
largest decrease in a query's time that I've observed was 45%, and
that was for a contrived worst-case for quicksort, but about 25% is
much more typical of queries similar to the ones I've shown, for more
normative data distributions. While that's a respectable gain, it
isn't a paradigm shifting one, and it makes parallelising qsort itself
for further improvements quite a lot less attractive - there's too
many other sources of overhead.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-29 Thread Bruce Momjian
Peter Geoghegan wrote:
 On 29 November 2011 15:31, Bruce Momjian br...@momjian.us wrote:
  These are exciting advanced you are producing and I am hopeful we can
  get this included in Postgres 9.2.
 
 Thanks Bruce.
 
 I have mentioned already that I
  think parallelism is the next big Postgres challenge, and of course, one
  of the first areas for parallelism is sorting.
 
 I'm not sure that sorting has that much to recommend it as an initial
 target of some new backend parallelism other than being easy to
 implement. I've observed the qsort_arg specialisations in this patch
 out-perform stock qsort_arg by as much as almost 3 times. However, the
 largest decrease in a query's time that I've observed was 45%, and
 that was for a contrived worst-case for quicksort, but about 25% is
 much more typical of queries similar to the ones I've shown, for more
 normative data distributions. While that's a respectable gain, it
 isn't a paradigm shifting one, and it makes parallelising qsort itself
 for further improvements quite a lot less attractive - there's too
 many other sources of overhead.

Agreed.  I think your improvements make it likely we will address not
address sort parallelism first.

With all the improvements coming in Postgres 9.2, we might need to look
at I/O parallelism first.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-29 Thread Andres Freund
On Tuesday, November 29, 2011 07:48:37 PM Peter Geoghegan wrote:
 On 29 November 2011 15:31, Bruce Momjian br...@momjian.us wrote:
  These are exciting advanced you are producing and I am hopeful we can
  get this included in Postgres 9.2.
 
 Thanks Bruce.
 
 I have mentioned already that I
 
  think parallelism is the next big Postgres challenge, and of course, one
  of the first areas for parallelism is sorting.
 
 I'm not sure that sorting has that much to recommend it as an initial
 target of some new backend parallelism other than being easy to
 implement. I've observed the qsort_arg specialisations in this patch
 out-perform stock qsort_arg by as much as almost 3 times. However, the
 largest decrease in a query's time that I've observed was 45%, and
 that was for a contrived worst-case for quicksort, but about 25% is
 much more typical of queries similar to the ones I've shown, for more
 normative data distributions. While that's a respectable gain, it
 isn't a paradigm shifting one, and it makes parallelising qsort itself
 for further improvements quite a lot less attractive - there's too
 many other sources of overhead.
I think that logic is faulty.

For one I doubt that anybody is honestly suggesting paralellism inside qsort 
itself. It seems more likely/sensible to implement that on the level of 
mergesorting.
Index builds for example could hugely benefit from improvements on that level. 
With index build you often get pretty non-optimal data distributions btw...

I also seriously doubt that you will find an area inside pg's executor where 
you find that paralellizing them will provide a near linear scale without 
much, much more work.

Also I wouldn't consider sorting the easiest target - especially on a qsort 
level - for parallelization as you constantly need to execute user defined 
operators with multiple input tuples which has the usual problems.
COPY parsing + inserting or such seems to be way easier target for example. 
Even doing hashing + aggregation in different threads seems likely to be 
easier.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-29 Thread Greg Jaskiewicz

On 28 Nov 2011, at 02:15, Peter Geoghegan wrote:

 Attached are the results from performing a similar process to the
 prior benchmark, but on Greg Smith's high-end server, and with an
 orderlines table that has been doubled-up until it is 1538 MB,
 making the same old query perform a quicksort that's over 3GB. Short
 version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
 gains hold-up very well for very large in-memory sorts, possibly even
 perfectly.
 

This is some really good stuff. 
Has to be said, that there must be quite few other places where inlining like 
that could have quite a positive benefit. 
But - I also have to say that both template functions and template classes in 
C++ give you pretty much the same speed improvement, with much better clarity 
and readability of the code. (I've tested it with the example Peter presented, 
if someone is interested in the code). 
One more reason why PostgreSQL project should look into the future and get some 
of the bits simplified and optimised by switching to C++. 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-27 Thread Peter Geoghegan
Attached are the results from performing a similar process to the
prior benchmark, but on Greg Smith's high-end server, and with an
orderlines table that has been doubled-up until it is 1538 MB,
making the same old query perform a quicksort that's over 3GB. Short
version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
gains hold-up very well for very large in-memory sorts, possibly even
perfectly.

Note that the doubling up had an interesting and desirable effect
for the purposes of this patch review: It's approaching a worst case
due to the low cardinality of the product + quantity columns, which
crops up with the non-inlined functions only (int4regqsort_arg, etc).
All too frequently, evaluating the first scankey results in equality,
and so we cannot elide the SQL function call machinery. This is going
to dampen the improvement for the multiple scanKey case considerably
(and it looks like a smaller relative improvement than when the
orderlines table was quite a lot smaller). As I said from the outset,
there is no worst case for the single scanKey case that occurs to me.
Multiple scanKeys are likely to be a problem for any effort to offer
user-defined, per-type sort functions. I could probably make
int4regqsort_arg and friends perform a bit better than we see here by
increasing the cardinality of those two columns for the second query,
so the first scanKey comparison would frequently suffice.

Incidentally, I'm pretty sceptical of the idea of any effort being
made to provide user-defined per-type sort functions or anything like
that. No one has come forward with a novel sorting implementation that
looks useful, although there has been some talk of some. It's
relatively easy to hack on tuplesort to build a prototype, so I don't
think the lack of this functionality is the blocker here. Besides, we
will probably end up doing a terrible job of anticipating how invasive
such a facility needs to be to facilitate these novel sorting
implementations.

While I'm very encouraged by these results, I must admit that there
are a few things that I cannot currently explain. I don't know why the
second query on the Optimized sheet outperforms the same query on
the Not inlined page. When you consider how small the standard
deviation is for each set of results, it's fairly obvious that it
cannot be explained by noise. I thought it was weird, so I re-ran both
benchmarks, with much the same outcome. It may be that the compiler
gets lucky for the optimized case, resulting in a bit of an extra
gain, and that effect is ridiculously magnified. In all cases, the
regression tests pass.

The single key/inlining optimisation seems to account for a big part
of the gains, and the Not inlined figures for the first query would
seem to suggest that the inlining can account for more than half of
gains seen, whereas that was previously assumed to be less. What I
think is happening here is that we're benefiting both from inlining as
well as not having to even consider multiple scanKeys (we have to at
least consider them even if we know in advance from the nScankeys
variable that there'll never be multiple scanKeys). Again, if the
cardinality was higher, we'd probably see Not inlined do better
here.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


results_server.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-26 Thread Peter Geoghegan
3 items are attached:

1. A spreadsheet, results.ods, which has results for automated
multiple runs of the doubled up dellstore orderlines queries (where
orderlines is 385 MB). Results are sorted, with the median (or the
lower middle value, didn't get the mean of the two middle runs) result
for each run highlighted as representative. There are 3 builds of
Postgres (HEAD, not inlined, optimized), each on its own sheet in the
spreadsheet. The cache was warmed for each query, and subsequently the
query was run 20 times.

2. A python script I hacked together that should run on anything
around Python 2.5+, with a single dependency, psycopg2. Look at --help
to see how it works. This is the script that actually generated the
figures that went into the spreadsheet, by being run once for each
type of build. You can fairly easily play along at home with this, for
these and other queries. It will spit out CSV files. It is designed to
warm the cache (by default 3 times before each query) to eliminate
caching effects. You can add your own query to the python list to have
it run by the script, to generate the same set of figures for that
query. I'm rather curious as to how much of an impact this
optimisation will have on queries with unique nodes, joins and
grouping when they rely on a sort node for their input, in the real
world. Certainly, a query need not have an ORDER BY clause to see
benefits, perhaps substantial benefits. The logic to parse sort node
details is rather unsophisticated right now though, due to my focus on
the obvious ORDER BY case.

3. The revision of the patch that was actually tested, now with
inlining specialisations for the single scanKey case, and a
non-inlined specialisation for multiple scanKeys where we can still
benefit from eliding the SQL function call machinery for the first
scanKey, which is often almost as useful as being able to do so for
all scanKeys. It also selects a sorting specialisation when it can in
tuplesort_begin_heap, per Robert's suggestion.

Reviewers will want to comment out line 731 of tuplesort.c, #define
optimize, to quickly get unoptimized behaviour for comparative
purposes. Furthermore, if you'd like to see what this would look like
without inlining, you can simply comment out assignments of inline
variants of specialisations (i.e. int4inlqsort_arg and
int8inlqsort_arg) in tuplesort_begin_heap.

It has been suggested that I'm chasing diminishing returns by
inlining, as I go further for a smaller benefit. Of course, that's
true. However, I believe that I'm not chasing them past the point
where that ceases to make sense, and these figures support that
contention - chasing diminishing returns in the nature of this kind of
work. Here, the additional benefit of inlining accounts for over an
entire second shaved off a query that was originally 7056ms, so that's
not to be sniffed at.

I'll reiterate that qsort_arg has only been modified 4 times after its
initial commit in 2006, and these were all trivial changes. While the
way that I ape generic programming with the preprocessor is on the
ugly side, what I've done can be much more easily understood with
reference to qsort_arg itself. Robert said that duplicating the sort
function was iffy. However, that already happened long ago, as we're
already maintaining both qsort_arg.c and qsort.c, and comments already
urge maintainers to keep the two consistent. That's not been a problem
though, because, as I've said, there has never been any call to make
substantive changes to either in all those years.

We could probably further reduce the code footprint of all of this by
having template_qsort_arg.h generate comparators directly. That might
be inflexible in a way that turns out to matter though, like if we
wanted to use this for non-HAVE_INT64_TIMESTAMP timestamps and had to
add NaN crud.

My next step is to see how this goes with hundreds of sorts in the
hundreds of megabytes on a high-end server. I don't have immediate
access to one, but I'm working that out.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
From 92185762297ba9819eb7f1b208db03663d9fdfa8 Mon Sep 17 00:00:00 2001
From: peter2ndQuadrant pe...@2ndquadrant.com
Date: Sun, 20 Nov 2011 20:36:25 +
Subject: [PATCH] Initial commit of optimization

Stop directly using oid

Added int8 quicksort fast path specialisation, which can also be used in
place of F_TIMESTAMP_CMP for HAVE_INT64_TIMESTAMP builds.

Rebased, revised patch for -hackers, with timestamp and int8 fast path
sorting using the same int8 specialization.

Remove unneeded line

Rebase for -hackers

Descriminate against cases where nKeys  1 when selecting sort
specialisation.

We now support fast path sorting with multiple scanKeys.

We now inline for one scanKey, but do not inline our comparetup_heap
replacement when multiple scanKeys are used (although this is at the
compiler's discretion). However, when multiple scanKeys are used, we
still use 

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-24 Thread Peter Geoghegan
On 23 November 2011 19:24, Robert Haas robertmh...@gmail.com wrote:
 Well, right now the decision as to which mechanism should be used here
 gets made in tuplesort_performsort(), which has no good way of
 communicating with EXPLAIN anyway.

You could pretty easily add something to Tuplesortstate to accomplish
this. That isn't an endorsement of doing so, but I'm not sure that it
isn't appropriate.

 Actually, I think that's a
 modularity violation; using the address of comparetup_heap as a flag
 value seems quite ugly. How about moving that logic up to
 tuplesort_begin_heap()

I'll post a patch soon that does just that in the next day or two.
Tuplesortstate has a pointer to a sort specialisation.

 and having it set some state inside the
 Tuplesort, maybe based on a flag in the opclass (or would it have to
 attach to the individual operator)?

I'm not sure that there's much point in such a flag.

 At least on my machine, your latest patch reliably crashes the
 regression tests in multiple places.

 TRAP: FailedAssertion(!(state-nKeys == 1), File: tuplesort.c, Line: 
 1261);

Yes, sorry about that. Should have been discriminating against nKeys  1 cases.

As of this evening, for sorts with multiple scankeys, I'm using
optimisations for the first scankey but not subsequent scankeys, which
is frequently almost as good as having optimisations for all scanKeys.

 The formatting of src/include/utils/template_qsort_arg.h is hard to
 read.  At ts=8, the backslashes line up, but the code doesn't fit in
 80 columns.  If you set ts=4, then it fits in 80 columns, but the
 backslashes don't line up any more, and the variable declarations
 don't either.  I believe ts=4 is project standard.

Fair enough. My working copy and .vimrc have been updated.

 I still think it would be a good idea to provide a mechanism to
 override heap_comparetup() with a type-specific function.  I don't
 think that would take much extra code, and then any data type could
 get at least that much benefit out of this.

 It seems like it could be a good idea to do some
 per-assembler-instruction profiling of this code, and perhaps also of
 the original code.  I'm curious where the time is being spent.

How would you go about doing that? The instrumentation that profilers
use actually caused a big drop in performance here when I attempted it
a few weeks ago. There's a kind of Heisenberg effect.

This optimisation *more than doubles* raw sort performance for the
cases. There is nothing contrived or cherry picked about the query
that I selected to represent this optimisation - it was literally the
first one that I selected.

Sometimes, I see even a markedly better gain than a doubling of raw
sort performance - I think my earlier experiments that indicated a
much smaller improvement past a certain point may have been
methodologically flawed. Sorry about that.

If I double-up the data in the orderlines table a few times, until it
reaches 385 MB (duplicate ever tuple with an insert into ...select ),
then warm the cache, I get very interesting results. Here, we see a
few runs of the same old query unoptimised (note that I've excluded
some cold-cache runs before these runs):

Before optimisation
==
 Total runtime: 7785.473 ms - 3.517310 secs just sorting
 Total runtime: 8203.533 ms - 3.577193 secs just sorting
 Total runtime: 8559.743 ms - 3.892719 secs just sorting

 Total runtime: 9032.564 ms - 3.844746 secs just sorting

 Total runtime: 9637.179 ms - 4.434431 secs just sorting
 Total runtime: 9647.215 ms - 4.440560 secs just sorting
 Total runtime: 9669.701 ms - 4.448572 secs just sorting

After optimisation
==
 Total runtime: 5462.419 ms - 1.169963 secs just sorting
 Total runtime: 5510.660 ms - 1.234393 secs just sorting
 Total runtime: 5511.703 ms - 1.208377 secs just sorting

 Total runtime: 5588.604 ms - 1.175536 secs just sorting

 Total runtime: 5899.496 ms - 1.250403 secs just sorting
 Total runtime: 6023.132 ms - 1.338760 secs just sorting
 Total runtime: 6717.177 ms - 1.486602 secs just sorting

This is a 800109kB sort.

So, taking the median value as representative here, that looks to be
just shy of a 40% improvement, or 3.4 seconds. My /proc/cpuinfo is
attached on the off chance that someone is interested in that. More
work is needed here, but this seems promising.

It will be interesting to see how far all of this can be taken with
comparetup_index_btree. Certainly, I'm sure there's some gain to be
had there by applying lessons learned from comparetup_heap.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 Duo CPU E6750  @ 2.66GHz
stepping: 11
cpu MHz : 2000.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-23 Thread Simon Riggs
On Fri, Nov 18, 2011 at 2:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote:
 I think that we should really consider doing with this patch what Tom
 suggested upthread; namely, looking for a mechanism to allow
 individual datatypes to offer up a comparator function that doesn't
 require bouncing through FunctionCall2Coll().

 I don't think its credible to implement that kind of generic
 improvement at this stage of the release cycle.

 Er, *what*?  We're in mid development cycle, we are nowhere near
 release.  When exactly would you have us make major changes?

 In any case, what I understood Robert to be proposing was an add-on
 feature that could be implemented in one datatype at a time.  Not
 a global flag day.  We couldn't really do the latter anyway without
 making life very unpleasant for authors of extension datatypes.

Tom, whenever you think I've said something you really disagree with,
just assume there's a misunderstanding. Like here.

Of course it is OK to make such changes at this time.

Given we have 2 months to the last CF of this release, inventing a
generic infrastructure is unlikely to be finished and complete in this
dev cycle, so requesting that isn't a practical suggestion, IMHO.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-23 Thread Robert Haas
On Tue, Nov 22, 2011 at 8:09 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I wonder, is it worth qualifying that the Sort Method was a
 quicksort (fast path) sort within explain analyze output? This is a
 rather large improvement, so It might actually be something that
 people look out for, as it might be tricky to remember the exact
 circumstances under which the optimisation kicks in by the time we're
 done here.

Well, right now the decision as to which mechanism should be used here
gets made in tuplesort_performsort(), which has no good way of
communicating with EXPLAIN anyway.  Actually, I think that's a
modularity violation; using the address of comparetup_heap as a flag
value seems quite ugly.  How about moving that logic up to
tuplesort_begin_heap() and having it set some state inside the
Tuplesort, maybe based on a flag in the opclass (or would it have to
attach to the individual operator)?

At least on my machine, your latest patch reliably crashes the
regression tests in multiple places.

The following test case also crashes them for me (perhaps for the same
reason the regression tests crash):

create table i4 (a int, b int);
insert into i4 values (4, 1), (2, 1), (0, 1), (null, 1), (-2, 1), (-7,
1), (4, 2), (4, 3), (4, 4);
select * from i4 order by 1, 2;
TRAP: FailedAssertion(!(state-nKeys == 1), File: tuplesort.c, Line: 1261);

The formatting of src/include/utils/template_qsort_arg.h is hard to
read.  At ts=8, the backslashes line up, but the code doesn't fit in
80 columns.  If you set ts=4, then it fits in 80 columns, but the
backslashes don't line up any more, and the variable declarations
don't either.  I believe ts=4 is project standard.

I still think it would be a good idea to provide a mechanism to
override heap_comparetup() with a type-specific function.  I don't
think that would take much extra code, and then any data type could
get at least that much benefit out of this.

It seems like it could be a good idea to do some
per-assembler-instruction profiling of this code, and perhaps also of
the original code.  I'm curious where the time is being spent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-22 Thread Peter Geoghegan
On 21 November 2011 22:55, Robert Haas robertmh...@gmail.com wrote:
 Results on my machine, for what they're worth:

I was curious as to how much of an improvement I'd made to sorting in
isolation. I've adding simple sort timing instrumentation to the code
in a revised patch, attached, for the convenience of reviewers.

This patch adds an int8 specialisation, but it also supports fast path
sorting of timestamps by using that same int8 specialisation (for
HAVE_INT64_TIMESTAMP builds at least - didn't know if it was worth
handling the NaN crud for the other case). ISTM that various different
types with identical internal representations can share
specialisations like this, which I hope will alleviate some earlier
concerns about both the limited applicability of this optimisation and
the possible proliferation of specialisations.

Here's some raw qsort figures in seconds for the query select * from
orderlines order by test (where test is a timestamptz column I added
and updated with some data, executing repeatedly on the same machine
as before):

Before:

DEBUG:  elapsed: 0.031738
DEBUG:  elapsed: 0.031595
DEBUG:  elapsed: 0.031502
DEBUG:  elapsed: 0.031378
DEBUG:  elapsed: 0.031359
DEBUG:  elapsed: 0.031413
DEBUG:  elapsed: 0.031499
DEBUG:  elapsed: 0.031394
DEBUG:  elapsed: 0.031368

After:

DEBUG:  elapsed: 0.013341
DEBUG:  elapsed: 0.014608
DEBUG:  elapsed: 0.013809
DEBUG:  elapsed: 0.013244
DEBUG:  elapsed: 0.014307
DEBUG:  elapsed: 0.013207
DEBUG:  elapsed: 0.013227
DEBUG:  elapsed: 0.013264
DEBUG:  elapsed: 0.013143
DEBUG:  elapsed: 0.013455
DEBUG:  elapsed: 0.013447

I wonder, is it worth qualifying that the Sort Method was a
quicksort (fast path) sort within explain analyze output? This is a
rather large improvement, so It might actually be something that
people look out for, as it might be tricky to remember the exact
circumstances under which the optimisation kicks in by the time we're
done here.

I haven't had as much time as I'd like to polish this patch, or to get
clearer answers. I expect to spend more time on it over the weekend.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
From 687779e799eb0d0064f86192cc74ea473ff9a607 Mon Sep 17 00:00:00 2001
From: peter2ndQuadrant pe...@2ndquadrant.com
Date: Sun, 20 Nov 2011 20:36:25 +
Subject: [PATCH] Initial commit of optimization

Stop directly using oid

Added int8 quicksort fast path specialisation, which can also be used in
place of F_TIMESTAMP_CMP for HAVE_INT64_TIMESTAMP builds.

Rebased, revised patch for -hackers, with timestamp and int8 fast path
sorting using the same int8 specialization.

Remove unneeded line

Rebase for -hackers
---
 src/backend/utils/sort/tuplesort.c |  105 +++-
 src/include/utils/template_qsort_arg.h |  217 
 2 files changed, 316 insertions(+), 6 deletions(-)
 create mode 100644 src/include/utils/template_qsort_arg.h

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3505236..4962973 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -107,11 +107,13 @@
 #include miscadmin.h
 #include pg_trace.h
 #include utils/datum.h
+#include utils/fmgroids.h
 #include utils/logtape.h
 #include utils/lsyscache.h
 #include utils/memutils.h
 #include utils/pg_rusage.h
 #include utils/rel.h
+#include utils/template_qsort_arg.h
 #include utils/tuplesort.h
 
 
@@ -1225,12 +1227,60 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 }
 
 /*
- * All tuples have been provided; finish the sort.
+ * Manufacture type specific sorting specialisations
+ * with inline comparators
+ */
+static inline int
+compare_int4(Datum first, Datum second)
+{
+	int32		first_int = DatumGetInt32(first);
+	int32		second_int = DatumGetInt32(second);
+
+	if (first_int  second_int)
+		return 1;
+	else if (first_int == second_int)
+		return 0;
+	else
+		return -1;
+}
+
+static inline int
+compare_int8(Datum first, Datum second)
+{
+	int64		first_int = DatumGetInt64(first);
+	int64		second_int = DatumGetInt64(second);
+
+	if (first_int  second_int)
+		return 1;
+	else if (first_int == second_int)
+		return 0;
+	else
+		return -1;
+}
+
+TEMPLATE_QSORT_ARG(int4, compare_int4);
+TEMPLATE_QSORT_ARG(int8, compare_int8);
+
+double timeval_subtract(struct timeval *x, struct timeval *y);
+double timeval_subtract(struct timeval *x, struct timeval *y)
+{
+	struct timeval result;
+	/* Compute the time remaining to wait. tv_usec is certainly positive. */
+	result.tv_sec = x-tv_sec - y-tv_sec;
+	result.tv_usec = x-tv_usec - y-tv_usec;
+
+	/* return difference in seconds */
+	return result.tv_sec + ((double) result.tv_usec / 100);
+}
+
+/*
+ * All tuples have been provided; perform the sort.
  */
 void
 tuplesort_performsort(Tuplesortstate *state)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state-sortcontext);
+	struct timeval begin, end;
 
 #ifdef TRACE_SORT
 	if 

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-21 Thread Robert Haas
On Tue, Sep 20, 2011 at 7:53 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I don't think that the fact that that happens is at all significant at
 this early stage, and it never even occurred to me that you'd think
 that it might be. I was simply disclosing a quirk of this POC patch.
 The workaround is probably to use a macro instead. For the benefit of
 those that didn't follow the other threads, the macro-based qsort
 implementation, which I found to perform significantly better than
 regular qsort(), runs like this on my laptop when I built at 02 with
 GCC 4.6 just now:

 C stdlib quick-sort time elapsed: 2.092451 seconds
 Inline quick-sort time elapsed: 1.587651 seconds

Results on my machine, for what they're worth:

[rhaas inline_compar_test]$ gcc -O0 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 2.366762 seconds
Inline quick-sort time elapsed: 1.807951 seconds
[rhaas inline_compar_test]$ gcc -O1 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 1.970473 seconds
Inline quick-sort time elapsed: 1.002765 seconds
[rhaas inline_compar_test]$ gcc -O2 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 1.966408 seconds
Inline quick-sort time elapsed: 0.958999 seconds
[rhaas inline_compar_test]$ gcc -O3 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 1.988693 seconds
Inline quick-sort time elapsed: 0.975090 seconds

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-20 Thread Peter Geoghegan
Part of my problem with not producing specialisations that I really
neglected to complain about until now is the inconsistency between
types, and the need to deal with that.

We benefit from assuming in our specialisation that we're only dealing
with POD types, simply not considering things that we would otherwise
have had to for the benefit of types that have a Datum representation
that is pass-by-reference, or have collations, or have multiple
scankeys. Leaving aside the question of straight compiler-optimisation
benefits for the moment, the remaining benefit of what I've done comes
not so much from avoiding the usual function call machinery per se, as
from doing so *as well as* cutting down on what currently happens in
comparetup_heap to handle every single compound and scalar type.
Compare the function comparetup_heap with my meta-function
TYPE##AppSort to see what I mean. The function comparetup_heap is the
comparator directly used by qsort_arg when sorting heap tuples, and
qsort_arg outsources to comparetup_heap some things that you might not
expect it to (very little has changed about qsort_arg since we lifted
it from NetBSD back in 2006). So while you might imagine that that
loop in comparetup_heap and things like its use of the heap_getattr
macros won't expand to that many instructions, you really don't want
them in the inner loop of a long operation like qsorting, paid for up
to O(n ^ 2) times. There's also the obvious implications for compiler
optimisations, particularly relating to effective usage of CPU cache.

I'm trying to get you what you asked for: A straight choice between
what Tom suggested and what I suggested, with perhaps some compromises
between the two positions. That's sort of tricky though, especially
considering the issues raised above.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-19 Thread Peter Geoghegan
On 19 November 2011 02:55, Robert Haas robertmh...@gmail.com wrote:
 Maybe we should look at trying to isolate that a bit better.

Indeed. Fortunately, GCC has options to disable each optimisation.
Here's potentially relevant flags that we're already implicitly using
at -02:

-finline-small-functions -- this is the one that inlined functions
without an inline keyword
-findirect-inlining
-finline-functions-called-once
-fearly-inlining
-fipa-sra (Perform interprocedural scalar replacement of aggregates,
removal of unused parameters and replacement of parameters passed by
reference by parameters passed by value).

In an effort to better isolate the effects of inlining, I tried this:

./configure CFLAGS=-fno-inline -fno-inline-small-functions (I could
have disabled more -02 optimisations, but this proved sufficient to
make my point)

Unsurprisingly, this makes things slower than regular -02.

With the patch, the same query (once again, using explain analyze)
with the same fast path quicksort stabilises around 92ms +/- 5ms
(recall that the original figure was ~82ms). Our gains evaporate, and
then some. Take away the additional CLAGS, and we're predictably back
to big gains, with the query taking ~52ms just as before.

What happens to this query when we build an unmodified postgres with
these same CFLAGS? Well, we see the query take ~116ms after a few
runs. It seems that the impedance mismatch matters, but inlining and
other optimisations look to be at least as important. This isn't
surprising to me, given what I was able to do with the isolated test.
Maybe I should have tried it with additional disabling of
optimisations named above, but that would have perhaps made things
less clear.

I'd probably have been better off directly measuring qsort speed and
only passing those flags when compiling tuplesort.c (maybe impedance
mismatch issues would have proven to have been even less relevant),
but I wanted to do something that could be easily recreated, plus it's
late.

 It strikes me that we could probably create an API that would support
 doing either of these things depending on the wishes of the underlying
 datatype.  For example, imagine that we're sorting with (int4, int4).
  We associate a PGPROC-callable function with that operator that
 returns internal, really a pointer to a struct.  The first element
 of the struct is a pointer to a comparison function that qsort() (or a
 tape sort) can invoke without a trampoline; the second is a wholesale
 replacement for qsort(); either or both can be NULL.  Given that, it
 seems to me that we could experiment with this pretty easily, and if
 it turns out that only one of them is worth doing, it's easy to drop
 one element out of the structure.

 Or do you have another plan for how to do this?

I haven't given it much thought. Let me get back to you on that next week.

 Have you done any benchmarks where this saves seconds or minutes,
 rather than milliseconds?  That would certainly make it more exciting,
 at least to me.

Right. Well, I thought I'd use pgbench to generate a large table in a
re-creatable way. That is:

pgbench -i -s 60

This puts pgbench_accounts at 769MB. Then, having increased work_mem
to 1GB (enough to qsort) and maintenance_work_mem to 756mb, I decided
to test this query with the patch:

explain analyze select * from pgbench_accounts order BY abalance;

This stabilised at ~3450ms, through repeatedly being executed.

How does this compare to unpatched postgres? Well, it stabilised at
about ~3780ms for the same query.

This patch is obviously less of a win as the number of tuples to sort
goes up. That's probably partly explained by the cost of everything
else going up at a greater rate than the number of comparisons. I
suspect that if we measure qsort in isolation, we'll see better
results, so we may still see a good win on index creation time as a
result of this work.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-19 Thread Peter Geoghegan
On 20 November 2011 03:29, Peter Geoghegan pe...@2ndquadrant.com wrote:
 ./configure CFLAGS=-fno-inline -fno-inline-small-functions (I could
 have disabled more -02 optimisations, but this proved sufficient to
 make my point)

I'll isolate this further to tuplesort.c soon, which ought to be a lot
more useful.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-18 Thread Simon Riggs
On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote:

 I think that we should really consider doing with this patch what Tom
 suggested upthread; namely, looking for a mechanism to allow
 individual datatypes to offer up a comparator function that doesn't
 require bouncing through FunctionCall2Coll().  It seems to me that
 duplicating the entire qsort() algorithm is a little iffy.  Sure, in
 this case it works out to a win.  But it's only a small win -
 three-quarters of it is in the uncontroversial activity of reducing
 the impedance mismatch - and how do we know it will always be a win?
 Adding more copies of the same code can be an anti-optimization if it
 means that a smaller percentage of it fits in the instruction cache,
 and sometimes small changes in runtime are caused by random shifts in
 the layout of memory that align things more or less favorably across
 cache lines rather than by real effects.  Now it may well be that this
 is a real effect, but will it still look as good when we do this for
 10 data types?  For 100 data types?

 In contrast, it seems to me that reducing the impedance mismatch is
 something that we could go and do across our entire code base, and
 every single data type would benefit from it.  It would also be
 potentially usable by other sorting algorithms, not just quick sort.

I don't think its credible to implement that kind of generic
improvement at this stage of the release cycle. That has a much bigger
impact since it potentially effects all internal datatypes and
external ones also. Definitely a longer term way forward.

If individual datatypes offer up a comparator function that is easily
going to result in more code than is being suggested here. So the
argument about flooding the CPU cache works against your alternate
proposal, not in favour of it.

We have no proof that we need to do this for 10 or 100 data types. We
only currently have proof that there is gain for the most common
types. Of course, it sounds like it might be useful to allow any data
type to gain an advantage, but we shouldn't be blind to the point that
almost nobody will use such a facility, and if they do the code won't
be written for a long time yet. If this came as a request from custom
datatype authors complaining of slow sorts it would be different, but
it didn't so we don't even know if anybody would ever write user
defined comparator routines. Rejecting a patch because of a guessed
user requirement is not good.

Peter's suggested change adds very few lines of code and those compile
to some very terse code, a few hundred instructions at very most.
Requesting an extra few cachelines to improve qsort by so much is
still an easy overall win.

The OP change improves qsort dramatically, and is a small, isolated
patch. There is no significant downside. We also have it now, so lets
commit this, chalk up another very good performance improvement and
use our time on something else this commitfest, such as the flexlocks
idea.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-18 Thread Robert Haas
On Fri, Nov 18, 2011 at 3:53 AM, Simon Riggs si...@2ndquadrant.com wrote:
 We have no proof that we need to do this for 10 or 100 data types. We
 only currently have proof that there is gain for the most common
 types.

Well, that's kind of my point.  I think this needs more work before we
decide what the best approach is.  So far, the ONLY test result we
have that compares inlining to not inlining shows a speedup from 60 ms
to 52 ms.  I think that an 8 ms speedup on one test with one datatype
on one platform/compiler combination isn't sufficient evidence to
conclude that this is the best possible approach.

I think the way to look at this is that this patch contains two
possibly good ideas: one of which is to make the qsort() argument
match the qsort() calling convention, and the other of which is to
have multiple copies of the qsort() logic that inline the comparators
for their respective datatypes.  Tom hypothesized that most of the
benefit was in the first idea, and the numbers that Peter posted seem
to support that conclusion.  The first idea is also less invasive and
more broadly applicable, so to me that seems like the first thing to
pursue.

Now, that doesn't mean that we shouldn't consider the second idea as
well, but I don't believe that the evidence presented thus far is
sufficient to prove that we should go that route.  It seems entirely
possible that inlining any non-trivial comparator function could work
out to a loss, or that the exact choice of compiler flags could affect
whether inlining works out to a plus or a minus (what happens with -O3
vs. -O2?  what about -O1?  what about -O2 -fno-omit-frame-pointer?
what if they're using HP's aCC or Intel's icc rather than gcc?).
There's no point in installing an optimization that could easily be a
pessimization on some other workload, and that hasn't been tested, or
at least no results have been posted here.  On the other hand,
matching the calling convention of the comparator function to what
qsort() wants and eliminating the trampoline seems absolutely certain
to be a win in every case, and based on the work Peter has done it
seems like it might be a quite large win.

In fact, you have to ask yourself just exactly how much our
function-calling convention is costing us in general.  We use that
mechanism an awful lot and whatever loss of cycles is involved would
be spread all over the code base where oprofile or gprof won't easily
be able to pick it out.  Even the cost at any particular call site
will be split between the caller and the callee.  There might be more
stuff we could optimize here than just sorting...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-18 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote:
 I think that we should really consider doing with this patch what Tom
 suggested upthread; namely, looking for a mechanism to allow
 individual datatypes to offer up a comparator function that doesn't
 require bouncing through FunctionCall2Coll().

 I don't think its credible to implement that kind of generic
 improvement at this stage of the release cycle.

Er, *what*?  We're in mid development cycle, we are nowhere near
release.  When exactly would you have us make major changes?

In any case, what I understood Robert to be proposing was an add-on
feature that could be implemented in one datatype at a time.  Not
a global flag day.  We couldn't really do the latter anyway without
making life very unpleasant for authors of extension datatypes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-18 Thread Peter Geoghegan
On 18 November 2011 05:20, Robert Haas robertmh...@gmail.com wrote:
 I think that we should really consider doing with this patch what Tom
 suggested upthread; namely, looking for a mechanism to allow
 individual datatypes to offer up a comparator function that doesn't
 require bouncing through FunctionCall2Coll().  It seems to me that
 duplicating the entire qsort() algorithm is a little iffy.

I understand that we highly value extensibility and genericity (yes,
that's a real word). We may not always be well served by that
tendency.

 Sure, in this case it works out to a win.  But it's only a small win -
 three-quarters of it is in the uncontroversial activity of reducing
 the impedance mismatch - and how do we know it will always be a win?

Firstly, 1/4 of a quite large gain is still a pretty good gain.
Secondly, I probably didn't actually isolate the effects of inlining,
nor the overall benefit of the compiler knowing the comparator at
compile-time. I just removed the inline keyword. Those are two
different things. The inline keyword serves as a request to the
compiler to inline. The compiler can and often will ignore that
request. Most people know that. What isn't so widely known is that
modern compilers may equally well inline even when they haven't been
asked to (but only when they can). When you also consider, as I've
already pointed out several times, that individual call sites are
inlined, it becomes apparent that there may well still be a certain
amount of inlining and/or other optimisations like procedural
integration going on at some call sites even without the encouragement
of the inline keyword, that would not have been performed without the
benefit of compile-time comparator knowledge. The addition of the
inline keyword may just, in this particular case, have the compiler
inline even more call sites.  Posting the ~8ms difference was
motivated by a desire to prove that inlining had *some* role to play,
without actually going to the trouble of implementing Tom's idea as a
basis of comparison, because Tom was very sceptical of inlining.

The long and the short of it is that I'm going to have to get my hands
dirty with a dissembler before we really know exactly what's
happening. That, or I could use an optimisation fence of some type.

 Adding more copies of the same code can be an anti-optimization if it
 means that a smaller percentage of it fits in the instruction cache,
 and sometimes small changes in runtime are caused by random shifts in
 the layout of memory that align things more or less favorably across
 cache lines rather than by real effects.  Now it may well be that this
 is a real effect, but will it still look as good when we do this for
 10 data types?  For 100 data types?

I'd favour limiting it to just the common integer and float types.

 In contrast, it seems to me that reducing the impedance mismatch is
 something that we could go and do across our entire code base, and
 every single data type would benefit from it.  It would also be
 potentially usable by other sorting algorithms, not just quick sort.

Suppose that we went ahead and added that infrastructure. What you
must acknowledge is that one reason that this speed-up is so dramatic
is that the essential expense of a comparison is already so low - a
single instruction - and therefore the overall per-comparison cost
goes way down, particularly if the qsort inner loop can store the code
across fewer cache lines. For that reason, any absolute improvement
that you'll see in complex datatypes will be smaller, maybe much
smaller, because for each comparison we'll execute many more
instructions that are essential to the comparison. In my estimation,
all of this work does not point to there being an undue overhead in
the function calling convention as you suggested. Still, I'm not
opposed to investigating generalising this in some way, reservations
notwithstanding, unless we have to block-wait on it. I don't want to
chase diminishing returns too far.

 Well, that's kind of my point.  I think this needs more work before we
 decide what the best approach is.

Agreed.

 So far, the ONLY test result we
 have that compares inlining to not inlining shows a speedup from 60 ms
 to 52 ms.  I think that an 8 ms speedup on one test with one datatype
 on one platform/compiler combination isn't sufficient evidence to
 conclude that this is the best possible approach.

Fair enough, but it's not the only test I did - I posted other numbers
for the same query when the table was 48mb, and we saw a proportional
improvement, consistent with a per-comparison win. I'm supposed to be
on leave for a few days at the moment, so I won't be very active this
weekend, but I'm rather curious as to where you or others would like
to see me go with benchmarks.

I should point out that we currently don't have much idea how big of a
win applying these principles could be for index creation times...it
could possibly be very significant. My profiling of index 

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-18 Thread Martijn van Oosterhout
On Fri, Nov 18, 2011 at 12:20:26AM -0500, Robert Haas wrote:
 I think that we should really consider doing with this patch what Tom
 suggested upthread; namely, looking for a mechanism to allow
 individual datatypes to offer up a comparator function that doesn't
 require bouncing through FunctionCall2Coll().  It seems to me that
 duplicating the entire qsort() algorithm is a little iffy.  Sure, in
 this case it works out to a win.  But it's only a small win -
 three-quarters of it is in the uncontroversial activity of reducing
 the impedance mismatch - and how do we know it will always be a win?

There's always the old idea of a data type providing a function mapping
f:(type - int64) in such a way that it preserves order. That is, in the
sense that:

f(x)  f(y)  =  x  y

When sorting, you add f(x) as hidden column and change ORDER BY x to
ORDER BY f(x), x.  Then you only need to special case the int64
version.  This would mean that in most cases you may be able to skip
the call because you're comparing integers.  The downside is you need
to call f on each input.  It depends on the datatype if that's cheaper
or not, but for all numerics types I think it's an easy win.

I don't think anyone has written a proof of concept for this. It does
have the advantage of scaling better than coding a qsort for each
individual type.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-18 Thread Robert Haas
On Fri, Nov 18, 2011 at 4:38 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I understand that we highly value extensibility and genericity (yes,
 that's a real word). We may not always be well served by that
 tendency.

True (except that genericity is not a synonym for generality AFAICT).
A good fraction of the optimizations that we've done over the last,
well, pick any arbitrary period of time consists in adding special
cases that occur frequently enough to be worth optimizing - e.g. HOT,
or fast relation locks - often to extremely good effect.

 Firstly, 1/4 of a quite large gain is still a pretty good gain.
 Secondly, I probably didn't actually isolate the effects of inlining,
 nor the overall benefit of the compiler knowing the comparator at
 compile-time. I just removed the inline keyword.

Maybe we should look at trying to isolate that a bit better.

It strikes me that we could probably create an API that would support
doing either of these things depending on the wishes of the underlying
datatype.  For example, imagine that we're sorting with (int4, int4).
 We associate a PGPROC-callable function with that operator that
returns internal, really a pointer to a struct.  The first element
of the struct is a pointer to a comparison function that qsort() (or a
tape sort) can invoke without a trampoline; the second is a wholesale
replacement for qsort(); either or both can be NULL.  Given that, it
seems to me that we could experiment with this pretty easily, and if
it turns out that only one of them is worth doing, it's easy to drop
one element out of the structure.

Or do you have another plan for how to do this?

 Fair enough, but it's not the only test I did - I posted other numbers
 for the same query when the table was 48mb, and we saw a proportional
 improvement, consistent with a per-comparison win. I'm supposed to be
 on leave for a few days at the moment, so I won't be very active this
 weekend, but I'm rather curious as to where you or others would like
 to see me go with benchmarks.

 I should point out that we currently don't have much idea how big of a
 win applying these principles could be for index creation times...it
 could possibly be very significant. My profiling of index creation
 makes this looks promising.

Have you done any benchmarks where this saves seconds or minutes,
rather than milliseconds?  That would certainly make it more exciting,
at least to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-11-17 Thread Robert Haas
On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I've produced something much neater than my first patch, attached,
 although I still consider this to be at the POC stage, not least since
 I'm not exactly sure how I should be selecting the right
 specialisation in tuplesort_performsort() (a hack is used right now
 that does a direct pg_proc OID comparison), and also because I haven't
 implemented anything other than qsort for heap tuples yet (a minor
 detail, I suppose). I'm pleased to say that a much clearer picture of
 what's really going on here has emerged.

 Statistics: Total runtime according to explain analyze for query
 select * from orderlines order by prod_id (dellstore2 sample db), at
 GCC 4.5's -02 optimisation level, after warming the cache, on my
 desktop:

 Without the patch:

 ~82ms

 With the patch, but with the inline keyword commented out for all
 new functions/meta-functions:

 ~60ms

 with the patch, unmodified:

 ~52ms

Reviewing away when I should be sleeping, wahoo...!

I think that we should really consider doing with this patch what Tom
suggested upthread; namely, looking for a mechanism to allow
individual datatypes to offer up a comparator function that doesn't
require bouncing through FunctionCall2Coll().  It seems to me that
duplicating the entire qsort() algorithm is a little iffy.  Sure, in
this case it works out to a win.  But it's only a small win -
three-quarters of it is in the uncontroversial activity of reducing
the impedance mismatch - and how do we know it will always be a win?
Adding more copies of the same code can be an anti-optimization if it
means that a smaller percentage of it fits in the instruction cache,
and sometimes small changes in runtime are caused by random shifts in
the layout of memory that align things more or less favorably across
cache lines rather than by real effects.  Now it may well be that this
is a real effect, but will it still look as good when we do this for
10 data types?  For 100 data types?

In contrast, it seems to me that reducing the impedance mismatch is
something that we could go and do across our entire code base, and
every single data type would benefit from it.  It would also be
potentially usable by other sorting algorithms, not just quick sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-10-05 Thread Robert Haas
On Tue, Oct 4, 2011 at 10:55 PM, Greg Stark st...@mit.edu wrote:
 I agree that if we wanted to farm out entire plan nodes we would
 probably end up generating new partial plans which would be handed to
 other backend processes. That's not unlike what Oracle did with
 Parallel Query. But i'm a bit skeptical that we'll get much of that
 done in 2-3 years. The main use case for Parallel Query in Oracle is
 for partitioned tables -- and we haven't really polished that after
 how many years?

Partitioning hasn't been completely neglected; 9.1 adds support for
something called Merge Append.  You may have heard of (or, err,
authored) that particular bit of functionality.  :-)

Of course, it would be nice to have a better syntax, but I don't think
the lack of it should discourage us from working on parallel query,
which is a muti-part problem.  You need to:

- have planner support to decide when to parallelize
- have a mechanism for firing up worker processes and synchronizing
the relevant bits of state between the master and the workers
- have an IPC mechanism for streaming data between the master process
and the workers
- figure out how to structure the executor so that the workers neither
stall nor run too far ahead of the master (which might have LIMIT 10
or something)

Markus Wanner took a crack at generalizing the autovacuum machinery
that we have now into something that could be used to fire up
general-purpose worker processes, but it fell down mostly because I
(and, I think, others) weren't convinced that imessages were something
we wanted to suck into core, and Markus reasonably enough wasn't
interested in rewriting it to do something that wouldn't really help
his work with Postgres-R.  I'm not sure where Bruce is getting his
timeline from, but I think the limiting factor is not so much that we
don't have people who can write the code as that those people are
busy, and this is a big project.  But you can bet that if it gets to
the top of Tom's priority list (just for example) we'll see some
motion...!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-10-05 Thread Bruce Momjian
Robert Haas wrote:
 Markus Wanner took a crack at generalizing the autovacuum machinery
 that we have now into something that could be used to fire up
 general-purpose worker processes, but it fell down mostly because I
 (and, I think, others) weren't convinced that imessages were something
 we wanted to suck into core, and Markus reasonably enough wasn't
 interested in rewriting it to do something that wouldn't really help
 his work with Postgres-R.  I'm not sure where Bruce is getting his
 timeline from, but I think the limiting factor is not so much that we
 don't have people who can write the code as that those people are
 busy, and this is a big project.  But you can bet that if it gets to
 the top of Tom's priority list (just for example) we'll see some
 motion...!

I was thinking of setting up a team to map out some strategies and get
community buy-in, and then we could attack each issue.  I got the 2-3
years from the Win32 timeline.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-10-04 Thread Bruce Momjian
Peter Geoghegan wrote:
 On the subject of highly ambitious optimisations to sorting, one
 possibility I consider much more practicable than GPU-accelerated
 sorting is simple threading; quicksort can be parallelised very
 effectively, due to its divide-and-conquer nature. If we could agree
 on a threading abstraction more sophisticated than forking, it's
 something I'd be interested in looking at. To do so would obviously
 entail lots of discussion about how that relates to whatever way we
 eventually decide on implementing parallel query, and that's obviously
 a difficult discussion.

I agree that the next big challenge for Postgres is parallel operations.
With the number of cores increasing, and with increased memory and SSD,
parallel operation is even more important.  Rather than parallelizing
the entire backend, I imagine adding threading or helper processes for
things like sorts, index scans, executor nodes, and stored procedure
languages.  I expect final code to be 2-3 years in the future.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-10-04 Thread Greg Stark
On Wed, Oct 5, 2011 at 2:49 AM, Bruce Momjian br...@momjian.us wrote:

 Rather than parallelizing
 the entire backend, I imagine adding threading or helper processes for
 things like sorts, index scans, executor nodes, and stored procedure
 languages.  I expect final code to be 2-3 years in the future.

I don't actually see it would be a big problem quicksort to start up a
bunch of threads which do some of the work and go away when the
tuplesort is done. As long as the code they're executing is well
defined and limited to a small set of code that can be guaranteed to
be thread-safe it should be reasonably simple.

The problem is that in most of the Postgres core there aren't many
places where much code fits that description. Even in tuplesort it can
call out to arbitrary user-defined functions so we would need a way to
mark which functions are thread-safe. Beyond integer and floating
point comparisons it may not be much -- certainly not Numeric or
strings due to detoasting... And then there are things like handling
the various signals (including SI invalidations?) and so on.

I agree that if we wanted to farm out entire plan nodes we would
probably end up generating new partial plans which would be handed to
other backend processes. That's not unlike what Oracle did with
Parallel Query. But i'm a bit skeptical that we'll get much of that
done in 2-3 years. The main use case for Parallel Query in Oracle is
for partitioned tables -- and we haven't really polished that after
how many years?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-26 Thread Peter Geoghegan
On 26 September 2011 04:46, Robert Haas robertmh...@gmail.com wrote:
 I don't want to take time to review this in detail right now, because
 I don't think it would be fair to put this ahead of things that were
 submitted for the current CommitFest, but I'm impressed.

Thank you.

Now that I think about it, the if statement that determines if the
int4 specialisation will be used may be okay - it sort of documents
the conditions under which the int4 specialisation should be used with
reference to how the else generic case actually works, which is
perhaps no bad thing. I now realise that I should be using constants
from fmgroids.h though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-25 Thread Robert Haas
On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 [ new results ]

Nice results.  I think these are far more convincing than the last
set, because (1) the gains are bigger and (2) they survive -O2 and (3)
you tested an actual query, not just qsort() itself.

I don't want to take time to review this in detail right now, because
I don't think it would be fair to put this ahead of things that were
submitted for the current CommitFest, but I'm impressed.

 On the subject of highly ambitious optimisations to sorting, one
 possibility I consider much more practicable than GPU-accelerated
 sorting is simple threading; quicksort can be parallelised very
 effectively, due to its divide-and-conquer nature. If we could agree
 on a threading abstraction more sophisticated than forking, it's
 something I'd be interested in looking at. To do so would obviously
 entail lots of discussion about how that relates to whatever way we
 eventually decide on implementing parallel query, and that's obviously
 a difficult discussion.

I have the same feeling about about this that I do about almost every
executor optimization that anyone proposes: the whole project would be
entirely simple and relatively painless if it weren't for the need to
make planner changes.  I mean, deciding on a threading interface is
likely to be a somewhat contentious discussion, with differences of
opinion on whether we should do it and what the API should look like.
But at the end of the day it's not rocket science, and I expect that
we would end up with something reasonable.  What seems much harder is
figuring out how to decide when to perform quicksort in parallel vs.
single-threaded, and how much parallelism would be appropriate.  I
haven't seen anyone propose even a shadow of an idea about how to make
such decisions intelligently, either in general or in specific cases.

The other issue is that, while threading would be possibly suitable
for this particular case, at least for built-in datatypes with
comparison operations that basically reduce to single machine-language
comparison instructions, it's hard to see how we could take it much
further.  It would be unsafe for these multiple threads of execution
to do anything that could possibly throw an error or anything that
touches a lightweight lock or, really, just about anything at all.
Trying to make the entire backend thread-safe - or even any
significant portion of it - seems like a colossal effort that will
most likely fail, but maybe not without eating an enormous amount of
developer time first.  And without doing that, I don't think we could
even extend this as far as, say, numeric, whose functions do things
like palloc() and ereport() internally.  So I feel like this whole
approach might be a dead-end - there's a narrow range of cases where
it could be made to work, I think, but after that I think you hit a
titanium wall.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Heikki Linnakangas

On 21.09.2011 02:53, Peter Geoghegan wrote:

C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 seconds

Does *that* look attractive to you?


Not really, to be honest. That's a 25% speedup in pure qsorting speed. 
How much of a gain in a real query do you expect to get from that, in 
the best case? There's so many other sources of overhead that I'm afraid 
this will be lost in the noise. If you find a query that spends, say, 
50% of its time in qsort(), you will only get a 12.5% speedup on that 
query. And even 50% is really pushing it - I challenge you to find a 
query that spends any significant amount of time qsorting integers.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Simon Riggs
On Tue, Sep 20, 2011 at 3:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 This performance patch differs from most in that it's difficult in
 principle to imagine a performance regression occurring.

 Really?  N copies of the same code could lead to performance loss just
 due to code bloat (ie, less of a query's inner loops fitting in CPU
 cache).  Not to mention the clear regression in maintainability.  So
 I'm disinclined to consider this sort of change without a significantly
 bigger win than you're suggesting above (no, I don't even consider the
 -O0 number attractive, let alone what you're finding at -O2).

More copies of the code are somewhat annoying, but its only 100 lines
of code in one module and we can easily have specific tests for each.
The extra code size is minor in comparison to the reams of code we add
elsewhere.

It's a surprisingly good win for such a common use case. Well done, Peter.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Simon Riggs
On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 21.09.2011 02:53, Peter Geoghegan wrote:

 C stdlib quick-sort time elapsed: 2.092451 seconds
 Inline quick-sort time elapsed: 1.587651 seconds

 Does *that* look attractive to you?

 Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
 much of a gain in a real query do you expect to get from that, in the best
 case? There's so many other sources of overhead that I'm afraid this will be
 lost in the noise. If you find a query that spends, say, 50% of its time in
 qsort(), you will only get a 12.5% speedup on that query. And even 50% is
 really pushing it - I challenge you to find a query that spends any
 significant amount of time qsorting integers.

How about almost every primary index creation?

Don't really see a reason for the negativity here. If you use that
argument no performance gain is worth it because all workloads are
mixed.

This is a marvellous win, a huge gain from a small, isolated and
easily tested change. By far the smallest amount of additional code to
sorting we will have added and yet one of the best gains.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Heikki Linnakangas

On 21.09.2011 10:01, Simon Riggs wrote:

On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 21.09.2011 02:53, Peter Geoghegan wrote:


C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 seconds

Does *that* look attractive to you?


Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
much of a gain in a real query do you expect to get from that, in the best
case? There's so many other sources of overhead that I'm afraid this will be
lost in the noise. If you find a query that spends, say, 50% of its time in
qsort(), you will only get a 12.5% speedup on that query. And even 50% is
really pushing it - I challenge you to find a query that spends any
significant amount of time qsorting integers.


How about almost every primary index creation?


Nope. Swamped by everything else.

Also note that as soon as the sort grows big enough to not fit in 
maintenance_work_mem, you switch to the external sort algorithm, which 
doesn't use qsort. Perhaps you could do similar inlining in the heap 
sort  merge passes done in the external sort, but it's unlikely to be 
as big a win there.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Peter Geoghegan
On 21 September 2011 01:48,  karave...@mail.bg wrote:
 All -O2 version show 42% speedup with inlined qsort.
 -O0 showed 25% speedup.

Thanks. Looks like the figures I posted last night were fairly
conservative. Does anyone else care to report results?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Amit Kapila
Recent discussions on the threads Double sorting split patch and

CUDA sorting raised the possibility that there could be significant

performance optimisation low-hanging fruit picked by having the

executor treat integers and floats as a special case during sorting,

avoiding going to the trouble of calling a comparator using the

built-in SQL function machinery

 

Why only for integers and floats why not for char/varchar?

But I believe this can make code less maintainable as similar things can be
done at other places to avoid SQL function machinery.

 

Once the cache has been warmed, explain analyze very consistently

reports a runtime of 123ms for this query on master/HEAD, which varies

+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply

this patch, that goes down to 107ms +/- 1ms at -O0.

 

Time 123ms which is without your change is with which optimization -O2 or
O0?

 

 


***

This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

 

 

-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Peter Geoghegan
Sent: Tuesday, September 20, 2011 7:26 AM
To: PG Hackers
Subject: [HACKERS] Inlining comparators as a performance optimisation

 

Recent discussions on the threads Double sorting split patch and

CUDA sorting raised the possibility that there could be significant

performance optimisation low-hanging fruit picked by having the

executor treat integers and floats as a special case during sorting,

avoiding going to the trouble of calling a comparator using the

built-in SQL function machinery, and taking advantage of inlining of

the comparator, which has been shown to have a considerable

performance advantage (at least compared to a general purpose c stdlib

qsort(), that takes a function pointer as its comparator, much like

tuplesort).

 

I've hacked together a sloppy POC implementation in a hurry

(basically, some code is shifted around) , which is attached - I felt

that it would be useful to determine if the community feels that this

is a worth-while undertaking in advance of a business trip that I'm

leaving on tomorrow lasting until Friday, during which I will be

mostly unavailable. The patch breaks the Postgres sorting executor

node (at least when it quicksorts) for any type other than int4. I

apologise for how rough the patch is, but the code itself isn't

important right now - the idea is. I anticipate that the value

state-datumType or something similar will be set in a near future

revision, so that tuplesort_performsort will know which type-specific

optimisation it can use for the type, while falling back on the

existing generic qsort_arg + qsort_arg_comparator, and sorting won't

actually be broken.

 

I've been doing some preliminary testing using the dell store 2 sample

database. I increase work_mem to '50MB', to ensure that a quicksort

will be performed for sorting (otherwise, I'm using the

postgresql.conf that initdb gave me). The query is:

 

explain analyze select * from orderlines order by prod_id;

 

Once the cache has been warmed, explain analyze very consistently

reports a runtime of 123ms for this query on master/HEAD, which varies

+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply

this patch, that goes down to 107ms +/- 1ms at -O0. I think that

that's a pretty good start. Funnily enough, the difference/advantage

vanishes at -O2 (I'm guessing that the higher optimisation level of

GCC 4.5 hyper-corrects away the inlining, but I don't have time to

check that right now).

 

I imagine the version that I actually submit for patch review will

have a macro-based infrastructure for inlining the sorting of various

built-in types, initially integers and floats. It will most likely

have some other optimisations - I haven't even used a profiler yet.

 

This performance patch differs from most in that it's difficult in

principle to imagine a performance regression occurring.

 

Thoughts?

 

-- 

Peter Geoghegan   http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training and Services



Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Greg Stark
On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 How about almost every primary index creation?

 Nope. Swamped by everything else.

Really? I think it's pretty common for shops to be able to dedicate
large amounts of RAM to building initial indexes on data loads or
reindex operations. Enough that they can cache the entire table for
the short time they're doing the index builds even if they're quite
large. Witness the recent pleas to allow maintenance_work_mem on the
order of tens of gigabytes. And it's also pretty common that shops can
dedicate very large I/O bandwidth, in many cases enough to saturate
the memory bandwidth, for doing these kinds of batch operations when
they get large enough to need to do an external sort.

There's still overhead of reading the pages, the tuples, finding the
sort keys in the tuple, etc. But I think the actual qsort or heap
operations in tapesort are pretty big portions of the work.

This is pretty easy to measure. Just run oprofile or gprof and see
what percentage of time for a big index build is spent in qsort.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Robert Haas
On Wed, Sep 21, 2011 at 8:47 AM, Greg Stark st...@mit.edu wrote:
 On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 How about almost every primary index creation?

 Nope. Swamped by everything else.

 Really? I think it's pretty common for shops to be able to dedicate
 large amounts of RAM to building initial indexes on data loads or
 reindex operations. Enough that they can cache the entire table for
 the short time they're doing the index builds even if they're quite
 large. Witness the recent pleas to allow maintenance_work_mem on the
 order of tens of gigabytes. And it's also pretty common that shops can
 dedicate very large I/O bandwidth, in many cases enough to saturate
 the memory bandwidth, for doing these kinds of batch operations when
 they get large enough to need to do an external sort.

 There's still overhead of reading the pages, the tuples, finding the
 sort keys in the tuple, etc. But I think the actual qsort or heap
 operations in tapesort are pretty big portions of the work.

 This is pretty easy to measure. Just run oprofile or gprof and see
 what percentage of time for a big index build is spent in qsort.

+1 for some actual measurements.

I don't think anyone on this thread is saying that if we can get big
performance gains from doing this we still shouldn't do it.  But at
this point it's unclear that we can get a consistent speedup that
isn't heavily dependent on the choice of compiler flags (to say
nothing of compiler and OS), and even if we can, it's not clear that
it will still be noticeable when you measure the run time of an entire
query rather than just the speed of qsort().  Like Tom and Heikki, I'm
a bit skeptical: it wouldn't surprise me to find out that qsort() is
5% of the runtime any realistic test case and the average qsort()
speedup based on tests on a couple different platforms is 10% and so
on average we're looking at a 0.5% improvement, in which case it might
be more trouble than it's worth, especially if it turns out that there
are OS/platform combinations where the inlined version is (for some
crazy reason) slower.  I've seen performance differences of up to 3%
from minor code rearrangements that don't seem like they should matter
at all, just because code and data shifts around across cache-line
boundaries and the new arrangement is slightly better or worse than
the old one.  So if the performance improvement turns out to be very
small, then validating that it actually IS an improvement in general
is likely to be kind of a pain in the ass.

On the other hand, the performance improvement might turn out to be
large.  Maybe there's a test case where, as Heikki suggests, 50% of
the time is spent in qsort().  If we can reliably make that 25%
faster, I wouldn't dismiss that out of hand; I think that would be
pretty good, assuming it didn't require massive amounts of spaghetti
code to make it work.  I don't see that that would be any more
marginal than the sorts of things we've optimized in, say, commit
4fc115b2e981f8c63165ca86a23215380a3fda66, or commit
f4d242ef94730c447d87b9840a40b0ec3371fe0f.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Peter Geoghegan
On 21 September 2011 07:51, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 21.09.2011 02:53, Peter Geoghegan wrote:

 C stdlib quick-sort time elapsed: 2.092451 seconds
 Inline quick-sort time elapsed: 1.587651 seconds

 Does *that* look attractive to you?

 Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
 much of a gain in a real query do you expect to get from that, in the best
 case? There's so many other sources of overhead that I'm afraid this will be
 lost in the noise.

I'm surprised that you're dismissive of this. After all, we have in
the past indulged in micro-optimisation of qsort, or so it would seem
from this comment:

 * We have modified their original by adding a check for already-sorted input,
 * which seems to be a win per discussions on pgsql-hackers around 2006-03-21.

Makes affected queries radically faster (In the best case, a speedup
somewhat greater than 12.5%) is an unreasonably high standard for a
performance optimisation of the executor in general (such a high
standard might be sensible if it was due to a particular
maintainability downside, but you didn't mention one). Even still, I
think that the 12.5% figure is pretty pessimistic - I've already sped
up the dell store query by almost that much, and that's with a patch
that was, due to circumstances, cobbled together.

Not only are we benefiting from the effects of inlining, we're also
benefiting from the removal of unnecessary indirection. As Tom said,
In concrete terms, there would be no reason to have tuplesort.c's
myFunctionCall2Coll, and maybe not inlineApplySortFunction either, if
the datatype-specific comparison functions had APIs that were closer
to what sorting wants rather than following the general
SQL-callable-function API. He was just referring to the benefits of
removing indirection here, so ISTM that this is really two performance
optimisations rolled into one - it's conceivable that the total
performance improvement will even exceed the isolated inlining
comparator benchmark.

As I've said, I believe this patch can be committed without
compromising the maintainability of the tuplesort code to an extent
that is not clearly worth it, through the use of a clean, macro-based
abstraction. Concerns about bloated binaries are probably not well
founded, because what I'm proposing is to a certain extent emulating
C++ templates, while using a very common pattern used with C++
templates. In the C++ world, algorithms are often generalised as
templates, so that they can be used equally well with any datatype
(that supports the interface of the template), while availing of
compiler optimisations per template instantiation (instance of using a
given type with a given template). I actually got the idea for this
patch in part from a book that I read years ago that described the
fact that counter-intuitively, std::sort() consistently outperforms
qsort(), because the comparator is often inlined, and the compiler can
generally avail of optimisations from knowing the comparator at
compile-time.

On 21 September 2011 13:47, Greg Stark st...@mit.edu wrote:
 This is pretty easy to measure. Just run oprofile or gprof and see
 what percentage of time for a big index build is spent in qsort.

I'll do so soon. I intend to get to this on Friday evening, and
perhaps have a proper patch to show next week.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Heikki Linnakangas

On 21.09.2011 17:20, Peter Geoghegan wrote:

Even still, I
think that the 12.5% figure is pretty pessimistic - I've already sped
up the dell store query by almost that much, and that's with a patch
that was, due to circumstances, cobbled together.


I'm not against making things faster, it's just that I haven't seen 
solid evidence yet that this will help. Just provide a best-case test 
case for this that shows a huge improvement, and I'll shut up. If the 
improvement is only modest, then let's discuss how big it is and whether 
it's worth the code ugliness this causes.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 21.09.2011 17:20, Peter Geoghegan wrote:
 Even still, I
 think that the 12.5% figure is pretty pessimistic - I've already sped
 up the dell store query by almost that much, and that's with a patch
 that was, due to circumstances, cobbled together.

 I'm not against making things faster, it's just that I haven't seen 
 solid evidence yet that this will help. Just provide a best-case test 
 case for this that shows a huge improvement, and I'll shut up. If the 
 improvement is only modest, then let's discuss how big it is and whether 
 it's worth the code ugliness this causes.

The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint.  I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Andrew Dunstan



On 09/21/2011 10:50 AM, Tom Lane wrote:

The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint.  I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.




Can we get a patch so we can do benchmarks on this?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 This is a marvellous win, a huge gain from a small, isolated and
 easily tested change. By far the smallest amount of additional code to
 sorting we will have added and yet one of the best gains.

I think you forgot your cheerleader uniform.  A patch along these lines
is not going to be small, isolated, easily maintained, nor beneficial
for any but a small number of predetermined datatypes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Peter Geoghegan
On 21 September 2011 15:50, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 I'm not against making things faster, it's just that I haven't seen
 solid evidence yet that this will help. Just provide a best-case test
 case for this that shows a huge improvement, and I'll shut up. If the
 improvement is only modest, then let's discuss how big it is and whether
 it's worth the code ugliness this causes.

Fair enough.

 The other question that I'm going to be asking is whether it's not
 possible to get most of the same improvement with a much smaller code
 footprint.

That's a reasonable question, and I hope to be able to come up with a
good answer.

 I continue to suspect that getting rid of the SQL function
 impedance-match layer (myFunctionCall2Coll etc) would provide most of
 whatever gain is to be had here, without nearly as large a cost in code
 size and maintainability, and with the extra benefit that the speedup
 would also be available to non-core datatypes.

I'm fairly surprised that your view on that is mostly or entirely
unchanged, even after I've demonstrated a considerable performance
advantage from a macro-based qsort implementation over my OS vendor's
c std lib qsort(), using an isolated test-case, that does not have
anything to do with that impedance mismatch. I'm not sure why you
doubt that the same thing is happening within tuplesort.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 09/21/2011 10:50 AM, Tom Lane wrote:
 The other question that I'm going to be asking is whether it's not
 possible to get most of the same improvement with a much smaller code
 footprint.  I continue to suspect that getting rid of the SQL function
 impedance-match layer (myFunctionCall2Coll etc) would provide most of
 whatever gain is to be had here, without nearly as large a cost in code
 size and maintainability, and with the extra benefit that the speedup
 would also be available to non-core datatypes.

 Can we get a patch so we can do benchmarks on this?

Well, we'd have to negotiate what the API ought to be.  What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable.  As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.

The idea that I was toying with was to allow the regular SQL-callable
comparison function to somehow return a function pointer to the
alternate comparison function, so that the first comparison in a given
sort run would be done the traditional way but then we'd notice the
provided function pointer and start using that.  It would not be too
hard to pass back the pointer using FunctionCallInfoData.context, say.
The downside is adding cycles to unoptimized cases to uselessly check
for a returned function pointer that's not there.  Perhaps it could be
hacked so that we only add cycles to the very first call, but I've not
looked closely at the code to see what would be involved.

Has anyone got a better idea for getting hold of the alternate function?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Heikki Linnakangas

On 21.09.2011 18:46, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

On 09/21/2011 10:50 AM, Tom Lane wrote:

The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint.  I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.



Can we get a patch so we can do benchmarks on this?


Well, we'd have to negotiate what the API ought to be.  What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable.  As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.

The idea that I was toying with was to allow the regular SQL-callable
comparison function to somehow return a function pointer to the
alternate comparison function, so that the first comparison in a given
sort run would be done the traditional way but then we'd notice the
provided function pointer and start using that.  It would not be too
hard to pass back the pointer using FunctionCallInfoData.context, say.
The downside is adding cycles to unoptimized cases to uselessly check
for a returned function pointer that's not there.  Perhaps it could be
hacked so that we only add cycles to the very first call, but I've not
looked closely at the code to see what would be involved.


You could have a new function with a pg_proc entry, that just returns a 
function pointer to the qsort-callback.


Or maybe the interface should be an even more radical replacement of 
qsort, not just the comparison function. Instead of calling qsort, 
tuplesort.c would call the new datatype-specific sort-function (which 
would be in pg_proc). The implementation could use an inlined version of 
qsort, like Peter is suggesting, or it could do something completely 
different, like a radix sort or a GPU-assisted sort or whatever.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Greg Stark
On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  As such, they could not have entries in pg_proc, so
 it seems like there's no ready way to represent them in the catalogs.

Why couldn't they be in pg_proc with a bunch of opaque arguments like
the GIST opclass support functions?

I'm a bit puzzled what the arguments would look like. They would still
need to know the collation, nulls first/last flags, etc.
And calling it would still not be inlinable.  So they would have to
check those flags on each invocation instead of having a piece of
straightline code that hard codes the behaviour with the right
behaviour inline.  ISTM the hope for a speedup from the inlining
mostly came from the idea that the compiler might be able to hoist
this logic outside the loop (and I suppose implement n specialized
loops depending on the behaviour needed).

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 21.09.2011 18:46, Tom Lane wrote:
 The idea that I was toying with was to allow the regular SQL-callable
 comparison function to somehow return a function pointer to the
 alternate comparison function,

 You could have a new function with a pg_proc entry, that just returns a 
 function pointer to the qsort-callback.

Yeah, possibly.  That would be a much more invasive change, but cleaner
in some sense.  I'm not really prepared to do all the legwork involved
in that just to get to a performance-testable patch though.

 Or maybe the interface should be an even more radical replacement of 
 qsort, not just the comparison function. Instead of calling qsort, 
 tuplesort.c would call the new datatype-specific sort-function (which 
 would be in pg_proc). The implementation could use an inlined version of 
 qsort, like Peter is suggesting, or it could do something completely 
 different, like a radix sort or a GPU-assisted sort or whatever.

No.  In the first place, that only helps for in-memory sorts.  In the
second, it would absolutely destroy our ability to change the behavior
of sorting ever again.  Considering that we've added ASC/DESC, NULLS
FIRST/LAST, and collation support over the years, are you really
prepared to bet that the sort code will never need any more feature
upgrades?  (This concern is in fact the source of my beef with the whole
inlining proposal to begin with, but allowing the inlining to occur into
third-party code that we don't control at all would be a hundred times
worse.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Heikki Linnakangas

On 21.09.2011 18:46, Tom Lane wrote:

Well, we'd have to negotiate what the API ought to be.  What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable.  As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.


Quite aside from this qsort-thing, it would be nice to have versions of 
all simple functions that could be called without the FunctionCall 
overhead. So instead of:


FunctionCall2(flinfo_for_int4pl, 1, 2)

you could do simply

int4pl_fastpath(1,2)

I'm not sure how big an effect this would have, but it seems like it 
could shave some cycles across the system.


We could have an extended version of the PG_FUNCTION_INFO_V1 macro that 
would let you register the fastpath function:


PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  As such, they could not have entries in pg_proc, so
 it seems like there's no ready way to represent them in the catalogs.

 Why couldn't they be in pg_proc with a bunch of opaque arguments like
 the GIST opclass support functions?

That does not mean the same thing at all.  Everything in pg_proc is
meant to be called through the V0 or V1 function call info protocols.

 I'm a bit puzzled what the arguments would look like. They would still
 need to know the collation, nulls first/last flags, etc.

No, I wasn't thinking that we should do that.  The datatype comparison
functions should have the exact same semantics they do now, just a
lower-overhead call mechanism.  If you try to push stuff like NULLS
FIRST/LAST into the per-datatype code, then you are up against a problem
when you want to add a new flag: you have to touch lots of code not all
of which you even control.

 And calling it would still not be inlinable.  So they would have to
 check those flags on each invocation instead of having a piece of
 straightline code that hard codes the behaviour with the right
 behaviour inline.  ISTM the hope for a speedup from the inlining
 mostly came from the idea that the compiler might be able to hoist
 this logic outside the loop (and I suppose implement n specialized
 loops depending on the behaviour needed).

None of that stuff is inlinable or constant-foldable today, nor would it
be with the patch that Peter was proposing AFAICS, because none of the
flags will ever be compile time constant values.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 21.09.2011 18:46, Tom Lane wrote:
 Well, we'd have to negotiate what the API ought to be.  What I'm
 envisioning is that datatypes could provide alternate comparison
 functions that are designed to be qsort-callable rather than
 SQL-callable.  As such, they could not have entries in pg_proc, so
 it seems like there's no ready way to represent them in the catalogs.

 Quite aside from this qsort-thing, it would be nice to have versions of 
 all simple functions that could be called without the FunctionCall 
 overhead.

Hmm, that's an interesting idea.  I think probably the important aspects
are (1) known number of arguments and (2) no null argument or result
values are allowed.  Not sure what we'd do with collations though.

 We could have an extended version of the PG_FUNCTION_INFO_V1 macro that 
 would let you register the fastpath function:
 PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);

We don't use PG_FUNCTION_INFO_V1 for built-in functions ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Greg Stark
On Wed, Sep 21, 2011 at 5:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 None of that stuff is inlinable or constant-foldable today, nor would it
 be with the patch that Peter was proposing AFAICS, because none of the
 flags will ever be compile time constant values.

I was referring to transformations like this which I believe compilers
are already capable of doing:

v = ...;
while (...)
  if (v) {
if (a  b) ... else ;
  } else {
if (a  b) ... else ...;
  }

turning it into code that looks like:

if (v) {
 while ()
   if (ab) ... else ...;
} else {
  while ()
if (ab) ... else ...;
}

which may not look like much -- especially with branch prediction --
but then it's much more likely to be able to unroll the loop and do
clever instruction scheduling and so on than if there's an extra
branch in the middle of the loop. But if there's a function call to an
external function called through a function pointer in the middle of
the loop then the whole endeavour would be for naught.



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-21 Thread Simon Riggs
On Wed, Sep 21, 2011 at 4:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 This is a marvellous win, a huge gain from a small, isolated and
 easily tested change. By far the smallest amount of additional code to
 sorting we will have added and yet one of the best gains.

 I think you forgot your cheerleader uniform.

LOL. I'm happy whoever and whenever we get large wins like that.

Go Postgres!

 A patch along these lines
 is not going to be small, isolated, easily maintained, nor beneficial
 for any but a small number of predetermined datatypes.

That was the starting premise.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-20 Thread Peter Geoghegan
On 20 September 2011 03:51, Tom Lane t...@sss.pgh.pa.us wrote:
 Considering that -O2 is our standard optimization level, that
 observation seems to translate to this patch will be useless in
 practice.  I think you had better investigate that aspect in some
 detail before spending more effort.

I don't think that the fact that that happens is at all significant at
this early stage, and it never even occurred to me that you'd think
that it might be. I was simply disclosing a quirk of this POC patch.
The workaround is probably to use a macro instead. For the benefit of
those that didn't follow the other threads, the macro-based qsort
implementation, which I found to perform significantly better than
regular qsort(), runs like this on my laptop when I built at 02 with
GCC 4.6 just now:

C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 seconds

Does *that* look attractive to you? I've attached source code of the
program that produced these figures, which has been ported to C from
C++.

When I #define LARGE_SIZE 1, here's what I see:

[peter@peter inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 23.659411 seconds
Inline quick-sort time elapsed: 18.470611 seconds

Here, sorting with the function pointer/stdlib version takes about
1.28 times as long. In the prior test (with the smaller LARGE_SIZE),
it took about 1.32 times as long. Fairly predictable, linear, and not
to be sniffed at.

The variance I'm seeing across runs is low - a couple of hundredths of
a second at most. This is a Fedora 15  Intel(R) Core(TM) i5-2540M CPU
@ 2.60GHz machine. I'm not sure right now why the inline quick-sort
is less of a win than on my old Fedora 14 desktop (where it was 3.24
Vs 2.01), but it's still a significant win. Perhaps others can build
this simple program and tell me what they come up with.

 This performance patch differs from most in that it's difficult in
 principle to imagine a performance regression occurring.

 Really?  N copies of the same code could lead to performance loss just
 due to code bloat (ie, less of a query's inner loops fitting in CPU
 cache).

I did consider that. Of course inlining has an overhead, and I'll be
testing that each instance of inlining has a net benefit. I just meant
that many other performance patches have an obvious worst case, and I
think that it is policy to focus on that case, but I can't think of
one here. Does anyone else have any ideas?

 Not to mention the clear regression in maintainability.  So
 I'm disinclined to consider this sort of change without a significantly
 bigger win than you're suggesting above

Sure, there'll be some sort of regression in maintainability - I think
that HOT had a clear regression in maintainability too. The important
questions are obviously how big is the loss of maintainability?, and
is it worth it?. We'll know more when this work is actually shaped
into a proper patch. Perhaps I should have waited until I had
something along those lines before making an announcement, but I
wanted community input as early as possible. I think that there's
plenty of tweaking that can be done to get additional performance
improvements - all I've done so far is demonstrate that those
improvements are real and worth thinking about, in the fastest
possible way, partly because you expressed skepticism of the benefits
of inlining comparators to Greg Stark in an earlier thread.

Performance and maintainability are often somewhat in tension, but we
cannot ignore performance. If this work can bring us an improvement in
performance approaching the isolated macro Vs qsort() function pointer
benchmark, that's a *big* win. Sorting integers and floats is very
common and important.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


inline_compar_test.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-20 Thread karavelov
- Цитат от Peter Geoghegan (pe...@2ndquadrant.com), на 21.09.2011 в 02:53 
-

 On 20 September 2011 03:51, Tom Lane t...@sss.pgh.pa.us wrote:
 Considering that -O2 is our standard optimization level, that
 observation seems to translate to this patch will be useless in
 practice.  I think you had better investigate that aspect in some
 detail before spending more effort.
 
 I don't think that the fact that that happens is at all significant at
 this early stage, and it never even occurred to me that you'd think
 that it might be. I was simply disclosing a quirk of this POC patch.
 The workaround is probably to use a macro instead. For the benefit of
 those that didn't follow the other threads, the macro-based qsort
 implementation, which I found to perform significantly better than
 regular qsort(), runs like this on my laptop when I built at 02 with
 GCC 4.6 just now:
 
 C stdlib quick-sort time elapsed: 2.092451 seconds
 Inline quick-sort time elapsed: 1.587651 seconds
 
 Does *that* look attractive to you? I've attached source code of the
 program that produced these figures, which has been ported to C from
 C++.
 
 When I #define LARGE_SIZE 1, here's what I see:
 
 [peter@peter inline_compar_test]$ ./a.out
 C stdlib quick-sort time elapsed: 23.659411 seconds
 Inline quick-sort time elapsed: 18.470611 seconds
 
 Here, sorting with the function pointer/stdlib version takes about
 1.28 times as long. In the prior test (with the smaller LARGE_SIZE),
 it took about 1.32 times as long. Fairly predictable, linear, and not
 to be sniffed at.
 
 The variance I'm seeing across runs is low - a couple of hundredths of
 a second at most. This is a Fedora 15  Intel(R) Core(TM) i5-2540M CPU
 @ 2.60GHz machine. I'm not sure right now why the inline quick-sort
 is less of a win than on my old Fedora 14 desktop (where it was 3.24
 Vs 2.01), but it's still a significant win. Perhaps others can build
 this simple program and tell me what they come up with.
 
Run it here.

Intel(R) Core(TM)2 Duo CPU E8200  @ 2.66GHz
gcc version 4.6.1 (Debian 4.6.1-10)

g++ -O2 qsort-inline-benchmark.c
./a.out 
C stdlib quick-sort time elapsed: 1.942686 seconds
Inline quick-sort time elapsed: 1.126508 seconds

With #define LARGE_SIZE 1

C stdlib quick-sort time elapsed: 22.158207 seconds
Inline quick-sort time elapsed: 12.861018 seconds

with g++ -O0
C stdlib quick-sort time elapsed: 2.736360 seconds
Inline quick-sort time elapsed: 2.045619 seconds

On server hardware:
Intel(R) Xeon(R) CPU   E5405  @ 2.00GHz
gcc version 4.4.5 (Debian 4.4.5-8)

/a.out 
C stdlib quick-sort time elapsed: 2.610150 seconds
Inline quick-sort time elapsed: 1.494198 seconds

All -O2 version show 42% speedup with inlined qsort. 
-O0 showed 25% speedup.

Best regards

--
Luben Karavelov

Re: [HACKERS] Inlining comparators as a performance optimisation

2011-09-19 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 Once the cache has been warmed, explain analyze very consistently
 reports a runtime of 123ms for this query on master/HEAD, which varies
 +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
 this patch, that goes down to 107ms +/- 1ms at -O0. I think that
 that's a pretty good start. Funnily enough, the difference/advantage
 vanishes at -O2 (I'm guessing that the higher optimisation level of
 GCC 4.5 hyper-corrects away the inlining, but I don't have time to
 check that right now).

Considering that -O2 is our standard optimization level, that
observation seems to translate to this patch will be useless in
practice.  I think you had better investigate that aspect in some
detail before spending more effort.

 This performance patch differs from most in that it's difficult in
 principle to imagine a performance regression occurring.

Really?  N copies of the same code could lead to performance loss just
due to code bloat (ie, less of a query's inner loops fitting in CPU
cache).  Not to mention the clear regression in maintainability.  So
I'm disinclined to consider this sort of change without a significantly
bigger win than you're suggesting above (no, I don't even consider the
-O0 number attractive, let alone what you're finding at -O2).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers