Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Jan 13, 2012 at 10:48:56AM +0100, Pierre C wrote: Maybe an extra column in pg_proc would do (but then, the proargtypes and friends would describe only the sql-callable version) ? Or an extra table ? pg_cproc ? Or an in-memory hash : hashtable[ fmgr-callable function ] = C version - What happens if a C function has no SQL-callable equivalent ? Or (ugly) introduce an extra per-type function type_get_function_ptr( function_kind ) which returns the requested function ptr If one of those happens, I'll dust off my old copy-optimization patch ;) I agree that COPY is ripe for optimization, and I am excited you have some ideas on this. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, 21 Sep 2011 18:13:07 +0200, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 21.09.2011 18:46, Tom Lane wrote: The idea that I was toying with was to allow the regular SQL-callable comparison function to somehow return a function pointer to the alternate comparison function, You could have a new function with a pg_proc entry, that just returns a function pointer to the qsort-callback. Yeah, possibly. That would be a much more invasive change, but cleaner in some sense. I'm not really prepared to do all the legwork involved in that just to get to a performance-testable patch though. A few years ago I had looked for a way to speed up COPY operations, and it turned out that COPY TO has a good optimization opportunity. At that time, for each datum, COPY TO would : - test for nullness - call an outfunc through fmgr - outfunc pallocs() a bytea or text, fills it with data, and returns it (sometimes it uses an extensible string buffer which may be repalloc()d several times) - COPY memcpy()s returned data to a buffer and eventually flushes the buffer to client socket. I introduced a special write buffer with an on-flush callback (ie, a close relative of the existing string-buffer), in this case the callback was flush to client socket, and several outfuncs (one per type) which took that buffer as argument, besides the datum to output, and simply put the datum inside the buffer, with appropriate transformations (like converting to bytea or text), and flushed if needed. Then the COPY TO BINARY of a constant-size datum would turn to : - one test for nullness - one C function call - one test to ensure appropriate space available in buffer (flush if needed) - one htonl() and memcpy of constant size, which the compiler turns out into a couple of simple instructions I recall measuring speedups of 2x - 8x on COPY BINARY, less for text, but still large gains. Although eliminating fmgr call and palloc overhead was an important part of it, another large part was getting rid of memcpy()'s which the compiler turned into simple movs for known-size types, a transformation that can be done only if the buffer write functions are inlined inside the outfuncs. Compilers love constants... Additionnally, code size growth was minimal since I moved the old outfuncs code into the new outfuncs, and replaced the old fmgr-callable outfuncs with create buffer with on-full callback=extend_and_repalloc() - pass to new outfunc(buffer,datum) - return buffer. Which is basically equivalent to the previous palloc()-based code, maybe with a few extra instructions. When I submitted the patch for review, Tom rightfully pointed out that my way of obtaining the C function pointer sucked very badly (I don't remember how I did it, only that it was butt-ugly) but the idea was to get a quick measurement of what could be gained, and the result was positive. Unfortunately I had no time available to finish it and make it into a real patch, I'm sorry about that. So why do I post in this sorting topic ? It seems, by bypassing fmgr for functions which are small, simple, and called lots of times, there is a large gain to be made, not only because of fmgr overhead but also because of the opportunity for new compiler optimizations, palloc removal, etc. However, in my experiment the arguments and return types of the new functions were DIFFERENT from the old functions : the new ones do the same thing, but in a different manner. One manner was suited to sql-callable functions (ie, palloc and return a bytea) and another one to writing large amounts of data (direct buffer write). Since both have very different requirements, being fast at both is impossible for the same function. Anyway, all that rant boils down to : Some functions could benefit having two versions (while sharing almost all the code between them) : - User-callable (fmgr) version (current one) - C-callable version, usually with different parameters and return type And it would be cool to have a way to grab a bare function pointer on the second one. Maybe an extra column in pg_proc would do (but then, the proargtypes and friends would describe only the sql-callable version) ? Or an extra table ? pg_cproc ? Or an in-memory hash : hashtable[ fmgr-callable function ] = C version - What happens if a C function has no SQL-callable equivalent ? Or (ugly) introduce an extra per-type function type_get_function_ptr( function_kind ) which returns the requested function ptr If one of those happens, I'll dust off my old copy-optimization patch ;) Hmm... just my 2c Regards Pierre -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
I think we can call a new sorting infrastructure popping out and what looks to be over 90 messages on this topic as successful progress on this front. Peter's going to rev a new patch, but with more performance results to review and followup discussion I can't see this one as wrapping for the current CommitFest. Marking it returned and we'll return to this topic during or before the next CF. With the majority of the comments here coming from committers, I think continuing progress in that direction won't be a problem. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 7 December 2011 03:45, Robert Haas robertmh...@gmail.com wrote: In this regard, I think Heikki's remarks upthread are worth some thought. If inlining is a win just because it avoids saving and restoring registers or allows better instruction scheduling, then inlining is the (probably?) the only way to get the benefit. But if most of the benefit is in having a separate path for the single-sort-key case, we can do that without duplicating the qsort() code and get the benefit for every data type without much code bloat. I'd like to see us dig into that a little, so that we get the broadest possible benefit out of this work. It doesn't bother me that not every optimization will apply to every case, and I don't object to optimizations that are intrinsically narrow (within some reasonable limits). But I'd rather not take what could be a fairly broad-based optimization and apply it only narrowly, all things being equal. I think we're in agreement then. It's important to realise, though I'm sure you do, that once you've done what I did with sorting single scanKey integers, that's it - there isn't really any way that I can think of to speed up quick sorting further, other than parallelism which has major challenges, and isn't particularly appropriate at this granularity. The real significance of inlining is, well, that's it, that's all there is - after inlining, I predict that the well will run dry, or practically dry. Inlining actually is a pretty great win when you look at sorting in isolation, but it's just that there's been a number of other significant wins in my patch, and of course there is overhead from a number of other areas, so perhaps it's easy to lose sight of that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: 1. Adding sortsupport infrastructure for more datatypes. 2. Revising nbtree and related code to use this infrastructure. 3. Integrating Peter's work into this framework. I'll try to take care of #1 for at least a few key datatypes before I commit, but I think #2 is best done as a separate patch, so I'll postpone that till later. I see you've committed a chunk of this now. Does it make sense to do #1 for every data type we support, or should we be more selective than that? My gut feeling would be to add it across the board and introduce an opr_sanity check for it. The general utility of adding it to deprecated types like abstime is perhaps questionable, but it strikes me that the value of making everything consistent probably outweighs the cost of a few extra lines of code. Are you planning to do anything about #2 or #3? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: 1. Adding sortsupport infrastructure for more datatypes. 2. Revising nbtree and related code to use this infrastructure. 3. Integrating Peter's work into this framework. I'll try to take care of #1 for at least a few key datatypes before I commit, but I think #2 is best done as a separate patch, so I'll postpone that till later. I see you've committed a chunk of this now. Does it make sense to do #1 for every data type we support, or should we be more selective than that? Basically, I tried to do #1 for every datatype for which the comparator was cheap enough that reducing the call overhead seemed likely to make a useful difference. I'm not in favor of adding sortsupport functions where this is not true, as I think it'll be useless code and catalog bloat. I don't want to add 'em for cruft like abstime either. There's some stuff that's debatable according to this criterion --- in particular, I wondered whether it'd be worth having a fast path for bttextcmp, especially if we pre-tested the collate_is_c condition and had a separate version that just hardwired the memcmp code path. (The idea of doing that was one reason I insisted on collation being known at the setup step.) But it would still have to be prepared for detoasting, so in the end I was unenthused. Anyone who feels like testing could try to prove me wrong about it though. Are you planning to do anything about #2 or #3? I am willing to do #2, but not right now; I feel what I need to do next is go review SPGist. I don't believe that #2 blocks progress on #3 anyway. I think #3 is in Peter's court, or yours if you want to do it. (BTW, I agree with your comments yesterday about trying to break down the different aspects of what Peter did, and put as many of them as we can into the non-inlined code paths.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote: There's some stuff that's debatable according to this criterion --- in particular, I wondered whether it'd be worth having a fast path for bttextcmp, especially if we pre-tested the collate_is_c condition and had a separate version that just hardwired the memcmp code path. (The idea of doing that was one reason I insisted on collation being known at the setup step.) But it would still have to be prepared for detoasting, so in the end I was unenthused. Anyone who feels like testing could try to prove me wrong about it though. I think that'd definitely be worth investigating (although I'm not sure I have the time to do it myself any time real soon). Are you planning to do anything about #2 or #3? I am willing to do #2, but not right now; I feel what I need to do next is go review SPGist. Yeah, makes sense. That one seems likely to be a challenge to absorb. I don't believe that #2 blocks progress on #3 anyway. I think #3 is in Peter's court, or yours if you want to do it. (BTW, I agree with your comments yesterday about trying to break down the different aspects of what Peter did, and put as many of them as we can into the non-inlined code paths.) Cool. Peter, can you rebase your patch and integrate it into the sortsupport framework that's now committed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 7 December 2011 15:15, Robert Haas robertmh...@gmail.com wrote: On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote: But it would still have to be prepared for detoasting, so in the end I was unenthused. Anyone who feels like testing could try to prove me wrong about it though. I think that'd definitely be worth investigating (although I'm not sure I have the time to do it myself any time real soon). I'll at least take a look at it. Sorting text is a fairly common case. I'm not hugely enthused about spending too much time on work that will only be useful if collate_is_c. I don't believe that #2 blocks progress on #3 anyway. I think #3 is in Peter's court, or yours if you want to do it. (BTW, I agree with your comments yesterday about trying to break down the different aspects of what Peter did, and put as many of them as we can into the non-inlined code paths.) I'm confident that we should have everything for the simple case of ordering by a single int4 and int8 column, and I think you'd probably agree with that - they're extremely common cases. Anything beyond that will need to be justified, probably in part by running additional benchmarks. Cool. Peter, can you rebase your patch and integrate it into the sortsupport framework that's now committed? Yes, I'd be happy to, though I don't think I'm going to be getting around to it this side of Friday. Since it isn't a blocker, I assume that's okay. The rebased revision will come complete with a well thought-out rationale for my use of inlining specialisations, that takes account of the trade-off against binary bloat that Tom highlighted. I wasn't ignoring that issue, but I did fail to articulate my thoughts there, mostly because I felt the need to do some additional research to justify my position. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Dec 7, 2011 at 10:58 AM, Peter Geoghegan pe...@2ndquadrant.com wrote: On 7 December 2011 15:15, Robert Haas robertmh...@gmail.com wrote: On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane t...@sss.pgh.pa.us wrote: But it would still have to be prepared for detoasting, so in the end I was unenthused. Anyone who feels like testing could try to prove me wrong about it though. I think that'd definitely be worth investigating (although I'm not sure I have the time to do it myself any time real soon). I'll at least take a look at it. Sorting text is a fairly common case. I'm not hugely enthused about spending too much time on work that will only be useful if collate_is_c. Well, text_pattern_ops is not exactly an uncommon case, but sure. And there might be some benefit for other collations, too. Cool. Peter, can you rebase your patch and integrate it into the sortsupport framework that's now committed? Yes, I'd be happy to, though I don't think I'm going to be getting around to it this side of Friday. Since it isn't a blocker, I assume that's okay. Sounds fine to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
2011/12/5 Tom Lane t...@sss.pgh.pa.us: What is bothering me is that this approach is going to cause substantial bloat of the executable code, and such bloat has got distributed costs, which we don't have any really good way to measure but for sure micro-benchmarks addressing only sort speed aren't going to reveal them. Cache lines filled with sort code take cycles to flush and replace with something else. I think it's possibly reasonable to have inlined copies of qsort for a small number of datatypes, but it seems much less reasonable to have multiple copies per datatype in order to obtain progressively tinier micro-benchmark speedups. We need to figure out what the tradeoff against executable size really is, but so far it seems like you've been ignoring the fact that there is any such tradeoff at all. [ Randomly spouting ideas here: ] Might it not be a good idea to decide whether to use the inlined copies vs. the normal version, based on how much data to sort? Surely for a very large sort, the cache blow-out doesn't matter that much relative to the amount of time that can be saved sorting? Assuming that all types would have an inlined sort function, although that would indeed result in a larger binary, most of that binary would never touch the cache, because corresponding large sorts are never performed. If they would sporadically occur (and assuming the points at which inline sorting starts to get used are chosen wisely), the possibly resulting cache blow-out would be a net win. I am also assuming here that instruction cache lines are small enough for case line aliasing not to become a problem; putting all sort functions next to each other in the binary (so that they don't alias with the rest of the backend code that might be used more often) might alleviate that. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: * I invented a SortKey struct that replaces ScanKey for tuplesort's purposes. Right now that's local in tuplesort.c, but we're more than likely going to need it elsewhere as well. Should we just define it in sortsupport.h? Or perhaps we should just add the few additional fields to SortSupportInfoData, and not bother with two struct types? Before you answer, consider the next point. +1 for not bothering with two struct types. We might want to consider calling the resulting structure SortKey rather than SortSupportInfoData, however. * I wonder whether it would be worthwhile to elide inlineApplyComparator altogether, pushing what it does down to the level of the datatype-specific functions. That would require changing the comparator API to include isnull flags, and exposing the reverse/nulls_first sort flags to the comparators (presumably by including them in SortSupportInfoData). The main way in which that could be a win would be if the setup function could choose one of four comparator functions that are pre-specialized for each flags combination; but that seems like it would add a lot of code bulk, and the bigger problem is that we need to be able to change the flags after sort initialization (cf. the reversedirection code in tuplesort.c), so we'd also need some kind of re-select the comparator call API. On the whole this doesn't seem promising, but maybe somebody else has a different idea. I thought about this, too, but it didn't seem promising to me, either. * We're going to want to expose PrepareSortSupportComparisonShim for use outside tuplesort.c too, and possibly refactor tuplesort_begin_heap so that the SortKey setup logic inside it can be extracted for use elsewhere. Shall we just add those to tuplesort's API, or would it be better to create a sortsupport.c with these sorts of functions? Why are we going to want to do that? If it's because there are other places in the code that can make use of a fast comparator that don't go through tuplesort.c, then we should probably break it off into a separate file (sortkey.c?). But if it's because we think that clients of the tuplesort code are going to need it for some reason, then we may as well keep it in tuplesort.c. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: * We're going to want to expose PrepareSortSupportComparisonShim for use outside tuplesort.c too, and possibly refactor tuplesort_begin_heap so that the SortKey setup logic inside it can be extracted for use elsewhere. Shall we just add those to tuplesort's API, or would it be better to create a sortsupport.c with these sorts of functions? Why are we going to want to do that? If it's because there are other places in the code that can make use of a fast comparator that don't go through tuplesort.c, then we should probably break it off into a separate file (sortkey.c?). But if it's because we think that clients of the tuplesort code are going to need it for some reason, then we may as well keep it in tuplesort.c. My expectation is that nbtree, as well as mergejoin and mergeappend, would get converted over to use the fast comparator API. I looked at that a little bit but didn't push it far enough to be very sure about whether they'd be able to share the initialization code from tuplesort_begin_heap. But they're definitely going to need the shim function for backwards compatibility, and PrepareSortSupportComparisonShim was my first cut at a wrapper that would be generally useful. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Dec 6, 2011 at 12:06 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: * We're going to want to expose PrepareSortSupportComparisonShim for use outside tuplesort.c too, and possibly refactor tuplesort_begin_heap so that the SortKey setup logic inside it can be extracted for use elsewhere. Shall we just add those to tuplesort's API, or would it be better to create a sortsupport.c with these sorts of functions? Why are we going to want to do that? If it's because there are other places in the code that can make use of a fast comparator that don't go through tuplesort.c, then we should probably break it off into a separate file (sortkey.c?). But if it's because we think that clients of the tuplesort code are going to need it for some reason, then we may as well keep it in tuplesort.c. My expectation is that nbtree, as well as mergejoin and mergeappend, would get converted over to use the fast comparator API. I looked at that a little bit but didn't push it far enough to be very sure about whether they'd be able to share the initialization code from tuplesort_begin_heap. But they're definitely going to need the shim function for backwards compatibility, and PrepareSortSupportComparisonShim was my first cut at a wrapper that would be generally useful. OK. Well, then pushing it out to a separate file probably makes sense. Do you want to do that or shall I have a crack at it? If the latter, what do you think about using the name SortKey for everything rather than SortSupport? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: OK. Well, then pushing it out to a separate file probably makes sense. Do you want to do that or shall I have a crack at it? If the latter, what do you think about using the name SortKey for everything rather than SortSupport? I'll take another crack at it. I'm not entirely sold yet on merging the two structs; I think first we'd better look and see what the needs are in the other potential callers I mentioned. If we'd end up cluttering the struct with half a dozen weird fields, it'd be better to stick to a minimal interface struct with various wrapper structs, IMO. OTOH it did seem that the names were getting a bit long. If we do keep the two-struct-levels approach, what do you think of s/SortSupportInfo/SortSupport/g ? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: OK. Well, then pushing it out to a separate file probably makes sense. Do you want to do that or shall I have a crack at it? If the latter, what do you think about using the name SortKey for everything rather than SortSupport? I'll take another crack at it. I'm not entirely sold yet on merging the two structs; I think first we'd better look and see what the needs are in the other potential callers I mentioned. If we'd end up cluttering the struct with half a dozen weird fields, it'd be better to stick to a minimal interface struct with various wrapper structs, IMO. OK. I'll defer to whatever you come up with after looking at it. OTOH it did seem that the names were getting a bit long. If we do keep the two-struct-levels approach, what do you think of s/SortSupportInfo/SortSupport/g ? +1. I had that thought when you originally suggested that name, but it didn't seem worth arguing about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane t...@sss.pgh.pa.us wrote: I'll take another crack at it. I'm not entirely sold yet on merging the two structs; I think first we'd better look and see what the needs are in the other potential callers I mentioned. If we'd end up cluttering the struct with half a dozen weird fields, it'd be better to stick to a minimal interface struct with various wrapper structs, IMO. OK. I'll defer to whatever you come up with after looking at it. OK, it looks like nodeMergeAppend.c could use something exactly like the draft SortKey struct, while nodeMergejoin.c could embed such a struct in MergeJoinClauseData. The btree stuff needs something more nearly equivalent to a ScanKey, including a datum-to-compare-to and a flags field. I'm inclined to think the latter would be too specialized to put in the generic struct. On the other hand, including the reverse and nulls_first flags in the generic struct is clearly a win since it allows ApplyComparator() to be defined as a generic function. So the only thing that's really debatable is the attno field, and I'm not anal enough to insist on a separate level of struct just for that. I am however inclined to stick with the shortened struct name SortSupport rather than using SortKey. The presence of the function pointer fields (especially the inlined-qsort pointers, assuming we adopt some form of Peter's patch) changes the struct's nature in my view; it's not really describing just a sort key (ie an ORDER BY column specification). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Dec 6, 2011 at 4:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane t...@sss.pgh.pa.us wrote: I'll take another crack at it. I'm not entirely sold yet on merging the two structs; I think first we'd better look and see what the needs are in the other potential callers I mentioned. If we'd end up cluttering the struct with half a dozen weird fields, it'd be better to stick to a minimal interface struct with various wrapper structs, IMO. OK. I'll defer to whatever you come up with after looking at it. OK, it looks like nodeMergeAppend.c could use something exactly like the draft SortKey struct, while nodeMergejoin.c could embed such a struct in MergeJoinClauseData. The btree stuff needs something more nearly equivalent to a ScanKey, including a datum-to-compare-to and a flags field. I'm inclined to think the latter would be too specialized to put in the generic struct. On the other hand, including the reverse and nulls_first flags in the generic struct is clearly a win since it allows ApplyComparator() to be defined as a generic function. So the only thing that's really debatable is the attno field, and I'm not anal enough to insist on a separate level of struct just for that. I am however inclined to stick with the shortened struct name SortSupport rather than using SortKey. The presence of the function pointer fields (especially the inlined-qsort pointers, assuming we adopt some form of Peter's patch) changes the struct's nature in my view; it's not really describing just a sort key (ie an ORDER BY column specification). Works for me. I think we should go ahead and get this part committed first, and then we can look at the inlining stuff as a further optimization for certain cases... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 7 December 2011 00:18, Robert Haas robertmh...@gmail.com wrote: Works for me. I think we should go ahead and get this part committed first, and then we can look at the inlining stuff as a further optimization for certain cases... Do you mean just inlining, or inlining and the numerous other optimisations that my patch had? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Dec 6, 2011 at 8:13 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: On 7 December 2011 00:18, Robert Haas robertmh...@gmail.com wrote: Works for me. I think we should go ahead and get this part committed first, and then we can look at the inlining stuff as a further optimization for certain cases... Do you mean just inlining, or inlining and the numerous other optimisations that my patch had? Whichever you like. But I think part of the point here is to disentangle those optimizations from each other and decide how broadly it makes sense to apply each one. Avoiding the FunctionCallInfo stuff is one, and it seems like we can apply that to a wide variety of data types (maybe all of them) for both in-memory and on-disk sorting, plus btree index ops, merge joins, and merge append. The gains may be modest, but they will benefit many use cases. Your original patch targets a much narrower use case (in-memory sorting of POD types) but the benefits are larger. We don't have to pick between a general but small optimization and a narrower but larger one; we can do both. In this regard, I think Heikki's remarks upthread are worth some thought. If inlining is a win just because it avoids saving and restoring registers or allows better instruction scheduling, then inlining is the (probably?) the only way to get the benefit. But if most of the benefit is in having a separate path for the single-sort-key case, we can do that without duplicating the qsort() code and get the benefit for every data type without much code bloat. I'd like to see us dig into that a little, so that we get the broadest possible benefit out of this work. It doesn't bother me that not every optimization will apply to every case, and I don't object to optimizations that are intrinsically narrow (within some reasonable limits). But I'd rather not take what could be a fairly broad-based optimization and apply it only narrowly, all things being equal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 05.12.2011 02:14, Peter Geoghegan wrote: On 4 December 2011 19:17, Tom Lanet...@sss.pgh.pa.us wrote: I have not done any performance testing on this patch, but it might be interesting to check it with the same test cases Peter's been using. I've attached a revision of exactly the same benchmark run to get the results in results_server.ods . You'll see very similar figures to results_server.ods for HEAD and for my patch, as you'd expect. I think the results speak for themselves. I maintain that we should use specialisations - that's where most of the benefit is to be found. I'm still not convinced the big gain is in inlining the comparison functions. Your patch contains a few orthogonal tricks, too: * A fastpath for the case of just one scankey. No reason we can't do that without inlining. * inlining the swap-functions within qsort. Thaẗ́'s different from inlining the comparison functions, and could be done regardless of the data type. The qsort element size in tuplesort.c is always sizeof(SortTuple) And there's some additional specializations we can do, not implemented in your patch, that don't depend on inlining the comparison functions: * A fastpath for the case of no NULLs * separate comparetup functions for ascending and descending sorts (this allows tail call elimination of the call to type-specific comparison function in the ascending version) * call CHECK_FOR_INTERRUPTS() less frequently. To see how much difference those things make, I hacked together the attached patch. I didn't base this on Robert's/Tom's patch, but instead just added a quick dirty FunctionCallInvoke-overhead-skipping version of btint4cmp. I believe that aspect of this patch it's similar in performance characteristics, though. In my quick tests, it gets quite close in performance to your inlining patch, when sorting int4s and the input contains no NULLs. But please give it a try in your test environment, to get numbers comparable with your other tests. Perhaps you can get even more gain by adding the no-NULLs, and asc/desc special cases to your inlining patch, too, but let's squeeze all the fat out of the non-inlined version first. One nice aspect of many of these non-inlining optimizations is that they help with external sorts, too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 3505236..85bec20 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -114,6 +114,8 @@ #include utils/rel.h #include utils/tuplesort.h +#include utils/builtins.h + /* sort-type codes for sort__start probes */ #define HEAP_SORT 0 @@ -273,6 +275,8 @@ struct Tuplesortstate int memtupcount; /* number of tuples currently present */ int memtupsize; /* allocated length of memtuples array */ + bool hasnulls_initial; /* any NULLs in the first key col? */ + /* * While building initial runs, this is the current output run number * (starting at 0). Afterwards, it is the number of initial runs we made. @@ -341,6 +345,8 @@ struct Tuplesortstate TupleDesc tupDesc; ScanKey scanKeys; /* array of length nKeys */ + int (*comparator) (Datum, Datum); + /* * These variables are specific to the CLUSTER case; they are set by * tuplesort_begin_cluster. Note CLUSTER also uses tupDesc and @@ -459,6 +465,11 @@ static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK); static void markrunend(Tuplesortstate *state, int tapenum); static int comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); +static int comparetup_heap_1key_asc_nonulls(const SortTuple *a, const SortTuple *b, +Tuplesortstate *state); +static int comparetup_heap_1key_asc_nonulls_btint4cmp(const SortTuple *a, const SortTuple *b, +Tuplesortstate *state); +static int compare_int4(Datum first, Datum second); static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup); static void writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup); @@ -551,6 +562,7 @@ tuplesort_begin_common(int workMem, bool randomAccess) state-availMem = state-allowedMem; state-sortcontext = sortcontext; state-tapeset = NULL; + state-hasnulls_initial = false; state-memtupcount = 0; state-memtupsize = 1024; /* initial guess */ @@ -1247,11 +1259,41 @@ tuplesort_performsort(Tuplesortstate *state) * amount of memory. Just qsort 'em and we're done. */ if (state-memtupcount 1) + { +int (*comparetup) (const SortTuple *a, const SortTuple *b, + Tuplesortstate *state); + +/* If there is only one key col, the input contains no NULLs, + * and it's ascending, we can use a fastpath version of + * comparetup_heap that skips the over those things. + */ +if (state-comparetup == comparetup_heap + state-nKeys == 1 + !(state-scanKeys[0].sk_flags SK_BT_DESC) +
Re: [HACKERS] Inlining comparators as a performance optimisation
On 5 December 2011 13:23, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I'm still not convinced the big gain is in inlining the comparison functions. Your patch contains a few orthogonal tricks, too: * A fastpath for the case of just one scankey. No reason we can't do that without inlining. True, though Tom did seem to not like that idea very much. * inlining the swap-functions within qsort. Thaẗ́'s different from inlining the comparison functions, and could be done regardless of the data type. The qsort element size in tuplesort.c is always sizeof(SortTuple) Sure. I think that Tom mostly objects to hard-coded intelligence about which datatypes are used, rather than specialisations per se. And there's some additional specializations we can do, not implemented in your patch, that don't depend on inlining the comparison functions: * A fastpath for the case of no NULLs * separate comparetup functions for ascending and descending sorts (this allows tail call elimination of the call to type-specific comparison function in the ascending version) * call CHECK_FOR_INTERRUPTS() less frequently. All of those had occurred to me, except the last - if you look at the definition of that macro, it looks like more trouble than it's worth to do less frequently. I didn't revise my patch with them though, because the difference that they made, while clearly measurable, seemed fairly small, or at least relatively so. I wasn't about to add additional kludge for marginal benefits given the existing controversy, though I have not dismissed the idea entirely - it could be important on some other machine. Perhaps you can get even more gain by adding the no-NULLs, and asc/desc special cases to your inlining patch, too, but let's squeeze all the fat out of the non-inlined version first. As I said, those things are simple enough to test, and were not found to make a significant difference, at least to my exact test case + hardware. One nice aspect of many of these non-inlining optimizations is that they help with external sorts, too. I'd expect the benefits to be quite a lot smaller there, but fair point. Results from running the same test on your patch are attached. I think that while you're right to suggest that the inlining of the comparator proper, rather than any other function or other optimisation isn't playing a dominant role in these optimisations, I think that you're underestimating the role of locality of reference. To illustrate this, I've also included results for when I simply move the comparator to another translation unit, logtape.c . There's clearly a regression from doing so (I ran the numbers twice, in a clinical environment). The point is, there is a basically unknown overhead paid for not inlining - maybe inlining is not worth it, but that's a hard call to make. I didn't try to make the numbers look worse by putting the comparator proper in some other translation unit, but it may be that I'd have shown considerably worse numbers by doing so. Why the strong aversion to what I've done? I accept that it's ugly, but is it really worth worrying about that sort of regression in maintainability for something that was basically untouched since 2006, and will probably remain untouched after this work concludes, whatever happens? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services results_server_w_heikki.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Peter Geoghegan pe...@2ndquadrant.com writes: Why the strong aversion to what I've done? I accept that it's ugly, but is it really worth worrying about that sort of regression in maintainability for something that was basically untouched since 2006, and will probably remain untouched after this work concludes, whatever happens? Maintainability is only part of the issue --- though it's definitely one part, since this code has hardly gone untouched since 2006, cf http://git.postgresql.org/gitweb/?p=postgresql.git;a=history;f=src/backend/utils/sort/tuplesort.c;hb=HEAD What is bothering me is that this approach is going to cause substantial bloat of the executable code, and such bloat has got distributed costs, which we don't have any really good way to measure but for sure micro-benchmarks addressing only sort speed aren't going to reveal them. Cache lines filled with sort code take cycles to flush and replace with something else. I think it's possibly reasonable to have inlined copies of qsort for a small number of datatypes, but it seems much less reasonable to have multiple copies per datatype in order to obtain progressively tinier micro-benchmark speedups. We need to figure out what the tradeoff against executable size really is, but so far it seems like you've been ignoring the fact that there is any such tradeoff at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: OK, so I tried to code this up. Adding the new amproc wasn't too difficult (see attached). It wasn't obvious to me how to tie it into the tuplesort infrastructure, though, so instead of wasting time guessing what a sensible approach might be I'm going to use one of my lifelines and poll the audience (or is that ask an expert?). Here's another cut at this. I only went as far as converting the heap-sort code path in tuplesort.c, so there's lots more to do, but this does sort integers successfully. Before expanding the patch to do more, I think we need to have consensus on some API details, in particular: * I invented a SortKey struct that replaces ScanKey for tuplesort's purposes. Right now that's local in tuplesort.c, but we're more than likely going to need it elsewhere as well. Should we just define it in sortsupport.h? Or perhaps we should just add the few additional fields to SortSupportInfoData, and not bother with two struct types? Before you answer, consider the next point. * I wonder whether it would be worthwhile to elide inlineApplyComparator altogether, pushing what it does down to the level of the datatype-specific functions. That would require changing the comparator API to include isnull flags, and exposing the reverse/nulls_first sort flags to the comparators (presumably by including them in SortSupportInfoData). The main way in which that could be a win would be if the setup function could choose one of four comparator functions that are pre-specialized for each flags combination; but that seems like it would add a lot of code bulk, and the bigger problem is that we need to be able to change the flags after sort initialization (cf. the reversedirection code in tuplesort.c), so we'd also need some kind of re-select the comparator call API. On the whole this doesn't seem promising, but maybe somebody else has a different idea. * We're going to want to expose PrepareSortSupportComparisonShim for use outside tuplesort.c too, and possibly refactor tuplesort_begin_heap so that the SortKey setup logic inside it can be extracted for use elsewhere. Shall we just add those to tuplesort's API, or would it be better to create a sortsupport.c with these sorts of functions? I have not done any performance testing on this patch, but it might be interesting to check it with the same test cases Peter's been using. regards, tom lane diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index 28324361a950f31737c0cbb6909726d02ce1af5f..51f847c9f907756a823e1f76ef0275bceed7 100644 *** a/doc/src/sgml/xindex.sgml --- b/doc/src/sgml/xindex.sgml *** DEFAULT FOR TYPE int4 USING btree FAMILY *** 758,764 OPERATOR 3 = , OPERATOR 4 = , OPERATOR 5 , ! FUNCTION 1 btint4cmp(int4, int4) ; CREATE OPERATOR CLASS int2_ops DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS --- 758,765 OPERATOR 3 = , OPERATOR 4 = , OPERATOR 5 , ! FUNCTION 1 btint4cmp(int4, int4) , ! FUNCTION 2 btint4sortsupport(internal) ; CREATE OPERATOR CLASS int2_ops DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c index 23f2b61fe902cff7eebcf8526cf8116d90be75e8..62d03f96de07cbbe14dedefb1596220af398caea 100644 *** a/src/backend/access/nbtree/nbtcompare.c --- b/src/backend/access/nbtree/nbtcompare.c *** *** 49,54 --- 49,55 #include postgres.h #include utils/builtins.h + #include utils/sortsupport.h Datum *** btint4cmp(PG_FUNCTION_ARGS) *** 83,88 --- 84,112 PG_RETURN_INT32(-1); } + static int + btint4fastcmp(Datum x, Datum y, SortSupportInfo sinfo) + { + int32 a = DatumGetInt32(x); + int32 b = DatumGetInt32(y); + + if (a b) + return 1; + else if (a == b) + return 0; + else + return -1; + } + + Datum + btint4sortsupport(PG_FUNCTION_ARGS) + { + SortSupportInfo sinfo = (SortSupportInfo) PG_GETARG_POINTER(0); + + sinfo-comparator = btint4fastcmp; + PG_RETURN_VOID(); + } + Datum btint8cmp(PG_FUNCTION_ARGS) { diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index cb341b8db679382c3ab8de31ff7a0bb57bc8dc50..cb6a38bcfd29fb17fc62040eef79d34aee9fe802 100644 *** a/src/backend/utils/cache/lsyscache.c --- b/src/backend/utils/cache/lsyscache.c *** get_compare_function_for_ordering_op(Oid *** 287,292 --- 287,348 } /* + * get_sort_function_for_ordering_op + * Get the OID of the datatype-specific btree sort support function, + * or if there is none, the btree comparison function, + * associated with an ordering operator (a or operator). + * + * *sortfunc receives the support or comparison function OID. + * *issupport is set TRUE if it's a support func, FALSE if a comparison func. + * *reverse is set FALSE if the operator is , TRUE if it's
Re: [HACKERS] Inlining comparators as a performance optimisation
On 4 December 2011 19:17, Tom Lane t...@sss.pgh.pa.us wrote: I have not done any performance testing on this patch, but it might be interesting to check it with the same test cases Peter's been using. I've attached a revision of exactly the same benchmark run to get the results in results_server.ods . You'll see very similar figures to results_server.ods for HEAD and for my patch, as you'd expect. I think the results speak for themselves. I maintain that we should use specialisations - that's where most of the benefit is to be found. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services results_server_new.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: OK, so I tried to code this up. Adding the new amproc wasn't too difficult (see attached). It wasn't obvious to me how to tie it into the tuplesort infrastructure, though, so instead of wasting time guessing what a sensible approach might be I'm going to use one of my lifelines and poll the audience (or is that ask an expert?). OK, I'll take a whack at improving this over the weekend. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 2 December 2011 01:13, Tom Lane t...@sss.pgh.pa.us wrote: Regardless of that, I'm still of the opinion that this patch is fundamentally misdesigned. The way it's set up, it is only possible to have a fast path for types that are hard-wired into tuplesort.c, and that flies in the face of twenty years' worth of effort to make Postgres an extensible system. What the user doesn't know can't hurt them. I'm not of the opinion that an internal optimisations flies in the face of anything - is it really so hard for you to hold your nose for a fairly isolated patch like this? I really don't care about performance measurements: this is not an acceptable design, period. If ever there was a justification for doing something so distasteful, I would think that a 60% decrease in the time taken to perform a raw sort for POD types (plus common covariants) would be it. What I want to see is some mechanism that pushes this out to the individual datatypes, so that both core and plug-in datatypes have access to the benefits. There are a number of ways that could be attacked, but the most obvious one is to invent additional, optional support function(s) for btree opclasses. I also still think that we should pursue the idea of a lower-overhead API for comparison functions. It may be that it's worth the code bulk to have an inlined copy of qsort for a small number of high-usage datatypes, but I think there are going to be a lot for which we aren't going to want to pay that price, and yet we could get some benefit from cutting the call overhead. I'm not opposed to that. It just seems fairly tangential to what I've done here, as well as considerably less important to users, particularly when you look at the numbers I've produced. It would be nice to get a lesser gain for text and things like that too, but other than that I don't see a huge demand. A lower-overhead API would also save cycles in usage of the comparison functions from btree itself (remember that?). Yes, I remember that. Why not do something similar there, and get similarly large benefits? I think in particular that the proposed approach to multiple sort keys is bogus and should be replaced by just using lower-overhead comparators. The gain per added code byte in doing it as proposed has got to be too low to be reasonable. Reasonable by what standard? We may well be better off with lower-overhead comparators for the multiple scanKey case (though I wouldn't like to bet on it) but we would most certainly not be better off without discriminating against single scanKey and multiple scanKey cases as I have. Look at the spreadsheet results_server.ods . Compare the not inlined and optimized sheets. It's clear from those numbers that a combination of inlining and of simply not having to consider a case that will never come up (multiple scanKeys), the inlining specialisation far outperforms the non-inlining, multiple scanKey specialization for the same data (I simply commented out inlining specializations so that non-inlining specialisations would always be called for int4 and friends, even if there was only a single scanKey). That's where the really dramatic improvements are seen. It's well worth having this additional benefit, because it's such a common case and the benefit is so big, and while I will be quite happy to pursue a plan to bring some of these benefits to all types such as the one you describe, I do not want to jettison the additional benefits described to do so. Isn't that reasonable? We're talking about taking 6778.302ms off a 20468.0ms query, rather than 1860.332ms. Not exactly a subtle difference. Assuming that those figures are representative of the gains to be had by a generic mechanism that does not inline/specialize across number of scanKeys, are you sure that that's worthwhile? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Dec 2, 2011 at 12:16 AM, Tom Lane t...@sss.pgh.pa.us wrote: It should be the same or better. Right now, we are going through FunctionCall2Coll to reach the FCI-style comparator. The shim function would be more or less equivalent to that, and since it's quite special purpose I would hope we could shave a cycle or two. For instance, we could probably afford to set up a dedicated FunctionCallInfo struct associated with the SortSupportInfo struct, and not have to reinitialize one each time. OK, but I think it's also going to cost you an extra syscache lookup, no? You're going to have to check for this new support function first, and then if you don't find it, you'll have to check for the original one. I don't think there's any higher-level caching around opfamilies to save our bacon here, is there? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: OK, but I think it's also going to cost you an extra syscache lookup, no? You're going to have to check for this new support function first, and then if you don't find it, you'll have to check for the original one. I don't think there's any higher-level caching around opfamilies to save our bacon here, is there? [ shrug... ] If you are bothered by that, get off your duff and provide the function for your datatype. But it's certainly going to be in the noise for btree index usage, and I submit that query parsing/setup involves quite a lot of syscache lookups already. I think that as a performance objection, the above is nonsensical. (And I would also comment that your proposal with a handle is going to involve a table search that's at least as expensive as a syscache lookup.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: OK, but I think it's also going to cost you an extra syscache lookup, no? You're going to have to check for this new support function first, and then if you don't find it, you'll have to check for the original one. I don't think there's any higher-level caching around opfamilies to save our bacon here, is there? [ shrug... ] If you are bothered by that, get off your duff and provide the function for your datatype. But it's certainly going to be in the noise for btree index usage, and I submit that query parsing/setup involves quite a lot of syscache lookups already. I think that as a performance objection, the above is nonsensical. (And I would also comment that your proposal with a handle is going to involve a table search that's at least as expensive as a syscache lookup.) Agreed. Doing something once and doing something in the sort loop are two different overheads. I am excited by this major speedup Peter Geoghegan has found. Postgres doesn't have parallel query, which is often used for sorting, so we are already behind some of the databases are compared against. Getting this speedup is definitely going to help us. And I do like the generic nature of where we are heading! -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
[ shrug... ] If you are bothered by that, get off your duff and provide the function for your datatype. But it's certainly going to be in the noise for btree index usage, and I submit that query parsing/setup involves quite a lot of syscache lookups already. I think that as a performance objection, the above is nonsensical. (And I would also comment that your proposal with a handle is going to involve a table search that's at least as expensive as a syscache lookup.) Agreed. Doing something once and doing something in the sort loop are two different overheads. I am excited by this major speedup Peter Geoghegan has found. Postgres doesn't have parallel query, which is often used for sorting, so we are already behind some of the databases are compared against. Getting this speedup is definitely going to help us. And I do like the generic nature of where we are heading! Oracle has not or had not parallel sort too, and I have a reports so Oracle does sort faster then PostgreSQL (but without any numbers). So any solution is welcome Regards Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian br...@momjian.us wrote: Agreed. Doing something once and doing something in the sort loop are two different overheads. OK, so I tried to code this up. Adding the new amproc wasn't too difficult (see attached). It wasn't obvious to me how to tie it into the tuplesort infrastructure, though, so instead of wasting time guessing what a sensible approach might be I'm going to use one of my lifelines and poll the audience (or is that ask an expert?). Currently the Tuplesortstate[1] has a pointer to an array of ScanKeyData objects, one per column being sorted. But now instead of FmgrInfo sk_func, the tuplesort code is going to want each scankey to contain SortSupportInfo(Data?) sk_sortsupport[2], because that's where we get the comparison function from. Should I just go ahead and add one more member to that struct, or is there some more appropriate way to handle this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] Consistent capitalization is for wimps. [2] Hey, we could abbreviate that SSI! Oh, wait... sort-support.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas wrote: On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian br...@momjian.us wrote: Agreed. ?Doing something once and doing something in the sort loop are two different overheads. OK, so I tried to code this up. Adding the new amproc wasn't too difficult (see attached). It wasn't obvious to me how to tie it into the tuplesort infrastructure, though, so instead of wasting time guessing what a sensible approach might be I'm going to use one of my lifelines and poll the audience (or is that ask an expert?). Currently the Tuplesortstate[1] has a pointer to an array of ScanKeyData objects, one per column being sorted. But now instead of FmgrInfo sk_func, the tuplesort code is going to want each scankey to contain SortSupportInfo(Data?) sk_sortsupport[2], because that's where we get the comparison function from. Should I just go ahead and add one more member to that struct, or is there some more appropriate way to handle this? Is this code immediately usable anywhere else in our codebasde, and if so, is it generic enough? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Attached is revision of my patch with some clean-ups. In particular, I'm now using switch statements for greater readability, plus supporting fast path sorting of the time datatype. I've also updated the documentation on Date/Time Types to note the additional disadvantage of using the deprecated store timestamp + friends as double precision floating-point numbers compile time option. There is one aspect to this optimisation that I haven't touched on, which is the effect on memory consumption. I think that much of the value that this patch will deliver will come from being able to release sort memory earlier. Consider that the substantial improvements in raw sorting speed (far more substantial than the improvements in query runtime) will sometimes result in a concomitant reduction in the time that the executor holds onto memory allocated for sorting. Maybe the effect will only be really noticeable for plans with a sort node as their root node, but that isn't exactly a rare occurrence, particularly among large, expensive sorts. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services From cffad91578edd698e89ef2bf7384d82aee26b57e Mon Sep 17 00:00:00 2001 From: peter2ndQuadrant pe...@2ndquadrant.com Date: Sun, 20 Nov 2011 20:36:25 + Subject: [PATCH] Initial commit of optimization Stop directly using oid Added int8 quicksort fast path specialisation, which can also be used in place of F_TIMESTAMP_CMP for HAVE_INT64_TIMESTAMP builds. Rebased, revised patch for -hackers, with timestamp and int8 fast path sorting using the same int8 specialization. Remove unneeded line Rebase for -hackers Descriminate against cases where nKeys 1 when selecting sort specialisation. We now support fast path sorting with multiple scanKeys. We now inline for one scanKey, but do not inline our comparetup_heap replacement when multiple scanKeys are used (although this is at the compiler's discretion). However, when multiple scanKeys are used, we still use a specialisation that will elide the SQL-function-call machinery for the first scanKey, which is almost as good for most cases. Further formatting clean-ups Move specialization selection logic to tuplesort_begin_heap(), per suggestion of Robert Haas Add simple switch for testing optimization. Improvements to template qsort_arg comments. Support fast path sorting of dates. Added float fast path sorting Patch mimimization. Additional clean-up. Indentation issue. Specialization clean-ups Use switch statement for sort specialization selection Further comment clean-up. Additional clean-up Make comparators consistent Added date fast path support, documentation updates Fix template_qsort_arg.h assertion Assert nScanKeys == 1 for inlining case --- doc/src/sgml/datatype.sgml |6 + src/backend/utils/sort/tuplesort.c | 223 - src/include/utils/template_qsort_arg.h | 288 src/port/qsort.c |3 +- 4 files changed, 512 insertions(+), 8 deletions(-) create mode 100644 src/include/utils/template_qsort_arg.h diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index fe59a1c..f5dfbbb 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -1619,6 +1619,12 @@ SELECT E'\\xDEADBEEF'; floating-point case, large typeinterval/type values degrade in precision as the size of the interval increases. /para + para +Note that if the server is built to represent these types as +floating point numbers, that will prevent it from using fast path +sort specializations for the types, and sorting columns of those +types will be noticeably slower. + /para /note para diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 3505236..0a14a90 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -107,11 +107,13 @@ #include miscadmin.h #include pg_trace.h #include utils/datum.h +#include utils/fmgroids.h #include utils/logtape.h #include utils/lsyscache.h #include utils/memutils.h #include utils/pg_rusage.h #include utils/rel.h +#include utils/template_qsort_arg.h #include utils/tuplesort.h @@ -226,6 +228,13 @@ struct Tuplesortstate Tuplesortstate *state); /* + * This specialization function pointer is sometimes used as an alternative + * to the standard qsort_arg, when it has been determined that we can + * benefit from various per-type performance optimizations. + */ + void (*qsort_arg_spec)(void *a, size_t n, size_t es, void *arg); + + /* * Function to copy a supplied input tuple into palloc'd space and set up * its SortTuple representation (ie, set tuple/datum1/isnull1). Also, * state-availMem must be decreased by the amount of space used for the @@ -457,6 +466,11 @@ static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple
Re: [HACKERS] Inlining comparators as a performance optimisation
On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan pe...@2ndquadrant.com wrote: Attached is revision of my patch with some clean-ups. In particular, I'm now using switch statements for greater readability, plus supporting fast path sorting of the time datatype. I've also updated the documentation on Date/Time Types to note the additional disadvantage of using the deprecated store timestamp + friends as double precision floating-point numbers compile time option. One thing I'm starting to get a bit concerned about is the effect of this patch on the size of the resulting binary. The performance results you've posted are getting steadily more impressive as you get into this, which is cool, but it seems like the number of copies of the qsort logic that you're proposing to include in the resulting binary is also steadily climbing. On my system, a stripped postgres binary built with my usual compile options (except --enable-cassert, which I took out) is 49336 bytes bigger with this patch applied, an increase of close to 1%. We might need to be a little bit choosy about this, because I don't think that we want to end up with a situation where some noticeable percentage of the final binary consists of differently-inlined versions of the quicksort algorithm - especially because there may be other places where we want to do similar kinds of inlining. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan pe...@2ndquadrant.com wrote: Attached is revision of my patch with some clean-ups. One thing I'm starting to get a bit concerned about is the effect of this patch on the size of the resulting binary. Regardless of that, I'm still of the opinion that this patch is fundamentally misdesigned. The way it's set up, it is only possible to have a fast path for types that are hard-wired into tuplesort.c, and that flies in the face of twenty years' worth of effort to make Postgres an extensible system. I really don't care about performance measurements: this is not an acceptable design, period. What I want to see is some mechanism that pushes this out to the individual datatypes, so that both core and plug-in datatypes have access to the benefits. There are a number of ways that could be attacked, but the most obvious one is to invent additional, optional support function(s) for btree opclasses. I also still think that we should pursue the idea of a lower-overhead API for comparison functions. It may be that it's worth the code bulk to have an inlined copy of qsort for a small number of high-usage datatypes, but I think there are going to be a lot for which we aren't going to want to pay that price, and yet we could get some benefit from cutting the call overhead. A lower-overhead API would also save cycles in usage of the comparison functions from btree itself (remember that?). I think in particular that the proposed approach to multiple sort keys is bogus and should be replaced by just using lower-overhead comparators. The gain per added code byte in doing it as proposed has got to be too low to be reasonable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 1 December 2011 17:15, Robert Haas robertmh...@gmail.com wrote: One thing I'm starting to get a bit concerned about is the effect of this patch on the size of the resulting binary. The performance results you've posted are getting steadily more impressive as you get into this, which is cool, but it seems like the number of copies of the qsort logic that you're proposing to include in the resulting binary is also steadily climbing. On my system, a stripped postgres binary built with my usual compile options (except --enable-cassert, which I took out) is 49336 bytes bigger with this patch applied, an increase of close to 1%. Do your usual compile options include debug symbols? I've been using standard compile options for development of this patch, for obvious reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase). Binary bloat is a legitimate concern. I thought that I was conservative in my choice of specialisations though. We might need to be a little bit choosy about this, because I don't think that we want to end up with a situation where some noticeable percentage of the final binary consists of differently-inlined versions of the quicksort algorithm - especially because there may be other places where we want to do similar kinds of inlining. Thoughts? A simple caveat in a comment along the lines of this mechanism instantiates 2 copies of qsort_arg per type, please use judiciously sounds like the right balance. It could also be possible to be penny wise and pound foolish here though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Thu, Dec 1, 2011 at 8:29 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: Do your usual compile options include debug symbols? I've been using standard compile options for development of this patch, for obvious reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase). They do, but I ran strip on the resulting binary before taking those measurements, so I thought that would undo it... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane t...@sss.pgh.pa.us wrote: What I want to see is some mechanism that pushes this out to the individual datatypes, so that both core and plug-in datatypes have access to the benefits. There are a number of ways that could be attacked, but the most obvious one is to invent additional, optional support function(s) for btree opclasses. I feel like what's missing here is a way for the catalogs to refer to a function that doesn't take FunctionCallInfo as an argument. But it strikes me that it wouldn't be very hard to design such a mechanism. It seems like all we really need to do is decide on a nomenclature. The simplest possible approach would probably be to just refer to functions by name. In other words, we add a text or name column to pg_amproc, and then inside the backend there's a function whose job it is to map the string to a function address. However, that would have the annoying effect of causing catalog bloat, so I'm inclined to instead propose an 8-byte integer. (A 4-byte integer is enough, but would increase the risk of collisions between values that might be chosen by third party extensions.) So, the API to this would look something like this: void register_arbitrary_function_pointer(int64 handle, void *funcptr); void *get_arbitrary_function_pointer(int64 handle); So the idea is that if you need to refer to an arbitrary function in a system catalog (such as pg_amproc), you generate a random non-zero 64-bit integer, stuff it into the catalog, and then register the same 64-bit integer with the appropriate function pointer. To avoid increasing backend startup time, we could have a Perl script generate a prefab hash table for the built-in functions; loadable modules would add their entries at PG_init() time using register_arbitrary_function_pointer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane t...@sss.pgh.pa.us wrote: What I want to see is some mechanism that pushes this out to the individual datatypes, so that both core and plug-in datatypes have access to the benefits. There are a number of ways that could be attacked, but the most obvious one is to invent additional, optional support function(s) for btree opclasses. I feel like what's missing here is a way for the catalogs to refer to a function that doesn't take FunctionCallInfo as an argument. But it strikes me that it wouldn't be very hard to design such a mechanism. IMO, entries in pg_proc ought to refer to functions that are callable by the standard interface: breaking that basic contract is not going to lead anywhere pleasant. Nor do I really want yet more columns in pg_proc. Nor does the register a pointer scheme you suggest make any sense to me: you still need to connect these things to catalog entries, pg_opclass entries in particular, and the intermediate handle adds nothing to the situation except for creating a risk of collisions. The scheme that was rolling around in my mind was about like this: * Define btree opclasses to have an optional support function, amprocnum 2, that has a SQL signature of func(internal) returns void. * The actual argument represented by the internal parameter would be a pointer to a struct which is to be filled in by the support function. We'd call the support function once, during tuplesort setup or btree index relcache entry setup, and save the struct somewhere. * The struct contents would be pointers to functions that have a non-FunctionCallInfo interface. We know we need at least two: a full qsort replacement, and a non-FCI comparator. We might want more in future, if someone convinces us that additional specializations of sorting are worth the trouble. So I imagine a struct like this: typedef struct SortSupportFunctions { void(*inline_qsort) (Datum *elements, int nelements); int (*comparator) (Datum a, Datum b); } SortSupportFunctions; with the agreement that the caller must zero out the struct before call, and then the btree support function sets the function pointers for any specializations it's going to offer. If we later add a third or fourth function pointer, datatypes that know about that could fill in those pointers, but datatypes that haven't been updated aren't broken. One thing I'm not too certain about is whether to define the APIs just as above, or to support a passthrough argument of some sort (and if so, what does it reference)? Possibly any datatype that we'd actually care about this for is going to be simple enough to not need any state data. Or possibly not. And what about collations? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: IMO, entries in pg_proc ought to refer to functions that are callable by the standard interface: breaking that basic contract is not going to lead anywhere pleasant. Nor do I really want yet more columns in pg_proc. I wasn't proposing to create pg_proc entries for this. Nor does the register a pointer scheme you suggest make any sense to me: you still need to connect these things to catalog entries, pg_opclass entries in particular, and the intermediate handle adds nothing to the situation except for creating a risk of collisions. I think you might be misinterpreting what I had in mind. Right now, pg_amproc entries have an amproc column that points to a pg_proc entry that in turn points to a function that takes FunctionCallInfoData as an argument. What I'm proposing to do is add an additional column to that catalog that points more or less directly to a non-SQL-callable function, but it can't actually just be the address of the function because that's not stable. So what I'm proposing is that we interpose the thinnest possible shim layer between the catalog and a function pointer, and an int64 - function pointer mapping seemed to me like something that would fit the bill. The scheme that was rolling around in my mind was about like this: * Define btree opclasses to have an optional support function, amprocnum 2, that has a SQL signature of func(internal) returns void. * The actual argument represented by the internal parameter would be a pointer to a struct which is to be filled in by the support function. We'd call the support function once, during tuplesort setup or btree index relcache entry setup, and save the struct somewhere. * The struct contents would be pointers to functions that have a non-FunctionCallInfo interface. We know we need at least two: a full qsort replacement, and a non-FCI comparator. We might want more in future, if someone convinces us that additional specializations of sorting are worth the trouble. So I imagine a struct like this: typedef struct SortSupportFunctions { void (*inline_qsort) (Datum *elements, int nelements); int (*comparator) (Datum a, Datum b); } SortSupportFunctions; with the agreement that the caller must zero out the struct before call, and then the btree support function sets the function pointers for any specializations it's going to offer. If we later add a third or fourth function pointer, datatypes that know about that could fill in those pointers, but datatypes that haven't been updated aren't broken. One thing I'm not too certain about is whether to define the APIs just as above, or to support a passthrough argument of some sort (and if so, what does it reference)? Possibly any datatype that we'd actually care about this for is going to be simple enough to not need any state data. Or possibly not. And what about collations? Maybe there should be a comparator_setup function that gets the collation OID and returns void *, and then that void * value gets passed as a third argument to each call to the comparator function. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: Nor does the register a pointer scheme you suggest make any sense to me: you still need to connect these things to catalog entries, pg_opclass entries in particular, and the intermediate handle adds nothing to the situation except for creating a risk of collisions. I think you might be misinterpreting what I had in mind. Right now, pg_amproc entries have an amproc column that points to a pg_proc entry that in turn points to a function that takes FunctionCallInfoData as an argument. What I'm proposing to do is add an additional column to that catalog that points more or less directly to a non-SQL-callable function, but it can't actually just be the address of the function because that's not stable. So what I'm proposing is that we interpose the thinnest possible shim layer between the catalog and a function pointer, and an int64 - function pointer mapping seemed to me like something that would fit the bill. But you still need a lot of mechanism to do that mapping, including an initialization function that has noplace to be executed (unless every .so that defines a btree opclass has to be listed in preload_libraries), so it doesn't seem like the thinnest possible shim to me. One thing I'm not too certain about is whether to define the APIs just as above, or to support a passthrough argument of some sort (and if so, what does it reference)? Possibly any datatype that we'd actually care about this for is going to be simple enough to not need any state data. Or possibly not. And what about collations? Maybe there should be a comparator_setup function that gets the collation OID and returns void *, and then that void * value gets passed as a third argument to each call to the comparator function. Maybe. Or perhaps we could merge that work into the function-pointer-setup function --- that is, give it the collation and some extra workspace to fool with. We would always know the collation at the time we're doing that setup. At this point the struct filled in by the setup function is starting to feel a lot like FmgrInfo, and history teaches us a lot about what needs to be in there. So maybe typedef struct SortSupportInfoData *SortSupportInfo; typedef struct SortSupportInfoData { MemoryContext info_cxt; /* where to allocate subsidiary data */ void*extra; /* any data needed by functions */ Oid collation; /* provided by caller */ void(*inline_qsort) (Datum *elements, int nelements, SortSupportInfo info); int (*comparator) (Datum a, Datum b, SortSupportInfo info); /* possibly other function pointers in future */ } SortSupportInfoData; I am thinking that the btree code, at least, would want to just unconditionally do colsortinfo-comparator(datum1, datum2, colsortinfo) so for an opclass that fails to supply the low-overhead comparator, it would insert into the comparator pointer a shim function that calls the opclass' old-style FCI-using comparator. (Anybody who complains about the added overhead would be told to get busy and supply a low-overhead comparator for their datatype...) But to do that, we have to have enough infrastructure here to cover all cases, so omitting collation or not having a place to stash an FmgrInfo won't do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: But you still need a lot of mechanism to do that mapping, including an initialization function that has noplace to be executed (unless every .so that defines a btree opclass has to be listed in preload_libraries), so it doesn't seem like the thinnest possible shim to me. PG_init? One thing I'm not too certain about is whether to define the APIs just as above, or to support a passthrough argument of some sort (and if so, what does it reference)? Possibly any datatype that we'd actually care about this for is going to be simple enough to not need any state data. Or possibly not. And what about collations? Maybe there should be a comparator_setup function that gets the collation OID and returns void *, and then that void * value gets passed as a third argument to each call to the comparator function. Maybe. Or perhaps we could merge that work into the function-pointer-setup function --- that is, give it the collation and some extra workspace to fool with. We would always know the collation at the time we're doing that setup. Ah! That seems quite nice. At this point the struct filled in by the setup function is starting to feel a lot like FmgrInfo, and history teaches us a lot about what needs to be in there. So maybe typedef struct SortSupportInfoData *SortSupportInfo; typedef struct SortSupportInfoData { MemoryContext info_cxt; /* where to allocate subsidiary data */ void *extra; /* any data needed by functions */ Oid collation; /* provided by caller */ void (*inline_qsort) (Datum *elements, int nelements, SortSupportInfo info); int (*comparator) (Datum a, Datum b, SortSupportInfo info); /* possibly other function pointers in future */ } SortSupportInfoData; I am thinking that the btree code, at least, would want to just unconditionally do colsortinfo-comparator(datum1, datum2, colsortinfo) so for an opclass that fails to supply the low-overhead comparator, it would insert into the comparator pointer a shim function that calls the opclass' old-style FCI-using comparator. (Anybody who complains about the added overhead would be told to get busy and supply a low-overhead comparator for their datatype...) But to do that, we have to have enough infrastructure here to cover all cases, so omitting collation or not having a place to stash an FmgrInfo won't do. I'm slightly worried about whether that'll be adding too much overhead to the case where there is no non-FCI comparator. But it may be no worse than what we're doing now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas robertmh...@gmail.com writes: On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: I am thinking that the btree code, at least, would want to just unconditionally do colsortinfo-comparator(datum1, datum2, colsortinfo) so for an opclass that fails to supply the low-overhead comparator, it would insert into the comparator pointer a shim function that calls the opclass' old-style FCI-using comparator. (Anybody who complains about the added overhead would be told to get busy and supply a low-overhead comparator for their datatype...) But to do that, we have to have enough infrastructure here to cover all cases, so omitting collation or not having a place to stash an FmgrInfo won't do. I'm slightly worried about whether that'll be adding too much overhead to the case where there is no non-FCI comparator. But it may be no worse than what we're doing now. It should be the same or better. Right now, we are going through FunctionCall2Coll to reach the FCI-style comparator. The shim function would be more or less equivalent to that, and since it's quite special purpose I would hope we could shave a cycle or two. For instance, we could probably afford to set up a dedicated FunctionCallInfo struct associated with the SortSupportInfo struct, and not have to reinitialize one each time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Peter Geoghegan wrote: Attached are the results from performing a similar process to the prior benchmark, but on Greg Smith's high-end server, and with an orderlines table that has been doubled-up until it is 1538 MB, making the same old query perform a quicksort that's over 3GB. Short version: HEAD is 20468.0ms, with my patch it's 13689.698ms, so these gains hold-up very well for very large in-memory sorts, possibly even perfectly. Note that the doubling up had an interesting and desirable effect for the purposes of this patch review: It's approaching a worst case due to the low cardinality of the product + quantity columns, which crops up with the non-inlined functions only (int4regqsort_arg, etc). All too frequently, evaluating the first scankey results in equality, and so we cannot elide the SQL function call machinery. This is going to dampen the improvement for the multiple scanKey case considerably (and it looks like a smaller relative improvement than when the orderlines table was quite a lot smaller). As I said from the outset, there is no worst case for the single scanKey case that occurs to me. Multiple scanKeys are likely to be a problem for any effort to offer user-defined, per-type sort functions. I could probably make int4regqsort_arg and friends perform a bit better than we see here by increasing the cardinality of those two columns for the second query, so the first scanKey comparison would frequently suffice. These are exciting advanced you are producing and I am hopeful we can get this included in Postgres 9.2. I have mentioned already that I think parallelism is the next big Postgres challenge, and of course, one of the first areas for parallelism is sorting. If you can speed up sorting as you have reported, this allows us to provide faster sorting now, and get even larger benefits if we implement sorting parallelism. In fact, because Postgres 9.2 has so many performance improvements, we might find that we have a more limited need for parallelism. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 29 November 2011 15:31, Bruce Momjian br...@momjian.us wrote: These are exciting advanced you are producing and I am hopeful we can get this included in Postgres 9.2. Thanks Bruce. I have mentioned already that I think parallelism is the next big Postgres challenge, and of course, one of the first areas for parallelism is sorting. I'm not sure that sorting has that much to recommend it as an initial target of some new backend parallelism other than being easy to implement. I've observed the qsort_arg specialisations in this patch out-perform stock qsort_arg by as much as almost 3 times. However, the largest decrease in a query's time that I've observed was 45%, and that was for a contrived worst-case for quicksort, but about 25% is much more typical of queries similar to the ones I've shown, for more normative data distributions. While that's a respectable gain, it isn't a paradigm shifting one, and it makes parallelising qsort itself for further improvements quite a lot less attractive - there's too many other sources of overhead. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Peter Geoghegan wrote: On 29 November 2011 15:31, Bruce Momjian br...@momjian.us wrote: These are exciting advanced you are producing and I am hopeful we can get this included in Postgres 9.2. Thanks Bruce. I have mentioned already that I think parallelism is the next big Postgres challenge, and of course, one of the first areas for parallelism is sorting. I'm not sure that sorting has that much to recommend it as an initial target of some new backend parallelism other than being easy to implement. I've observed the qsort_arg specialisations in this patch out-perform stock qsort_arg by as much as almost 3 times. However, the largest decrease in a query's time that I've observed was 45%, and that was for a contrived worst-case for quicksort, but about 25% is much more typical of queries similar to the ones I've shown, for more normative data distributions. While that's a respectable gain, it isn't a paradigm shifting one, and it makes parallelising qsort itself for further improvements quite a lot less attractive - there's too many other sources of overhead. Agreed. I think your improvements make it likely we will address not address sort parallelism first. With all the improvements coming in Postgres 9.2, we might need to look at I/O parallelism first. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tuesday, November 29, 2011 07:48:37 PM Peter Geoghegan wrote: On 29 November 2011 15:31, Bruce Momjian br...@momjian.us wrote: These are exciting advanced you are producing and I am hopeful we can get this included in Postgres 9.2. Thanks Bruce. I have mentioned already that I think parallelism is the next big Postgres challenge, and of course, one of the first areas for parallelism is sorting. I'm not sure that sorting has that much to recommend it as an initial target of some new backend parallelism other than being easy to implement. I've observed the qsort_arg specialisations in this patch out-perform stock qsort_arg by as much as almost 3 times. However, the largest decrease in a query's time that I've observed was 45%, and that was for a contrived worst-case for quicksort, but about 25% is much more typical of queries similar to the ones I've shown, for more normative data distributions. While that's a respectable gain, it isn't a paradigm shifting one, and it makes parallelising qsort itself for further improvements quite a lot less attractive - there's too many other sources of overhead. I think that logic is faulty. For one I doubt that anybody is honestly suggesting paralellism inside qsort itself. It seems more likely/sensible to implement that on the level of mergesorting. Index builds for example could hugely benefit from improvements on that level. With index build you often get pretty non-optimal data distributions btw... I also seriously doubt that you will find an area inside pg's executor where you find that paralellizing them will provide a near linear scale without much, much more work. Also I wouldn't consider sorting the easiest target - especially on a qsort level - for parallelization as you constantly need to execute user defined operators with multiple input tuples which has the usual problems. COPY parsing + inserting or such seems to be way easier target for example. Even doing hashing + aggregation in different threads seems likely to be easier. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 28 Nov 2011, at 02:15, Peter Geoghegan wrote: Attached are the results from performing a similar process to the prior benchmark, but on Greg Smith's high-end server, and with an orderlines table that has been doubled-up until it is 1538 MB, making the same old query perform a quicksort that's over 3GB. Short version: HEAD is 20468.0ms, with my patch it's 13689.698ms, so these gains hold-up very well for very large in-memory sorts, possibly even perfectly. This is some really good stuff. Has to be said, that there must be quite few other places where inlining like that could have quite a positive benefit. But - I also have to say that both template functions and template classes in C++ give you pretty much the same speed improvement, with much better clarity and readability of the code. (I've tested it with the example Peter presented, if someone is interested in the code). One more reason why PostgreSQL project should look into the future and get some of the bits simplified and optimised by switching to C++. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Attached are the results from performing a similar process to the prior benchmark, but on Greg Smith's high-end server, and with an orderlines table that has been doubled-up until it is 1538 MB, making the same old query perform a quicksort that's over 3GB. Short version: HEAD is 20468.0ms, with my patch it's 13689.698ms, so these gains hold-up very well for very large in-memory sorts, possibly even perfectly. Note that the doubling up had an interesting and desirable effect for the purposes of this patch review: It's approaching a worst case due to the low cardinality of the product + quantity columns, which crops up with the non-inlined functions only (int4regqsort_arg, etc). All too frequently, evaluating the first scankey results in equality, and so we cannot elide the SQL function call machinery. This is going to dampen the improvement for the multiple scanKey case considerably (and it looks like a smaller relative improvement than when the orderlines table was quite a lot smaller). As I said from the outset, there is no worst case for the single scanKey case that occurs to me. Multiple scanKeys are likely to be a problem for any effort to offer user-defined, per-type sort functions. I could probably make int4regqsort_arg and friends perform a bit better than we see here by increasing the cardinality of those two columns for the second query, so the first scanKey comparison would frequently suffice. Incidentally, I'm pretty sceptical of the idea of any effort being made to provide user-defined per-type sort functions or anything like that. No one has come forward with a novel sorting implementation that looks useful, although there has been some talk of some. It's relatively easy to hack on tuplesort to build a prototype, so I don't think the lack of this functionality is the blocker here. Besides, we will probably end up doing a terrible job of anticipating how invasive such a facility needs to be to facilitate these novel sorting implementations. While I'm very encouraged by these results, I must admit that there are a few things that I cannot currently explain. I don't know why the second query on the Optimized sheet outperforms the same query on the Not inlined page. When you consider how small the standard deviation is for each set of results, it's fairly obvious that it cannot be explained by noise. I thought it was weird, so I re-ran both benchmarks, with much the same outcome. It may be that the compiler gets lucky for the optimized case, resulting in a bit of an extra gain, and that effect is ridiculously magnified. In all cases, the regression tests pass. The single key/inlining optimisation seems to account for a big part of the gains, and the Not inlined figures for the first query would seem to suggest that the inlining can account for more than half of gains seen, whereas that was previously assumed to be less. What I think is happening here is that we're benefiting both from inlining as well as not having to even consider multiple scanKeys (we have to at least consider them even if we know in advance from the nScankeys variable that there'll never be multiple scanKeys). Again, if the cardinality was higher, we'd probably see Not inlined do better here. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services results_server.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
3 items are attached: 1. A spreadsheet, results.ods, which has results for automated multiple runs of the doubled up dellstore orderlines queries (where orderlines is 385 MB). Results are sorted, with the median (or the lower middle value, didn't get the mean of the two middle runs) result for each run highlighted as representative. There are 3 builds of Postgres (HEAD, not inlined, optimized), each on its own sheet in the spreadsheet. The cache was warmed for each query, and subsequently the query was run 20 times. 2. A python script I hacked together that should run on anything around Python 2.5+, with a single dependency, psycopg2. Look at --help to see how it works. This is the script that actually generated the figures that went into the spreadsheet, by being run once for each type of build. You can fairly easily play along at home with this, for these and other queries. It will spit out CSV files. It is designed to warm the cache (by default 3 times before each query) to eliminate caching effects. You can add your own query to the python list to have it run by the script, to generate the same set of figures for that query. I'm rather curious as to how much of an impact this optimisation will have on queries with unique nodes, joins and grouping when they rely on a sort node for their input, in the real world. Certainly, a query need not have an ORDER BY clause to see benefits, perhaps substantial benefits. The logic to parse sort node details is rather unsophisticated right now though, due to my focus on the obvious ORDER BY case. 3. The revision of the patch that was actually tested, now with inlining specialisations for the single scanKey case, and a non-inlined specialisation for multiple scanKeys where we can still benefit from eliding the SQL function call machinery for the first scanKey, which is often almost as useful as being able to do so for all scanKeys. It also selects a sorting specialisation when it can in tuplesort_begin_heap, per Robert's suggestion. Reviewers will want to comment out line 731 of tuplesort.c, #define optimize, to quickly get unoptimized behaviour for comparative purposes. Furthermore, if you'd like to see what this would look like without inlining, you can simply comment out assignments of inline variants of specialisations (i.e. int4inlqsort_arg and int8inlqsort_arg) in tuplesort_begin_heap. It has been suggested that I'm chasing diminishing returns by inlining, as I go further for a smaller benefit. Of course, that's true. However, I believe that I'm not chasing them past the point where that ceases to make sense, and these figures support that contention - chasing diminishing returns in the nature of this kind of work. Here, the additional benefit of inlining accounts for over an entire second shaved off a query that was originally 7056ms, so that's not to be sniffed at. I'll reiterate that qsort_arg has only been modified 4 times after its initial commit in 2006, and these were all trivial changes. While the way that I ape generic programming with the preprocessor is on the ugly side, what I've done can be much more easily understood with reference to qsort_arg itself. Robert said that duplicating the sort function was iffy. However, that already happened long ago, as we're already maintaining both qsort_arg.c and qsort.c, and comments already urge maintainers to keep the two consistent. That's not been a problem though, because, as I've said, there has never been any call to make substantive changes to either in all those years. We could probably further reduce the code footprint of all of this by having template_qsort_arg.h generate comparators directly. That might be inflexible in a way that turns out to matter though, like if we wanted to use this for non-HAVE_INT64_TIMESTAMP timestamps and had to add NaN crud. My next step is to see how this goes with hundreds of sorts in the hundreds of megabytes on a high-end server. I don't have immediate access to one, but I'm working that out. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services From 92185762297ba9819eb7f1b208db03663d9fdfa8 Mon Sep 17 00:00:00 2001 From: peter2ndQuadrant pe...@2ndquadrant.com Date: Sun, 20 Nov 2011 20:36:25 + Subject: [PATCH] Initial commit of optimization Stop directly using oid Added int8 quicksort fast path specialisation, which can also be used in place of F_TIMESTAMP_CMP for HAVE_INT64_TIMESTAMP builds. Rebased, revised patch for -hackers, with timestamp and int8 fast path sorting using the same int8 specialization. Remove unneeded line Rebase for -hackers Descriminate against cases where nKeys 1 when selecting sort specialisation. We now support fast path sorting with multiple scanKeys. We now inline for one scanKey, but do not inline our comparetup_heap replacement when multiple scanKeys are used (although this is at the compiler's discretion). However, when multiple scanKeys are used, we still use
Re: [HACKERS] Inlining comparators as a performance optimisation
On 23 November 2011 19:24, Robert Haas robertmh...@gmail.com wrote: Well, right now the decision as to which mechanism should be used here gets made in tuplesort_performsort(), which has no good way of communicating with EXPLAIN anyway. You could pretty easily add something to Tuplesortstate to accomplish this. That isn't an endorsement of doing so, but I'm not sure that it isn't appropriate. Actually, I think that's a modularity violation; using the address of comparetup_heap as a flag value seems quite ugly. How about moving that logic up to tuplesort_begin_heap() I'll post a patch soon that does just that in the next day or two. Tuplesortstate has a pointer to a sort specialisation. and having it set some state inside the Tuplesort, maybe based on a flag in the opclass (or would it have to attach to the individual operator)? I'm not sure that there's much point in such a flag. At least on my machine, your latest patch reliably crashes the regression tests in multiple places. TRAP: FailedAssertion(!(state-nKeys == 1), File: tuplesort.c, Line: 1261); Yes, sorry about that. Should have been discriminating against nKeys 1 cases. As of this evening, for sorts with multiple scankeys, I'm using optimisations for the first scankey but not subsequent scankeys, which is frequently almost as good as having optimisations for all scanKeys. The formatting of src/include/utils/template_qsort_arg.h is hard to read. At ts=8, the backslashes line up, but the code doesn't fit in 80 columns. If you set ts=4, then it fits in 80 columns, but the backslashes don't line up any more, and the variable declarations don't either. I believe ts=4 is project standard. Fair enough. My working copy and .vimrc have been updated. I still think it would be a good idea to provide a mechanism to override heap_comparetup() with a type-specific function. I don't think that would take much extra code, and then any data type could get at least that much benefit out of this. It seems like it could be a good idea to do some per-assembler-instruction profiling of this code, and perhaps also of the original code. I'm curious where the time is being spent. How would you go about doing that? The instrumentation that profilers use actually caused a big drop in performance here when I attempted it a few weeks ago. There's a kind of Heisenberg effect. This optimisation *more than doubles* raw sort performance for the cases. There is nothing contrived or cherry picked about the query that I selected to represent this optimisation - it was literally the first one that I selected. Sometimes, I see even a markedly better gain than a doubling of raw sort performance - I think my earlier experiments that indicated a much smaller improvement past a certain point may have been methodologically flawed. Sorry about that. If I double-up the data in the orderlines table a few times, until it reaches 385 MB (duplicate ever tuple with an insert into ...select ), then warm the cache, I get very interesting results. Here, we see a few runs of the same old query unoptimised (note that I've excluded some cold-cache runs before these runs): Before optimisation == Total runtime: 7785.473 ms - 3.517310 secs just sorting Total runtime: 8203.533 ms - 3.577193 secs just sorting Total runtime: 8559.743 ms - 3.892719 secs just sorting Total runtime: 9032.564 ms - 3.844746 secs just sorting Total runtime: 9637.179 ms - 4.434431 secs just sorting Total runtime: 9647.215 ms - 4.440560 secs just sorting Total runtime: 9669.701 ms - 4.448572 secs just sorting After optimisation == Total runtime: 5462.419 ms - 1.169963 secs just sorting Total runtime: 5510.660 ms - 1.234393 secs just sorting Total runtime: 5511.703 ms - 1.208377 secs just sorting Total runtime: 5588.604 ms - 1.175536 secs just sorting Total runtime: 5899.496 ms - 1.250403 secs just sorting Total runtime: 6023.132 ms - 1.338760 secs just sorting Total runtime: 6717.177 ms - 1.486602 secs just sorting This is a 800109kB sort. So, taking the median value as representative here, that looks to be just shy of a 40% improvement, or 3.4 seconds. My /proc/cpuinfo is attached on the off chance that someone is interested in that. More work is needed here, but this seems promising. It will be interesting to see how far all of this can be taken with comparetup_index_btree. Certainly, I'm sure there's some gain to be had there by applying lessons learned from comparetup_heap. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz stepping: 11 cpu MHz : 2000.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0
Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Nov 18, 2011 at 2:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote: I think that we should really consider doing with this patch what Tom suggested upthread; namely, looking for a mechanism to allow individual datatypes to offer up a comparator function that doesn't require bouncing through FunctionCall2Coll(). I don't think its credible to implement that kind of generic improvement at this stage of the release cycle. Er, *what*? We're in mid development cycle, we are nowhere near release. When exactly would you have us make major changes? In any case, what I understood Robert to be proposing was an add-on feature that could be implemented in one datatype at a time. Not a global flag day. We couldn't really do the latter anyway without making life very unpleasant for authors of extension datatypes. Tom, whenever you think I've said something you really disagree with, just assume there's a misunderstanding. Like here. Of course it is OK to make such changes at this time. Given we have 2 months to the last CF of this release, inventing a generic infrastructure is unlikely to be finished and complete in this dev cycle, so requesting that isn't a practical suggestion, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Nov 22, 2011 at 8:09 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: I wonder, is it worth qualifying that the Sort Method was a quicksort (fast path) sort within explain analyze output? This is a rather large improvement, so It might actually be something that people look out for, as it might be tricky to remember the exact circumstances under which the optimisation kicks in by the time we're done here. Well, right now the decision as to which mechanism should be used here gets made in tuplesort_performsort(), which has no good way of communicating with EXPLAIN anyway. Actually, I think that's a modularity violation; using the address of comparetup_heap as a flag value seems quite ugly. How about moving that logic up to tuplesort_begin_heap() and having it set some state inside the Tuplesort, maybe based on a flag in the opclass (or would it have to attach to the individual operator)? At least on my machine, your latest patch reliably crashes the regression tests in multiple places. The following test case also crashes them for me (perhaps for the same reason the regression tests crash): create table i4 (a int, b int); insert into i4 values (4, 1), (2, 1), (0, 1), (null, 1), (-2, 1), (-7, 1), (4, 2), (4, 3), (4, 4); select * from i4 order by 1, 2; TRAP: FailedAssertion(!(state-nKeys == 1), File: tuplesort.c, Line: 1261); The formatting of src/include/utils/template_qsort_arg.h is hard to read. At ts=8, the backslashes line up, but the code doesn't fit in 80 columns. If you set ts=4, then it fits in 80 columns, but the backslashes don't line up any more, and the variable declarations don't either. I believe ts=4 is project standard. I still think it would be a good idea to provide a mechanism to override heap_comparetup() with a type-specific function. I don't think that would take much extra code, and then any data type could get at least that much benefit out of this. It seems like it could be a good idea to do some per-assembler-instruction profiling of this code, and perhaps also of the original code. I'm curious where the time is being spent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21 November 2011 22:55, Robert Haas robertmh...@gmail.com wrote: Results on my machine, for what they're worth: I was curious as to how much of an improvement I'd made to sorting in isolation. I've adding simple sort timing instrumentation to the code in a revised patch, attached, for the convenience of reviewers. This patch adds an int8 specialisation, but it also supports fast path sorting of timestamps by using that same int8 specialisation (for HAVE_INT64_TIMESTAMP builds at least - didn't know if it was worth handling the NaN crud for the other case). ISTM that various different types with identical internal representations can share specialisations like this, which I hope will alleviate some earlier concerns about both the limited applicability of this optimisation and the possible proliferation of specialisations. Here's some raw qsort figures in seconds for the query select * from orderlines order by test (where test is a timestamptz column I added and updated with some data, executing repeatedly on the same machine as before): Before: DEBUG: elapsed: 0.031738 DEBUG: elapsed: 0.031595 DEBUG: elapsed: 0.031502 DEBUG: elapsed: 0.031378 DEBUG: elapsed: 0.031359 DEBUG: elapsed: 0.031413 DEBUG: elapsed: 0.031499 DEBUG: elapsed: 0.031394 DEBUG: elapsed: 0.031368 After: DEBUG: elapsed: 0.013341 DEBUG: elapsed: 0.014608 DEBUG: elapsed: 0.013809 DEBUG: elapsed: 0.013244 DEBUG: elapsed: 0.014307 DEBUG: elapsed: 0.013207 DEBUG: elapsed: 0.013227 DEBUG: elapsed: 0.013264 DEBUG: elapsed: 0.013143 DEBUG: elapsed: 0.013455 DEBUG: elapsed: 0.013447 I wonder, is it worth qualifying that the Sort Method was a quicksort (fast path) sort within explain analyze output? This is a rather large improvement, so It might actually be something that people look out for, as it might be tricky to remember the exact circumstances under which the optimisation kicks in by the time we're done here. I haven't had as much time as I'd like to polish this patch, or to get clearer answers. I expect to spend more time on it over the weekend. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services From 687779e799eb0d0064f86192cc74ea473ff9a607 Mon Sep 17 00:00:00 2001 From: peter2ndQuadrant pe...@2ndquadrant.com Date: Sun, 20 Nov 2011 20:36:25 + Subject: [PATCH] Initial commit of optimization Stop directly using oid Added int8 quicksort fast path specialisation, which can also be used in place of F_TIMESTAMP_CMP for HAVE_INT64_TIMESTAMP builds. Rebased, revised patch for -hackers, with timestamp and int8 fast path sorting using the same int8 specialization. Remove unneeded line Rebase for -hackers --- src/backend/utils/sort/tuplesort.c | 105 +++- src/include/utils/template_qsort_arg.h | 217 2 files changed, 316 insertions(+), 6 deletions(-) create mode 100644 src/include/utils/template_qsort_arg.h diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 3505236..4962973 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -107,11 +107,13 @@ #include miscadmin.h #include pg_trace.h #include utils/datum.h +#include utils/fmgroids.h #include utils/logtape.h #include utils/lsyscache.h #include utils/memutils.h #include utils/pg_rusage.h #include utils/rel.h +#include utils/template_qsort_arg.h #include utils/tuplesort.h @@ -1225,12 +1227,60 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple) } /* - * All tuples have been provided; finish the sort. + * Manufacture type specific sorting specialisations + * with inline comparators + */ +static inline int +compare_int4(Datum first, Datum second) +{ + int32 first_int = DatumGetInt32(first); + int32 second_int = DatumGetInt32(second); + + if (first_int second_int) + return 1; + else if (first_int == second_int) + return 0; + else + return -1; +} + +static inline int +compare_int8(Datum first, Datum second) +{ + int64 first_int = DatumGetInt64(first); + int64 second_int = DatumGetInt64(second); + + if (first_int second_int) + return 1; + else if (first_int == second_int) + return 0; + else + return -1; +} + +TEMPLATE_QSORT_ARG(int4, compare_int4); +TEMPLATE_QSORT_ARG(int8, compare_int8); + +double timeval_subtract(struct timeval *x, struct timeval *y); +double timeval_subtract(struct timeval *x, struct timeval *y) +{ + struct timeval result; + /* Compute the time remaining to wait. tv_usec is certainly positive. */ + result.tv_sec = x-tv_sec - y-tv_sec; + result.tv_usec = x-tv_usec - y-tv_usec; + + /* return difference in seconds */ + return result.tv_sec + ((double) result.tv_usec / 100); +} + +/* + * All tuples have been provided; perform the sort. */ void tuplesort_performsort(Tuplesortstate *state) { MemoryContext oldcontext = MemoryContextSwitchTo(state-sortcontext); + struct timeval begin, end; #ifdef TRACE_SORT if
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Sep 20, 2011 at 7:53 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: I don't think that the fact that that happens is at all significant at this early stage, and it never even occurred to me that you'd think that it might be. I was simply disclosing a quirk of this POC patch. The workaround is probably to use a macro instead. For the benefit of those that didn't follow the other threads, the macro-based qsort implementation, which I found to perform significantly better than regular qsort(), runs like this on my laptop when I built at 02 with GCC 4.6 just now: C stdlib quick-sort time elapsed: 2.092451 seconds Inline quick-sort time elapsed: 1.587651 seconds Results on my machine, for what they're worth: [rhaas inline_compar_test]$ gcc -O0 qsort-inline-benchmark.c [rhaas inline_compar_test]$ ./a.out C stdlib quick-sort time elapsed: 2.366762 seconds Inline quick-sort time elapsed: 1.807951 seconds [rhaas inline_compar_test]$ gcc -O1 qsort-inline-benchmark.c [rhaas inline_compar_test]$ ./a.out C stdlib quick-sort time elapsed: 1.970473 seconds Inline quick-sort time elapsed: 1.002765 seconds [rhaas inline_compar_test]$ gcc -O2 qsort-inline-benchmark.c [rhaas inline_compar_test]$ ./a.out C stdlib quick-sort time elapsed: 1.966408 seconds Inline quick-sort time elapsed: 0.958999 seconds [rhaas inline_compar_test]$ gcc -O3 qsort-inline-benchmark.c [rhaas inline_compar_test]$ ./a.out C stdlib quick-sort time elapsed: 1.988693 seconds Inline quick-sort time elapsed: 0.975090 seconds -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Part of my problem with not producing specialisations that I really neglected to complain about until now is the inconsistency between types, and the need to deal with that. We benefit from assuming in our specialisation that we're only dealing with POD types, simply not considering things that we would otherwise have had to for the benefit of types that have a Datum representation that is pass-by-reference, or have collations, or have multiple scankeys. Leaving aside the question of straight compiler-optimisation benefits for the moment, the remaining benefit of what I've done comes not so much from avoiding the usual function call machinery per se, as from doing so *as well as* cutting down on what currently happens in comparetup_heap to handle every single compound and scalar type. Compare the function comparetup_heap with my meta-function TYPE##AppSort to see what I mean. The function comparetup_heap is the comparator directly used by qsort_arg when sorting heap tuples, and qsort_arg outsources to comparetup_heap some things that you might not expect it to (very little has changed about qsort_arg since we lifted it from NetBSD back in 2006). So while you might imagine that that loop in comparetup_heap and things like its use of the heap_getattr macros won't expand to that many instructions, you really don't want them in the inner loop of a long operation like qsorting, paid for up to O(n ^ 2) times. There's also the obvious implications for compiler optimisations, particularly relating to effective usage of CPU cache. I'm trying to get you what you asked for: A straight choice between what Tom suggested and what I suggested, with perhaps some compromises between the two positions. That's sort of tricky though, especially considering the issues raised above. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 19 November 2011 02:55, Robert Haas robertmh...@gmail.com wrote: Maybe we should look at trying to isolate that a bit better. Indeed. Fortunately, GCC has options to disable each optimisation. Here's potentially relevant flags that we're already implicitly using at -02: -finline-small-functions -- this is the one that inlined functions without an inline keyword -findirect-inlining -finline-functions-called-once -fearly-inlining -fipa-sra (Perform interprocedural scalar replacement of aggregates, removal of unused parameters and replacement of parameters passed by reference by parameters passed by value). In an effort to better isolate the effects of inlining, I tried this: ./configure CFLAGS=-fno-inline -fno-inline-small-functions (I could have disabled more -02 optimisations, but this proved sufficient to make my point) Unsurprisingly, this makes things slower than regular -02. With the patch, the same query (once again, using explain analyze) with the same fast path quicksort stabilises around 92ms +/- 5ms (recall that the original figure was ~82ms). Our gains evaporate, and then some. Take away the additional CLAGS, and we're predictably back to big gains, with the query taking ~52ms just as before. What happens to this query when we build an unmodified postgres with these same CFLAGS? Well, we see the query take ~116ms after a few runs. It seems that the impedance mismatch matters, but inlining and other optimisations look to be at least as important. This isn't surprising to me, given what I was able to do with the isolated test. Maybe I should have tried it with additional disabling of optimisations named above, but that would have perhaps made things less clear. I'd probably have been better off directly measuring qsort speed and only passing those flags when compiling tuplesort.c (maybe impedance mismatch issues would have proven to have been even less relevant), but I wanted to do something that could be easily recreated, plus it's late. It strikes me that we could probably create an API that would support doing either of these things depending on the wishes of the underlying datatype. For example, imagine that we're sorting with (int4, int4). We associate a PGPROC-callable function with that operator that returns internal, really a pointer to a struct. The first element of the struct is a pointer to a comparison function that qsort() (or a tape sort) can invoke without a trampoline; the second is a wholesale replacement for qsort(); either or both can be NULL. Given that, it seems to me that we could experiment with this pretty easily, and if it turns out that only one of them is worth doing, it's easy to drop one element out of the structure. Or do you have another plan for how to do this? I haven't given it much thought. Let me get back to you on that next week. Have you done any benchmarks where this saves seconds or minutes, rather than milliseconds? That would certainly make it more exciting, at least to me. Right. Well, I thought I'd use pgbench to generate a large table in a re-creatable way. That is: pgbench -i -s 60 This puts pgbench_accounts at 769MB. Then, having increased work_mem to 1GB (enough to qsort) and maintenance_work_mem to 756mb, I decided to test this query with the patch: explain analyze select * from pgbench_accounts order BY abalance; This stabilised at ~3450ms, through repeatedly being executed. How does this compare to unpatched postgres? Well, it stabilised at about ~3780ms for the same query. This patch is obviously less of a win as the number of tuples to sort goes up. That's probably partly explained by the cost of everything else going up at a greater rate than the number of comparisons. I suspect that if we measure qsort in isolation, we'll see better results, so we may still see a good win on index creation time as a result of this work. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 20 November 2011 03:29, Peter Geoghegan pe...@2ndquadrant.com wrote: ./configure CFLAGS=-fno-inline -fno-inline-small-functions (I could have disabled more -02 optimisations, but this proved sufficient to make my point) I'll isolate this further to tuplesort.c soon, which ought to be a lot more useful. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote: I think that we should really consider doing with this patch what Tom suggested upthread; namely, looking for a mechanism to allow individual datatypes to offer up a comparator function that doesn't require bouncing through FunctionCall2Coll(). It seems to me that duplicating the entire qsort() algorithm is a little iffy. Sure, in this case it works out to a win. But it's only a small win - three-quarters of it is in the uncontroversial activity of reducing the impedance mismatch - and how do we know it will always be a win? Adding more copies of the same code can be an anti-optimization if it means that a smaller percentage of it fits in the instruction cache, and sometimes small changes in runtime are caused by random shifts in the layout of memory that align things more or less favorably across cache lines rather than by real effects. Now it may well be that this is a real effect, but will it still look as good when we do this for 10 data types? For 100 data types? In contrast, it seems to me that reducing the impedance mismatch is something that we could go and do across our entire code base, and every single data type would benefit from it. It would also be potentially usable by other sorting algorithms, not just quick sort. I don't think its credible to implement that kind of generic improvement at this stage of the release cycle. That has a much bigger impact since it potentially effects all internal datatypes and external ones also. Definitely a longer term way forward. If individual datatypes offer up a comparator function that is easily going to result in more code than is being suggested here. So the argument about flooding the CPU cache works against your alternate proposal, not in favour of it. We have no proof that we need to do this for 10 or 100 data types. We only currently have proof that there is gain for the most common types. Of course, it sounds like it might be useful to allow any data type to gain an advantage, but we shouldn't be blind to the point that almost nobody will use such a facility, and if they do the code won't be written for a long time yet. If this came as a request from custom datatype authors complaining of slow sorts it would be different, but it didn't so we don't even know if anybody would ever write user defined comparator routines. Rejecting a patch because of a guessed user requirement is not good. Peter's suggested change adds very few lines of code and those compile to some very terse code, a few hundred instructions at very most. Requesting an extra few cachelines to improve qsort by so much is still an easy overall win. The OP change improves qsort dramatically, and is a small, isolated patch. There is no significant downside. We also have it now, so lets commit this, chalk up another very good performance improvement and use our time on something else this commitfest, such as the flexlocks idea. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Nov 18, 2011 at 3:53 AM, Simon Riggs si...@2ndquadrant.com wrote: We have no proof that we need to do this for 10 or 100 data types. We only currently have proof that there is gain for the most common types. Well, that's kind of my point. I think this needs more work before we decide what the best approach is. So far, the ONLY test result we have that compares inlining to not inlining shows a speedup from 60 ms to 52 ms. I think that an 8 ms speedup on one test with one datatype on one platform/compiler combination isn't sufficient evidence to conclude that this is the best possible approach. I think the way to look at this is that this patch contains two possibly good ideas: one of which is to make the qsort() argument match the qsort() calling convention, and the other of which is to have multiple copies of the qsort() logic that inline the comparators for their respective datatypes. Tom hypothesized that most of the benefit was in the first idea, and the numbers that Peter posted seem to support that conclusion. The first idea is also less invasive and more broadly applicable, so to me that seems like the first thing to pursue. Now, that doesn't mean that we shouldn't consider the second idea as well, but I don't believe that the evidence presented thus far is sufficient to prove that we should go that route. It seems entirely possible that inlining any non-trivial comparator function could work out to a loss, or that the exact choice of compiler flags could affect whether inlining works out to a plus or a minus (what happens with -O3 vs. -O2? what about -O1? what about -O2 -fno-omit-frame-pointer? what if they're using HP's aCC or Intel's icc rather than gcc?). There's no point in installing an optimization that could easily be a pessimization on some other workload, and that hasn't been tested, or at least no results have been posted here. On the other hand, matching the calling convention of the comparator function to what qsort() wants and eliminating the trampoline seems absolutely certain to be a win in every case, and based on the work Peter has done it seems like it might be a quite large win. In fact, you have to ask yourself just exactly how much our function-calling convention is costing us in general. We use that mechanism an awful lot and whatever loss of cycles is involved would be spread all over the code base where oprofile or gprof won't easily be able to pick it out. Even the cost at any particular call site will be split between the caller and the callee. There might be more stuff we could optimize here than just sorting... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Simon Riggs si...@2ndquadrant.com writes: On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas robertmh...@gmail.com wrote: I think that we should really consider doing with this patch what Tom suggested upthread; namely, looking for a mechanism to allow individual datatypes to offer up a comparator function that doesn't require bouncing through FunctionCall2Coll(). I don't think its credible to implement that kind of generic improvement at this stage of the release cycle. Er, *what*? We're in mid development cycle, we are nowhere near release. When exactly would you have us make major changes? In any case, what I understood Robert to be proposing was an add-on feature that could be implemented in one datatype at a time. Not a global flag day. We couldn't really do the latter anyway without making life very unpleasant for authors of extension datatypes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 18 November 2011 05:20, Robert Haas robertmh...@gmail.com wrote: I think that we should really consider doing with this patch what Tom suggested upthread; namely, looking for a mechanism to allow individual datatypes to offer up a comparator function that doesn't require bouncing through FunctionCall2Coll(). It seems to me that duplicating the entire qsort() algorithm is a little iffy. I understand that we highly value extensibility and genericity (yes, that's a real word). We may not always be well served by that tendency. Sure, in this case it works out to a win. But it's only a small win - three-quarters of it is in the uncontroversial activity of reducing the impedance mismatch - and how do we know it will always be a win? Firstly, 1/4 of a quite large gain is still a pretty good gain. Secondly, I probably didn't actually isolate the effects of inlining, nor the overall benefit of the compiler knowing the comparator at compile-time. I just removed the inline keyword. Those are two different things. The inline keyword serves as a request to the compiler to inline. The compiler can and often will ignore that request. Most people know that. What isn't so widely known is that modern compilers may equally well inline even when they haven't been asked to (but only when they can). When you also consider, as I've already pointed out several times, that individual call sites are inlined, it becomes apparent that there may well still be a certain amount of inlining and/or other optimisations like procedural integration going on at some call sites even without the encouragement of the inline keyword, that would not have been performed without the benefit of compile-time comparator knowledge. The addition of the inline keyword may just, in this particular case, have the compiler inline even more call sites. Posting the ~8ms difference was motivated by a desire to prove that inlining had *some* role to play, without actually going to the trouble of implementing Tom's idea as a basis of comparison, because Tom was very sceptical of inlining. The long and the short of it is that I'm going to have to get my hands dirty with a dissembler before we really know exactly what's happening. That, or I could use an optimisation fence of some type. Adding more copies of the same code can be an anti-optimization if it means that a smaller percentage of it fits in the instruction cache, and sometimes small changes in runtime are caused by random shifts in the layout of memory that align things more or less favorably across cache lines rather than by real effects. Now it may well be that this is a real effect, but will it still look as good when we do this for 10 data types? For 100 data types? I'd favour limiting it to just the common integer and float types. In contrast, it seems to me that reducing the impedance mismatch is something that we could go and do across our entire code base, and every single data type would benefit from it. It would also be potentially usable by other sorting algorithms, not just quick sort. Suppose that we went ahead and added that infrastructure. What you must acknowledge is that one reason that this speed-up is so dramatic is that the essential expense of a comparison is already so low - a single instruction - and therefore the overall per-comparison cost goes way down, particularly if the qsort inner loop can store the code across fewer cache lines. For that reason, any absolute improvement that you'll see in complex datatypes will be smaller, maybe much smaller, because for each comparison we'll execute many more instructions that are essential to the comparison. In my estimation, all of this work does not point to there being an undue overhead in the function calling convention as you suggested. Still, I'm not opposed to investigating generalising this in some way, reservations notwithstanding, unless we have to block-wait on it. I don't want to chase diminishing returns too far. Well, that's kind of my point. I think this needs more work before we decide what the best approach is. Agreed. So far, the ONLY test result we have that compares inlining to not inlining shows a speedup from 60 ms to 52 ms. I think that an 8 ms speedup on one test with one datatype on one platform/compiler combination isn't sufficient evidence to conclude that this is the best possible approach. Fair enough, but it's not the only test I did - I posted other numbers for the same query when the table was 48mb, and we saw a proportional improvement, consistent with a per-comparison win. I'm supposed to be on leave for a few days at the moment, so I won't be very active this weekend, but I'm rather curious as to where you or others would like to see me go with benchmarks. I should point out that we currently don't have much idea how big of a win applying these principles could be for index creation times...it could possibly be very significant. My profiling of index
Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Nov 18, 2011 at 12:20:26AM -0500, Robert Haas wrote: I think that we should really consider doing with this patch what Tom suggested upthread; namely, looking for a mechanism to allow individual datatypes to offer up a comparator function that doesn't require bouncing through FunctionCall2Coll(). It seems to me that duplicating the entire qsort() algorithm is a little iffy. Sure, in this case it works out to a win. But it's only a small win - three-quarters of it is in the uncontroversial activity of reducing the impedance mismatch - and how do we know it will always be a win? There's always the old idea of a data type providing a function mapping f:(type - int64) in such a way that it preserves order. That is, in the sense that: f(x) f(y) = x y When sorting, you add f(x) as hidden column and change ORDER BY x to ORDER BY f(x), x. Then you only need to special case the int64 version. This would mean that in most cases you may be able to skip the call because you're comparing integers. The downside is you need to call f on each input. It depends on the datatype if that's cheaper or not, but for all numerics types I think it's an easy win. I don't think anyone has written a proof of concept for this. It does have the advantage of scaling better than coding a qsort for each individual type. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Inlining comparators as a performance optimisation
On Fri, Nov 18, 2011 at 4:38 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: I understand that we highly value extensibility and genericity (yes, that's a real word). We may not always be well served by that tendency. True (except that genericity is not a synonym for generality AFAICT). A good fraction of the optimizations that we've done over the last, well, pick any arbitrary period of time consists in adding special cases that occur frequently enough to be worth optimizing - e.g. HOT, or fast relation locks - often to extremely good effect. Firstly, 1/4 of a quite large gain is still a pretty good gain. Secondly, I probably didn't actually isolate the effects of inlining, nor the overall benefit of the compiler knowing the comparator at compile-time. I just removed the inline keyword. Maybe we should look at trying to isolate that a bit better. It strikes me that we could probably create an API that would support doing either of these things depending on the wishes of the underlying datatype. For example, imagine that we're sorting with (int4, int4). We associate a PGPROC-callable function with that operator that returns internal, really a pointer to a struct. The first element of the struct is a pointer to a comparison function that qsort() (or a tape sort) can invoke without a trampoline; the second is a wholesale replacement for qsort(); either or both can be NULL. Given that, it seems to me that we could experiment with this pretty easily, and if it turns out that only one of them is worth doing, it's easy to drop one element out of the structure. Or do you have another plan for how to do this? Fair enough, but it's not the only test I did - I posted other numbers for the same query when the table was 48mb, and we saw a proportional improvement, consistent with a per-comparison win. I'm supposed to be on leave for a few days at the moment, so I won't be very active this weekend, but I'm rather curious as to where you or others would like to see me go with benchmarks. I should point out that we currently don't have much idea how big of a win applying these principles could be for index creation times...it could possibly be very significant. My profiling of index creation makes this looks promising. Have you done any benchmarks where this saves seconds or minutes, rather than milliseconds? That would certainly make it more exciting, at least to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: I've produced something much neater than my first patch, attached, although I still consider this to be at the POC stage, not least since I'm not exactly sure how I should be selecting the right specialisation in tuplesort_performsort() (a hack is used right now that does a direct pg_proc OID comparison), and also because I haven't implemented anything other than qsort for heap tuples yet (a minor detail, I suppose). I'm pleased to say that a much clearer picture of what's really going on here has emerged. Statistics: Total runtime according to explain analyze for query select * from orderlines order by prod_id (dellstore2 sample db), at GCC 4.5's -02 optimisation level, after warming the cache, on my desktop: Without the patch: ~82ms With the patch, but with the inline keyword commented out for all new functions/meta-functions: ~60ms with the patch, unmodified: ~52ms Reviewing away when I should be sleeping, wahoo...! I think that we should really consider doing with this patch what Tom suggested upthread; namely, looking for a mechanism to allow individual datatypes to offer up a comparator function that doesn't require bouncing through FunctionCall2Coll(). It seems to me that duplicating the entire qsort() algorithm is a little iffy. Sure, in this case it works out to a win. But it's only a small win - three-quarters of it is in the uncontroversial activity of reducing the impedance mismatch - and how do we know it will always be a win? Adding more copies of the same code can be an anti-optimization if it means that a smaller percentage of it fits in the instruction cache, and sometimes small changes in runtime are caused by random shifts in the layout of memory that align things more or less favorably across cache lines rather than by real effects. Now it may well be that this is a real effect, but will it still look as good when we do this for 10 data types? For 100 data types? In contrast, it seems to me that reducing the impedance mismatch is something that we could go and do across our entire code base, and every single data type would benefit from it. It would also be potentially usable by other sorting algorithms, not just quick sort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Oct 4, 2011 at 10:55 PM, Greg Stark st...@mit.edu wrote: I agree that if we wanted to farm out entire plan nodes we would probably end up generating new partial plans which would be handed to other backend processes. That's not unlike what Oracle did with Parallel Query. But i'm a bit skeptical that we'll get much of that done in 2-3 years. The main use case for Parallel Query in Oracle is for partitioned tables -- and we haven't really polished that after how many years? Partitioning hasn't been completely neglected; 9.1 adds support for something called Merge Append. You may have heard of (or, err, authored) that particular bit of functionality. :-) Of course, it would be nice to have a better syntax, but I don't think the lack of it should discourage us from working on parallel query, which is a muti-part problem. You need to: - have planner support to decide when to parallelize - have a mechanism for firing up worker processes and synchronizing the relevant bits of state between the master and the workers - have an IPC mechanism for streaming data between the master process and the workers - figure out how to structure the executor so that the workers neither stall nor run too far ahead of the master (which might have LIMIT 10 or something) Markus Wanner took a crack at generalizing the autovacuum machinery that we have now into something that could be used to fire up general-purpose worker processes, but it fell down mostly because I (and, I think, others) weren't convinced that imessages were something we wanted to suck into core, and Markus reasonably enough wasn't interested in rewriting it to do something that wouldn't really help his work with Postgres-R. I'm not sure where Bruce is getting his timeline from, but I think the limiting factor is not so much that we don't have people who can write the code as that those people are busy, and this is a big project. But you can bet that if it gets to the top of Tom's priority list (just for example) we'll see some motion...! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Robert Haas wrote: Markus Wanner took a crack at generalizing the autovacuum machinery that we have now into something that could be used to fire up general-purpose worker processes, but it fell down mostly because I (and, I think, others) weren't convinced that imessages were something we wanted to suck into core, and Markus reasonably enough wasn't interested in rewriting it to do something that wouldn't really help his work with Postgres-R. I'm not sure where Bruce is getting his timeline from, but I think the limiting factor is not so much that we don't have people who can write the code as that those people are busy, and this is a big project. But you can bet that if it gets to the top of Tom's priority list (just for example) we'll see some motion...! I was thinking of setting up a team to map out some strategies and get community buy-in, and then we could attack each issue. I got the 2-3 years from the Win32 timeline. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Peter Geoghegan wrote: On the subject of highly ambitious optimisations to sorting, one possibility I consider much more practicable than GPU-accelerated sorting is simple threading; quicksort can be parallelised very effectively, due to its divide-and-conquer nature. If we could agree on a threading abstraction more sophisticated than forking, it's something I'd be interested in looking at. To do so would obviously entail lots of discussion about how that relates to whatever way we eventually decide on implementing parallel query, and that's obviously a difficult discussion. I agree that the next big challenge for Postgres is parallel operations. With the number of cores increasing, and with increased memory and SSD, parallel operation is even more important. Rather than parallelizing the entire backend, I imagine adding threading or helper processes for things like sorts, index scans, executor nodes, and stored procedure languages. I expect final code to be 2-3 years in the future. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Oct 5, 2011 at 2:49 AM, Bruce Momjian br...@momjian.us wrote: Rather than parallelizing the entire backend, I imagine adding threading or helper processes for things like sorts, index scans, executor nodes, and stored procedure languages. I expect final code to be 2-3 years in the future. I don't actually see it would be a big problem quicksort to start up a bunch of threads which do some of the work and go away when the tuplesort is done. As long as the code they're executing is well defined and limited to a small set of code that can be guaranteed to be thread-safe it should be reasonably simple. The problem is that in most of the Postgres core there aren't many places where much code fits that description. Even in tuplesort it can call out to arbitrary user-defined functions so we would need a way to mark which functions are thread-safe. Beyond integer and floating point comparisons it may not be much -- certainly not Numeric or strings due to detoasting... And then there are things like handling the various signals (including SI invalidations?) and so on. I agree that if we wanted to farm out entire plan nodes we would probably end up generating new partial plans which would be handed to other backend processes. That's not unlike what Oracle did with Parallel Query. But i'm a bit skeptical that we'll get much of that done in 2-3 years. The main use case for Parallel Query in Oracle is for partitioned tables -- and we haven't really polished that after how many years? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 26 September 2011 04:46, Robert Haas robertmh...@gmail.com wrote: I don't want to take time to review this in detail right now, because I don't think it would be fair to put this ahead of things that were submitted for the current CommitFest, but I'm impressed. Thank you. Now that I think about it, the if statement that determines if the int4 specialisation will be used may be okay - it sort of documents the conditions under which the int4 specialisation should be used with reference to how the else generic case actually works, which is perhaps no bad thing. I now realise that I should be using constants from fmgroids.h though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: [ new results ] Nice results. I think these are far more convincing than the last set, because (1) the gains are bigger and (2) they survive -O2 and (3) you tested an actual query, not just qsort() itself. I don't want to take time to review this in detail right now, because I don't think it would be fair to put this ahead of things that were submitted for the current CommitFest, but I'm impressed. On the subject of highly ambitious optimisations to sorting, one possibility I consider much more practicable than GPU-accelerated sorting is simple threading; quicksort can be parallelised very effectively, due to its divide-and-conquer nature. If we could agree on a threading abstraction more sophisticated than forking, it's something I'd be interested in looking at. To do so would obviously entail lots of discussion about how that relates to whatever way we eventually decide on implementing parallel query, and that's obviously a difficult discussion. I have the same feeling about about this that I do about almost every executor optimization that anyone proposes: the whole project would be entirely simple and relatively painless if it weren't for the need to make planner changes. I mean, deciding on a threading interface is likely to be a somewhat contentious discussion, with differences of opinion on whether we should do it and what the API should look like. But at the end of the day it's not rocket science, and I expect that we would end up with something reasonable. What seems much harder is figuring out how to decide when to perform quicksort in parallel vs. single-threaded, and how much parallelism would be appropriate. I haven't seen anyone propose even a shadow of an idea about how to make such decisions intelligently, either in general or in specific cases. The other issue is that, while threading would be possibly suitable for this particular case, at least for built-in datatypes with comparison operations that basically reduce to single machine-language comparison instructions, it's hard to see how we could take it much further. It would be unsafe for these multiple threads of execution to do anything that could possibly throw an error or anything that touches a lightweight lock or, really, just about anything at all. Trying to make the entire backend thread-safe - or even any significant portion of it - seems like a colossal effort that will most likely fail, but maybe not without eating an enormous amount of developer time first. And without doing that, I don't think we could even extend this as far as, say, numeric, whose functions do things like palloc() and ereport() internally. So I feel like this whole approach might be a dead-end - there's a narrow range of cases where it could be made to work, I think, but after that I think you hit a titanium wall. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21.09.2011 02:53, Peter Geoghegan wrote: C stdlib quick-sort time elapsed: 2.092451 seconds Inline quick-sort time elapsed: 1.587651 seconds Does *that* look attractive to you? Not really, to be honest. That's a 25% speedup in pure qsorting speed. How much of a gain in a real query do you expect to get from that, in the best case? There's so many other sources of overhead that I'm afraid this will be lost in the noise. If you find a query that spends, say, 50% of its time in qsort(), you will only get a 12.5% speedup on that query. And even 50% is really pushing it - I challenge you to find a query that spends any significant amount of time qsorting integers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Tue, Sep 20, 2011 at 3:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: This performance patch differs from most in that it's difficult in principle to imagine a performance regression occurring. Really? N copies of the same code could lead to performance loss just due to code bloat (ie, less of a query's inner loops fitting in CPU cache). Not to mention the clear regression in maintainability. So I'm disinclined to consider this sort of change without a significantly bigger win than you're suggesting above (no, I don't even consider the -O0 number attractive, let alone what you're finding at -O2). More copies of the code are somewhat annoying, but its only 100 lines of code in one module and we can easily have specific tests for each. The extra code size is minor in comparison to the reams of code we add elsewhere. It's a surprisingly good win for such a common use case. Well done, Peter. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 21.09.2011 02:53, Peter Geoghegan wrote: C stdlib quick-sort time elapsed: 2.092451 seconds Inline quick-sort time elapsed: 1.587651 seconds Does *that* look attractive to you? Not really, to be honest. That's a 25% speedup in pure qsorting speed. How much of a gain in a real query do you expect to get from that, in the best case? There's so many other sources of overhead that I'm afraid this will be lost in the noise. If you find a query that spends, say, 50% of its time in qsort(), you will only get a 12.5% speedup on that query. And even 50% is really pushing it - I challenge you to find a query that spends any significant amount of time qsorting integers. How about almost every primary index creation? Don't really see a reason for the negativity here. If you use that argument no performance gain is worth it because all workloads are mixed. This is a marvellous win, a huge gain from a small, isolated and easily tested change. By far the smallest amount of additional code to sorting we will have added and yet one of the best gains. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21.09.2011 10:01, Simon Riggs wrote: On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 21.09.2011 02:53, Peter Geoghegan wrote: C stdlib quick-sort time elapsed: 2.092451 seconds Inline quick-sort time elapsed: 1.587651 seconds Does *that* look attractive to you? Not really, to be honest. That's a 25% speedup in pure qsorting speed. How much of a gain in a real query do you expect to get from that, in the best case? There's so many other sources of overhead that I'm afraid this will be lost in the noise. If you find a query that spends, say, 50% of its time in qsort(), you will only get a 12.5% speedup on that query. And even 50% is really pushing it - I challenge you to find a query that spends any significant amount of time qsorting integers. How about almost every primary index creation? Nope. Swamped by everything else. Also note that as soon as the sort grows big enough to not fit in maintenance_work_mem, you switch to the external sort algorithm, which doesn't use qsort. Perhaps you could do similar inlining in the heap sort merge passes done in the external sort, but it's unlikely to be as big a win there. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21 September 2011 01:48, karave...@mail.bg wrote: All -O2 version show 42% speedup with inlined qsort. -O0 showed 25% speedup. Thanks. Looks like the figures I posted last night were fairly conservative. Does anyone else care to report results? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Recent discussions on the threads Double sorting split patch and CUDA sorting raised the possibility that there could be significant performance optimisation low-hanging fruit picked by having the executor treat integers and floats as a special case during sorting, avoiding going to the trouble of calling a comparator using the built-in SQL function machinery Why only for integers and floats why not for char/varchar? But I believe this can make code less maintainable as similar things can be done at other places to avoid SQL function machinery. Once the cache has been warmed, explain analyze very consistently reports a runtime of 123ms for this query on master/HEAD, which varies +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply this patch, that goes down to 107ms +/- 1ms at -O0. Time 123ms which is without your change is with which optimization -O2 or O0? *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Peter Geoghegan Sent: Tuesday, September 20, 2011 7:26 AM To: PG Hackers Subject: [HACKERS] Inlining comparators as a performance optimisation Recent discussions on the threads Double sorting split patch and CUDA sorting raised the possibility that there could be significant performance optimisation low-hanging fruit picked by having the executor treat integers and floats as a special case during sorting, avoiding going to the trouble of calling a comparator using the built-in SQL function machinery, and taking advantage of inlining of the comparator, which has been shown to have a considerable performance advantage (at least compared to a general purpose c stdlib qsort(), that takes a function pointer as its comparator, much like tuplesort). I've hacked together a sloppy POC implementation in a hurry (basically, some code is shifted around) , which is attached - I felt that it would be useful to determine if the community feels that this is a worth-while undertaking in advance of a business trip that I'm leaving on tomorrow lasting until Friday, during which I will be mostly unavailable. The patch breaks the Postgres sorting executor node (at least when it quicksorts) for any type other than int4. I apologise for how rough the patch is, but the code itself isn't important right now - the idea is. I anticipate that the value state-datumType or something similar will be set in a near future revision, so that tuplesort_performsort will know which type-specific optimisation it can use for the type, while falling back on the existing generic qsort_arg + qsort_arg_comparator, and sorting won't actually be broken. I've been doing some preliminary testing using the dell store 2 sample database. I increase work_mem to '50MB', to ensure that a quicksort will be performed for sorting (otherwise, I'm using the postgresql.conf that initdb gave me). The query is: explain analyze select * from orderlines order by prod_id; Once the cache has been warmed, explain analyze very consistently reports a runtime of 123ms for this query on master/HEAD, which varies +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply this patch, that goes down to 107ms +/- 1ms at -O0. I think that that's a pretty good start. Funnily enough, the difference/advantage vanishes at -O2 (I'm guessing that the higher optimisation level of GCC 4.5 hyper-corrects away the inlining, but I don't have time to check that right now). I imagine the version that I actually submit for patch review will have a macro-based infrastructure for inlining the sorting of various built-in types, initially integers and floats. It will most likely have some other optimisations - I haven't even used a profiler yet. This performance patch differs from most in that it's difficult in principle to imagine a performance regression occurring. Thoughts? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: How about almost every primary index creation? Nope. Swamped by everything else. Really? I think it's pretty common for shops to be able to dedicate large amounts of RAM to building initial indexes on data loads or reindex operations. Enough that they can cache the entire table for the short time they're doing the index builds even if they're quite large. Witness the recent pleas to allow maintenance_work_mem on the order of tens of gigabytes. And it's also pretty common that shops can dedicate very large I/O bandwidth, in many cases enough to saturate the memory bandwidth, for doing these kinds of batch operations when they get large enough to need to do an external sort. There's still overhead of reading the pages, the tuples, finding the sort keys in the tuple, etc. But I think the actual qsort or heap operations in tapesort are pretty big portions of the work. This is pretty easy to measure. Just run oprofile or gprof and see what percentage of time for a big index build is spent in qsort. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Sep 21, 2011 at 8:47 AM, Greg Stark st...@mit.edu wrote: On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: How about almost every primary index creation? Nope. Swamped by everything else. Really? I think it's pretty common for shops to be able to dedicate large amounts of RAM to building initial indexes on data loads or reindex operations. Enough that they can cache the entire table for the short time they're doing the index builds even if they're quite large. Witness the recent pleas to allow maintenance_work_mem on the order of tens of gigabytes. And it's also pretty common that shops can dedicate very large I/O bandwidth, in many cases enough to saturate the memory bandwidth, for doing these kinds of batch operations when they get large enough to need to do an external sort. There's still overhead of reading the pages, the tuples, finding the sort keys in the tuple, etc. But I think the actual qsort or heap operations in tapesort are pretty big portions of the work. This is pretty easy to measure. Just run oprofile or gprof and see what percentage of time for a big index build is spent in qsort. +1 for some actual measurements. I don't think anyone on this thread is saying that if we can get big performance gains from doing this we still shouldn't do it. But at this point it's unclear that we can get a consistent speedup that isn't heavily dependent on the choice of compiler flags (to say nothing of compiler and OS), and even if we can, it's not clear that it will still be noticeable when you measure the run time of an entire query rather than just the speed of qsort(). Like Tom and Heikki, I'm a bit skeptical: it wouldn't surprise me to find out that qsort() is 5% of the runtime any realistic test case and the average qsort() speedup based on tests on a couple different platforms is 10% and so on average we're looking at a 0.5% improvement, in which case it might be more trouble than it's worth, especially if it turns out that there are OS/platform combinations where the inlined version is (for some crazy reason) slower. I've seen performance differences of up to 3% from minor code rearrangements that don't seem like they should matter at all, just because code and data shifts around across cache-line boundaries and the new arrangement is slightly better or worse than the old one. So if the performance improvement turns out to be very small, then validating that it actually IS an improvement in general is likely to be kind of a pain in the ass. On the other hand, the performance improvement might turn out to be large. Maybe there's a test case where, as Heikki suggests, 50% of the time is spent in qsort(). If we can reliably make that 25% faster, I wouldn't dismiss that out of hand; I think that would be pretty good, assuming it didn't require massive amounts of spaghetti code to make it work. I don't see that that would be any more marginal than the sorts of things we've optimized in, say, commit 4fc115b2e981f8c63165ca86a23215380a3fda66, or commit f4d242ef94730c447d87b9840a40b0ec3371fe0f. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21 September 2011 07:51, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 21.09.2011 02:53, Peter Geoghegan wrote: C stdlib quick-sort time elapsed: 2.092451 seconds Inline quick-sort time elapsed: 1.587651 seconds Does *that* look attractive to you? Not really, to be honest. That's a 25% speedup in pure qsorting speed. How much of a gain in a real query do you expect to get from that, in the best case? There's so many other sources of overhead that I'm afraid this will be lost in the noise. I'm surprised that you're dismissive of this. After all, we have in the past indulged in micro-optimisation of qsort, or so it would seem from this comment: * We have modified their original by adding a check for already-sorted input, * which seems to be a win per discussions on pgsql-hackers around 2006-03-21. Makes affected queries radically faster (In the best case, a speedup somewhat greater than 12.5%) is an unreasonably high standard for a performance optimisation of the executor in general (such a high standard might be sensible if it was due to a particular maintainability downside, but you didn't mention one). Even still, I think that the 12.5% figure is pretty pessimistic - I've already sped up the dell store query by almost that much, and that's with a patch that was, due to circumstances, cobbled together. Not only are we benefiting from the effects of inlining, we're also benefiting from the removal of unnecessary indirection. As Tom said, In concrete terms, there would be no reason to have tuplesort.c's myFunctionCall2Coll, and maybe not inlineApplySortFunction either, if the datatype-specific comparison functions had APIs that were closer to what sorting wants rather than following the general SQL-callable-function API. He was just referring to the benefits of removing indirection here, so ISTM that this is really two performance optimisations rolled into one - it's conceivable that the total performance improvement will even exceed the isolated inlining comparator benchmark. As I've said, I believe this patch can be committed without compromising the maintainability of the tuplesort code to an extent that is not clearly worth it, through the use of a clean, macro-based abstraction. Concerns about bloated binaries are probably not well founded, because what I'm proposing is to a certain extent emulating C++ templates, while using a very common pattern used with C++ templates. In the C++ world, algorithms are often generalised as templates, so that they can be used equally well with any datatype (that supports the interface of the template), while availing of compiler optimisations per template instantiation (instance of using a given type with a given template). I actually got the idea for this patch in part from a book that I read years ago that described the fact that counter-intuitively, std::sort() consistently outperforms qsort(), because the comparator is often inlined, and the compiler can generally avail of optimisations from knowing the comparator at compile-time. On 21 September 2011 13:47, Greg Stark st...@mit.edu wrote: This is pretty easy to measure. Just run oprofile or gprof and see what percentage of time for a big index build is spent in qsort. I'll do so soon. I intend to get to this on Friday evening, and perhaps have a proper patch to show next week. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21.09.2011 17:20, Peter Geoghegan wrote: Even still, I think that the 12.5% figure is pretty pessimistic - I've already sped up the dell store query by almost that much, and that's with a patch that was, due to circumstances, cobbled together. I'm not against making things faster, it's just that I haven't seen solid evidence yet that this will help. Just provide a best-case test case for this that shows a huge improvement, and I'll shut up. If the improvement is only modest, then let's discuss how big it is and whether it's worth the code ugliness this causes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 21.09.2011 17:20, Peter Geoghegan wrote: Even still, I think that the 12.5% figure is pretty pessimistic - I've already sped up the dell store query by almost that much, and that's with a patch that was, due to circumstances, cobbled together. I'm not against making things faster, it's just that I haven't seen solid evidence yet that this will help. Just provide a best-case test case for this that shows a huge improvement, and I'll shut up. If the improvement is only modest, then let's discuss how big it is and whether it's worth the code ugliness this causes. The other question that I'm going to be asking is whether it's not possible to get most of the same improvement with a much smaller code footprint. I continue to suspect that getting rid of the SQL function impedance-match layer (myFunctionCall2Coll etc) would provide most of whatever gain is to be had here, without nearly as large a cost in code size and maintainability, and with the extra benefit that the speedup would also be available to non-core datatypes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 09/21/2011 10:50 AM, Tom Lane wrote: The other question that I'm going to be asking is whether it's not possible to get most of the same improvement with a much smaller code footprint. I continue to suspect that getting rid of the SQL function impedance-match layer (myFunctionCall2Coll etc) would provide most of whatever gain is to be had here, without nearly as large a cost in code size and maintainability, and with the extra benefit that the speedup would also be available to non-core datatypes. Can we get a patch so we can do benchmarks on this? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Simon Riggs si...@2ndquadrant.com writes: This is a marvellous win, a huge gain from a small, isolated and easily tested change. By far the smallest amount of additional code to sorting we will have added and yet one of the best gains. I think you forgot your cheerleader uniform. A patch along these lines is not going to be small, isolated, easily maintained, nor beneficial for any but a small number of predetermined datatypes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21 September 2011 15:50, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: I'm not against making things faster, it's just that I haven't seen solid evidence yet that this will help. Just provide a best-case test case for this that shows a huge improvement, and I'll shut up. If the improvement is only modest, then let's discuss how big it is and whether it's worth the code ugliness this causes. Fair enough. The other question that I'm going to be asking is whether it's not possible to get most of the same improvement with a much smaller code footprint. That's a reasonable question, and I hope to be able to come up with a good answer. I continue to suspect that getting rid of the SQL function impedance-match layer (myFunctionCall2Coll etc) would provide most of whatever gain is to be had here, without nearly as large a cost in code size and maintainability, and with the extra benefit that the speedup would also be available to non-core datatypes. I'm fairly surprised that your view on that is mostly or entirely unchanged, even after I've demonstrated a considerable performance advantage from a macro-based qsort implementation over my OS vendor's c std lib qsort(), using an isolated test-case, that does not have anything to do with that impedance mismatch. I'm not sure why you doubt that the same thing is happening within tuplesort. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Andrew Dunstan and...@dunslane.net writes: On 09/21/2011 10:50 AM, Tom Lane wrote: The other question that I'm going to be asking is whether it's not possible to get most of the same improvement with a much smaller code footprint. I continue to suspect that getting rid of the SQL function impedance-match layer (myFunctionCall2Coll etc) would provide most of whatever gain is to be had here, without nearly as large a cost in code size and maintainability, and with the extra benefit that the speedup would also be available to non-core datatypes. Can we get a patch so we can do benchmarks on this? Well, we'd have to negotiate what the API ought to be. What I'm envisioning is that datatypes could provide alternate comparison functions that are designed to be qsort-callable rather than SQL-callable. As such, they could not have entries in pg_proc, so it seems like there's no ready way to represent them in the catalogs. The idea that I was toying with was to allow the regular SQL-callable comparison function to somehow return a function pointer to the alternate comparison function, so that the first comparison in a given sort run would be done the traditional way but then we'd notice the provided function pointer and start using that. It would not be too hard to pass back the pointer using FunctionCallInfoData.context, say. The downside is adding cycles to unoptimized cases to uselessly check for a returned function pointer that's not there. Perhaps it could be hacked so that we only add cycles to the very first call, but I've not looked closely at the code to see what would be involved. Has anyone got a better idea for getting hold of the alternate function? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21.09.2011 18:46, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 09/21/2011 10:50 AM, Tom Lane wrote: The other question that I'm going to be asking is whether it's not possible to get most of the same improvement with a much smaller code footprint. I continue to suspect that getting rid of the SQL function impedance-match layer (myFunctionCall2Coll etc) would provide most of whatever gain is to be had here, without nearly as large a cost in code size and maintainability, and with the extra benefit that the speedup would also be available to non-core datatypes. Can we get a patch so we can do benchmarks on this? Well, we'd have to negotiate what the API ought to be. What I'm envisioning is that datatypes could provide alternate comparison functions that are designed to be qsort-callable rather than SQL-callable. As such, they could not have entries in pg_proc, so it seems like there's no ready way to represent them in the catalogs. The idea that I was toying with was to allow the regular SQL-callable comparison function to somehow return a function pointer to the alternate comparison function, so that the first comparison in a given sort run would be done the traditional way but then we'd notice the provided function pointer and start using that. It would not be too hard to pass back the pointer using FunctionCallInfoData.context, say. The downside is adding cycles to unoptimized cases to uselessly check for a returned function pointer that's not there. Perhaps it could be hacked so that we only add cycles to the very first call, but I've not looked closely at the code to see what would be involved. You could have a new function with a pg_proc entry, that just returns a function pointer to the qsort-callback. Or maybe the interface should be an even more radical replacement of qsort, not just the comparison function. Instead of calling qsort, tuplesort.c would call the new datatype-specific sort-function (which would be in pg_proc). The implementation could use an inlined version of qsort, like Peter is suggesting, or it could do something completely different, like a radix sort or a GPU-assisted sort or whatever. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: As such, they could not have entries in pg_proc, so it seems like there's no ready way to represent them in the catalogs. Why couldn't they be in pg_proc with a bunch of opaque arguments like the GIST opclass support functions? I'm a bit puzzled what the arguments would look like. They would still need to know the collation, nulls first/last flags, etc. And calling it would still not be inlinable. So they would have to check those flags on each invocation instead of having a piece of straightline code that hard codes the behaviour with the right behaviour inline. ISTM the hope for a speedup from the inlining mostly came from the idea that the compiler might be able to hoist this logic outside the loop (and I suppose implement n specialized loops depending on the behaviour needed). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 21.09.2011 18:46, Tom Lane wrote: The idea that I was toying with was to allow the regular SQL-callable comparison function to somehow return a function pointer to the alternate comparison function, You could have a new function with a pg_proc entry, that just returns a function pointer to the qsort-callback. Yeah, possibly. That would be a much more invasive change, but cleaner in some sense. I'm not really prepared to do all the legwork involved in that just to get to a performance-testable patch though. Or maybe the interface should be an even more radical replacement of qsort, not just the comparison function. Instead of calling qsort, tuplesort.c would call the new datatype-specific sort-function (which would be in pg_proc). The implementation could use an inlined version of qsort, like Peter is suggesting, or it could do something completely different, like a radix sort or a GPU-assisted sort or whatever. No. In the first place, that only helps for in-memory sorts. In the second, it would absolutely destroy our ability to change the behavior of sorting ever again. Considering that we've added ASC/DESC, NULLS FIRST/LAST, and collation support over the years, are you really prepared to bet that the sort code will never need any more feature upgrades? (This concern is in fact the source of my beef with the whole inlining proposal to begin with, but allowing the inlining to occur into third-party code that we don't control at all would be a hundred times worse.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 21.09.2011 18:46, Tom Lane wrote: Well, we'd have to negotiate what the API ought to be. What I'm envisioning is that datatypes could provide alternate comparison functions that are designed to be qsort-callable rather than SQL-callable. As such, they could not have entries in pg_proc, so it seems like there's no ready way to represent them in the catalogs. Quite aside from this qsort-thing, it would be nice to have versions of all simple functions that could be called without the FunctionCall overhead. So instead of: FunctionCall2(flinfo_for_int4pl, 1, 2) you could do simply int4pl_fastpath(1,2) I'm not sure how big an effect this would have, but it seems like it could shave some cycles across the system. We could have an extended version of the PG_FUNCTION_INFO_V1 macro that would let you register the fastpath function: PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath); -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Greg Stark st...@mit.edu writes: On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: As such, they could not have entries in pg_proc, so it seems like there's no ready way to represent them in the catalogs. Why couldn't they be in pg_proc with a bunch of opaque arguments like the GIST opclass support functions? That does not mean the same thing at all. Everything in pg_proc is meant to be called through the V0 or V1 function call info protocols. I'm a bit puzzled what the arguments would look like. They would still need to know the collation, nulls first/last flags, etc. No, I wasn't thinking that we should do that. The datatype comparison functions should have the exact same semantics they do now, just a lower-overhead call mechanism. If you try to push stuff like NULLS FIRST/LAST into the per-datatype code, then you are up against a problem when you want to add a new flag: you have to touch lots of code not all of which you even control. And calling it would still not be inlinable. So they would have to check those flags on each invocation instead of having a piece of straightline code that hard codes the behaviour with the right behaviour inline. ISTM the hope for a speedup from the inlining mostly came from the idea that the compiler might be able to hoist this logic outside the loop (and I suppose implement n specialized loops depending on the behaviour needed). None of that stuff is inlinable or constant-foldable today, nor would it be with the patch that Peter was proposing AFAICS, because none of the flags will ever be compile time constant values. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 21.09.2011 18:46, Tom Lane wrote: Well, we'd have to negotiate what the API ought to be. What I'm envisioning is that datatypes could provide alternate comparison functions that are designed to be qsort-callable rather than SQL-callable. As such, they could not have entries in pg_proc, so it seems like there's no ready way to represent them in the catalogs. Quite aside from this qsort-thing, it would be nice to have versions of all simple functions that could be called without the FunctionCall overhead. Hmm, that's an interesting idea. I think probably the important aspects are (1) known number of arguments and (2) no null argument or result values are allowed. Not sure what we'd do with collations though. We could have an extended version of the PG_FUNCTION_INFO_V1 macro that would let you register the fastpath function: PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath); We don't use PG_FUNCTION_INFO_V1 for built-in functions ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Sep 21, 2011 at 5:23 PM, Tom Lane t...@sss.pgh.pa.us wrote: None of that stuff is inlinable or constant-foldable today, nor would it be with the patch that Peter was proposing AFAICS, because none of the flags will ever be compile time constant values. I was referring to transformations like this which I believe compilers are already capable of doing: v = ...; while (...) if (v) { if (a b) ... else ; } else { if (a b) ... else ...; } turning it into code that looks like: if (v) { while () if (ab) ... else ...; } else { while () if (ab) ... else ...; } which may not look like much -- especially with branch prediction -- but then it's much more likely to be able to unroll the loop and do clever instruction scheduling and so on than if there's an extra branch in the middle of the loop. But if there's a function call to an external function called through a function pointer in the middle of the loop then the whole endeavour would be for naught. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On Wed, Sep 21, 2011 at 4:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: This is a marvellous win, a huge gain from a small, isolated and easily tested change. By far the smallest amount of additional code to sorting we will have added and yet one of the best gains. I think you forgot your cheerleader uniform. LOL. I'm happy whoever and whenever we get large wins like that. Go Postgres! A patch along these lines is not going to be small, isolated, easily maintained, nor beneficial for any but a small number of predetermined datatypes. That was the starting premise. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
On 20 September 2011 03:51, Tom Lane t...@sss.pgh.pa.us wrote: Considering that -O2 is our standard optimization level, that observation seems to translate to this patch will be useless in practice. I think you had better investigate that aspect in some detail before spending more effort. I don't think that the fact that that happens is at all significant at this early stage, and it never even occurred to me that you'd think that it might be. I was simply disclosing a quirk of this POC patch. The workaround is probably to use a macro instead. For the benefit of those that didn't follow the other threads, the macro-based qsort implementation, which I found to perform significantly better than regular qsort(), runs like this on my laptop when I built at 02 with GCC 4.6 just now: C stdlib quick-sort time elapsed: 2.092451 seconds Inline quick-sort time elapsed: 1.587651 seconds Does *that* look attractive to you? I've attached source code of the program that produced these figures, which has been ported to C from C++. When I #define LARGE_SIZE 1, here's what I see: [peter@peter inline_compar_test]$ ./a.out C stdlib quick-sort time elapsed: 23.659411 seconds Inline quick-sort time elapsed: 18.470611 seconds Here, sorting with the function pointer/stdlib version takes about 1.28 times as long. In the prior test (with the smaller LARGE_SIZE), it took about 1.32 times as long. Fairly predictable, linear, and not to be sniffed at. The variance I'm seeing across runs is low - a couple of hundredths of a second at most. This is a Fedora 15 Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz machine. I'm not sure right now why the inline quick-sort is less of a win than on my old Fedora 14 desktop (where it was 3.24 Vs 2.01), but it's still a significant win. Perhaps others can build this simple program and tell me what they come up with. This performance patch differs from most in that it's difficult in principle to imagine a performance regression occurring. Really? N copies of the same code could lead to performance loss just due to code bloat (ie, less of a query's inner loops fitting in CPU cache). I did consider that. Of course inlining has an overhead, and I'll be testing that each instance of inlining has a net benefit. I just meant that many other performance patches have an obvious worst case, and I think that it is policy to focus on that case, but I can't think of one here. Does anyone else have any ideas? Not to mention the clear regression in maintainability. So I'm disinclined to consider this sort of change without a significantly bigger win than you're suggesting above Sure, there'll be some sort of regression in maintainability - I think that HOT had a clear regression in maintainability too. The important questions are obviously how big is the loss of maintainability?, and is it worth it?. We'll know more when this work is actually shaped into a proper patch. Perhaps I should have waited until I had something along those lines before making an announcement, but I wanted community input as early as possible. I think that there's plenty of tweaking that can be done to get additional performance improvements - all I've done so far is demonstrate that those improvements are real and worth thinking about, in the fastest possible way, partly because you expressed skepticism of the benefits of inlining comparators to Greg Stark in an earlier thread. Performance and maintainability are often somewhat in tension, but we cannot ignore performance. If this work can bring us an improvement in performance approaching the isolated macro Vs qsort() function pointer benchmark, that's a *big* win. Sorting integers and floats is very common and important. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services inline_compar_test.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inlining comparators as a performance optimisation
- Цитат от Peter Geoghegan (pe...@2ndquadrant.com), на 21.09.2011 в 02:53 - On 20 September 2011 03:51, Tom Lane t...@sss.pgh.pa.us wrote: Considering that -O2 is our standard optimization level, that observation seems to translate to this patch will be useless in practice. I think you had better investigate that aspect in some detail before spending more effort. I don't think that the fact that that happens is at all significant at this early stage, and it never even occurred to me that you'd think that it might be. I was simply disclosing a quirk of this POC patch. The workaround is probably to use a macro instead. For the benefit of those that didn't follow the other threads, the macro-based qsort implementation, which I found to perform significantly better than regular qsort(), runs like this on my laptop when I built at 02 with GCC 4.6 just now: C stdlib quick-sort time elapsed: 2.092451 seconds Inline quick-sort time elapsed: 1.587651 seconds Does *that* look attractive to you? I've attached source code of the program that produced these figures, which has been ported to C from C++. When I #define LARGE_SIZE 1, here's what I see: [peter@peter inline_compar_test]$ ./a.out C stdlib quick-sort time elapsed: 23.659411 seconds Inline quick-sort time elapsed: 18.470611 seconds Here, sorting with the function pointer/stdlib version takes about 1.28 times as long. In the prior test (with the smaller LARGE_SIZE), it took about 1.32 times as long. Fairly predictable, linear, and not to be sniffed at. The variance I'm seeing across runs is low - a couple of hundredths of a second at most. This is a Fedora 15 Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz machine. I'm not sure right now why the inline quick-sort is less of a win than on my old Fedora 14 desktop (where it was 3.24 Vs 2.01), but it's still a significant win. Perhaps others can build this simple program and tell me what they come up with. Run it here. Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz gcc version 4.6.1 (Debian 4.6.1-10) g++ -O2 qsort-inline-benchmark.c ./a.out C stdlib quick-sort time elapsed: 1.942686 seconds Inline quick-sort time elapsed: 1.126508 seconds With #define LARGE_SIZE 1 C stdlib quick-sort time elapsed: 22.158207 seconds Inline quick-sort time elapsed: 12.861018 seconds with g++ -O0 C stdlib quick-sort time elapsed: 2.736360 seconds Inline quick-sort time elapsed: 2.045619 seconds On server hardware: Intel(R) Xeon(R) CPU E5405 @ 2.00GHz gcc version 4.4.5 (Debian 4.4.5-8) /a.out C stdlib quick-sort time elapsed: 2.610150 seconds Inline quick-sort time elapsed: 1.494198 seconds All -O2 version show 42% speedup with inlined qsort. -O0 showed 25% speedup. Best regards -- Luben Karavelov
[HACKERS] Inlining comparators as a performance optimisation
Recent discussions on the threads Double sorting split patch and CUDA sorting raised the possibility that there could be significant performance optimisation low-hanging fruit picked by having the executor treat integers and floats as a special case during sorting, avoiding going to the trouble of calling a comparator using the built-in SQL function machinery, and taking advantage of inlining of the comparator, which has been shown to have a considerable performance advantage (at least compared to a general purpose c stdlib qsort(), that takes a function pointer as its comparator, much like tuplesort). I've hacked together a sloppy POC implementation in a hurry (basically, some code is shifted around) , which is attached - I felt that it would be useful to determine if the community feels that this is a worth-while undertaking in advance of a business trip that I'm leaving on tomorrow lasting until Friday, during which I will be mostly unavailable. The patch breaks the Postgres sorting executor node (at least when it quicksorts) for any type other than int4. I apologise for how rough the patch is, but the code itself isn't important right now - the idea is. I anticipate that the value state-datumType or something similar will be set in a near future revision, so that tuplesort_performsort will know which type-specific optimisation it can use for the type, while falling back on the existing generic qsort_arg + qsort_arg_comparator, and sorting won't actually be broken. I've been doing some preliminary testing using the dell store 2 sample database. I increase work_mem to '50MB', to ensure that a quicksort will be performed for sorting (otherwise, I'm using the postgresql.conf that initdb gave me). The query is: explain analyze select * from orderlines order by prod_id; Once the cache has been warmed, explain analyze very consistently reports a runtime of 123ms for this query on master/HEAD, which varies +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply this patch, that goes down to 107ms +/- 1ms at -O0. I think that that's a pretty good start. Funnily enough, the difference/advantage vanishes at -O2 (I'm guessing that the higher optimisation level of GCC 4.5 hyper-corrects away the inlining, but I don't have time to check that right now). I imagine the version that I actually submit for patch review will have a macro-based infrastructure for inlining the sorting of various built-in types, initially integers and floats. It will most likely have some other optimisations - I haven't even used a profiler yet. This performance patch differs from most in that it's difficult in principle to imagine a performance regression occurring. Thoughts? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 3505236..c5ac708 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -1224,6 +1224,285 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple) } } +inline int +compare_int4( Datum first, Datum second) +{ + int32 a_is = DatumGetInt32(first); + int32 b_is = DatumGetInt32(second); + + if (a_is b_is) + return 1; + else if (a_is == b_is) + return 0; + else + return -1; +} + + +/* + * Inline-able copy of FunctionCall2Coll() to save some cycles in sorting. + */ +static inline Datum +myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2) +{ + FunctionCallInfoData fcinfo; + Datum result; + + InitFunctionCallInfoData(fcinfo, flinfo, 2, collation, NULL, NULL); + + fcinfo.arg[0] = arg1; + fcinfo.arg[1] = arg2; + fcinfo.argnull[0] = false; + fcinfo.argnull[1] = false; + + result = FunctionCallInvoke(fcinfo); + + /* Check for null result, since caller is clearly not expecting one */ + if (fcinfo.isnull) + elog(ERROR, function %u returned NULL, fcinfo.flinfo-fn_oid); + + return result; +} +/* + * Apply a sort function (by now converted to fmgr lookup form) + * and return a 3-way comparison result. This takes care of handling + * reverse-sort and NULLs-ordering properly. We assume that DESC and + * NULLS_FIRST options are encoded in sk_flags the same way btree does it. + */ +static inline int32 +inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation, + Datum datum1, bool isNull1, + Datum datum2, bool isNull2) +{ + int32 compare; + + if (isNull1) + { + if (isNull2) + compare = 0; /* NULL = NULL */ + else if (sk_flags SK_BT_NULLS_FIRST) + compare = -1; /* NULL NOT_NULL */ + else + compare = 1; /* NULL NOT_NULL */ + } + else if (isNull2) + { + if (sk_flags SK_BT_NULLS_FIRST) + compare = 1; /* NOT_NULL NULL */ + else + compare = -1; /* NOT_NULL NULL */ + } + else + { + compare = compare_int4(datum1, datum2); + + if (sk_flags SK_BT_DESC) + compare = -compare; + } + + return compare; +} + + + +inline int
Re: [HACKERS] Inlining comparators as a performance optimisation
Peter Geoghegan pe...@2ndquadrant.com writes: Once the cache has been warmed, explain analyze very consistently reports a runtime of 123ms for this query on master/HEAD, which varies +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply this patch, that goes down to 107ms +/- 1ms at -O0. I think that that's a pretty good start. Funnily enough, the difference/advantage vanishes at -O2 (I'm guessing that the higher optimisation level of GCC 4.5 hyper-corrects away the inlining, but I don't have time to check that right now). Considering that -O2 is our standard optimization level, that observation seems to translate to this patch will be useless in practice. I think you had better investigate that aspect in some detail before spending more effort. This performance patch differs from most in that it's difficult in principle to imagine a performance regression occurring. Really? N copies of the same code could lead to performance loss just due to code bloat (ie, less of a query's inner loops fitting in CPU cache). Not to mention the clear regression in maintainability. So I'm disinclined to consider this sort of change without a significantly bigger win than you're suggesting above (no, I don't even consider the -O0 number attractive, let alone what you're finding at -O2). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers