Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Jan 21, 2015 at 2:22 AM, Peter Geoghegan p...@heroku.com wrote: You'll probably prefer the attached. This patch works by disabling abbreviation, but only after writing out runs, with the final merge left to go. That way, it doesn't matter when abbreviated keys are not read back from disk (or regenerated). Yes, this seems like the way to go for now. Thanks, committed. And thanks to Andrew for spotting this so quickly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Jan 23, 2015 at 2:18 AM, David Rowley dgrowle...@gmail.com wrote: On 20 January 2015 at 17:10, Peter Geoghegan p...@heroku.com wrote: On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier michael.paqu...@gmail.com wrote: With your patch applied, the failure with MSVC disappeared, but there is still a warning showing up: (ClCompile target) - src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended? That seems harmless. I suggest an explicit cast to Size here. This caught my eye too. I agree about casting to Size. Patch attached. Thanks, committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On 20 January 2015 at 17:10, Peter Geoghegan p...@heroku.com wrote: On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier michael.paqu...@gmail.com wrote: With your patch applied, the failure with MSVC disappeared, but there is still a warning showing up: (ClCompile target) - src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended? That seems harmless. I suggest an explicit cast to Size here. This caught my eye too. I agree about casting to Size. Patch attached. Regards David Rowley hyperloglog_bitshift_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
Peter == Peter Geoghegan p...@heroku.com writes: Peter You'll probably prefer the attached. This patch works by Peter disabling abbreviation, but only after writing out runs, with Peter the final merge left to go. That way, it doesn't matter when Peter abbreviated keys are not read back from disk (or regenerated). This seems tolerable to me for a quick fix. The merits of storing the abbreviation vs. re-abbreviating on input can be studied later. Peter I believe this bug was missed because it only occurs when there Peter are multiple runs, and not in the common case where there is one Peter big initial run that is found already sorted when we reach Peter mergeruns(). Ah, yes, there is an optimization for the one-run case which bypasses all further comparisons, hiding the problem. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
Peter == Peter Geoghegan p...@heroku.com writes: Peter Basically, the intersection of the datum sort case with Peter abbreviated keys seems complicated. Not to me. To me it seems completely trivial. Now, I follow this general principle that someone who is not doing the work should never say X is easy to someone who _is_ doing it, unless they're prepared to at least outline the solution on request or otherwise contribute. So see the attached patch (which I will concede could probably do with more comments, it's a quick hack intended for illustration) and tell me what you think is missing that would make it a complicated problem. Peter I tended to think that the solution was to force a heaptuple Peter sort instead (where abbreviation naturally can be used), This seems completely wrong - why should the caller have to worry about this implementation detail? The caller shouldn't have to know about what types or what circumstances might or might not benefit from abbreviation. -- Andrew (irc:RhodiumToad) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index b6e302f..0dbb277 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -901,30 +901,34 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, state-copytup = copytup_datum; state-writetup = writetup_datum; state-readtup = readtup_datum; + state-abbrevNext = 10; state-datumType = datumType; - /* Prepare SortSupport data */ - state-onlyKey = (SortSupport) palloc0(sizeof(SortSupportData)); - - state-onlyKey-ssup_cxt = CurrentMemoryContext; - state-onlyKey-ssup_collation = sortCollation; - state-onlyKey-ssup_nulls_first = nullsFirstFlag; - /* - * Conversion to abbreviated representation infeasible in the Datum case. - * It must be possible to subsequently fetch original datum values within - * tuplesort_getdatum(), which would require special-case preservation of - * original values. - */ - state-onlyKey-abbreviate = false; - - PrepareSortSupportFromOrderingOp(sortOperator, state-onlyKey); - /* lookup necessary attributes of the datum type */ get_typlenbyval(datumType, typlen, typbyval); state-datumTypeLen = typlen; state-datumTypeByVal = typbyval; + /* Prepare SortSupport data */ + state-sortKeys = (SortSupport) palloc0(sizeof(SortSupportData)); + + state-sortKeys-ssup_cxt = CurrentMemoryContext; + state-sortKeys-ssup_collation = sortCollation; + state-sortKeys-ssup_nulls_first = nullsFirstFlag; + state-sortKeys-abbreviate = !typbyval; + + PrepareSortSupportFromOrderingOp(sortOperator, state-sortKeys); + + /* + * The onlyKey optimization cannot be used with abbreviated keys, since + * tie-breaker comparisons may be required. Typically, the optimization is + * only of value to pass-by-value types anyway, whereas abbreviated keys + * are typically only of value to pass-by-reference types. + */ + if (!state-sortKeys-abbrev_converter) + state-onlyKey = state-sortKeys; + MemoryContextSwitchTo(oldcontext); return state; @@ -1318,10 +1322,43 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull) } else { - stup.datum1 = datumCopy(val, false, state-datumTypeLen); + Datum original = datumCopy(val, false, state-datumTypeLen); stup.isnull1 = false; - stup.tuple = DatumGetPointer(stup.datum1); + stup.tuple = DatumGetPointer(original); USEMEM(state, GetMemoryChunkSpace(stup.tuple)); + + if (!state-sortKeys-abbrev_converter) + { + stup.datum1 = original; + } + else if (!consider_abort_common(state)) + { + /* Store abbreviated key representation */ + stup.datum1 = state-sortKeys-abbrev_converter(original, + state-sortKeys); + } + else + { + /* Abort abbreviation */ + int i; + + stup.datum1 = original; + + /* + * Set state to be consistent with never trying abbreviation. + * + * Alter datum1 representation in already-copied tuples, so as to + * ensure a consistent representation (current tuple was just handled). + * Note that we rely on all tuples copied so far actually being + * contained within memtuples array. + */ + for (i = 0; i state-memtupcount; i++) + { +SortTuple *mtup = state-memtuples[i]; + +mtup-datum1 = PointerGetDatum(mtup-tuple); + } + } } puttuple_common(state, stup); @@ -1887,9 +1924,9 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward, else { if (should_free) - *val = stup.datum1; + *val = PointerGetDatum(stup.tuple); else - *val = datumCopy(stup.datum1, false, state-datumTypeLen); + *val = datumCopy(PointerGetDatum(stup.tuple), false, state-datumTypeLen); *isNull = false; } @@ -3712,9 +3749,22 @@ readtup_index(Tuplesortstate *state, SortTuple *stup, static int comparetup_datum(const SortTuple *a, const SortTuple *b, Tuplesortstate *state) { - return ApplySortComparator(a-datum1, a-isnull1, - b-datum1, b-isnull1, - state-onlyKey); + int compare; + +
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
Robert == Robert Haas robertmh...@gmail.com writes: Robert All right, it seems Tom is with you on that point, so after Robert some study, I've committed this with very minor modifications. This caught my eye (thanks to conflict with GS patch): * In the future, we should consider forcing the * tuplesort_begin_heap() case when the abbreviated key * optimization can thereby be used, even when numInputs is 1. The comment in tuplesort_begin_datum that abbreviation can't be used seems wrong to me; why is the copy of the original value pointed to by stup-tuple (in the case of by-reference types, and abbreviation is obviously not needed for by-value types) not sufficient? Or what am I missing? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Jan 19, 2015 at 9:29 PM, Peter Geoghegan p...@heroku.com wrote: I think that the attached patch should at least fix that much. Maybe the problem on the other animal is also explained by the lack of this, since there could also be a MinGW-ish strxfrm_l(), I suppose. Committed that, rather blindly, since it looks like a reasonable fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 3:34 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 20, 2015 at 3:34 PM, Robert Haas robertmh...@gmail.com wrote: Dear me. Peter, can you fix this RSN? Investigating. It's certainly possible to fix Andrew's test case with the attached. I'm not sure that that's the appropriate fix, though: there is probably a case to be made for not bothering with abbreviation once we've read tuples in for the final merge run. More likely, the strongest case is for storing the abbreviated keys on disk too, and reading those back. -- Peter Geoghegan diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 6d3aa88..adf4c4d 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -3148,6 +3148,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup, MinimalTuple tuple = (MinimalTuple) palloc(tuplen); char *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET; HeapTupleData htup; + Datum original; USEMEM(state, GetMemoryChunkSpace(tuple)); /* read in the tuple proper */ @@ -3161,10 +3162,29 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup, /* set up first-column key value */ htup.t_len = tuple-t_len + MINIMAL_TUPLE_OFFSET; htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET); - stup-datum1 = heap_getattr(htup, -state-sortKeys[0].ssup_attno, -state-tupDesc, -stup-isnull1); + /* Once again, store abbreviated key representation */ + original = heap_getattr(htup, + state-sortKeys[0].ssup_attno, + state-tupDesc, + stup-isnull1); + + if (!state-sortKeys-abbrev_converter || stup-isnull1) + { + /* + * Store ordinary Datum representation, or NULL value. If there is a + * converter it won't expect NULL values, and cost model is not + * required to account for NULL, so in that case we avoid calling + * converter and just set datum1 to void representation (to be + * consistent). + */ + stup-datum1 = original; + } + else + { + /* Store abbreviated key representation */ + stup-datum1 = state-sortKeys-abbrev_converter(original, + state-sortKeys); + } } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 6:27 PM, Andrew Gierth and...@tao11.riddles.org.uk wrote: Robert == Robert Haas robertmh...@gmail.com writes: Robert All right, it seems Tom is with you on that point, so after Robert some study, I've committed this with very minor modifications. While hacking up a patch to demonstrate the simplicity of extending this to the Datum sorter, I seem to have run into a fairly major issue with this: there seems to be no attempt whatsoever to handle spilling to disk correctly. The data spilled to disk only has the un-abbreviated values, but nothing tries to re-abbreviate it (or disable abbreviations) when it is read back in, and chaos ensues: Dear me. Peter, can you fix this RSN? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 3:46 AM, Andrew Gierth and...@tao11.riddles.org.uk wrote: The comment in tuplesort_begin_datum that abbreviation can't be used seems wrong to me; why is the copy of the original value pointed to by stup-tuple (in the case of by-reference types, and abbreviation is obviously not needed for by-value types) not sufficient? We haven't formalized the idea that pass-by-value types are not targets for abbreviation (it's just that the practical application of abbreviated keys is likely to be limited to pass-by-reference types, generating a compact pass-by-value abbreviated representation). That could be a useful restriction to formalize, and certainly seems likely to be a harmless one, but for now that's the way it is. It might be sufficient for some tuplesort_begin_datum() callers. Datum tuple sorts require the original values. Aside from the formalization of abbreviation only applying to pass-by-value types, you'd have to teach tuplesort_getdatum() to reconstruct the non-abbreviated representation transparently from each SortTuple's tuple proper. However, the actual tuplesort_getdatum() calls could be the dominant cost, not the sort (I'm not sure of that offhand - that's total speculation). Basically, the intersection of the datum sort case with abbreviated keys seems complicated. I tended to think that the solution was to force a heaptuple sort instead (where abbreviation naturally can be used), since clearly that could work in some important cases like nodeAgg.c, iff the gumption to do it that way was readily available. Rightly or wrongly, I preferred that idea to the idea of teaching the Datum case to handle abbreviation across the board. Maybe that's the wrong way of fixing that, but for now I don't think it's acceptable that abbreviation isn't always used in certain cases where it could make sense (e.g. not for simple GroupAggregates with a single attribute -- only multiple attribute GroupAggregates). After all, most sort cases (e.g. B-Tree builds) didn't use SortSupport for several years, simply because no one got around to it until I finally did a few months back. Note that most tuplesort non-users of abbreviation don't use abbreviation for sensible reasons. For example, abbreviation simply doesn't make sense for Top-N heap sorts, or MJExamineQuals(). The non-single-attribute GroupAggregate/nodeAgg.c case seems bad, but I don't have a good sense of how bad things are with orderedsetaggs.c non-use is...it might matter less than the other case. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
Robert == Robert Haas robertmh...@gmail.com writes: Robert All right, it seems Tom is with you on that point, so after Robert some study, I've committed this with very minor modifications. While hacking up a patch to demonstrate the simplicity of extending this to the Datum sorter, I seem to have run into a fairly major issue with this: there seems to be no attempt whatsoever to handle spilling to disk correctly. The data spilled to disk only has the un-abbreviated values, but nothing tries to re-abbreviate it (or disable abbreviations) when it is read back in, and chaos ensues: set work_mem = 64; select v, v lag(v) over (order by v) from (select 'B'||i as v from generate_series(1,1) i union all select 'a'||i from generate_series(1,1) i offset 0) s order by v limit 20; v| ?column? +-- a1 | B1 | f a1000 | t a1001 | t a1002 | t a1003 | t B1000 | f B1001 | t B1002 | t B1003 | t B1004 | t B1005 | t a1004 | t a1005 | t a1006 | t a1007 | t a1008 | t B1 | f B10| t B100 | t (20 rows) -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 2:00 PM, Peter Geoghegan p...@heroku.com wrote: Maybe that's the wrong way of fixing that, but for now I don't think it's acceptable that abbreviation isn't always used in certain cases where it could make sense (e.g. not for simple GroupAggregates with a single attribute -- only multiple attribute GroupAggregates). After all, most sort cases (e.g. B-Tree builds) didn't use SortSupport for several years, simply because no one got around to it until I finally did a few months back. Exuse me. I mean that this *is* an acceptable restriction for the time being. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 3:34 PM, Robert Haas robertmh...@gmail.com wrote: Dear me. Peter, can you fix this RSN? Investigating. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 3:33 PM, Robert Haas robertmh...@gmail.com wrote: Peter, this made bowerbird (Windows 8/Visual Studio) build, but it's failing make check. Ditto hamerkop (Windows 2k8/VC++) and currawong (Windows XP Pro/MSVC++). jacana (Windows 8/gcc) and brolga (Windows XP Pro/cygwin) are unhappy too, although the failures are showing up in different build stages rather than in 'make check'. narwhal (Windows 2k3/mingw) and frogmouth (Windows XP Pro/gcc) are happy, though, so it's not affecting ALL of the Windows critters. Still, I'm leaning toward the view that we should disable this optimization across-the-board on Windows until somebody has time to do the legwork to figure out what it takes to make it work, and what makes it work on some of these critters and fail on others. We can't leave the buildfarm red for long periods of time. Fair enough. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 3:57 PM, Peter Geoghegan p...@heroku.com wrote: It's certainly possible to fix Andrew's test case with the attached. I'm not sure that that's the appropriate fix, though: there is probably a case to be made for not bothering with abbreviation once we've read tuples in for the final merge run. More likely, the strongest case is for storing the abbreviated keys on disk too, and reading those back. Maybe not, though: An extra 8 bytes per tuple on disk is not free. OTOH, if we're I/O bound on the final merge, as we ought to be, then recomputing the abbreviated keys could make sense, since there may well be an idle CPU core anyway. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 10:54 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Jan 19, 2015 at 9:29 PM, Peter Geoghegan p...@heroku.com wrote: I think that the attached patch should at least fix that much. Maybe the problem on the other animal is also explained by the lack of this, since there could also be a MinGW-ish strxfrm_l(), I suppose. Committed that, rather blindly, since it looks like a reasonable fix. Peter, this made bowerbird (Windows 8/Visual Studio) build, but it's failing make check. Ditto hamerkop (Windows 2k8/VC++) and currawong (Windows XP Pro/MSVC++). jacana (Windows 8/gcc) and brolga (Windows XP Pro/cygwin) are unhappy too, although the failures are showing up in different build stages rather than in 'make check'. narwhal (Windows 2k3/mingw) and frogmouth (Windows XP Pro/gcc) are happy, though, so it's not affecting ALL of the Windows critters. Still, I'm leaning toward the view that we should disable this optimization across-the-board on Windows until somebody has time to do the legwork to figure out what it takes to make it work, and what makes it work on some of these critters and fail on others. We can't leave the buildfarm red for long periods of time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote: I was assuming we were going to fix this by undoing the abbreviation (as in the abort case) when we spill to disk, and not bothering with it thereafter. The spill-to-disk case is at least as compelling at the internal sort case. The overhead of comparisons is much higher for tapesort. Attached patch serializes keys. On reflection, I'm inclined to go with this approach. Even if the CPU overhead of reconstructing strxfrm() blobs is acceptable for text, it might be much more expensive for other types. I'm loathe to throw away those abbreviated keys unnecessarily. We don't have to worry about having aborted abbreviation, since once we spill to disk we've effectively committed to abbreviation. This patch formalizes the idea that there is strictly a pass-by-value representation required for such cases (but not that the original Datums must be of a pass-by-reference, which is another thing entirely). I've tested it some, obviously with Andrew's testcase and the regression tests, but also with my B-Tree verification tool. Please review it. Sorry about this. -- Peter Geoghegan diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 6d3aa88..a442617 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -413,10 +413,13 @@ struct Tuplesortstate * * If state-randomAccess is true, then the stored representation of the * tuple must be followed by another unsigned int that is a copy of the - * length --- so the total tape space used is actually sizeof(unsigned int) - * more than the stored length value. This allows read-backwards. When - * randomAccess is not true, the write/read routines may omit the extra - * length word. + * length (less any abbreviated key that is stored, which is also possible) --- + * so the total tape space used is actually sizeof(unsigned int) more than the + * stored length value, as well as another sizeof(Datum) overhead for storing + * abbreviated keys. This allows read-backwards, and avoids the need to + * re-compute abbreviated keys. When randomAccess is not true, the write/read + * routines may omit the extra length word. Also, when abbreviation is not in + * play, abbreviated keys are not stored. * * writetup is expected to write both length words as well as the tuple * data. When readtup is called, the tape is positioned just after the @@ -3124,11 +3127,15 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup) char *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET; unsigned int tupbodylen = tuple-t_len - MINIMAL_TUPLE_DATA_OFFSET; - /* total on-disk footprint: */ + /* total on-disk footprint for tuple: */ unsigned int tuplen = tupbodylen + sizeof(int); LogicalTapeWrite(state-tapeset, tapenum, (void *) tuplen, sizeof(tuplen)); + /* Store abbreviated key, if any (not accounted for by tuplen) */ + if (state-sortKeys-abbrev_converter) + LogicalTapeWrite(state-tapeset, tapenum, + (void *) stup-datum1, sizeof(Datum)); LogicalTapeWrite(state-tapeset, tapenum, (void *) tupbody, tupbodylen); if (state-randomAccess) /* need trailing length word? */ @@ -3150,6 +3157,10 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup, HeapTupleData htup; USEMEM(state, GetMemoryChunkSpace(tuple)); + /* read in the tuple abbreviated key, if any */ + if (state-sortKeys-abbrev_converter) + LogicalTapeReadExact(state-tapeset, tapenum, (void *) stup-datum1, + sizeof(Datum)); /* read in the tuple proper */ tuple-t_len = tuplen; LogicalTapeReadExact(state-tapeset, tapenum, @@ -3161,10 +3172,12 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup, /* set up first-column key value */ htup.t_len = tuple-t_len + MINIMAL_TUPLE_OFFSET; htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET); - stup-datum1 = heap_getattr(htup, -state-sortKeys[0].ssup_attno, -state-tupDesc, -stup-isnull1); + /* set up first-column key value (for non-abbreviated case) */ + if (!state-sortKeys-abbrev_converter) + stup-datum1 = heap_getattr(htup, + state-sortKeys[0].ssup_attno, + state-tupDesc, + stup-isnull1); } /* @@ -3355,12 +3368,20 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup) { HeapTuple tuple = (HeapTuple) stup-tuple; unsigned int tuplen = tuple-t_len + sizeof(ItemPointerData) + sizeof(int); + AttrNumber leading = state-indexInfo-ii_KeyAttrNumbers[0]; - /* We need to store t_self, but not other fields of HeapTupleData */ + /* + * We need to store t_self, but not other fields of HeapTupleData (although + * possibly abbreviated key value) + */ LogicalTapeWrite(state-tapeset, tapenum, tuplen, sizeof(tuplen)); LogicalTapeWrite(state-tapeset, tapenum, tuple-t_self, sizeof(ItemPointerData)); + /* Store abbreviated key, if any (not accounted for by tuplen) */ + if
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 9:33 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 20, 2015 at 6:30 PM, Robert Haas robertmh...@gmail.com wrote: I don't want to change the on-disk format for tapes without a lot more discussion. Can you come up with a fix that avoids that for now? A more conservative approach would be to perform conversion on-the-fly once more. That wouldn't be patently unacceptable from a performance perspective, and would also not change the on-disk format for tapes. Thoughts? That might be OK. Probably needs a bit of performance testing to see how it looks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote: I was assuming we were going to fix this by undoing the abbreviation (as in the abort case) when we spill to disk, and not bothering with it thereafter. The spill-to-disk case is at least as compelling at the internal sort case. The overhead of comparisons is much higher for tapesort. First, we need to unbreak this. Then, we can look at optimizing it. The latter task will require performance testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 6:30 PM, Robert Haas robertmh...@gmail.com wrote: I don't want to change the on-disk format for tapes without a lot more discussion. Can you come up with a fix that avoids that for now? A more conservative approach would be to perform conversion on-the-fly once more. That wouldn't be patently unacceptable from a performance perspective, and would also not change the on-disk format for tapes. Thoughts? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 7:07 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 20, 2015 at 3:57 PM, Peter Geoghegan p...@heroku.com wrote: It's certainly possible to fix Andrew's test case with the attached. I'm not sure that that's the appropriate fix, though: there is probably a case to be made for not bothering with abbreviation once we've read tuples in for the final merge run. More likely, the strongest case is for storing the abbreviated keys on disk too, and reading those back. Maybe not, though: An extra 8 bytes per tuple on disk is not free. OTOH, if we're I/O bound on the final merge, as we ought to be, then recomputing the abbreviated keys could make sense, since there may well be an idle CPU core anyway. I was assuming we were going to fix this by undoing the abbreviation (as in the abort case) when we spill to disk, and not bothering with it thereafter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 6:39 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 20, 2015 at 6:34 PM, Robert Haas robertmh...@gmail.com wrote: That might be OK. Probably needs a bit of performance testing to see how it looks. Well, we're still only doing it when we do our final merge. So that's only doubling the number of conversions required, which if we're blocked on I/O might not matter at all. You'll probably prefer the attached. This patch works by disabling abbreviation, but only after writing out runs, with the final merge left to go. That way, it doesn't matter when abbreviated keys are not read back from disk (or regenerated). I believe this bug was missed because it only occurs when there are multiple runs, and not in the common case where there is one big initial run that is found already sorted when we reach mergeruns(). -- Peter Geoghegan diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 6d3aa88..b6e302f 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -2172,6 +2172,22 @@ mergeruns(Tuplesortstate *state) return; } + if (state-sortKeys-abbrev_converter) + { + /* + * If there are multiple runs to be merged, when we go to read back + * tuples from disk, abbreviated keys will not have been stored, and we + * don't care to regenerate them. Disable abbreviation from this point + * on. + */ + state-sortKeys-abbrev_converter = NULL; + state-sortKeys-comparator = state-sortKeys-abbrev_full_comparator; + + /* Not strictly necessary, but be tidy */ + state-sortKeys-abbrev_abort = NULL; + state-sortKeys-abbrev_full_comparator = NULL; + } + /* End of step D2: rewind all output tapes to prepare for merging */ for (tapenum = 0; tapenum state-tapeRange; tapenum++) LogicalTapeRewind(state-tapeset, tapenum, false); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote: I was assuming we were going to fix this by undoing the abbreviation (as in the abort case) when we spill to disk, and not bothering with it thereafter. The spill-to-disk case is at least as compelling at the internal sort case. The overhead of comparisons is much higher for tapesort. First, we need to unbreak this. Then, we can look at optimizing it. The latter task will require performance testing. I don't see that any alternative isn't a performance trade-off. My patch accomplishes unbreaking this. I agree that it needs still needs review from that perspective, but it doesn't seem any worse than any other alternative. Would you prefer it if the spill-to-disk case aborted in the style of low entropy keys? That doesn't seem significantly safer than this, and it certainly not acceptable from a performance perspective. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 5:46 PM, Peter Geoghegan p...@heroku.com wrote: Would you prefer it if the spill-to-disk case aborted in the style of low entropy keys? That doesn't seem significantly safer than this, and it certainly not acceptable from a performance perspective. BTW, I can write that patch if that's your preference. Should I? I just don't favor that even as a short term correctness fix, because it seems unacceptable to throw away all the strxfrm() work where that's a very predictable and even likely outcome. I suggest reviewing and committing my fix as a short term fix, that may well turn out to be generally acceptable upon further consideration. Yes, we'll need to make a point of reviewing an already committed patch, but there is a precedent for that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote: I was assuming we were going to fix this by undoing the abbreviation (as in the abort case) when we spill to disk, and not bothering with it thereafter. The spill-to-disk case is at least as compelling at the internal sort case. The overhead of comparisons is much higher for tapesort. Attached patch serializes keys. On reflection, I'm inclined to go with this approach. Even if the CPU overhead of reconstructing strxfrm() blobs is acceptable for text, it might be much more expensive for other types. I'm loathe to throw away those abbreviated keys unnecessarily. We don't have to worry about having aborted abbreviation, since once we spill to disk we've effectively committed to abbreviation. This patch formalizes the idea that there is strictly a pass-by-value representation required for such cases (but not that the original Datums must be of a pass-by-reference, which is another thing entirely). I've tested it some, obviously with Andrew's testcase and the regression tests, but also with my B-Tree verification tool. Please review it. Sorry about this. I don't want to change the on-disk format for tapes without a lot more discussion. Can you come up with a fix that avoids that for now? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 6:34 PM, Robert Haas robertmh...@gmail.com wrote: That might be OK. Probably needs a bit of performance testing to see how it looks. Well, we're still only doing it when we do our final merge. So that's only doubling the number of conversions required, which if we're blocked on I/O might not matter at all. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Jan 20, 2015 at 11:29 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Jan 19, 2015 at 5:59 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: You did notice that bowerbird isn't building, right? http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46 Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever reason. I think that the attached patch should at least fix that much. Maybe the problem on the other animal is also explained by the lack of this, since there could also be a MinGW-ish strxfrm_l(), I suppose. On MinGW-32, not that I know of: $ find . -name *.h | xgrep strxfrm_l ./lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define if strxfr m_l is available in string.h. */ ./mingw32/lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define i f strxfrm_l is available in string.h. */ strxfrm is defined in string.h though. With your patch applied, the failure with MSVC disappeared, but there is still a warning showing up: (ClCompile target) - src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
Peter Geoghegan wrote: It appears that the buildfarm animal brolga isn't happy about this patch. I'm not sure why, since I thought we already figured out bugs or other inconsistencies in various strxfrm() implementations. You did notice that bowerbird isn't building, right? http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46 -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Jan 19, 2015 at 5:43 PM, Peter Geoghegan p...@heroku.com wrote: It appears that the buildfarm animal brolga isn't happy about this patch. I'm not sure why, since I thought we already figured out bugs or other inconsistencies in various strxfrm() implementations. Well, the first thing that comes to mind is that strxfrm() is returning strings that, when sorted, do not give the same order we would have obtained via strcoll(). It's true that there are existing callers of strxfrm(), but it looks like that is mostly used for statistics-gathering, so it's possible that differences vs. strcoll() would not have shown up before now. Is there any legitimate way that strxfrm() and strcoll() can return inconsistent answers - e.g. they are somehow allowed to derive their notion of the relevant locale differently - or is this just a case of Cygwin being busted? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier michael.paqu...@gmail.com wrote: On MinGW-32, not that I know of: $ find . -name *.h | xgrep strxfrm_l ./lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define if strxfr m_l is available in string.h. */ ./mingw32/lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define i f strxfrm_l is available in string.h. */ strxfrm is defined in string.h though. I'm not quite following. Doesn't that imply that strxfrm_l() at least *could* be available? I guess it doesn't matter, though, because the animal with the successful build that fails the locale regression test (brolga) does not have locale_t support. Therefore, there is no new strxfrm_l() caller. My next guess is that the real problem is an assumption I've made. That is, my assumption that strxfrm() always behaves equivalently to strcpy() when the C locale happens to be in use may not be portable (due to external factors). I guess we're inconsistent about making sure that LC_COLLATE is set correctly in WIN32 and/or EXEC_BACKEND builds, or something along those lines. The implementation in the past got away with strcoll()/strxfrm() not having the C locale set, since strcoll() was never called when the C locale was in use -- we just called strcmp() instead. Assuming that's correct, it might be easier just to entirely disable the optimization on Windows, even with the C locale. It may not be worth it to even bother just for C locale support of abbreviated keys. I'm curious about what will happen there when the _strxfrm_l() fix patch is applied. With your patch applied, the failure with MSVC disappeared, but there is still a warning showing up: (ClCompile target) - src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended? That seems harmless. I suggest an explicit cast to Size here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Jan 19, 2015 at 5:59 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: You did notice that bowerbird isn't building, right? http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46 Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever reason. I think that the attached patch should at least fix that much. Maybe the problem on the other animal is also explained by the lack of this, since there could also be a MinGW-ish strxfrm_l(), I suppose. -- Peter Geoghegan diff --git a/src/include/port/win32.h b/src/include/port/win32.h index 550c3ec..4cb51ec 100644 --- a/src/include/port/win32.h +++ b/src/include/port/win32.h @@ -341,6 +341,7 @@ typedef int pid_t; #define isspace_l _isspace_l #define iswspace_l _iswspace_l #define strcoll_l _strcoll_l +#define strxfrm_l _strxfrm_l #define wcscoll_l _wcscoll_l #define wcstombs_l _wcstombs_l #define mbstowcs_l _mbstowcs_l -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: You did notice that bowerbird isn't building, right? http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46 Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever reason. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 8:28 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote: Well, maybe you should make the updates we've agreed on and I can take another look at it. Agreed. Attached, revised patchset makes these updates. I continue to use the sortsupport struct to convey that a given attribute of a given sort is abbreviation-applicable (although the field is now a bool, not an enum). All right, it seems Tom is with you on that point, so after some study, I've committed this with very minor modifications. Sorry for the long delay. I have not committed the 0002 patch, though, because I haven't studied that enough yet to know whether I think it's a good idea. Perhaps that could get its own CommitFest entry and thread, though, to separate it from this exceedingly long discussion and make it clear exactly what we're hoping to gain by that patch specifically. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Jan 19, 2015 at 3:33 PM, Robert Haas robertmh...@gmail.com wrote: All right, it seems Tom is with you on that point, so after some study, I've committed this with very minor modifications. Sorry for the long delay. I have not committed the 0002 patch, though, because I haven't studied that enough yet to know whether I think it's a good idea. Perhaps that could get its own CommitFest entry and thread, though, to separate it from this exceedingly long discussion and make it clear exactly what we're hoping to gain by that patch specifically. By the way, for those following along at home, here's an example of how this patch can help: rhaas=# create table stuff as select random()::text as a, 'filler filler filler'::text as b, g as c from generate_series(1, 100) g; SELECT 100 rhaas=# create index on stuff (a); CREATE INDEX On the PPC64 machine I normally use for performance testing, it takes about 6.3 seconds to build the index with the commit just before this one. With this commit, it drops to 1.9 seconds. That's more than a 3x speedup! Now, if I change the query that creates the table to this. rhaas=# create table stuff as select '' || random()::text as a, 'filler filler filler'::text as b, g as c from generate_series(1, 100) g; ...then it takes 10.8 seconds with or without this patch. In general, any case where the first few characters of every string are exactly identical (or only quite rarely different) will not benefit, but many practical cases will benefit significantly. Also, Peter's gone to a fair amount of work to make sure that even when the patch does not help, it doesn't hurt, either. So that's pretty cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
* Robert Haas (robertmh...@gmail.com) wrote: On the PPC64 machine I normally use for performance testing, it takes about 6.3 seconds to build the index with the commit just before this one. With this commit, it drops to 1.9 seconds. That's more than a 3x speedup! Now, if I change the query that creates the table to this. rhaas=# create table stuff as select '' || random()::text as a, 'filler filler filler'::text as b, g as c from generate_series(1, 100) g; ...then it takes 10.8 seconds with or without this patch. In general, any case where the first few characters of every string are exactly identical (or only quite rarely different) will not benefit, but many practical cases will benefit significantly. Also, Peter's gone to a fair amount of work to make sure that even when the patch does not help, it doesn't hurt, either. So that's pretty cool. Wow, nice! Good work Peter! Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
There is an interesting thread about strcoll() overhead over on -general: http://www.postgresql.org/message-id/cab25xexnondrmc1_cy3jvmb0tmydm38ef9q2d7xla0rbncj...@mail.gmail.com My guess was that this person experienced a rather unexpected downside of spilling to disk when sorting on a text attribute: System throughput becomes very CPU bound, because tapesort tends to result in more comparisons [1]. With abbreviated keys, tapesort can actually compete with quicksort in certain cases [2]. Tapesorts of text attributes are especially bad on released versions of Postgres, and will perform very little actual I/O. In all seriousness, I wonder if we should add a release note item stating that when using Postgres 9.5, due to the abbreviated key optimization, external sorts can be much more I/O bound than in previous releases... [1] http://www.postgresql.org/message-id/20140806035512.ga91...@tornado.leadboat.com [2] http://www.postgresql.org/message-id/CAM3SWZQiGvGhMB4TMbEWoNjO17=ysb5b5y5mgqjsanq4uwt...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 5:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Geoghegan p...@heroku.com writes: On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote: Right, and what I'm saying is that maybe the applicability flag shouldn't be stored in the SortSupport object, but passed down as an argument. But then how does that information get to any given sortsupport routine? That's the place that really needs to know if abbreviation is useful. In general, they're only passed a SortSupport object. Are you suggesting revising the signature required of SortSupport routines to add that extra flag as an additional argument? I think that is what he's suggesting, and I too am wondering why it's a good idea. I find it somewhat confusing that we've got one flag which is only used from the time the SortSupport object is created until the time that it's fully initialized, and then a different way of indicating whether we paid attention to that flag. I'm not totally sure what the right solution to that problem is, but the current situation feels like something of a wart. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 1:21 PM, Peter Geoghegan p...@heroku.com wrote: Incidentally, I think that an under-appreciated possible source of regressions here is that attributes abbreviated have a strong physical/logical correlation. I could see a small regression for one such case even though my cost model indicated that it should be very profitable. This was the column in question: postgres=# select * from pg_stats where tablename = 'ohio_voters' and attname = 'mailing_address1' ; -[ RECORD 1 ]- schemaname | public tablename | ohio_voters attname| mailing_address1 inherited | f null_frac | 0 avg_width | 5 n_distinct | 789 most_common_vals | {} most_common_freqs | {0.969267} histogram_bounds | SNIP *** correlation| 0.944785 SNIP *** This n_distinct is wrong, though. In fact, the number of distinct columns is 25,946, while the number of distinct abbreviated keys is 13,691. So correlation was not the dominant factor here (although it was probably still a factor) - rather, the dominant factor was that the vast majority of comparisons would get away with an opportunistic memcmp() == 0 anyway (although not with Postgres 9.4), and so my baseline is very fast for this case. This would not have come up had the value been represented as NULL (as it clearly should have been), since that would not undergo strxfrm() transformation/abbreviation in the first place. Even still, highly skewed attributes exist in the wild, and deserve our consideration - we do not model the distribution of values within the set. I believe that these cases are rare enough, and (thanks to the already committed parts of this work) fast enough to probably not be worried about; maybe a more complex cost model could do better, but I'm inclined to think that it's not worth it. We'd end up slightly improving this case at bigger cost to other, much more common cases. Besides, equality-resolved comparisons are not necessarily much cheaper for datatypes other than Postgres 9.5 text (in a world where there is a variety of datatypes accelerated by abbreviation), which discourages a general solution. A custom format dump of this data (publicly available Ohio State voter records) is available from: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/ohio_voters.custom.dump -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Nov 25, 2014 at 1:38 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas robertmh...@gmail.com wrote: - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE. Actually, the word only is removed (because PG_CACHE_LINE_SIZE has a new client now). So it isn't quite the same paragraph as before. Oy, I missed that. - I really don't think we need a #define in pg_config_manual.h for this. Please omit that. You'd prefer to not offer a way to disable abbreviation? Okay. I guess that makes sense - it should work well as a general optimization. I'd prefer not to have a #define in pg_config_manual.h. Only stuff that we expect a reasonably decent number of users to need to change should be in that file, and this is too marginal for that. If anybody other than the developers of the feature is disabling this, the whole thing is going to be ripped back out. I'm not sure about that. I'd prefer to have tuplesort (and one or two other sites) set the abbreviation is possible in principle flag. Otherwise, sortsupport needs to assume that the leading attribute is going to be the abbreviation-applicable one, which might not always be true. Still, it's not as if I feel strongly about it. When wouldn't that be true? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote: I'd prefer not to have a #define in pg_config_manual.h. Only stuff that we expect a reasonably decent number of users to need to change should be in that file, and this is too marginal for that. If anybody other than the developers of the feature is disabling this, the whole thing is going to be ripped back out. I agree. The patch either works well in general and it goes in, or it doesn't and should be rejected. That doesn't mean that the standard applied is that regressions are absolutely unacceptable, but the standard shouldn't be too far off that. I feel pretty confident that we'll be able to meet that standard, though, because database luminary Jim Gray recommended this technique in about 1995. Even if the details of what I have here could stand to be tweaked, there is no getting around the fact that a good sort routine needs to strongly consider locality. That was apparent even in 1995, but now it's a very major trend. Incidentally, I think that an under-appreciated possible source of regressions here is that attributes abbreviated have a strong physical/logical correlation. I could see a small regression for one such case even though my cost model indicated that it should be very profitable. On the other hand, on other occasions my cost model (i.e. considering how good a proxy for full key cardinality abbreviated key cardinality is) was quite pessimistic. Although, at least it was only a small regression, even though the correlation was something like 0.95. And at least the sort will be very fast in any case. You'll recall that Heikki's test case involved correlation like that, even though it was mostly intended to make a point about the entropy in abbreviated keys. Correlation was actually the most important factor there. I think it might be generally true that it's the most important factor, in practice more important even than capturing sufficient entropy in the abbreviated key representation. I'm not sure about that. I'd prefer to have tuplesort (and one or two other sites) set the abbreviation is possible in principle flag. Otherwise, sortsupport needs to assume that the leading attribute is going to be the abbreviation-applicable one, which might not always be true. Still, it's not as if I feel strongly about it. When wouldn't that be true? It just feels a bit wrong to me. There might be a future in which we want to use the datum1 field for a non-leading attribute. For example, when it is known ahead of time that there are low cardinality integers in the leading key/attribute. Granted, that's pretty speculative, but then it's not as if I'm insisting that it must be done that way. I defer to you. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 4:21 PM, Peter Geoghegan p...@heroku.com wrote: I'm not sure about that. I'd prefer to have tuplesort (and one or two other sites) set the abbreviation is possible in principle flag. Otherwise, sortsupport needs to assume that the leading attribute is going to be the abbreviation-applicable one, which might not always be true. Still, it's not as if I feel strongly about it. When wouldn't that be true? It just feels a bit wrong to me. There might be a future in which we want to use the datum1 field for a non-leading attribute. For example, when it is known ahead of time that there are low cardinality integers in the leading key/attribute. Granted, that's pretty speculative, but then it's not as if I'm insisting that it must be done that way. I defer to you. Well, maybe you should make the updates we've agreed on and I can take another look at it. But I didn't think that I was proposing to change anything about the level at which the decision about whether to abbreviate or not was made; rather, I thought I was suggesting that we pass that flag down to the code that initializes the sortsupport object as an argument rather than through the sortsupport structure itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote: Well, maybe you should make the updates we've agreed on and I can take another look at it. Agreed. But I didn't think that I was proposing to change anything about the level at which the decision about whether to abbreviate or not was made; rather, I thought I was suggesting that we pass that flag down to the code that initializes the sortsupport object as an argument rather than through the sortsupport structure itself. The flag I'm talking about concerns the *applicability* of abbreviation, and not whether or not it will actually be used (maybe the opclass lacks support, or decides not to for some platform specific reason). Tuplesort has a contract with abbreviation + sortsupport that considers whether or not the function pointer used to abbreviate is set, which relates to whether or not abbreviation will *actually* be used. Note that for non-abbreviation-applicable attributes, btsortsupport_worker() never sets the function pointer (nor, incidentally, does it set the other abbreviation related function pointers in the struct). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 5:16 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote: Well, maybe you should make the updates we've agreed on and I can take another look at it. Agreed. But I didn't think that I was proposing to change anything about the level at which the decision about whether to abbreviate or not was made; rather, I thought I was suggesting that we pass that flag down to the code that initializes the sortsupport object as an argument rather than through the sortsupport structure itself. The flag I'm talking about concerns the *applicability* of abbreviation, and not whether or not it will actually be used (maybe the opclass lacks support, or decides not to for some platform specific reason). Tuplesort has a contract with abbreviation + sortsupport that considers whether or not the function pointer used to abbreviate is set, which relates to whether or not abbreviation will *actually* be used. Note that for non-abbreviation-applicable attributes, btsortsupport_worker() never sets the function pointer (nor, incidentally, does it set the other abbreviation related function pointers in the struct). Right, and what I'm saying is that maybe the applicability flag shouldn't be stored in the SortSupport object, but passed down as an argument. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote: Right, and what I'm saying is that maybe the applicability flag shouldn't be stored in the SortSupport object, but passed down as an argument. But then how does that information get to any given sortsupport routine? That's the place that really needs to know if abbreviation is useful. In general, they're only passed a SortSupport object. Are you suggesting revising the signature required of SortSupport routines to add that extra flag as an additional argument? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
Peter Geoghegan p...@heroku.com writes: On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote: Right, and what I'm saying is that maybe the applicability flag shouldn't be stored in the SortSupport object, but passed down as an argument. But then how does that information get to any given sortsupport routine? That's the place that really needs to know if abbreviation is useful. In general, they're only passed a SortSupport object. Are you suggesting revising the signature required of SortSupport routines to add that extra flag as an additional argument? I think that is what he's suggesting, and I too am wondering why it's a good idea. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote: Well, maybe you should make the updates we've agreed on and I can take another look at it. Agreed. Attached, revised patchset makes these updates. I continue to use the sortsupport struct to convey that a given attribute of a given sort is abbreviation-applicable (although the field is now a bool, not an enum). -- Peter Geoghegan From 865a1abeb6bfb601b1ec605afb1e339c0e444e10 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Sun, 9 Nov 2014 14:38:44 -0800 Subject: [PATCH 2/2] Estimate total number of rows to be sorted Sortsupport opclasses now accept a row hint, indicating the estimated number of rows to be sorted. This gives opclasses a sense of proportion about how far along the copying of tuples is when considering aborting abbreviation. Estimates come from various sources. The text opclass now always avoids aborting abbreviation if the total number of rows to be sorted is high enough, without considering cardinality at all. --- src/backend/access/nbtree/nbtree.c | 5 ++- src/backend/access/nbtree/nbtsort.c| 14 +- src/backend/commands/cluster.c | 4 +- src/backend/executor/nodeAgg.c | 5 ++- src/backend/executor/nodeSort.c| 1 + src/backend/utils/adt/orderedsetaggs.c | 2 +- src/backend/utils/adt/varlena.c| 80 -- src/backend/utils/sort/tuplesort.c | 14 -- src/include/access/nbtree.h| 2 +- src/include/utils/sortsupport.h| 7 ++- src/include/utils/tuplesort.h | 6 +-- 11 files changed, 121 insertions(+), 19 deletions(-) diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index d881525..d26c60b 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -109,14 +109,15 @@ btbuild(PG_FUNCTION_ARGS) elog(ERROR, index \%s\ already contains data, RelationGetRelationName(index)); - buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique, false); + buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique, + indexInfo-ii_Predicate != NIL, false); /* * If building a unique index, put dead tuples in a second spool to keep * them out of the uniqueness check. */ if (indexInfo-ii_Unique) - buildstate.spool2 = _bt_spoolinit(heap, index, false, true); + buildstate.spool2 = _bt_spoolinit(heap, index, false, true, true); /* do the heap scan */ reltuples = IndexBuildHeapScan(heap, index, indexInfo, true, diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 593571b..473ac54 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -73,6 +73,7 @@ #include storage/smgr.h #include tcop/tcopprot.h #include utils/rel.h +#include utils/selfuncs.h #include utils/sortsupport.h #include utils/tuplesort.h @@ -149,10 +150,13 @@ static void _bt_load(BTWriteState *wstate, * create and initialize a spool structure */ BTSpool * -_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead) +_bt_spoolinit(Relation heap, Relation index, bool isunique, bool ispartial, + bool isdead) { BTSpool*btspool = (BTSpool *) palloc0(sizeof(BTSpool)); int btKbytes; + double estRows; + float4 relTuples; btspool-heap = heap; btspool-index = index; @@ -165,10 +169,16 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead) * unique index actually requires two BTSpool objects. We expect that the * second one (for dead tuples) won't get very full, so we give it only * work_mem. + * + * Certain cases will always have a relTuples of 0, such as reindexing as + * part of a CLUSTER operation, or when reindexing toast tables. This is + * interpreted as no estimate available. */ btKbytes = isdead ? work_mem : maintenance_work_mem; + relTuples = RelationGetForm(heap)-reltuples; + estRows = relTuples * (isdead || ispartial ? DEFAULT_INEQ_SEL : 1); btspool-sortstate = tuplesort_begin_index_btree(heap, index, isunique, - btKbytes, false); + btKbytes, estRows, false); return btspool; } diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index bc5f33f..8e5f536 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -890,7 +890,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose, /* Set up sorting if wanted */ if (use_sort) tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, - maintenance_work_mem, false); + maintenance_work_mem, + RelationGetForm(OldHeap)-reltuples, + false); else tuplesort = NULL; diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 89de755..95143c3 100644 ---
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Dec 2, 2014 at 5:28 PM, Peter Geoghegan p...@heroku.com wrote: Attached, revised patchset makes these updates. Whoops. Missed some obsolete comments. Here is a third commit that makes a further small modification to one comment. -- Peter Geoghegan From 8d1aba80f95e05742047cba5bd83d8f17aa5ef37 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Tue, 2 Dec 2014 17:42:21 -0800 Subject: [PATCH 3/3] Alter comments to reflect current naming --- src/include/utils/sortsupport.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h index 659233b..f7c73b3 100644 --- a/src/include/utils/sortsupport.h +++ b/src/include/utils/sortsupport.h @@ -127,9 +127,9 @@ typedef struct SortSupportData * Returning zero from the alternative comparator does not indicate * equality, as with a conventional support routine 1, though -- it * indicates that it wasn't possible to determine how the two abbreviated - * values compared. A proper comparison, using auth_comparator/ - * ApplySortComparatorFull() is therefore required. In many cases this - * results in most or all comparisons only using the cheap alternative + * values compared. A proper comparison, using abbrev_full_comparator/ + * ApplySortAbbrevFullComparator() is therefore required. In many cases + * this results in most or all comparisons only using the cheap alternative * comparison func, which is typically implemented as code that compiles to * just a few CPU instructions. CPU cache miss penalties are expensive; to * get good overall performance, sort infrastructure must heavily weigh -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sun, Nov 9, 2014 at 10:02 PM, Peter Geoghegan p...@heroku.com wrote: On Sat, Oct 11, 2014 at 6:34 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch, when applied, accelerates all tuplesort cases using abbreviated keys, building on previous work here, as well as the patch posted to that other thread. I attach an updated patch set, rebased on top of the master branch's tip. All relevant tuplesort cases (B-Tree, MinimalTuple, CLUSTER) are now directly covered by the patch set, since there is now general sortsupport support for those cases in the master branch -- no need to apply some other patch from some other thread. For the convenience of reviewers, this new revision includes a new approach to making my improvements cumulative: A second commit adds tuple count estimation. This hint, passed along to the text opclass's convert routine, is taken from the optimizer's own estimate, or the relcache's reltuples, depending on the tuplesort case being accelerated. As in previous revisions, the idea is to give the opclass a sense of proportion about how far along it is, to be weighed in deciding whether or not to abort abbreviation. One potentially controversial aspect of that is how the text opclass abbreviation cost model/abort early stuff weighs simply having many tuples - past a certain point, it *always* proceeds with abbreviation, not matter what the cardinality of abbreviated keys so far is. For that reason it particular, it seemed to make sense to split these parts out into a second commit. I hope that we can finish up all 9.5 work on accelerated sorting soon. There's a lot of stuff in this patch I'm still trying to digest, but here are a few thoughts on patch 0001: - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE. - I really don't think we need a #define in pg_config_manual.h for this. Please omit that. - I'm much happier with the way the changes to sortsupport.h look in this version. However, I think that auth_comparator is a confusing name, because auth is often used as an abbreviation for authentication. We can spell it out (authoritative_comparator) or come up with a different name (backup_comparator? abbrev_full_comparator?). Whatever we do, ApplySortComparatorAuth() should be renamed to match. - Also, I don't think making abbrev_state an enumerated value with two values is really doing anything for us; we could just use a Boolean. I'm wondering if we should actually go a bit further and remove this from the SortSupport object and instead add an additional Boolean flag to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all the way down to the opclass's sortsupport function. It seems like that might be more clear. Once the opclass function has done its thing, the other three new nembers are enough to know whether we're using the optimization or not (and can be fiddled if we want to make a later decision to call the whole thing off). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas robertmh...@gmail.com wrote: - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE. Actually, the word only is removed (because PG_CACHE_LINE_SIZE has a new client now). So it isn't quite the same paragraph as before. - I really don't think we need a #define in pg_config_manual.h for this. Please omit that. You'd prefer to not offer a way to disable abbreviation? Okay. I guess that makes sense - it should work well as a general optimization. - I'm much happier with the way the changes to sortsupport.h look in this version. However, I think that auth_comparator is a confusing name, because auth is often used as an abbreviation for authentication. We can spell it out (authoritative_comparator) or come up with a different name (backup_comparator? abbrev_full_comparator?). Whatever we do, ApplySortComparatorAuth() should be renamed to match. Okay. - Also, I don't think making abbrev_state an enumerated value with two values is really doing anything for us; we could just use a Boolean. I'm wondering if we should actually go a bit further and remove this from the SortSupport object and instead add an additional Boolean flag to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all the way down to the opclass's sortsupport function. It seems like that might be more clear. Once the opclass function has done its thing, the other three new nembers are enough to know whether we're using the optimization or not (and can be fiddled if we want to make a later decision to call the whole thing off). I'm not sure about that. I'd prefer to have tuplesort (and one or two other sites) set the abbreviation is possible in principle flag. Otherwise, sortsupport needs to assume that the leading attribute is going to be the abbreviation-applicable one, which might not always be true. Still, it's not as if I feel strongly about it. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Nov 25, 2014 at 10:38 AM, Peter Geoghegan p...@heroku.com wrote: - Also, I don't think making abbrev_state an enumerated value with two values is really doing anything for us; we could just use a Boolean. I'm wondering if we should actually go a bit further and remove this from the SortSupport object and instead add an additional Boolean flag to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all the way down to the opclass's sortsupport function. It seems like that might be more clear. Once the opclass function has done its thing, the other three new nembers are enough to know whether we're using the optimization or not (and can be fiddled if we want to make a later decision to call the whole thing off). I'm not sure about that. I'd prefer to have tuplesort (and one or two other sites) set the abbreviation is possible in principle flag. As for the related question of whether or not there should just be a bool in place of an abbreviation state enum: I thought that we might add some more flags to that enum (you'll recall that there actually was another flag in earlier revisions, relating to optimistic tie-breaks with memcmp() that the master branch now always does anyway). But come to think of it, I think it's very unlikely that that flag will ever be extended to represent some now unforeseen state regarding abbreviation. It's either going to be abbreviation applicable for this sort and this attribute or not applicable. So, yes, let's make it a boolean instead. As I think I mentioned at one point, I imagine that if and when we do abbreviation with the numeric opclass, it won't ever abort - its encoding strategy will adapt. That ought to be possible for certain other datatypes, of which numeric is the best example. Although, I think it's only going to be a handful of the most important datatypes that get abbreviation support. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sat, Oct 11, 2014 at 6:34 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch, when applied, accelerates all tuplesort cases using abbreviated keys, building on previous work here, as well as the patch posted to that other thread. I attach an updated patch set, rebased on top of the master branch's tip. All relevant tuplesort cases (B-Tree, MinimalTuple, CLUSTER) are now directly covered by the patch set, since there is now general sortsupport support for those cases in the master branch -- no need to apply some other patch from some other thread. For the convenience of reviewers, this new revision includes a new approach to making my improvements cumulative: A second commit adds tuple count estimation. This hint, passed along to the text opclass's convert routine, is taken from the optimizer's own estimate, or the relcache's reltuples, depending on the tuplesort case being accelerated. As in previous revisions, the idea is to give the opclass a sense of proportion about how far along it is, to be weighed in deciding whether or not to abort abbreviation. One potentially controversial aspect of that is how the text opclass abbreviation cost model/abort early stuff weighs simply having many tuples - past a certain point, it *always* proceeds with abbreviation, not matter what the cardinality of abbreviated keys so far is. For that reason it particular, it seemed to make sense to split these parts out into a second commit. I hope that we can finish up all 9.5 work on accelerated sorting soon. -- Peter Geoghegan From 78d163220b774218170123208c238e0fe2c07eb6 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Sun, 9 Nov 2014 14:38:44 -0800 Subject: [PATCH 2/2] Estimate total number of rows to be sorted Sortsupport opclasses now accept a row hint, indicating the estimated number of rows to be sorted. This gives opclasses a sense of proportion about how far along the copying of tuples is when considering aborting abbreviation. Estimates come from various sources. The text opclass now always avoids aborting abbreviated if the total number of rows to be sorted is high enough, regardless of anything else. --- src/backend/access/nbtree/nbtree.c | 5 ++- src/backend/access/nbtree/nbtsort.c| 14 +- src/backend/commands/cluster.c | 4 +- src/backend/executor/nodeAgg.c | 5 ++- src/backend/executor/nodeSort.c| 1 + src/backend/utils/adt/orderedsetaggs.c | 2 +- src/backend/utils/adt/varlena.c| 80 -- src/backend/utils/sort/tuplesort.c | 14 -- src/include/access/nbtree.h| 2 +- src/include/utils/sortsupport.h| 7 ++- src/include/utils/tuplesort.h | 6 +-- 11 files changed, 121 insertions(+), 19 deletions(-) diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index d881525..d26c60b 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -109,14 +109,15 @@ btbuild(PG_FUNCTION_ARGS) elog(ERROR, index \%s\ already contains data, RelationGetRelationName(index)); - buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique, false); + buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique, + indexInfo-ii_Predicate != NIL, false); /* * If building a unique index, put dead tuples in a second spool to keep * them out of the uniqueness check. */ if (indexInfo-ii_Unique) - buildstate.spool2 = _bt_spoolinit(heap, index, false, true); + buildstate.spool2 = _bt_spoolinit(heap, index, false, true, true); /* do the heap scan */ reltuples = IndexBuildHeapScan(heap, index, indexInfo, true, diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index c60240f..dd57935 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -73,6 +73,7 @@ #include storage/smgr.h #include tcop/tcopprot.h #include utils/rel.h +#include utils/selfuncs.h #include utils/tuplesort.h @@ -148,10 +149,13 @@ static void _bt_load(BTWriteState *wstate, * create and initialize a spool structure */ BTSpool * -_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead) +_bt_spoolinit(Relation heap, Relation index, bool isunique, bool ispartial, + bool isdead) { BTSpool*btspool = (BTSpool *) palloc0(sizeof(BTSpool)); int btKbytes; + double estRows; + float4 relTuples; btspool-heap = heap; btspool-index = index; @@ -164,10 +168,16 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead) * unique index actually requires two BTSpool objects. We expect that the * second one (for dead tuples) won't get very full, so we give it only * work_mem. + * + * Certain cases will always have a relTuples of 0, such as reindexing as + * part of a CLUSTER operation, or when reindexing toast tables. This is + * interpreted as no estimate available. */
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 29, 2014 at 10:34 PM, Peter Geoghegan p...@heroku.com wrote: single sortsupport state patch. You probably noticed that I posted an independently useful patch to make all tuplesort cases use sortsupport [1] - currently, both the B-Tree and CLUSTER cases do not use the sortsupport infrastructure more or less for no good reason. That patch was primarily written to make abbreviated keys work for all cases, though. I think that making heap tuple sorts based on a text attribute much faster is a very nice thing, but in practice most OLTP or web applications are not all that sensitive to the amount of time taken to sort heap tuples. However, the length of time it takes to build indexes is something that most busy production applications are very sensitive to, since of course apart from the large consumption of system resources often required, however long the sort takes is virtually the same amount of time as however long we hold a very heavy, disruptive relation-level ShareLock. Obviously the same applies to CLUSTER, but more so, since it must acquire an AccessExclusiveLock on the relation to be reorganized. I think almost everyone will agree that making B-Tree builds much faster is the really compelling case here, because that's where slow sorts cause by far the most pain for users in the real world. Attached patch, when applied, accelerates all tuplesort cases using abbreviated keys, building on previous work here, as well as the patch posted to that other thread. Exact instructions are in the commit message of 0004-* (i.e. where to find the pieces I haven't posted here). I also attach a minor bitrot fix commit/patch. Performance is improved for B-Tree index builds by a great deal, too. The improvements are only slightly less than those seen for comparable heap tuple sorts (that is, my earlier test cases that had client overhead removed). With larger sorts, that difference tends to get lost in the noise easily. I'm very encouraged by this. I think that being able to build B-Tree indexes on text attributes very significantly faster than previous versions of PostgreSQL is likely to be a significant feature for PostgreSQL 9.5. After all, external sorts are where improvements are most noticeable [2] - they're so much faster with this patch that they're actually sometimes faster than internal sorts *with* abbreviated keys. This would something that I found quite surprising. [1] http://www.postgresql.org/message-id/CAM3SWZTfKZHTUiWDdHg+6tcYuMsdHoC=bmuaivgmp9athj4...@mail.gmail.com [2] http://www.postgresql.org/message-id/cam3swzqvjcgme6ube-ydipu0n5bo7rmz31zrhmskdduynej...@mail.gmail.com -- Peter Geoghegan From 72070126e707cf356ddcb3de60cbdb14658ed077 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Fri, 10 Oct 2014 22:26:21 -0700 Subject: [PATCH 4/4] Make B-Tree/CLUSTER sortsupport use abbreviation This commit is intended to be applied on top of several others. First, apply the abbreviated key support patch -- the revision posted here: cam3swzt+v3llrbscyuj9jovumdpbq6bupf9wofbfuywgwuh...@mail.gmail.com. Then, apply the bitrot-fixing commit that was attached alongside this patch. In addition, the B-Tree/CLUSTER sortsupport patch must then be applied. Get it from: CAM3SWZTfKZHTUiWDdHg+6tcYuMsdHoC=bmuaivgmp9athj4...@mail.gmail.com Finally, apply this patch. B-Tree sortsupport with abbreviated keys (for text) shows benefits (and costs) that are similar to the heavily tested/benchmarked MinimalTuple case. There is additional overhead when building a B-Tree index as compared to heap tuplesorts (with no client overhead), but even still the vast majority of system time is spent actually sorting index tuples. Therefore, it is not expected that the addition of B-Tree/CLUSTER support will influence the review of abbreviated keys much either way. --- src/backend/access/nbtree/nbtsort.c | 3 + src/backend/utils/sort/tuplesort.c | 385 +++- 2 files changed, 298 insertions(+), 90 deletions(-) diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index a6c44a7..1ccc9fb 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -720,6 +720,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) strategy = (scanKey-sk_flags SK_BT_DESC) != 0 ? BTGreaterStrategyNumber : BTLessStrategyNumber; + /* Abbreviation is not supported here */ + sortKey-abbrev_state = ABBREVIATED_KEYS_NO; + PrepareSortSupportFromIndexRel(wstate-index, strategy, sortKey); } diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 1d57de7..c6a18dc 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -453,6 +453,7 @@ struct Tuplesortstate static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess); static void puttuple_common(Tuplesortstate *state, SortTuple *tuple); +static
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 24, 2014 at 7:04 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan p...@heroku.com wrote: Probably not - it appears to make very little difference to unoptimized pass-by-reference types whether or not datum1 can be used (see my simulation of Kevin's worst case, for example [1]). Streaming through a not inconsiderable proportion of memtuples again is probably a lot worse. The datum1 optimization (which is not all that old) made a lot of sense when initially introduced, because it avoided chasing through a pointer for pass-by-value types. I think that's its sole justification, though. Just to be clear -- I am blocked on this. Do you really prefer to restart copying heap tuples from scratch in the event of aborting, just to make sure that the datum1 representation is consistently either a pointer to text, or an abbreviated key? I don't think that the costs involved make that worth it, as I've said, but I'm not sure how to resolve that controversy. I suggest that we focus on making sure the abort logic itself is sound. There probably hasn't been enough discussion of that. Once that is resolved, we can revisit the question of whether or not copying should restart to keep the datum1 representation consistent. I suspect that leaving that until later will be easier all around. The top issue on my agenda is figuring out a way to get rid of the extra SortSupport object. I'm not going to commit any version of this patch that uses a second SortSupport for the tiebreak. I doubt anyone else will like that either, but you can try. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas robertmh...@gmail.com wrote: The top issue on my agenda is figuring out a way to get rid of the extra SortSupport object. Really? I'm surprised. Clearly the need to restart heap tuple copying from scratch, in order to make the datum1 representation consistent, rather than abandoning datum1 for storing abbreviated keys or pointers entirely is a very important aspect of whether or not we should change that. In turn, that's something that's going to (probably significantly) affect the worst case. Do you have an opinion on that? If you want me to start from scratch, and then have a consistent datum1 representation, and then be able to change the structure of comparetup_heap() as you outline (so as to get rid of the extra SortSupport object), I can do that. My concern is the regression. The datum1 pointer optimization appears to matter very little for pass by value types (traditionally, before abbreviated keys), and so I have a hard time imagining this working out. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 2:05 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas robertmh...@gmail.com wrote: The top issue on my agenda is figuring out a way to get rid of the extra SortSupport object. Really? I'm surprised. Clearly the need to restart heap tuple copying from scratch, in order to make the datum1 representation consistent, rather than abandoning datum1 for storing abbreviated keys or pointers entirely is a very important aspect of whether or not we should change that. In turn, that's something that's going to (probably significantly) affect the worst case. Do you have an opinion on that? I haven't looked at that part of the patch in detail yet, so... not really. But I don't see why you'd ever need to restart heap tuple copying. At most you'd need to re-extract datum1 from the tuples you have already copied. To find out how much that optimization buys, you should use tuples with many variable-length columns (say, 50) preceding the text column you're sorting on. I won't be surprised if that turns out to be expensive enough to be worth worrying about, but I have not benchmarked it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 11:53 AM, Robert Haas robertmh...@gmail.com wrote: I haven't looked at that part of the patch in detail yet, so... not really. But I don't see why you'd ever need to restart heap tuple copying. At most you'd need to re-extract datum1 from the tuples you have already copied. Well, okay, technically it's not restarting heap tuple copying, but it's about the same thing. The point is that you have to stream a significant chunk of the memtuples array through memory again. Not to mention, having infrastructure to do that, and pick up where we left off, which is significantly more code, all to make comparetup_heap() a bit clearer (i.e. making it so that it won't have to think about an extra sortsupport state). To find out how much that optimization buys, you should use tuples with many variable-length columns (say, 50) preceding the text column you're sorting on. I won't be surprised if that turns out to be expensive enough to be worth worrying about, but I have not benchmarked it. Sorry, but I don't follow. I don't think the pertinent question is if it's a noticeable cost. I think the pertinent question is if it's worth it. Doing something about it necessitates a lot of extra memory access. Not doing something about it hardly affects the amount of memory access required, perhaps not at all. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 25, 2014 at 3:17 PM, Peter Geoghegan p...@heroku.com wrote: To find out how much that optimization buys, you should use tuples with many variable-length columns (say, 50) preceding the text column you're sorting on. I won't be surprised if that turns out to be expensive enough to be worth worrying about, but I have not benchmarked it. Sorry, but I don't follow. I don't think the pertinent question is if it's a noticeable cost. I think the pertinent question is if it's worth it. Doing something about it necessitates a lot of extra memory access. Not doing something about it hardly affects the amount of memory access required, perhaps not at all. I think you're mincing words. If you go back and fix datum1, you'll spend a bunch of effort doing that.If you don't, you'll pay a cost on every comparison to re-find the relevant column inside each tuple. You can compare those costs in a variety of cases, including the one I mentioned, where the latter cost will be relatively high. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan p...@heroku.com wrote: Probably not - it appears to make very little difference to unoptimized pass-by-reference types whether or not datum1 can be used (see my simulation of Kevin's worst case, for example [1]). Streaming through a not inconsiderable proportion of memtuples again is probably a lot worse. The datum1 optimization (which is not all that old) made a lot of sense when initially introduced, because it avoided chasing through a pointer for pass-by-value types. I think that's its sole justification, though. Just to be clear -- I am blocked on this. Do you really prefer to restart copying heap tuples from scratch in the event of aborting, just to make sure that the datum1 representation is consistently either a pointer to text, or an abbreviated key? I don't think that the costs involved make that worth it, as I've said, but I'm not sure how to resolve that controversy. I suggest that we focus on making sure the abort logic itself is sound. There probably hasn't been enough discussion of that. Once that is resolved, we can revisit the question of whether or not copying should restart to keep the datum1 representation consistent. I suspect that leaving that until later will be easier all around. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Sep 16, 2014 at 4:55 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Sep 16, 2014 at 1:45 PM, Robert Haas robertmh...@gmail.com wrote: Even though our testing seems to indicate that the memcmp() is basically free, I think it would be good to make the effort to avoid doing memcmp() and then strcoll() and then strncmp(). Seems like it shouldn't be too hard. Really? The tie-breaker for the benefit of locales like hu_HU uses strcmp(), not memcmp(). It operates on the now-terminated copies of strings. There is no reason to think that the strings must be the same size for that strcmp(). I'd rather only do the new opportunistic memcmp() == 0 thing when len1 == len2. And I wouldn't like to have to also figure out that it's safe to use the earlier result, because as it happens len1 == len2, or any other such trickery. OK, good point. So committed as-is, then, except that I rewrote the comments, which I felt were excessively long for the amount of code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 19, 2014 at 9:59 AM, Robert Haas robertmh...@gmail.com wrote: OK, good point. So committed as-is, then, except that I rewrote the comments, which I felt were excessively long for the amount of code. Thanks! I look forward to hearing your thoughts on the open issues with the patch as a whole. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 11, 2014 at 8:34 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Sep 9, 2014 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: I like that I don't have to care about every combination, and can treat abbreviation abortion as the special case with the extra step, in line with how I think of the optimization conceptually. Does that make sense? No. comparetup_heap() is hip-deep in this optimization as it stands, and what I proposed - if done correctly - isn't going to make that significantly worse. In fact, it really ought to make things better: you should be able to set things up so that ssup-comparator is always the test that should be applied first, regardless of whether we're aborted or not-aborted or not doing this in the first place; and then ssup-tiebreak_comparator, if not NULL, can be applied after that. I'm not following here. Isn't that at least equally true of what I've proposed? Sure, I'm checking if (!state-abbrevAborted) first, but that's irrelevant to the non-abbreviated case. It has nothing to abort. Also, AFAICT we cannot abort and still call ssup-comparator() indifferently, since sorttuple.datum1 fields are perhaps abbreviated keys in half of all cases (i.e. pre-abort tuples), and uninitialized garbage the other half of the time (i.e. post-abort tuples). You can if you engineer ssup-comparator() to contain the right pointer at the right time. Also, shouldn't you go back and fix up those abbreviated keys to point to datum1 again if you abort? Where is the heap_getattr() stuff supposed to happen for the first attribute to get an authoritative comparison in the event of aborting (if we're calling ssup-comparator() on datum1 indifferently)? We decide that we're going to use abbreviated keys within datum1 fields up-front. When we abort, we cannot use datum1 fields at all (which doesn't appear to matter for text -- the datum1 optimization has historically only benefited pass-by-value types). You always pass datum1 to a function. The code doesn't need to care about whether that function is expecting abbreviated or non-abbreviated datums unless that function returns equality. Then it needs to think about calling a backup comparator if there is one. Is doing all this worth the small saving in memory? Personally, I don't think that it is, but I defer to you. I don't care about the memory; I care about the code complexity and the ease of understanding that code. I am confident that it can be done better than the patch does it today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 19, 2014 at 2:35 PM, Robert Haas robertmh...@gmail.com wrote: Also, shouldn't you go back and fix up those abbreviated keys to point to datum1 again if you abort? Probably not - it appears to make very little difference to unoptimized pass-by-reference types whether or not datum1 can be used (see my simulation of Kevin's worst case, for example [1]). Streaming through a not inconsiderable proportion of memtuples again is probably a lot worse. The datum1 optimization (which is not all that old) made a lot of sense when initially introduced, because it avoided chasing through a pointer for pass-by-value types. I think that's its sole justification, though. BTW, I think that if we ever get around to doing this for numeric, it won't ever abort. The abbreviation strategy can be adaptive, to maximize the number of comparisons successfully resolved with abbreviated keys. This would probably use a streaming algorithm like HyperLogLog too. [1] http://www.postgresql.org/message-id/cam3swzqhjxiyrsqbs5w3u-vtj_jt2hp8o02big5wyb4m9lp...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 7:21 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan p...@heroku.com wrote: OK, I'll draft a patch for that today, including similar alterations to varstr_cmp() for the benefit of Windows and so on. I attach a much simpler patch, that only adds an opportunistic memcmp() == 0 before a possible strcoll(). Both bttextfastcmp_locale() and varstr_cmp() have the optimization added, since there is no point in leaving anyone out for this part. Even though our testing seems to indicate that the memcmp() is basically free, I think it would be good to make the effort to avoid doing memcmp() and then strcoll() and then strncmp(). Seems like it shouldn't be too hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Sep 16, 2014 at 1:45 PM, Robert Haas robertmh...@gmail.com wrote: Even though our testing seems to indicate that the memcmp() is basically free, I think it would be good to make the effort to avoid doing memcmp() and then strcoll() and then strncmp(). Seems like it shouldn't be too hard. Really? The tie-breaker for the benefit of locales like hu_HU uses strcmp(), not memcmp(). It operates on the now-terminated copies of strings. There is no reason to think that the strings must be the same size for that strcmp(). I'd rather only do the new opportunistic memcmp() == 0 thing when len1 == len2. And I wouldn't like to have to also figure out that it's safe to use the earlier result, because as it happens len1 == len2, or any other such trickery. The bug fix that added the strcmp() tie-breaker was committed in 2005. PostgreSQL had locale support for something like 8 years prior, and it took that long for us to notice the problem. I would suggest that makes the case for doing anything else pretty marginal. In the bug report at the time, len1 != len2 anyway. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On 09/14/2014 11:34 PM, Peter Geoghegan wrote: On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless memcmp() is free with these parameters. Is this the worst case scenario? Other than pushing the differences much much later in the strings (which you surely thought of already), yes. Please test the absolute worst case scenario you can think of. As I said earlier, if you can demonstrate that the slowdown of that is acceptable, we don't even need to discuss how likely or realistic the case is. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sun, Sep 14, 2014 at 10:37 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/13/2014 11:28 PM, Peter Geoghegan wrote: Anyway, attached rough test program implements what you outline. This is for 30,000 32 byte strings (where just the final two bytes differ). On my laptop, output looks like this (edited to only show median duration in each case): Got to be careful to not let the compiler optimize away microbenchmarks like this. At least with my version of gcc, the strcoll calls get optimized away, as do the memcmp calls, if you don't use the result for anything. Clang was even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it optimizes away the loops altogether. Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it runs in C locale, and we don't use strcoll() at all for C locale. After fixing those, it runs much slower, so I had to reduce the number of strings. Here's a fixed program. I'm now getting numbers like this: (baseline) duration of comparisons without useless memcmp()s: 6.007368 seconds duration of comparisons with useless memcmp()s: 6.079826 seconds Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless memcmp() is free with these parameters. Is this the worst case scenario? I can't see a worse one, and I replicated your results here on my MacBook Pro. I also tried with 1MB strings and, surprisingly, it was basically free there, too. It strikes me that perhaps we should make this change (rearranging things so that the memcmp tiebreak is run before strcoll) first, before dealing with the rest of the abbreviated keys infrastructure. It appears to be a separate improvement which is worthwhile independently of what we do about that patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas robertmh...@gmail.com wrote: It strikes me that perhaps we should make this change (rearranging things so that the memcmp tiebreak is run before strcoll) first, before dealing with the rest of the abbreviated keys infrastructure. It appears to be a separate improvement which is worthwhile independently of what we do about that patch. I guess we could do that, but AFAICT the only open item blocking the commit of a basic version of abbreviated keys (the informally agreed to basic version lacking support for single-attribute aggregates) is what to do about the current need to create a separate sortsupport state. I've talked about my thoughts on that question in detail now [1]. BTW, you probably realize this, but we still need a second memcmp() after strcoll() too. hu_HU will care about that [2]. [1] http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com [2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 1:34 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas robertmh...@gmail.com wrote: It strikes me that perhaps we should make this change (rearranging things so that the memcmp tiebreak is run before strcoll) first, before dealing with the rest of the abbreviated keys infrastructure. It appears to be a separate improvement which is worthwhile independently of what we do about that patch. I guess we could do that, but AFAICT the only open item blocking the commit of a basic version of abbreviated keys (the informally agreed to basic version lacking support for single-attribute aggregates) is what to do about the current need to create a separate sortsupport state. I've talked about my thoughts on that question in detail now [1]. I think there's probably more than that to work out, but in any case there's no harm in getting a simple optimization done first before moving on to a complicated one. BTW, you probably realize this, but we still need a second memcmp() after strcoll() too. hu_HU will care about that [2]. [1] http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com [2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad I rather assume we could reuse the results of the first memcmp() instead of doing it again. x = memcmp(); if (x == 0) return x; y = strcoll(); if (y == 0) return x; return y; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote: I think there's probably more than that to work out, but in any case there's no harm in getting a simple optimization done first before moving on to a complicated one. I guess we never talked about the abort logic in all that much detail. I suppose there's that, too. I rather assume we could reuse the results of the first memcmp() instead of doing it again. x = memcmp(); if (x == 0) return x; y = strcoll(); if (y == 0) return x; return y; Of course, but you know what I mean. (I'm sure the compiler will realize this if the programmer doesn't) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote: I think there's probably more than that to work out, but in any case there's no harm in getting a simple optimization done first before moving on to a complicated one. I guess we never talked about the abort logic in all that much detail. I suppose there's that, too. Well, the real point is that from where I'm sitting, this... x = memcmp(); if (x == 0) return x; y = strcoll(); if (y == 0) return x; return y; ...looks like about a 10-line patch. We have the data to show that the loss is trivial even in the worst case, and we have or should be able to get data showing that the best-case win is significant even without the abbreviated key stuff. If you'd care to draft a patch for just that, I assume we could get it committed in a day or two, whereas I'm quite sure that considerably more work than that remains for the main patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 11:20 AM, Robert Haas robertmh...@gmail.com wrote: ...looks like about a 10-line patch. We have the data to show that the loss is trivial even in the worst case, and we have or should be able to get data showing that the best-case win is significant even without the abbreviated key stuff. If you'd care to draft a patch for just that, I assume we could get it committed in a day or two, whereas I'm quite sure that considerably more work than that remains for the main patch. OK, I'll draft a patch for that today, including similar alterations to varstr_cmp() for the benefit of Windows and so on. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan p...@heroku.com wrote: OK, I'll draft a patch for that today, including similar alterations to varstr_cmp() for the benefit of Windows and so on. I attach a much simpler patch, that only adds an opportunistic memcmp() == 0 before a possible strcoll(). Both bttextfastcmp_locale() and varstr_cmp() have the optimization added, since there is no point in leaving anyone out for this part. When this is committed, and I hear back from you on the question of what to do about having an independent sortsupport state for abbreviated tie-breakers (and possibly other issues of concern), I'll produce a rebased patch with a single commit. -- Peter Geoghegan diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c new file mode 100644 index 7549542..53be8ec *** a/src/backend/utils/adt/varlena.c --- b/src/backend/utils/adt/varlena.c *** varstr_cmp(char *arg1, int len1, char *a *** 1405,1410 --- 1405,1438 #endif } + /* + * Fast path: Attempt an entirely opportunistic memcmp() == 0. + * + * In general there is no reason to be optimistic about being able to + * resolve the string comparison in this way. There are many usage + * patterns in which this fast path will never be taken, and some in + * which it will frequently be taken. When it is frequently taken, + * clearly the optimization is worthwhile, since strcoll() is known to + * be very expensive. When it is not taken, the overhead is completely + * negligible, and perhaps even immeasurably small, even in the worst + * case. We have nothing to lose and everything to gain. + * + * These counter-intuitive performance characteristics are likely + * explained by the dominant memory latency cost when performing a + * strcoll(); in practice, this may give the compiler and CPU + * sufficient leeway to order things in a way that hides the overhead + * of useless memcmp() calls. Sorting is very probably bottlenecked on + * memory bandwidth, and so it's worth it to use a few arithmetic + * instructions on the off chance that it results in a saving in memory + * bandwidth. In effect, this makes use of a capacity that would + * otherwise go unused. The same memory needed to be read into CPU + * cache at this juncture anyway. Modern CPUs can execute hundreds of + * instructions in the time taken to fetch a single cache line from + * main memory. + */ + if (len1 == len2 memcmp(arg1, arg2, len1) == 0) + return 0; + #ifdef WIN32 /* Win32 does not have UTF-8, so we need to map to UTF-16 */ if (GetDatabaseEncoding() == PG_UTF8) *** bttextfastcmp_locale(Datum x, Datum y, S *** 1842,1847 --- 1870,1885 len1 = VARSIZE_ANY_EXHDR(arg1); len2 = VARSIZE_ANY_EXHDR(arg2); + /* + * Fast path: Attempt an entirely opportunistic memcmp() == 0, as + * explained within varstr_cmp() + */ + if (len1 == len2 memcmp(a1p, a2p, len1) == 0) + { + result = 0; + goto done; + } + if (len1 = tss-buflen1) { pfree(tss-buf1); *** bttextfastcmp_locale(Datum x, Datum y, S *** 1875,1880 --- 1913,1919 if (result == 0) result = strcmp(tss-buf1, tss-buf2); + done: /* We can't afford to leak memory here. */ if (PointerGetDatum(arg1) != x) pfree(arg1); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Mon, Sep 15, 2014 at 4:21 PM, Peter Geoghegan p...@heroku.com wrote: I attach a much simpler patch, that only adds an opportunistic memcmp() == 0 before a possible strcoll(). Both bttextfastcmp_locale() and varstr_cmp() have the optimization added, since there is no point in leaving anyone out for this part. FWIW, it occurs to me that this could be a big win for cases like ExecMergeJoin(). Obviously, abbreviated keys will usually make your merge join on text attributes a lot faster in the common case where a sort is involved (if we consider that the sort is integral to the cost of the join). However, when making sure that inner and outer tuples match, the MJCompare() call will *very* frequently get away with a cheap memcmp(). That could make a very significant additional difference, I think. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On 09/13/2014 11:28 PM, Peter Geoghegan wrote: Anyway, attached rough test program implements what you outline. This is for 30,000 32 byte strings (where just the final two bytes differ). On my laptop, output looks like this (edited to only show median duration in each case): Got to be careful to not let the compiler optimize away microbenchmarks like this. At least with my version of gcc, the strcoll calls get optimized away, as do the memcmp calls, if you don't use the result for anything. Clang was even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it optimizes away the loops altogether. Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it runs in C locale, and we don't use strcoll() at all for C locale. After fixing those, it runs much slower, so I had to reduce the number of strings. Here's a fixed program. I'm now getting numbers like this: (baseline) duration of comparisons without useless memcmp()s: 6.007368 seconds duration of comparisons with useless memcmp()s: 6.079826 seconds Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless memcmp() is free with these parameters. Is this the worst case scenario? - Heikki #include ctype.h #include stdio.h #include stdlib.h #include string.h #include sys/time.h #include locale.h /* STRING_SIZE does not include NUL byte */ #define STRING_SIZE 32 #define N_STRINGS 2000 #define LAST_N_DIFFER 2 #define INSTR_TIME_SUBTRACT(x,y) \ do { \ (x).tv_sec -= (y).tv_sec; \ (x).tv_usec -= (y).tv_usec; \ /* Normalize */ \ while ((x).tv_usec 0) \ { \ (x).tv_usec += 100; \ (x).tv_sec--; \ } \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ (((double) (t).tv_sec) + ((double) (t).tv_usec) / 100.0) #define Min(x, y) ((x) (y) ? (x) : (y)) /* * Generate single random alphabetical ASCII char. Uses an unseeded rand() * call, to ensure inter-run determinism. */ static inline char generate_prandom_printable_char(void) { for (;;) { char crand = rand() 0x7F; if ((crand = 'A' crand = 'Z') || (crand = 'a' crand = 'z')) return crand; } } /* * Generate a random payload string. The returned string is pre-calcuated * once, to form the bulk of each distinct string. */ static inline char * generate_prandom_payload() { char *payload = malloc(STRING_SIZE + 1); int i; /* * The final LAST_N_DIFFER bytes will ultimately be clobbered with distinct * bytes, but that occurs separately, per string. */ for (i = 0; i STRING_SIZE; i++) payload[i] = generate_prandom_printable_char(); /* * Add NUL here, even though complete strings are not NULL terminated (or, * rather, are copied into a buffer on the stack in order to add a NUL once * per comparison) */ payload[STRING_SIZE] = '\0'; return payload; } int main(int argc, const char *argv[]) { char **strings = malloc(N_STRINGS * sizeof(char *)); char *payload = generate_prandom_payload(); int i, j; int nmatches = 0; setlocale(LC_ALL, ); /* Initialize strings */ for (i = 0; i N_STRINGS; i++) { int j; strings[i] = malloc(STRING_SIZE); memcpy(strings[i], payload, STRING_SIZE); /* Last LAST_N_DIFFER characters randomly vary */ for (j = LAST_N_DIFFER; j 0; j--) { char n_last = generate_prandom_printable_char(); strings[i][STRING_SIZE - j] = n_last; } /* Don't terminate -- no room in buffer */ } printf(Strings generated - beginning tests\n); for (;;) { struct timeval before, after; /* ** * Baseline -- no wasted memcmp(). * * Compare each string to each other string using only strcoll() ** */ gettimeofday(before, NULL); for (i = 0; i N_STRINGS; i++) { char *str = strings[i]; /* Quadratic string comparison */ for (j = 0; j N_STRINGS; j++) { int cmp; char buf1[STRING_SIZE + 1]; char buf2[STRING_SIZE + 1]; char *other; other = strings[j]; memcpy(buf1, str, STRING_SIZE); memcpy(buf2, other, STRING_SIZE); buf1[STRING_SIZE] = '\0'; buf2[STRING_SIZE] = '\0'; cmp = strcoll(buf1, buf2); if (cmp == 0) nmatches++; /* * memcmp() tie-breaker (equivalent to the one that currently * appears within varstr_cmp()) isn't considered. It's only * relevant to a small number of locales, like hu_HU, where * strcoll() might (rarely) indicate equality in respect of a * pair of non-identical strings. If strcoll() did return 0, * then for most locales it's certain that the first memcmp() * would have worked out. */ } } /* Report no-wasted-memcmp case duration */ gettimeofday(after, NULL); INSTR_TIME_SUBTRACT(after, before); printf((baseline) duration of comparisons without useless memcmp()s: %f seconds\n\n, INSTR_TIME_GET_DOUBLE(after)); /*
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Got to be careful to not let the compiler optimize away microbenchmarks like this. At least with my version of gcc, the strcoll calls get optimized away, as do the memcmp calls, if you don't use the result for anything. Clang was even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it optimizes away the loops altogether. I suppose the fact that I saw results that fit my pre-conceived notion of what was happening made me lose my initial concern about that. Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it runs in C locale, and we don't use strcoll() at all for C locale. Oops. This might be a useful mistake, though -- if the strcoll() using the C locale is enough to make the memcmp() not free, then that suggests that strcoll() is the hiding place for the useless memcmp(), where instructions relating to the memcmp() can execute in parallel to instructions relating to strcoll() that add latency from memory accesses (for non-C locales). With the C locale, strcoll() is equivalent to strcmp()/memcmp(). Commenting out the setlocale(LC_ALL, ) in your revised versions shows something like my original numbers (so I guess my compiler wasn't smart enough to optimize away the strcoll() + memcmp() cases). Whereas, there is no noticeable regression/difference between each case when I run the revised program unmodified. That seems to prove that strcoll() is a good enough hiding place. Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless memcmp() is free with these parameters. Is this the worst case scenario? Other than pushing the differences much much later in the strings (which you surely thought of already), yes. I think it's worse than the worst, because we've boiled this down to just the comparison part, leaving only the strcoll() as a hiding place, which is evidently good enough. I thought that it was important that there be an unpredictable access pattern (characteristic of quicksort), so that memory latency is added here and there. I'm happy to learn that I was wrong about that, and that a strcoll() alone hides the would-be memcmp() latency. Large strings matter much less anyway, I think. If you have a pair of strings both longer than CACHE_LINE_SIZE bytes, and the first CACHE_LINE_SIZE bytes are identical, and the lengths are known to match, it seems like a very sensible bet to anticipate that they're fully equal. So in a world where that affects the outcome of this test program, I think it still changes nothing (if, indeed, it matters at all, which it appears not to anyway, at least with 256 byte strings). We should probably do the a fully opportunistic memcmp() == 0 within varstr_cmp() itself, so that Windows has the benefit of this too, as well as callers like compareJsonbScalarValue(). Actually, looking at it closely, I think that there might still be a microscopic regression, as there might have also been with my variant of your SQL test case [1] - certainly in the noise, but perhaps measurable with enough runs. If there is, that seems like an acceptable price to pay. When I test this stuff, I'm now very careful about power management settings on my laptop...there are many ways to be left with egg on your face with this kind of benchmark. [1] http://www.postgresql.org/message-id/cam3swzqy95sow00b+zjycrgmr-uf1mz8ryv4_ou2encvstn...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote: Based on discussion thus far it seems that there's a possibility that the trade-off may be different for short strings vs. long strings. If the string is small enough to fit in the L1 CPU cache, then it may be that memcmp() followed by strcoll() is not much more expensive than strcoll(). That should be easy to figure out: write a standalone C program that creates a bunch of arbitrary, fairly-short strings, say 32 bytes, in a big array. Make the strings different near the end, but not at the beginning. Then have the program either do strcoll() on every pair (O(n^2)) or, with a #define, memcmp() followed by strcoll() on every pair. It should be easy to see whether the memcmp() adds 1% or 25% or 100%. I get where you're coming from now (I think). It seems like you're interested in getting a sense of what it would cost to do an opportunistic memcmp() in a universe where memory latency isn't by far the dominant cost (which it clearly is, as evidenced by my most recent benchmark where a variant of Heikki's highly unsympathetic SQL test case showed a ~1% regression). You've described a case with totally predictable access patterns, perfectly suited to prefetching, and with no other memory access bottlenecks. As I've said, this seems reasonable (at least with those caveats in mind). The answer to your high level question appears to be: the implementation (the totally opportunistic memcmp() == 0 thing) benefits from the fact that we're always bottlenecked on memory, and to a fairly appreciable degree. I highly doubt that this is something that can fail to work out with real SQL queries, but anything that furthers our understanding of the optimization is a good thing. Of course, within tuplesort we're not really getting the totally opportunistic memcmp()s for free - rather, we're using a resource that we would not otherwise be able to use at all. This graph illustrates the historic trends of CPU and memory performance: http://www.cs.virginia.edu/stream/stream_logo.gif I find this imbalance quite remarkable - no wonder researchers are finding ways to make the best of the situation. To make matters worse, the per-core trends for memory bandwidth are now actually *negative growth*. We may actually be going backwards, if we assume that that's the bottleneck, and that we cannot parallelize things. Anyway, attached rough test program implements what you outline. This is for 30,000 32 byte strings (where just the final two bytes differ). On my laptop, output looks like this (edited to only show median duration in each case): Strings generated - beginning tests (baseline) duration of comparisons without useless memcmp()s: 13.445506 seconds duration of comparisons with useless memcmp()s: 17.720501 seconds It looks like about a 30% increase in run time when we always have useless memcmps(). You can change the constants around easily - let's consider 64 KiB strings now (by changing STRING_SIZE to 65536). In order to make the program not take too long, I also reduce the number of strings (N_STRINGS) from 30,000 to 1,000. If I do so, this is what I see as output: Strings generated - beginning tests (baseline) duration of comparisons without useless memcmp()s: 11.205683 seconds duration of comparisons with useless memcmp()s: 14.542997 seconds It looks like the overhead here is surprisingly consistent with the first run - again, about a 30% increase in run time. As for 1MiB strings (this time, with an N_STRINGS of 300): Strings generated - beginning tests (baseline) duration of comparisons without useless memcmp()s: 23.624183 seconds duration of comparisons with useless memcmp()s: 35.332728 seconds So, at this scale, the overhead gets quite a bit worse, but the case also becomes quite a bit less representative (if that's even possible). I suspect that the call stack's stupidly large size may be a problem here, but I didn't try and fix that. Does this answer your question? Are you intent on extrapolating across different platforms based on this test program, rather than looking at real SQL queries? While a certain amount of that makes sense, I think we should focus on queries that have some chance of being seen in real production PostgreSQL instances. Failing that, actual SQL queries. -- Peter Geoghegan #include ctype.h #include stdio.h #include stdlib.h #include string.h #include sys/time.h /* STRING_SIZE does not include NUL byte */ #define STRING_SIZE 32 #define N_STRINGS 3 #define LAST_N_DIFFER 2 #define INSTR_TIME_SUBTRACT(x,y) \ do { \ (x).tv_sec -= (y).tv_sec; \ (x).tv_usec -= (y).tv_usec; \ /* Normalize */ \ while ((x).tv_usec 0) \ { \ (x).tv_usec += 100; \ (x).tv_sec--; \ } \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ (((double) (t).tv_sec) + ((double) (t).tv_usec) / 100.0) #define Min(x, y) ((x) (y) ? (x) : (y)) /* * Generate single random alphabetical ASCII
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On 09/12/2014 12:46 AM, Peter Geoghegan wrote: On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com wrote: I think I said pretty clearly that it was. I agree that you did, but it wasn't clear exactly what factors you were asking me to simulate. All factors. Do you want me to compare the same string a million times in a loop, both with a strcoll() and with a memcmp()? Yes. Should I copy it into a buffer to add a NUL byte? Yes. Or should it be a new string each time, with a cache miss expected some proportion of the time? Yes. I'm being facetious - it's easy to ask for tests when you're not the one running them. But seriously, please do run the all the tests that you think make sense. I'm particularly interested in the worst case. What is the worst case for the proposed memcmp() check? Test that. If the worst case regresses significantly, then we need to have a discussion of how likely that worst case is to happen in real life, what the performance is like in more realistic almost-worst-case scenarios, does it need to be tunable, is the trade-off worth it, etc. But if the worst case regresses less than, say, 1%, and there are some other cases where you get a 300% speed up, then I think it's safe to say that the optimization is worth it, without any more testing or discussion. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 12, 2014 at 5:28 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/12/2014 12:46 AM, Peter Geoghegan wrote: On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com wrote: I think I said pretty clearly that it was. I agree that you did, but it wasn't clear exactly what factors you were asking me to simulate. All factors. Do you want me to compare the same string a million times in a loop, both with a strcoll() and with a memcmp()? Yes. Should I copy it into a buffer to add a NUL byte? Yes. Or should it be a new string each time, with a cache miss expected some proportion of the time? Yes. I'm being facetious - it's easy to ask for tests when you're not the one running them. But seriously, please do run the all the tests that you think make sense. I'm particularly interested in the worst case. What is the worst case for the proposed memcmp() check? Test that. If the worst case regresses significantly, then we need to have a discussion of how likely that worst case is to happen in real life, what the performance is like in more realistic almost-worst-case scenarios, does it need to be tunable, is the trade-off worth it, etc. But if the worst case regresses less than, say, 1%, and there are some other cases where you get a 300% speed up, then I think it's safe to say that the optimization is worth it, without any more testing or discussion. +1 to all that, including the facetious parts. Based on discussion thus far it seems that there's a possibility that the trade-off may be different for short strings vs. long strings. If the string is small enough to fit in the L1 CPU cache, then it may be that memcmp() followed by strcoll() is not much more expensive than strcoll(). That should be easy to figure out: write a standalone C program that creates a bunch of arbitrary, fairly-short strings, say 32 bytes, in a big array. Make the strings different near the end, but not at the beginning. Then have the program either do strcoll() on every pair (O(n^2)) or, with a #define, memcmp() followed by strcoll() on every pair. It should be easy to see whether the memcmp() adds 1% or 25% or 100%. Then, repeat the same thing with strings that are big enough to blow out the L1 cache, say 1MB in length. Some intermediate sizes (64kB?) might be worth testing, too. Again, it should be easy to see what the overhead is. Once we know that, we can make intelligent decisions about whether this is a good idea or not, and when. If you attach the test programs, other people (e.g. me) can also try them on other systems (e.g. MacOS X) to see whether the characteristics there are different than what you saw on your system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote: Based on discussion thus far it seems that there's a possibility that the trade-off may be different for short strings vs. long strings. If the string is small enough to fit in the L1 CPU cache, then it may be that memcmp() followed by strcoll() is not much more expensive than strcoll(). That should be easy to figure out: write a standalone C program that creates a bunch of arbitrary, fairly-short strings, say 32 bytes, in a big array. While I think that's fair, the reason I didn't bother playing tricks with only doing a (purely) opportunistic memcmp() when the string size is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter you'd have to have a use case where the first CACHE_LINE_SIZE of bytes matched, and the string just happened to be identical in length, but also ultimately differed at least a good fraction of the time. That seems like the kind of thing that it's okay to care less about. That might have been regressed worse than what you've seen already. It's narrow in a whole new dimension, though. The intersection of that issue, and the issues exercised by Heikki's existing test case must be exceedingly rare. I'm still confused about whether or not we're talking at cross purposes here, Robert. Are you happy to consider this as a separate and additional question to the question of what to do in an abbreviated comparison tie-break? The correlated multiple sort key attributes case strikes me as very common - it's a nice to have, and will sometimes offer a nice additional boost. On the other hand, doing this for abbreviated comparison tie-breakers is more or less fundamental to the patch. In my cost model, a memcmp() abbreviated key tie-breaker than works out is equivalent to an abbreviated comparison. This is a bit of a fudge, but close enough. BTW, I do appreciate your work on this. I realize that if you didn't give this patch a fair go, there is a chance that no one else would. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 12, 2014 at 2:58 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote: Based on discussion thus far it seems that there's a possibility that the trade-off may be different for short strings vs. long strings. If the string is small enough to fit in the L1 CPU cache, then it may be that memcmp() followed by strcoll() is not much more expensive than strcoll(). That should be easy to figure out: write a standalone C program that creates a bunch of arbitrary, fairly-short strings, say 32 bytes, in a big array. While I think that's fair, the reason I didn't bother playing tricks with only doing a (purely) opportunistic memcmp() when the string size is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter you'd have to have a use case where the first CACHE_LINE_SIZE of bytes matched, and the string just happened to be identical in length, but also ultimately differed at least a good fraction of the time. That seems like the kind of thing that it's okay to care less about. That might have been regressed worse than what you've seen already. It's narrow in a whole new dimension, though. The intersection of that issue, and the issues exercised by Heikki's existing test case must be exceedingly rare. I'm still confused about whether or not we're talking at cross purposes here, Robert. Are you happy to consider this as a separate and additional question to the question of what to do in an abbreviated comparison tie-break? I think I've said a few times now that I really want to get this additional data before forming an opinion. As a certain Mr. Doyle writes, It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. I can't say it any better than that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 12, 2014 at 12:02 PM, Robert Haas robertmh...@gmail.com wrote: I think I've said a few times now that I really want to get this additional data before forming an opinion. As a certain Mr. Doyle writes, It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. I can't say it any better than that. Well, in the abbreviated key case we might know that with probability 0.9 that the memcmp() == 0 thing will work out. In the non-abbreviated tie-breaker case, we'll definitely know nothing. That seems like a pretty fundamental distinction, so I don't think it's premature to ask you to consider those two questions individually. Still, maybe it's easier to justify both cases in the same way, if we can. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote: No, not really. All you have to do is right a little test program to gather the information. I don't think a little test program is useful - IMV it's too much of a simplification to suppose that a strcoll() has a fixed cost, and a memcmp() has a fixed cost, and that we can determine algebraically that we should (say) proceed or not proceed with the additional opportunistic memcmp() == 0 optimization based solely on that. I'm not sure if that's what you meant, but it might have been. Temporal locality is surely a huge factor here, for example. Are we talking about a memcmp() that always immediately precedes a similar strcoll() call on the same memory? Are we factoring in the cost of NUL-termination in order to make each strcoll() possible? And that's just for starters. However, I think it's perfectly fair to consider a case where the opportunistic memcmp() == 0 thing never works out (as opposed to mostly not helping, which is what I considered earlier [1]), as long as we're sorting real tuples. You mentioned Heikki's test case; it seems fair to consider that, but for the non-abbreviated case where the additional, *totally* opportunistic memcmp == 0 optimization only applies (so no abbreviated keys), while still having the additional optimization be 100% useless. Clearly that test case is also just about perfectly pessimal for this case too. (Recall that Heikki's test case shows performance on per-sorted input, so there are far fewer comparisons than would typically be required anyway - n comparisons, or a bubble sort best case. If I wanted to cheat, I could reduce work_mem so that an external tape sort is used, since as it happens tapesort doesn't opportunistically check for pre-sorted input, but I won't do that. Heikki's case both emphasizes the amortized cost of a strxfrm() where we abbreviate, and in this instance de-emphasizes the importance of memory latency by having access be sequential/predictable.) The only variation I'm adding here to Heikki's original test case is to have a leading int4 attribute that always has a value of 1 -- that conveniently removes abbreviation (including strxfrm() overhead) as a factor that can influence the outcome, since right now that isn't under consideration. So: create table sorttest (dummy int4, t text); insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75) from generate_series(1, 3) g; Benchmark: pg@hamster:~/tests$ cat heikki-sort.sql select * from (select * from sorttest order by dummy, t offset 100) f; pgbench -f heikki-sort.sql -T 100 -n With optimization enabled tps = 77.861353 (including connections establishing) tps = 77.862260 (excluding connections establishing) tps = 78.211053 (including connections establishing) tps = 78.212016 (excluding connections establishing) tps = 77.996117 (including connections establishing) tps = 77.997069 (excluding connections establishing) With optimization disabled (len1 == len2 thing is commented out) = tps = 78.719389 (including connections establishing) tps = 78.720366 (excluding connections establishing) tps = 78.764819 (including connections establishing) tps = 78.765712 (excluding connections establishing) tps = 78.472902 (including connections establishing) tps = 78.473844 (excluding connections establishing) So, yes, it looks like I might have just about regressed this case - it's hard to be completely sure. However, this is still a very unrealistic case, since invariably len1 == len2 without the optimization ever working out, whereas the case that benefits [2] is quite representative. As I'm sure you were expecting, I still favor pursuing this additional optimization. If you think I've been unfair or not thorough, I'm happy to look at other cases. Also, I'm not sure that you accept that I'm justified in considering this a separate question to the more important question of what to do in the tie-breaker abbreviation case (where we may be almost certain that equality will be indicated by a memcmp()). If you don't accept that I'm right about that more important case, I guess that means that you don't have confidence in my ad-hoc cost model (the HyperLogLog/cardinality stuff). [1] http://www.postgresql.org/message-id/CAM3SWZR9dtGO+zX4VEn7GTW2=+umSNq=c57sjgxg8oqhjar...@mail.gmail.com [2] http://www.postgresql.org/message-id/cam3swzqtyv3kp+cakzjzv3rwb1ojjahwpcz9coyjxpkhbtc...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 11, 2014 at 4:13 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote: No, not really. All you have to do is right a little test program to gather the information. I don't think a little test program is useful - IMV it's too much of a simplification to suppose that a strcoll() has a fixed cost, and a memcmp() has a fixed cost, and that we can determine algebraically that we should (say) proceed or not proceed with the additional opportunistic memcmp() == 0 optimization based solely on that. I'm not sure if that's what you meant, but it might have been. I think I said pretty clearly that it was. However, I think it's perfectly fair to consider a case where the opportunistic memcmp() == 0 thing never works out (as opposed to mostly not helping, which is what I considered earlier [1]), as long as we're sorting real tuples. You mentioned Heikki's test case; it seems fair to consider that, but for the non-abbreviated case where the additional, *totally* opportunistic memcmp == 0 optimization only applies (so no abbreviated keys), while still having the additional optimization be 100% useless. Clearly that test case is also just about perfectly pessimal for this case too. (Recall that Heikki's test case shows performance on per-sorted input, so there are far fewer comparisons than would typically be required anyway - n comparisons, or a bubble sort best case. If I wanted to cheat, I could reduce work_mem so that an external tape sort is used, since as it happens tapesort doesn't opportunistically check for pre-sorted input, but I won't do that. Heikki's case both emphasizes the amortized cost of a strxfrm() where we abbreviate, and in this instance de-emphasizes the importance of memory latency by having access be sequential/predictable.) The only variation I'm adding here to Heikki's original test case is to have a leading int4 attribute that always has a value of 1 -- that conveniently removes abbreviation (including strxfrm() overhead) as a factor that can influence the outcome, since right now that isn't under consideration. So: create table sorttest (dummy int4, t text); insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75) from generate_series(1, 3) g; Benchmark: pg@hamster:~/tests$ cat heikki-sort.sql select * from (select * from sorttest order by dummy, t offset 100) f; pgbench -f heikki-sort.sql -T 100 -n With optimization enabled tps = 77.861353 (including connections establishing) tps = 77.862260 (excluding connections establishing) tps = 78.211053 (including connections establishing) tps = 78.212016 (excluding connections establishing) tps = 77.996117 (including connections establishing) tps = 77.997069 (excluding connections establishing) With optimization disabled (len1 == len2 thing is commented out) = tps = 78.719389 (including connections establishing) tps = 78.720366 (excluding connections establishing) tps = 78.764819 (including connections establishing) tps = 78.765712 (excluding connections establishing) tps = 78.472902 (including connections establishing) tps = 78.473844 (excluding connections establishing) So, yes, it looks like I might have just about regressed this case - it's hard to be completely sure. However, this is still a very unrealistic case, since invariably len1 == len2 without the optimization ever working out, whereas the case that benefits [2] is quite representative. As I'm sure you were expecting, I still favor pursuing this additional optimization. Well, I have to agree that doesn't look too bad, but your reluctance to actually do the microbenchmark worries me. Granted, macrobenchmarks are what actually matters, but they can be noisy and there can be other confounding factors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com wrote: I think I said pretty clearly that it was. I agree that you did, but it wasn't clear exactly what factors you were asking me to simulate. It still isn't. Do you want me to compare the same string a million times in a loop, both with a strcoll() and with a memcmp()? Should I copy it into a buffer to add a NUL byte? Or should it be a new string each time, with a cache miss expected some proportion of the time? These considerations might significantly influence the outcome here, and one variation might be significantly less fair than another. Tell me what to do in a little more detail, and I'll do it (plus let you know what I think of it). I honestly don't know what you expect. So, yes, it looks like I might have just about regressed this case - it's hard to be completely sure. However, this is still a very unrealistic case, since invariably len1 == len2 without the optimization ever working out, whereas the case that benefits [2] is quite representative. As I'm sure you were expecting, I still favor pursuing this additional optimization. Well, I have to agree that doesn't look too bad, but your reluctance to actually do the microbenchmark worries me. Granted, macrobenchmarks are what actually matters, but they can be noisy and there can be other confounding factors. Well, I've been quite open about the fact that I think we can and should hide things in memory latency. I don't think my benchmark was in any way noisy, since you saw 3 100 second runs per test set/case, with a very stable outcome throughout - plus the test case is extremely unsympathetic/unrealistic to begin with. Hiding behind memory latency is an increasingly influential trick that I've seen crop up a few times in various papers. I think it's perfectly legitimate to rely on that. But, honestly, I have little idea how much I actually am relying on it. I think it's only fair to look at representative cases (i.e. actually SQL queries). Anything else can only be used as a guide. But, in this instance, a guide to what, exactly? This is not a rhetorical question, and I'm not trying to be difficult. If I thought there was something bad hiding here, I'd tell you about it. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Sep 9, 2014 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: I like that I don't have to care about every combination, and can treat abbreviation abortion as the special case with the extra step, in line with how I think of the optimization conceptually. Does that make sense? No. comparetup_heap() is hip-deep in this optimization as it stands, and what I proposed - if done correctly - isn't going to make that significantly worse. In fact, it really ought to make things better: you should be able to set things up so that ssup-comparator is always the test that should be applied first, regardless of whether we're aborted or not-aborted or not doing this in the first place; and then ssup-tiebreak_comparator, if not NULL, can be applied after that. I'm not following here. Isn't that at least equally true of what I've proposed? Sure, I'm checking if (!state-abbrevAborted) first, but that's irrelevant to the non-abbreviated case. It has nothing to abort. Also, AFAICT we cannot abort and still call ssup-comparator() indifferently, since sorttuple.datum1 fields are perhaps abbreviated keys in half of all cases (i.e. pre-abort tuples), and uninitialized garbage the other half of the time (i.e. post-abort tuples). Where is the heap_getattr() stuff supposed to happen for the first attribute to get an authoritative comparison in the event of aborting (if we're calling ssup-comparator() on datum1 indifferently)? We decide that we're going to use abbreviated keys within datum1 fields up-front. When we abort, we cannot use datum1 fields at all (which doesn't appear to matter for text -- the datum1 optimization has historically only benefited pass-by-value types). I'd mentioned that I'd hacked together a patch that doesn't necessitate a separate state (if only to save a very small amount of memory), but it is definitely messier within comparetup_heap(). I'm still tweaking it. FYI, it does this within comparetup_heap(): + if (!sortKey-abbrev_comparator) + { + /* +* There are no abbreviated keys to begin with (i.e. no opclass support +* exists). Compare the leading sort key, assuming an authoritative +* representation. +*/ + compare = ApplySortComparator(a-datum1, a-isnull1, + b-datum1, b-isnull1, + sortKey); + if (compare != 0) + return compare; + + sortKey++; + nkey = 1; + } + else if (!state-abbrevAborted sortKey-abbrev_comparator) + { + /* +* Leading attribute has abbreviated key representation, and +* abbreviation was not aborted when copying. Compare the leading sort +* key using abbreviated representation. +*/ + compare = ApplySortAbbrevComparator(a-datum1, a-isnull1, + b-datum1, b-isnull1, + sortKey); + if (compare != 0) + return compare; + + /* +* Since abbreviated comparison returned 0, call tie-breaker comparator +* using original, authoritative representation, which may break tie +* when differences were not captured within abbreviated representation +*/ + nkey = 0; + } + else + { + /* +* Started with abbreviated keys, but aborted during conversion/tuple +* copying -- check each attribute from scratch. It's not safe to make +* any assumption about the state of individual datum1 fields. +*/ + nkey = 0; + } Is doing all this worth the small saving in memory? Personally, I don't think that it is, but I defer to you. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Sep 9, 2014 at 2:00 PM, Robert Haas robertmh...@gmail.com wrote: Boiled down, what you're saying is that you might have a set that contains lots of duplicates in general, but not very many where the abbreviated-keys also match. Sure, that's true. Abbreviated keys are not used in the case where we do a (fully) opportunistic memcmp(), without really having any idea of whether or not it'll work out. Abbreviated keys aren't really relevant to that case, except perhaps in that we know we'll have statistics available for leading attributes, which will make the case less frequent in practice. But you might also not have that case, so I don't see where that gets us; the same worst-case test case Heikki developed the last time we relitigated this point is still relevant here. Well, I think that there should definitely be a distinction made between abbreviated and non-abbreviated cases; you could frequently have almost 100% certainty that each of those optimistic memcmp()s will work out with abbreviated keys. Low cardinality sets are very common. I'm not sure what your position on that is. My proposal to treat both of those cases (abbreviated with a cost model/cardinality statistics; non-abbreviated without) the same is based on different arguments for each case. In order to know how much we're giving up in that case, we need the exact number I asked you to provide in my previous email: the ratio of the cost of strcoll() to the cost of memcmp(). I see that you haven't chosen to provide that information in any of your four responses. Well, it's kind of difficult to give that number in a vacuum. I showed a case that had a large majority of opportunistic memcmp()s go to waste, while a small number were useful, which still put us ahead. I can look at Heikki's case with this again if you think that'll help. Heikki said that his case was all about wasted strxfrm()s, which are surely much more expensive than wasted memcmp()s, particularly when you consider temporal locality (we needed that memory to be stored in a cacheline for the immediately subsequent operation anyway, should the memcmp() thing not work out - the simple ratio that you're interested in may be elusive). In case I haven't been clear enough on this point, I re-emphasize that I do accept that for something like the non-abbreviated case, the opportunistic memcmp() thing must virtually be free if we're to proceed, since it is purely opportunistic. If it can be demonstrated that that isn't the case (and if that cannot be fixed by limiting it to CACHE_LINE_SIZE), clearly we should not proceed with opportunistic (non-abbreviated) memcmp()s. In fact, I think I'm holding it to a higher standard than you are - I believe that it had better be virtually free. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 10, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote: In order to know how much we're giving up in that case, we need the exact number I asked you to provide in my previous email: the ratio of the cost of strcoll() to the cost of memcmp(). I see that you haven't chosen to provide that information in any of your four responses. Well, it's kind of difficult to give that number in a vacuum. No, not really. All you have to do is right a little test program to gather the information. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 4, 2014 at 5:46 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote: Eh, maybe? I'm not sure why the case where we're using abbreviated keys should be different than the case we're not. In either case this is a straightforward trade-off: if we do a memcmp() before strcoll(), we win if it returns 0 and lose if returns non-zero and strcoll also returns non-zero. (If memcmp() returns non-zero but strcoll() returns 0, it's a tie.) I'm not immediately sure why it should affect the calculus one way or the other whether abbreviated keys are in use; the question of how much faster memcmp() is than strcoll() seems like the relevant consideration either way. Not quite. Consider my earlier example of sorting ~300,000 cities by country only. That's a pretty low cardinality attribute. We win big, and we are almost certain that the abbreviated key cardinality is a perfect proxy for the full key cardinality so we stick with abbreviated keys while copying over tuples. Sure, most comparisons will actually be resolved with a memcmp() == 0 rather than an abbreviated comparison, but under my ad-hoc cost model there is no distinction, since they're both very much cheaper than a strcoll() (particularly when we factor in the NUL termination copying that a memcmp() == 0 also avoids). To a lesser extent we're also justified in that optimism because we've already established that roughly the first 8 bytes of the string are bitwise equal. So the difference is that in the abbreviated key case, we are at least somewhat justified in our optimism. Whereas, where we're just eliding fmgr overhead, say on the 2nd or subsequent attribute of a multi-key sort, it's totally opportunistic to chance a memcmp() == 0. The latter optimization can only be justified by the fact that the memcmp() is somewhere between dirt cheap and free. That seems like soemthing that should significantly impact the calculus. Boiled down, what you're saying is that you might have a set that contains lots of duplicates in general, but not very many where the abbreviated-keys also match. Sure, that's true. But you might also not have that case, so I don't see where that gets us; the same worst-case test case Heikki developed the last time we relitigated this point is still relevant here. In order to know how much we're giving up in that case, we need the exact number I asked you to provide in my previous email: the ratio of the cost of strcoll() to the cost of memcmp(). I see that you haven't chosen to provide that information in any of your four responses. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 5, 2014 at 10:45 PM, Peter Geoghegan p...@heroku.com wrote: While I gave serious consideration to your idea of having a dedicated abbreviation comparator, and not duplicating sortsupport state when abbreviated keys are used (going so far as to almost fully implement the idea), I ultimately decided that my vote says we don't do that. It seemed to me that there were negligible benefits for increased complexity. In particular, I didn't want to burden tuplesort with having to worry about whether or not abbreviation was aborted during tuple copying, or was not used by the opclass in the first place - implementing your scheme makes that distinction relevant. It's very convenient to have comparetup_heap() compare the leading sort key (that specifically looks at SortTuple.datum1 pairs) indifferently, using the same comparator for abbreviated and not abbreviated cases indifferently. comparetup_heap() does not seem like a great place to burden with caring about each combination any more than strictly necessary. I like that I don't have to care about every combination, and can treat abbreviation abortion as the special case with the extra step, in line with how I think of the optimization conceptually. Does that make sense? No. comparetup_heap() is hip-deep in this optimization as it stands, and what I proposed - if done correctly - isn't going to make that significantly worse. In fact, it really ought to make things better: you should be able to set things up so that ssup-comparator is always the test that should be applied first, regardless of whether we're aborted or not-aborted or not doing this in the first place; and then ssup-tiebreak_comparator, if not NULL, can be applied after that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Sep 5, 2014 at 7:45 PM, Peter Geoghegan p...@heroku.com wrote: Attached additional patches are intended to be applied on top off most of the patches posted on September 2nd [1]. I attach another amendment/delta patch, intended to be applied on top of what was posted yesterday. I neglected to remove some abort logic that was previously only justified by the lack of opportunistic memcmp() == 0 comparisons in all instances, rather than just with abbreviated keys. -- Peter Geoghegan From c60b07dda3e81e1fc9a2e95c386faf1fccfe8f04 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Sat, 6 Sep 2014 14:56:37 -0700 Subject: [PATCH 8/8] Remove obsolete abort logic Remove some abort logic that was previously only justified by the lack of opportunistic memcmp() == 0 comparisons in all instances, rather than just with abbreviated keys. That justification no longer stands. --- src/backend/utils/adt/varlena.c | 9 - 1 file changed, 9 deletions(-) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 6aa3eaa..db4eae1 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -2295,15 +2295,6 @@ bttext_abbrev_abort(int memtupcount, double rowhint, SortSupport ssup) norm_key_card = key_distinct / (double) memtupcount; /* - * If this is a very low cardinality set generally, that is reason enough - * to favor our strategy, since many comparisons can be resolved with - * inexpensive memcmp() tie-breakers, even though abbreviated keys are of - * marginal utility. - */ - if (norm_key_card 0.05) - return false; - - /* * Abort abbreviation strategy. * * The worst case, where all abbreviated keys are identical while all -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Sat, Sep 6, 2014 at 3:01 PM, Peter Geoghegan p...@heroku.com wrote: I attach another amendment/delta patch Attached is another amendment to the patch set. With the recent addition of abbreviation support on 32-bit platforms, we should just hash the Datum representation as a uint32 on SIZEOF_DATUM != 8 platforms. -- Peter Geoghegan From 180ad839d8f049d83a89b42a1d85c7c99a7929f0 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Sat, 6 Sep 2014 21:39:16 -0700 Subject: [PATCH 9/9] On SIZEOF_DATUM != 4 platforms, call hash_uint32() directly In passing, remove some dead code within bttext_abbrev_abort(). --- src/backend/utils/adt/varlena.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index db4eae1..23944f2 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -2068,9 +2068,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup) char *pres; int len; Sizebsize; - uint32lohalf, - hihalf, - hash; + uint32hash; /* * Abbreviated key representation is a pass-by-value Datum that is treated @@ -2143,9 +2141,18 @@ retry: memcpy(pres, tss-buf2, Min(sizeof(Datum), bsize)); /* Hash abbreviated key */ - lohalf = (uint32) res; - hihalf = (uint32) (res 32); - hash = hash_uint32(lohalf ^ hihalf); +#if SIZEOF_DATUM == 8 + { + uint32lohalf, + hihalf; + + lohalf = (uint32) res; + hihalf = (uint32) (res 32); + hash = hash_uint32(lohalf ^ hihalf); + } +#else /* SIZEOF_DATUM != 8 */ + hash = hash_uint32((uint32) res); +#endif addHyperLogLog(tss-abbr_card, hash); @@ -2168,8 +2175,7 @@ bttext_abbrev_abort(int memtupcount, double rowhint, SortSupport ssup) { TextSortSupport *tss = (TextSortSupport *) ssup-ssup_extra; doubleabbrev_distinct, - key_distinct, - norm_key_card; + key_distinct; Assert(ssup-abbrev_state == ABBREVIATED_KEYS_YES); @@ -2289,12 +2295,6 @@ bttext_abbrev_abort(int memtupcount, double rowhint, SortSupport ssup) return false; /* - * Normalized cardinality is proportion of distinct original, authoritative - * keys - */ - norm_key_card = key_distinct / (double) memtupcount; - - /* * Abort abbreviation strategy. * * The worst case, where all abbreviated keys are identical while all -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 3, 2014 at 2:44 PM, Peter Geoghegan p...@heroku.com wrote: I guess it should still be a configure option, then. Or maybe there should just be a USE_ABBREV_KEYS macro within pg_config_manual.h. Attached additional patches are intended to be applied on top off most of the patches posted on September 2nd [1]. Note that you should not apply patch 0001-* from that set to master, since it has already been committed to master [2]. However, while rebasing I revised patch/commit 0005-* to abbreviation used on all platforms, including 32-bit platforms (the prior 0005-* patch just re-enabled the optimization on Darwin/Apple), so you should discard the earlier 0005-* patch. In a later commit I also properly formalize the idea that we always do opportunistic memcmp() == 0 checks, no matter what context a sortsupport-accelerated text comparison occurs in. That seems like a good idea, but it's broken out in a separate commit in case you are not in agreement. While I gave serious consideration to your idea of having a dedicated abbreviation comparator, and not duplicating sortsupport state when abbreviated keys are used (going so far as to almost fully implement the idea), I ultimately decided that my vote says we don't do that. It seemed to me that there were negligible benefits for increased complexity. In particular, I didn't want to burden tuplesort with having to worry about whether or not abbreviation was aborted during tuple copying, or was not used by the opclass in the first place - implementing your scheme makes that distinction relevant. It's very convenient to have comparetup_heap() compare the leading sort key (that specifically looks at SortTuple.datum1 pairs) indifferently, using the same comparator for abbreviated and not abbreviated cases indifferently. comparetup_heap() does not seem like a great place to burden with caring about each combination any more than strictly necessary. I like that I don't have to care about every combination, and can treat abbreviation abortion as the special case with the extra step, in line with how I think of the optimization conceptually. Does that make sense? Otherwise, there'd have to be a ApplySortComparator() *and* ApplySortComparatorAbbreviated() call with SortTuple.datum1 pairs passed, as appropriate for each opclass (and abortion state), as well as a heap_getattr() tie-breaker call for the latter case alone (when we got an inconclusive answer, OR when abbreviation was aborted). Finally, just as things are now, there'd have to be a loop where the second or subsequent attributes are dealt with by ApplySortComparator()'ing. So AFAICT under your scheme there are 4 ApplySortComparator* call sites required, rather than 3 as under mine. Along similar lines, I thought about starting from nkey = 0 within comparetup_heap() when abortion occurs (so that there'd only be 2 ApplySortComparator() call sites - no increase from master) , but that turns out to be messy, plus I like those special tie-breaker assertions. I will be away for much of next week, and will have limited access to e-mail. I will be around tomorrow, though. I hope that what I've posted is suitable to commit without further input from me. [1] http://www.postgresql.org/message-id/cam3swztetqckc24lhwkdlasjf-b-ccnn4q0oyjhgbx+ncpn...@mail.gmail.com [2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d8d4965dc29263462932be03d4206aa694e2cd7e -- Peter Geoghegan From 889d615218d5cac9097c11bec33e2d8e70112922 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan p...@heroku.com Date: Fri, 5 Sep 2014 16:43:56 -0700 Subject: [PATCH 7/7] Soften wording about nodeAgg.c single column support Support for forcing nodeAgg.c to use a heap tuplesort rather than a Datum tuplesort, even when sorting on a single attribute may be useful. Datum tuplesorts do not and cannot support abbreviated keys, since Datum tuplesorting necessitates preserving the original representation for later re-use by certain datum tuplesort clients. When using a heap tuplesort will make the difference between using and not using abbreviated keys, it is likely significantly faster to prefer a heap tuplesort. On the other hand, nodeAgg.c's current preference for a datum tuplesort is likely to result in superior performance for non-abbreviated cases. Since it is now apparent that support for forcing a heap tuplesort within nodeAgg.c will follow in a later, separately reviewed patch, soften the wording in comments within nodeAgg.c to reflect this. This doesn't seem unreasonable, since sortsupport optimization is already used inconsistently. --- src/backend/executor/nodeAgg.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 3b335e7..95143c3 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -365,9 +365,9 @@ initialize_aggregates(AggState *aggstate, * otherwise sort the full tuple. (See comments
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan p...@heroku.com wrote: * Still doesn't address the open question of whether or not we should optimistically always try memcmp() == 0 on tiebreak. I still lean towards yes. Let m be the cost of a memcmp() that fails near the end of the strings; and let s be the cost of a strcoll that does likewise. Clearly s m. But approximately what is s/m on platforms where you can test? Say, with 100 byte string, in a few different locales. If for example s/m 100 then it's a no-brainer, because in the worst case we're adding 1% overhead, and in the best case we're saving 99%. OTOH, if s/m 2 then I almost certainly wouldn't do it, because in the worst case we're adding 50% overhead, and in the best case we're saving 50%. That seems like it's doubling down on the abbreviated key stuff to work mostly all the time, and I'm not prepared to make that bet. There is of course a lot of daylight between a 2-to-1 ratio and a 100-to-1 ratio and I expect the real value is somewhere in the middle (probably closer to 2); I haven't at this time made up my mind what value would make this worthwhile, but I'd like to know what the real numbers are. * Leaves open the question of what to do when we can't use the abbreviated keys optimization just because a datum tuplesort is preferred when sorting single-attribute tuples (recall that datum case tuplesorts cannot use abbreviated keys). We want to avail of tuplesort datum sorting where we can, except when abbreviated keys are available, which presumably tips the balance in favor of heap tuple sorting even when sorting on only one attribute, simply because then we can then use abbreviated keys. I'm thinking in particular of nodeAgg.c, which is an important case. I favor leaving this issue to a future patch. The last thing this patch needs is more changes that someone might potentially dislike. Let's focus on getting the core thing working, and then you can enhance it once we all agree that it is. On the substance of this issue, I suspect that for pass-by-value data types it can hardly be wrong to use the datum tuplesort approach; but it's possible we will want to disable it for pass-by-reference data types when the abbreviated-key infrastructure is available. That will lose if it turns out that the abbreviated keys aren't capturing enough of the entropy, but maybe we'll decide that's OK. Or maybe not. But I don't think it's imperative that this patch make a change in that area, and indeed, in the interest of keeping separate changes isolated, I think it's better if it doesn't. There are still FIXME/TODO comments for each of these two points. Further, this revised/rebased patch set: * Incorporates your feedback on stylistic issues, with changes confined to their own commit (on top of earlier commits that are almost, but not quite, the same as the prior revision that your remarks apply to). * No longer does anything special within reversedirection_heap(), since that is unnecessary, as it's only used by bounded sorts, which aren't a useful target for abbreviated keys. This is noted. There is no convenient point to add a defensive assertion against this, so I haven't. * Updates comments in master in a broken-out way, reflecting opclass contract with sortsupport as established by 1d41739e5a04b0e93304d24d864b6bfa3efc45f2, that is convenient to apply to and commit in the master branch immediately. Thanks, committed that one. The remaining patches can be squashed into a single one, as none of them can be applied without the others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 3, 2014 at 5:44 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Sep 3, 2014 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote: My suggestion is to remove the special cases for Darwin and 32-bit systems and see how it goes. I guess it should still be a configure option, then. Or maybe there should just be a USE_ABBREV_KEYS macro within pg_config_manual.h. Are you suggesting that the patch be committed with the optimization enabled on all platforms by default, with the option to revisit disabling it if and when there is user push-back? I don't think that's unreasonable, given the precautions now taken, but I'm just not sure that's what you mean. That's what I mean. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 4, 2014 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan p...@heroku.com wrote: * Still doesn't address the open question of whether or not we should optimistically always try memcmp() == 0 on tiebreak. I still lean towards yes. Let m be the cost of a memcmp() that fails near the end of the strings; and let s be the cost of a strcoll that does likewise. Clearly s m. But approximately what is s/m on platforms where you can test? Say, with 100 byte string, in a few different locales. Just to be clear: I imagine you're more or less sold on the idea of testing equality in the event of a tie-break, where the leading 8 primary weight bytes are already known to be equal (and the full text string lengths also match); the theory of operation behind testing how good a proxy for full key cardinality abbreviated key cardinality is is very much predicated on that. We can still win big with very low cardinality sets this way, which are an important case. What I consider an open question is whether or not we should do that on the first call when there is no abbreviated comparison, such as on the second or subsequent attribute in a multi-column sort, in the hope that equality will just happen to be indicated. If for example s/m 100 then it's a no-brainer, because in the worst case we're adding 1% overhead, and in the best case we're saving 99%. OTOH, if s/m 2 then I almost certainly wouldn't do it, because in the worst case we're adding 50% overhead, and in the best case we're saving 50%. That seems like it's doubling down on the abbreviated key stuff to work mostly all the time, and I'm not prepared to make that bet. There is of course a lot of daylight between a 2-to-1 ratio and a 100-to-1 ratio and I expect the real value is somewhere in the middle (probably closer to 2); I haven't at this time made up my mind what value would make this worthwhile, but I'd like to know what the real numbers are. Well, we can only lose when the strings happen to be the same size. So that's something. But I'm willing to consider the possibility that the memcmp() is virtually free. I would only proceed with this extra optimization if that is actually the case. Modern CPUs are odd things. Branch prediction/instruction pipelining, and the fact that we're frequently stalled on cache misses might combine to make it effectively the case that the opportunistic memcmp() is free. I could be wrong about that, and I'm certainly wrong if you test large enough strings with differences only towards the very end, but it seems reasonable to speculate that it would work well with appropriate precautions (in particular, don't do it when the strings are huge). Let me try and come up with some numbers for a really unsympathetic case, since you've already seen sympathetic numbers. I think the sympathetic country/province/city sort test case [1] is actually fairly representative; sort keys *are* frequently correlated like that, implying that there are lots of savings to be had by being memcmp() == 0 optimistic when resolving comparisons using the second or subsequent attribute. * Leaves open the question of what to do when we can't use the abbreviated keys optimization just because a datum tuplesort is preferred when sorting single-attribute tuples (recall that datum case tuplesorts cannot use abbreviated keys). We want to avail of tuplesort datum sorting where we can, except when abbreviated keys are available, which presumably tips the balance in favor of heap tuple sorting even when sorting on only one attribute, simply because then we can then use abbreviated keys. I'm thinking in particular of nodeAgg.c, which is an important case. I favor leaving this issue to a future patch. The last thing this patch needs is more changes that someone might potentially dislike. Let's focus on getting the core thing working, and then you can enhance it once we all agree that it is. Makes sense. I think we should make a separate pass to enable sort support for B-Tree sorting - that's probably the most compelling case, after all. That's certainly the thing that I've heard complaints about. There could be as many as 2-3 follow-up commits. On the substance of this issue, I suspect that for pass-by-value data types it can hardly be wrong to use the datum tuplesort approach; but it's possible we will want to disable it for pass-by-reference data types when the abbreviated-key infrastructure is available. That will lose if it turns out that the abbreviated keys aren't capturing enough of the entropy, but maybe we'll decide that's OK. Or maybe not. But I don't think it's imperative that this patch make a change in that area, and indeed, in the interest of keeping separate changes isolated, I think it's better if it doesn't. Right. I had presumed that we'd want to figure that out each time. I wasn't sure how best to go about doing that, which is why it's
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 4, 2014 at 2:12 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Sep 4, 2014 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan p...@heroku.com wrote: * Still doesn't address the open question of whether or not we should optimistically always try memcmp() == 0 on tiebreak. I still lean towards yes. Let m be the cost of a memcmp() that fails near the end of the strings; and let s be the cost of a strcoll that does likewise. Clearly s m. But approximately what is s/m on platforms where you can test? Say, with 100 byte string, in a few different locales. Just to be clear: I imagine you're more or less sold on the idea of testing equality in the event of a tie-break, where the leading 8 primary weight bytes are already known to be equal (and the full text string lengths also match); the theory of operation behind testing how good a proxy for full key cardinality abbreviated key cardinality is is very much predicated on that. We can still win big with very low cardinality sets this way, which are an important case. What I consider an open question is whether or not we should do that on the first call when there is no abbreviated comparison, such as on the second or subsequent attribute in a multi-column sort, in the hope that equality will just happen to be indicated. Eh, maybe? I'm not sure why the case where we're using abbreviated keys should be different than the case we're not. In either case this is a straightforward trade-off: if we do a memcmp() before strcoll(), we win if it returns 0 and lose if returns non-zero and strcoll also returns non-zero. (If memcmp() returns non-zero but strcoll() returns 0, it's a tie.) I'm not immediately sure why it should affect the calculus one way or the other whether abbreviated keys are in use; the question of how much faster memcmp() is than strcoll() seems like the relevant consideration either way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote: Eh, maybe? I'm not sure why the case where we're using abbreviated keys should be different than the case we're not. In either case this is a straightforward trade-off: if we do a memcmp() before strcoll(), we win if it returns 0 and lose if returns non-zero and strcoll also returns non-zero. (If memcmp() returns non-zero but strcoll() returns 0, it's a tie.) I'm not immediately sure why it should affect the calculus one way or the other whether abbreviated keys are in use; the question of how much faster memcmp() is than strcoll() seems like the relevant consideration either way. Not quite. Consider my earlier example of sorting ~300,000 cities by country only. That's a pretty low cardinality attribute. We win big, and we are almost certain that the abbreviated key cardinality is a perfect proxy for the full key cardinality so we stick with abbreviated keys while copying over tuples. Sure, most comparisons will actually be resolved with a memcmp() == 0 rather than an abbreviated comparison, but under my ad-hoc cost model there is no distinction, since they're both very much cheaper than a strcoll() (particularly when we factor in the NUL termination copying that a memcmp() == 0 also avoids). To a lesser extent we're also justified in that optimism because we've already established that roughly the first 8 bytes of the string are bitwise equal. So the difference is that in the abbreviated key case, we are at least somewhat justified in our optimism. Whereas, where we're just eliding fmgr overhead, say on the 2nd or subsequent attribute of a multi-key sort, it's totally opportunistic to chance a memcmp() == 0. The latter optimization can only be justified by the fact that the memcmp() is somewhere between dirt cheap and free. That seems like soemthing that should significantly impact the calculus. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 4, 2014 at 11:12 AM, Peter Geoghegan p...@heroku.com wrote: What I consider an open question is whether or not we should do that on the first call when there is no abbreviated comparison, such as on the second or subsequent attribute in a multi-column sort, in the hope that equality will just happen to be indicated. Let me try and come up with some numbers for a really unsympathetic case, since you've already seen sympathetic numbers. So I came up with what I imagined to be an unsympathetic case: postgres=# create table opt_eq_test as select 1 as dummy, country || ', ' || city as country from cities order by city; SELECT 317102 [local]/postgres=# select * from opt_eq_test limit 5; dummy | country ---+-- 1 | India, 108 Kalthur 1 | Mexico, 10 de Abril 1 | India, 113 Thalluru 1 | Argentina, 11 de Octubre 1 | India, 11 Dlp (5 rows) I added the dummy column to prevent abbreviated keys from being used - this is all about the question of trying to get away with a memcmp() == 0 in all circumstances, without abbreviated keys/statistics on attribute cardinality. This is a question that has nothing to do with abbreviated keys in particular. With the most recent revision of the patch, performance of a representative query against this data stabilizes as follows on my laptop: LOG: duration: 2252.500 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2261.505 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2315.903 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2260.132 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2247.340 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2246.723 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2276.297 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2241.911 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2259.540 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2248.740 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2245.913 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2230.583 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; If I apply the additional optimization that we're on the fence about: --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -1967,7 +1967,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup) len1 = VARSIZE_ANY_EXHDR(arg1); len2 = VARSIZE_ANY_EXHDR(arg2); - if (ssup-abbrev_state == ABBREVIATED_KEYS_TIE len1 == len2) + if (len1 == len2) { /* * Being called as authoritative tie-breaker for an abbreviated key Then the equivalent numbers look like this: LOG: duration: 2178.220 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2175.005 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2219.648 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2174.865 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2246.387 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2234.023 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2186.957 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2177.778 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2186.709 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2171.557 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2211.822 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2224.198 ms statement: select * from (select * from opt_eq_test order by dummy, country offset 100) d; LOG: duration: 2192.506 ms statement: select * from (select * from opt_eq_test order
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 4, 2014 at 5:07 PM, Peter Geoghegan p...@heroku.com wrote: So I came up with what I imagined to be an unsympathetic case: BTW, this cities data is still available from: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Sep 2, 2014 at 4:41 PM, Peter Geoghegan p...@heroku.com wrote: HyperLogLog isn't sample-based - it's useful for streaming a set and accurately tracking its cardinality with fixed overhead. OK. Is it the right decision to suppress the abbreviated-key optimization unconditionally on 32-bit systems and on Darwin? There's certainly more danger, on those platforms, that the optimization could fail to pay off. But it could also win big, if in fact the first character or two of the string is enough to distinguish most rows, or if Darwin improves their implementation in the future. If the other defenses against pathological cases in the patch are adequate, I would think it'd be OK to remove the hard-coded checks here and let those cases use the optimization or not according to its merits in particular cases. We'd want to look at what the impact of that is, of course, but if it's bad, maybe those other defenses aren't adequate anyway. I'm not sure. Perhaps the Darwin thing is a bad idea because no one is using Macs to run real database servers. Apple haven't had a server product in years, and typically people only use Postgres on their Macs for development. We might as well have coverage of the new code for the benefit of Postgres hackers that favor Apple machines. Or, to look at it another way, the optimization is so beneficially that it's probably worth the risk, even for more marginal cases. 8 primary weights (the leading 8 bytes, frequently isomorphic to the first 8 Latin characters, regardless of whether or not they have accents/diacritics, or punctuation/whitespace) is twice as many as 4. But every time you add a byte of space to the abbreviated representation that can resolve a comparison, the number of unresolvable-without-tiebreak comparisons (in general) is, I imagine, reduced considerably. Basically, 8 bytes is way better than twice as good as 4 bytes in terms of its effect on the proportion of comparisons that are resolved only with abbreviated keys. Even still, I suspect it's still worth it to apply the optimization with only 4. You've seen plenty of suggestions on assessing the applicability of the optimization from me. Perhaps you have a few of your own. My suggestion is to remove the special cases for Darwin and 32-bit systems and see how it goes. That wouldn't be harmless - it would probably result in incorrect answers in practice, and would certainly be unspecified. However, I'm not reading uninitialized bytes. I call memset() so that in the event of the final strxfrm() blob being less than 8 bytes (which can happen even on glibc with en_US.UTF-8). It cannot be harmful to memcmp() every Datum byte if the remaining bytes are always initialized to NUL. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers