subject:"Re\: \[HACKERS\] B\-Tree support function number 3 \(strxfrm\(\) optimization\)"

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-23 Thread Robert Haas

On Wed, Jan 21, 2015 at 2:22 AM, Peter Geoghegan p...@heroku.com wrote:
 You'll probably prefer the attached. This patch works by disabling
 abbreviation, but only after writing out runs, with the final merge
 left to go. That way, it doesn't matter when abbreviated keys are not
 read back from disk (or regenerated).

Yes, this seems like the way to go for now.  Thanks, committed.  And
thanks to Andrew for spotting this so quickly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-23 Thread Robert Haas

On Fri, Jan 23, 2015 at 2:18 AM, David Rowley dgrowle...@gmail.com wrote:
 On 20 January 2015 at 17:10, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

  With your patch applied, the failure with MSVC disappeared, but there
  is still a warning showing up:
  (ClCompile target) -
src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of
  32-bit shift implicitly converted to 64 bits (was 64-bit shift
  intended?

 That seems harmless. I suggest an explicit cast to Size here.

 This caught my eye too.

 I agree about casting to Size.

 Patch attached.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-22 Thread David Rowley

On 20 January 2015 at 17:10, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

  With your patch applied, the failure with MSVC disappeared, but there
  is still a warning showing up:
  (ClCompile target) -
src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of
  32-bit shift implicitly converted to 64 bits (was 64-bit shift
  intended?

 That seems harmless. I suggest an explicit cast to Size here.


This caught my eye too.

I agree about casting to Size.

Patch attached.

Regards

David Rowley


hyperloglog_bitshift_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-21 Thread Andrew Gierth

 Peter == Peter Geoghegan p...@heroku.com writes:

 Peter You'll probably prefer the attached. This patch works by
 Peter disabling abbreviation, but only after writing out runs, with
 Peter the final merge left to go. That way, it doesn't matter when
 Peter abbreviated keys are not read back from disk (or regenerated).

This seems tolerable to me for a quick fix. The merits of storing the
abbreviation vs. re-abbreviating on input can be studied later.

 Peter I believe this bug was missed because it only occurs when there
 Peter are multiple runs, and not in the common case where there is one
 Peter big initial run that is found already sorted when we reach
 Peter mergeruns().

Ah, yes, there is an optimization for the one-run case which bypasses
all further comparisons, hiding the problem.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-21 Thread Andrew Gierth

 Peter == Peter Geoghegan p...@heroku.com writes:

 Peter Basically, the intersection of the datum sort case with
 Peter abbreviated keys seems complicated.

Not to me. To me it seems completely trivial.

Now, I follow this general principle that someone who is not doing the
work should never say X is easy to someone who _is_ doing it, unless
they're prepared to at least outline the solution on request or
otherwise contribute.  So see the attached patch (which I will concede
could probably do with more comments, it's a quick hack intended for
illustration) and tell me what you think is missing that would make it a
complicated problem.

 Peter I tended to think that the solution was to force a heaptuple
 Peter sort instead (where abbreviation naturally can be used),

This seems completely wrong - why should the caller have to worry about
this implementation detail? The caller shouldn't have to know about what
types or what circumstances might or might not benefit from
abbreviation.

-- 
Andrew (irc:RhodiumToad)


diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index b6e302f..0dbb277 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -901,30 +901,34 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state-copytup = copytup_datum;
 	state-writetup = writetup_datum;
 	state-readtup = readtup_datum;
+	state-abbrevNext = 10;
 
 	state-datumType = datumType;
 
-	/* Prepare SortSupport data */
-	state-onlyKey = (SortSupport) palloc0(sizeof(SortSupportData));
-
-	state-onlyKey-ssup_cxt = CurrentMemoryContext;
-	state-onlyKey-ssup_collation = sortCollation;
-	state-onlyKey-ssup_nulls_first = nullsFirstFlag;
-	/*
-	 * Conversion to abbreviated representation infeasible in the Datum case.
-	 * It must be possible to subsequently fetch original datum values within
-	 * tuplesort_getdatum(), which would require special-case preservation of
-	 * original values.
-	 */
-	state-onlyKey-abbreviate = false;
-
-	PrepareSortSupportFromOrderingOp(sortOperator, state-onlyKey);
-
 	/* lookup necessary attributes of the datum type */
 	get_typlenbyval(datumType, typlen, typbyval);
 	state-datumTypeLen = typlen;
 	state-datumTypeByVal = typbyval;
 
+	/* Prepare SortSupport data */
+	state-sortKeys = (SortSupport) palloc0(sizeof(SortSupportData));
+
+	state-sortKeys-ssup_cxt = CurrentMemoryContext;
+	state-sortKeys-ssup_collation = sortCollation;
+	state-sortKeys-ssup_nulls_first = nullsFirstFlag;
+	state-sortKeys-abbreviate = !typbyval;
+
+	PrepareSortSupportFromOrderingOp(sortOperator, state-sortKeys);
+
+	/*
+	 * The onlyKey optimization cannot be used with abbreviated keys, since
+	 * tie-breaker comparisons may be required.  Typically, the optimization is
+	 * only of value to pass-by-value types anyway, whereas abbreviated keys
+	 * are typically only of value to pass-by-reference types.
+	 */
+	if (!state-sortKeys-abbrev_converter)
+		state-onlyKey = state-sortKeys;
+
 	MemoryContextSwitchTo(oldcontext);
 
 	return state;
@@ -1318,10 +1322,43 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 	}
 	else
 	{
-		stup.datum1 = datumCopy(val, false, state-datumTypeLen);
+		Datum	original = datumCopy(val, false, state-datumTypeLen);
 		stup.isnull1 = false;
-		stup.tuple = DatumGetPointer(stup.datum1);
+		stup.tuple = DatumGetPointer(original);
 		USEMEM(state, GetMemoryChunkSpace(stup.tuple));
+
+		if (!state-sortKeys-abbrev_converter)
+		{
+			stup.datum1 = original;
+		}
+		else if (!consider_abort_common(state))
+		{
+			/* Store abbreviated key representation */
+			stup.datum1 = state-sortKeys-abbrev_converter(original,
+			state-sortKeys);
+		}
+		else
+		{
+			/* Abort abbreviation */
+			int		i;
+
+			stup.datum1 = original;
+
+			/*
+			 * Set state to be consistent with never trying abbreviation.
+			 *
+			 * Alter datum1 representation in already-copied tuples, so as to
+			 * ensure a consistent representation (current tuple was just handled).
+			 * Note that we rely on all tuples copied so far actually being
+			 * contained within memtuples array.
+			 */
+			for (i = 0; i  state-memtupcount; i++)
+			{
+SortTuple *mtup = state-memtuples[i];
+
+mtup-datum1 = PointerGetDatum(mtup-tuple);
+			}
+		}
 	}
 
 	puttuple_common(state, stup);
@@ -1887,9 +1924,9 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 	else
 	{
 		if (should_free)
-			*val = stup.datum1;
+			*val = PointerGetDatum(stup.tuple);
 		else
-			*val = datumCopy(stup.datum1, false, state-datumTypeLen);
+			*val = datumCopy(PointerGetDatum(stup.tuple), false, state-datumTypeLen);
 		*isNull = false;
 	}
 
@@ -3712,9 +3749,22 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 static int
 comparetup_datum(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
 {
-	return ApplySortComparator(a-datum1, a-isnull1,
-			   b-datum1, b-isnull1,
-			   state-onlyKey);
+	int		compare;
+
+

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Andrew Gierth

 Robert == Robert Haas robertmh...@gmail.com writes:

 Robert All right, it seems Tom is with you on that point, so after
 Robert some study, I've committed this with very minor modifications.

This caught my eye (thanks to conflict with GS patch):

 * In the future, we should consider forcing the
 * tuplesort_begin_heap() case when the abbreviated key
 * optimization can thereby be used, even when numInputs is 1.

The comment in tuplesort_begin_datum that abbreviation can't be used
seems wrong to me; why is the copy of the original value pointed to by
stup-tuple (in the case of by-reference types, and abbreviation is
obviously not needed for by-value types) not sufficient?

Or what am I missing?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Robert Haas

On Mon, Jan 19, 2015 at 9:29 PM, Peter Geoghegan p...@heroku.com wrote:
 I think that the attached patch should at least fix that much. Maybe
 the problem on the other animal is also explained by the lack of this,
 since there could also be a MinGW-ish strxfrm_l(), I suppose.

Committed that, rather blindly, since it looks like a reasonable fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 3:34 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 20, 2015 at 3:34 PM, Robert Haas robertmh...@gmail.com wrote:
 Dear me.  Peter, can you fix this RSN?

 Investigating.

It's certainly possible to fix Andrew's test case with the attached.
I'm not sure that that's the appropriate fix, though: there is
probably a case to be made for not bothering with abbreviation once
we've read tuples in for the final merge run. More likely, the
strongest case is for storing the abbreviated keys on disk too, and
reading those back.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6d3aa88..adf4c4d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3148,6 +3148,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	MinimalTuple tuple = (MinimalTuple) palloc(tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
+	Datum			original;
 
 	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* read in the tuple proper */
@@ -3161,10 +3162,29 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	/* set up first-column key value */
 	htup.t_len = tuple-t_len + MINIMAL_TUPLE_OFFSET;
 	htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
-	stup-datum1 = heap_getattr(htup,
-state-sortKeys[0].ssup_attno,
-state-tupDesc,
-stup-isnull1);
+	/* Once again, store abbreviated key representation */
+	original = heap_getattr(htup,
+			state-sortKeys[0].ssup_attno,
+			state-tupDesc,
+			stup-isnull1);
+
+	if (!state-sortKeys-abbrev_converter || stup-isnull1)
+	{
+		/*
+		 * Store ordinary Datum representation, or NULL value.  If there is a
+		 * converter it won't expect NULL values, and cost model is not
+		 * required to account for NULL, so in that case we avoid calling
+		 * converter and just set datum1 to void representation (to be
+		 * consistent).
+		 */
+		stup-datum1 = original;
+	}
+	else
+	{
+		/* Store abbreviated key representation */
+		stup-datum1 = state-sortKeys-abbrev_converter(original,
+		 state-sortKeys);
+	}
 }
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Robert Haas

On Tue, Jan 20, 2015 at 6:27 PM, Andrew Gierth
and...@tao11.riddles.org.uk wrote:
 Robert == Robert Haas robertmh...@gmail.com writes:
  Robert All right, it seems Tom is with you on that point, so after
  Robert some study, I've committed this with very minor modifications.

 While hacking up a patch to demonstrate the simplicity of extending this
 to the Datum sorter, I seem to have run into a fairly major issue with
 this: there seems to be no attempt whatsoever to handle spilling to disk
 correctly. The data spilled to disk only has the un-abbreviated values,
 but nothing tries to re-abbreviate it (or disable abbreviations) when it
 is read back in, and chaos ensues:

Dear me.  Peter, can you fix this RSN?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 3:46 AM, Andrew Gierth
and...@tao11.riddles.org.uk wrote:
 The comment in tuplesort_begin_datum that abbreviation can't be used
 seems wrong to me; why is the copy of the original value pointed to by
 stup-tuple (in the case of by-reference types, and abbreviation is
 obviously not needed for by-value types) not sufficient?

We haven't formalized the idea that pass-by-value types are not
targets for abbreviation (it's just that the practical application of
abbreviated keys is likely to be limited to pass-by-reference types,
generating a compact pass-by-value abbreviated representation). That
could be a useful restriction to formalize, and certainly seems likely
to be a harmless one, but for now that's the way it is.

It might be sufficient for some tuplesort_begin_datum() callers. Datum
tuple sorts require the original values. Aside from the formalization
of abbreviation only applying to pass-by-value types, you'd have to
teach tuplesort_getdatum() to reconstruct the non-abbreviated
representation transparently from each SortTuple's tuple proper.
However, the actual tuplesort_getdatum() calls could be the dominant
cost, not the sort  (I'm not sure of that offhand - that's total
speculation).

Basically, the intersection of the datum sort case with abbreviated
keys seems complicated. I tended to think that the solution was to
force a heaptuple sort instead (where abbreviation naturally can be
used), since clearly that could work in some important cases like
nodeAgg.c, iff the gumption to do it that way was readily available.
Rightly or wrongly, I preferred that idea to the idea of teaching the
Datum case to handle abbreviation across the board. Maybe that's the
wrong way of fixing that, but for now I don't think it's acceptable
that abbreviation isn't always used in certain cases where it could
make sense (e.g. not for simple GroupAggregates with a single
attribute -- only multiple attribute GroupAggregates). After all, most
sort cases (e.g. B-Tree builds) didn't use SortSupport for several
years, simply because no one got around to it until I finally did a
few months back.

Note that most tuplesort non-users of abbreviation don't use
abbreviation for sensible reasons. For example, abbreviation simply
doesn't make sense for Top-N heap sorts, or MJExamineQuals(). The
non-single-attribute GroupAggregate/nodeAgg.c case seems bad, but I
don't have a good sense of how bad things are with orderedsetaggs.c
non-use is...it might matter less than the other case.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Andrew Gierth

 Robert == Robert Haas robertmh...@gmail.com writes:

 Robert All right, it seems Tom is with you on that point, so after
 Robert some study, I've committed this with very minor modifications.

While hacking up a patch to demonstrate the simplicity of extending this
to the Datum sorter, I seem to have run into a fairly major issue with
this: there seems to be no attempt whatsoever to handle spilling to disk
correctly. The data spilled to disk only has the un-abbreviated values,
but nothing tries to re-abbreviate it (or disable abbreviations) when it
is read back in, and chaos ensues:

set work_mem = 64;
select v, v  lag(v) over (order by v)
  from (select 'B'||i as v from generate_series(1,1) i
union all select 'a'||i from generate_series(1,1) i offset
  0) s
  order by v limit 20;

   v| ?column? 
+--
 a1 | 
 B1 | f
 a1000  | t
 a1001  | t
 a1002  | t
 a1003  | t
 B1000  | f
 B1001  | t
 B1002  | t
 B1003  | t
 B1004  | t
 B1005  | t
 a1004  | t
 a1005  | t
 a1006  | t
 a1007  | t
 a1008  | t
 B1 | f
 B10| t
 B100   | t
(20 rows)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 2:00 PM, Peter Geoghegan p...@heroku.com wrote:
 Maybe that's the
 wrong way of fixing that, but for now I don't think it's acceptable
 that abbreviation isn't always used in certain cases where it could
 make sense (e.g. not for simple GroupAggregates with a single
 attribute -- only multiple attribute GroupAggregates). After all, most
 sort cases (e.g. B-Tree builds) didn't use SortSupport for several
 years, simply because no one got around to it until I finally did a
 few months back.


Exuse me. I mean that this *is* an acceptable restriction for the time being.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 3:34 PM, Robert Haas robertmh...@gmail.com wrote:
 Dear me.  Peter, can you fix this RSN?

Investigating.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 3:33 PM, Robert Haas robertmh...@gmail.com wrote:
 Peter, this made bowerbird (Windows 8/Visual Studio) build, but it's
 failing make check.  Ditto hamerkop (Windows 2k8/VC++) and currawong
 (Windows XP Pro/MSVC++).  jacana (Windows 8/gcc) and brolga (Windows
 XP Pro/cygwin) are unhappy too, although the failures are showing up
 in different build stages rather than in 'make check'.  narwhal
 (Windows 2k3/mingw) and frogmouth (Windows XP Pro/gcc) are happy,
 though, so it's not affecting ALL of the Windows critters.  Still, I'm
 leaning toward the view that we should disable this optimization
 across-the-board on Windows until somebody has time to do the legwork
 to figure out what it takes to make it work, and what makes it work on
 some of these critters and fail on others.  We can't leave the
 buildfarm red for long periods of time.

Fair enough.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 3:57 PM, Peter Geoghegan p...@heroku.com wrote:
 It's certainly possible to fix Andrew's test case with the attached.
 I'm not sure that that's the appropriate fix, though: there is
 probably a case to be made for not bothering with abbreviation once
 we've read tuples in for the final merge run. More likely, the
 strongest case is for storing the abbreviated keys on disk too, and
 reading those back.

Maybe not, though: An extra 8 bytes per tuple on disk is not free.
OTOH, if we're I/O bound on the final merge, as we ought to be, then
recomputing the abbreviated keys could make sense, since there may
well be an idle CPU core anyway.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Robert Haas

On Tue, Jan 20, 2015 at 10:54 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jan 19, 2015 at 9:29 PM, Peter Geoghegan p...@heroku.com wrote:
 I think that the attached patch should at least fix that much. Maybe
 the problem on the other animal is also explained by the lack of this,
 since there could also be a MinGW-ish strxfrm_l(), I suppose.

 Committed that, rather blindly, since it looks like a reasonable fix.

Peter, this made bowerbird (Windows 8/Visual Studio) build, but it's
failing make check.  Ditto hamerkop (Windows 2k8/VC++) and currawong
(Windows XP Pro/MSVC++).  jacana (Windows 8/gcc) and brolga (Windows
XP Pro/cygwin) are unhappy too, although the failures are showing up
in different build stages rather than in 'make check'.  narwhal
(Windows 2k3/mingw) and frogmouth (Windows XP Pro/gcc) are happy,
though, so it's not affecting ALL of the Windows critters.  Still, I'm
leaning toward the view that we should disable this optimization
across-the-board on Windows until somebody has time to do the legwork
to figure out what it takes to make it work, and what makes it work on
some of these critters and fail on others.  We can't leave the
buildfarm red for long periods of time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote:
 I was assuming we were going to fix this by undoing the abbreviation
 (as in the abort case) when we spill to disk, and not bothering with
 it thereafter.

The spill-to-disk case is at least as compelling at the internal sort
case. The overhead of comparisons is much higher for tapesort.

Attached patch serializes keys. On reflection, I'm inclined to go with
this approach. Even if the CPU overhead of reconstructing strxfrm()
blobs is acceptable for text, it might be much more expensive for
other types. I'm loathe to throw away those abbreviated keys
unnecessarily.

We don't have to worry about having aborted abbreviation, since once
we spill to disk we've effectively committed to abbreviation. This
patch formalizes the idea that there is strictly a pass-by-value
representation required for such cases (but not that the original
Datums must be of a pass-by-reference, which is another thing
entirely). I've tested it some, obviously with Andrew's testcase and
the regression tests, but also with my B-Tree verification tool.
Please review it.

Sorry about this.
-- 
Peter Geoghegan
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6d3aa88..a442617 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -413,10 +413,13 @@ struct Tuplesortstate
  *
  * If state-randomAccess is true, then the stored representation of the
  * tuple must be followed by another unsigned int that is a copy of the
- * length --- so the total tape space used is actually sizeof(unsigned int)
- * more than the stored length value.  This allows read-backwards.  When
- * randomAccess is not true, the write/read routines may omit the extra
- * length word.
+ * length (less any abbreviated key that is stored, which is also possible) ---
+ * so the total tape space used is actually sizeof(unsigned int) more than the
+ * stored length value, as well as another sizeof(Datum) overhead for storing
+ * abbreviated keys.  This allows read-backwards, and avoids the need to
+ * re-compute abbreviated keys.  When randomAccess is not true, the write/read
+ * routines may omit the extra length word.  Also, when abbreviation is not in
+ * play, abbreviated keys are not stored.
  *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
@@ -3124,11 +3127,15 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	unsigned int tupbodylen = tuple-t_len - MINIMAL_TUPLE_DATA_OFFSET;
 
-	/* total on-disk footprint: */
+	/* total on-disk footprint for tuple: */
 	unsigned int tuplen = tupbodylen + sizeof(int);
 
 	LogicalTapeWrite(state-tapeset, tapenum,
 	 (void *) tuplen, sizeof(tuplen));
+	/* Store abbreviated key, if any (not accounted for by tuplen) */
+	if (state-sortKeys-abbrev_converter)
+		LogicalTapeWrite(state-tapeset, tapenum,
+		 (void *) stup-datum1, sizeof(Datum));
 	LogicalTapeWrite(state-tapeset, tapenum,
 	 (void *) tupbody, tupbodylen);
 	if (state-randomAccess)	/* need trailing length word? */
@@ -3150,6 +3157,10 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	HeapTupleData htup;
 
 	USEMEM(state, GetMemoryChunkSpace(tuple));
+	/* read in the tuple abbreviated key, if any */
+	if (state-sortKeys-abbrev_converter)
+		LogicalTapeReadExact(state-tapeset, tapenum, (void *) stup-datum1,
+			 sizeof(Datum));
 	/* read in the tuple proper */
 	tuple-t_len = tuplen;
 	LogicalTapeReadExact(state-tapeset, tapenum,
@@ -3161,10 +3172,12 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	/* set up first-column key value */
 	htup.t_len = tuple-t_len + MINIMAL_TUPLE_OFFSET;
 	htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
-	stup-datum1 = heap_getattr(htup,
-state-sortKeys[0].ssup_attno,
-state-tupDesc,
-stup-isnull1);
+	/* set up first-column key value (for non-abbreviated case) */
+	if (!state-sortKeys-abbrev_converter)
+		stup-datum1 = heap_getattr(htup,
+	state-sortKeys[0].ssup_attno,
+	state-tupDesc,
+	stup-isnull1);
 }
 
 /*
@@ -3355,12 +3368,20 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 {
 	HeapTuple	tuple = (HeapTuple) stup-tuple;
 	unsigned int tuplen = tuple-t_len + sizeof(ItemPointerData) + sizeof(int);
+	AttrNumber	leading = state-indexInfo-ii_KeyAttrNumbers[0];
 
-	/* We need to store t_self, but not other fields of HeapTupleData */
+	/*
+	 * We need to store t_self, but not other fields of HeapTupleData (although
+	 * possibly abbreviated key value)
+	 */
 	LogicalTapeWrite(state-tapeset, tapenum,
 	 tuplen, sizeof(tuplen));
 	LogicalTapeWrite(state-tapeset, tapenum,
 	 tuple-t_self, sizeof(ItemPointerData));
+	/* Store abbreviated key, if any (not accounted for by tuplen) */
+	if

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Robert Haas

On Tue, Jan 20, 2015 at 9:33 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 20, 2015 at 6:30 PM, Robert Haas robertmh...@gmail.com wrote:
 I don't want to change the on-disk format for tapes without a lot more
 discussion.  Can you come up with a fix that avoids that for now?

 A more conservative approach would be to perform conversion on-the-fly
 once more. That wouldn't be patently unacceptable from a performance
 perspective, and would also not change the on-disk format for tapes.

 Thoughts?

That might be OK.  Probably needs a bit of performance testing to see
how it looks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Robert Haas

On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote:
 I was assuming we were going to fix this by undoing the abbreviation
 (as in the abort case) when we spill to disk, and not bothering with
 it thereafter.

 The spill-to-disk case is at least as compelling at the internal sort
 case. The overhead of comparisons is much higher for tapesort.

First, we need to unbreak this.  Then, we can look at optimizing it.
The latter task will require performance testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 6:30 PM, Robert Haas robertmh...@gmail.com wrote:
 I don't want to change the on-disk format for tapes without a lot more
 discussion.  Can you come up with a fix that avoids that for now?

A more conservative approach would be to perform conversion on-the-fly
once more. That wouldn't be patently unacceptable from a performance
perspective, and would also not change the on-disk format for tapes.

Thoughts?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Robert Haas

On Tue, Jan 20, 2015 at 7:07 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 20, 2015 at 3:57 PM, Peter Geoghegan p...@heroku.com wrote:
 It's certainly possible to fix Andrew's test case with the attached.
 I'm not sure that that's the appropriate fix, though: there is
 probably a case to be made for not bothering with abbreviation once
 we've read tuples in for the final merge run. More likely, the
 strongest case is for storing the abbreviated keys on disk too, and
 reading those back.

 Maybe not, though: An extra 8 bytes per tuple on disk is not free.
 OTOH, if we're I/O bound on the final merge, as we ought to be, then
 recomputing the abbreviated keys could make sense, since there may
 well be an idle CPU core anyway.

I was assuming we were going to fix this by undoing the abbreviation
(as in the abort case) when we spill to disk, and not bothering with
it thereafter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 6:39 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 20, 2015 at 6:34 PM, Robert Haas robertmh...@gmail.com wrote:
 That might be OK.  Probably needs a bit of performance testing to see
 how it looks.

 Well, we're still only doing it when we do our final merge. So that's
 only doubling the number of conversions required, which if we're
 blocked on I/O might not matter at all.

You'll probably prefer the attached. This patch works by disabling
abbreviation, but only after writing out runs, with the final merge
left to go. That way, it doesn't matter when abbreviated keys are not
read back from disk (or regenerated).

I believe this bug was missed because it only occurs when there are
multiple runs, and not in the common case where there is one big
initial run that is found already sorted when we reach mergeruns().
-- 
Peter Geoghegan
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6d3aa88..b6e302f 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2172,6 +2172,22 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	if (state-sortKeys-abbrev_converter)
+	{
+		/*
+		 * If there are multiple runs to be merged, when we go to read back
+		 * tuples from disk, abbreviated keys will not have been stored, and we
+		 * don't care to regenerate them.  Disable abbreviation from this point
+		 * on.
+		 */
+		state-sortKeys-abbrev_converter = NULL;
+		state-sortKeys-comparator = state-sortKeys-abbrev_full_comparator;
+
+		/* Not strictly necessary, but be tidy */
+		state-sortKeys-abbrev_abort = NULL;
+		state-sortKeys-abbrev_full_comparator = NULL;
+	}
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum  state-tapeRange; tapenum++)
 		LogicalTapeRewind(state-tapeset, tapenum, false);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote:
 I was assuming we were going to fix this by undoing the abbreviation
 (as in the abort case) when we spill to disk, and not bothering with
 it thereafter.

 The spill-to-disk case is at least as compelling at the internal sort
 case. The overhead of comparisons is much higher for tapesort.

 First, we need to unbreak this.  Then, we can look at optimizing it.
 The latter task will require performance testing.

I don't see that any alternative isn't a performance trade-off. My
patch accomplishes unbreaking this. I agree that it needs still needs
review from that perspective, but it doesn't seem any worse than any
other alternative. Would you prefer it if the spill-to-disk case
aborted in the style of low entropy keys? That doesn't seem
significantly safer than this, and it certainly not acceptable from a
performance perspective.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 5:46 PM, Peter Geoghegan p...@heroku.com wrote:
 Would you prefer it if the spill-to-disk case
 aborted in the style of low entropy keys? That doesn't seem
 significantly safer than this, and it certainly not acceptable from a
 performance perspective.

BTW, I can write that patch if that's your preference. Should I?

I just don't favor that even as a short term correctness fix, because
it seems unacceptable to throw away all the strxfrm() work where
that's a very predictable and even likely outcome. I suggest reviewing
and committing my fix as a short term fix, that may well turn out to
be generally acceptable upon further consideration. Yes, we'll need to
make a point of reviewing an already committed patch, but there is a
precedent for that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Robert Haas

On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote:
 I was assuming we were going to fix this by undoing the abbreviation
 (as in the abort case) when we spill to disk, and not bothering with
 it thereafter.

 The spill-to-disk case is at least as compelling at the internal sort
 case. The overhead of comparisons is much higher for tapesort.

 Attached patch serializes keys. On reflection, I'm inclined to go with
 this approach. Even if the CPU overhead of reconstructing strxfrm()
 blobs is acceptable for text, it might be much more expensive for
 other types. I'm loathe to throw away those abbreviated keys
 unnecessarily.

 We don't have to worry about having aborted abbreviation, since once
 we spill to disk we've effectively committed to abbreviation. This
 patch formalizes the idea that there is strictly a pass-by-value
 representation required for such cases (but not that the original
 Datums must be of a pass-by-reference, which is another thing
 entirely). I've tested it some, obviously with Andrew's testcase and
 the regression tests, but also with my B-Tree verification tool.
 Please review it.

 Sorry about this.

I don't want to change the on-disk format for tapes without a lot more
discussion.  Can you come up with a fix that avoids that for now?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-20 Thread Peter Geoghegan

On Tue, Jan 20, 2015 at 6:34 PM, Robert Haas robertmh...@gmail.com wrote:
 That might be OK.  Probably needs a bit of performance testing to see
 how it looks.

Well, we're still only doing it when we do our final merge. So that's
only doubling the number of conversions required, which if we're
blocked on I/O might not matter at all.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Michael Paquier

On Tue, Jan 20, 2015 at 11:29 AM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jan 19, 2015 at 5:59 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 You did notice that bowerbird isn't building, right?
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46

 Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever 
 reason.

 I think that the attached patch should at least fix that much. Maybe
 the problem on the other animal is also explained by the lack of this,
 since there could also be a MinGW-ish strxfrm_l(), I suppose.
On MinGW-32, not that I know of:
$ find . -name *.h | xgrep strxfrm_l
./lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define if strxfr
m_l is available in string.h. */
./mingw32/lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define i
f strxfrm_l is available in string.h. */
strxfrm is defined in string.h though.

With your patch applied, the failure with MSVC disappeared, but there
is still a warning showing up:
(ClCompile target) -
  src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of
32-bit shift implicitly converted to 64 bits (was 64-bit shift
intended?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Alvaro Herrera

Peter Geoghegan wrote:

 It appears that the buildfarm animal brolga isn't happy about this
 patch. I'm not sure why, since I thought we already figured out bugs
 or other inconsistencies in various strxfrm() implementations.

You did notice that bowerbird isn't building, right?
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Robert Haas

On Mon, Jan 19, 2015 at 5:43 PM, Peter Geoghegan p...@heroku.com wrote:
 It appears that the buildfarm animal brolga isn't happy about this
 patch. I'm not sure why, since I thought we already figured out bugs
 or other inconsistencies in various strxfrm() implementations.

Well, the first thing that comes to mind is that strxfrm() is
returning strings that, when sorted, do not give the same order we
would have obtained via strcoll().  It's true that there are existing
callers of strxfrm(), but it looks like that is mostly used for
statistics-gathering, so it's possible that differences vs. strcoll()
would not have shown up before now.  Is there any legitimate way that
strxfrm() and strcoll() can return inconsistent answers - e.g. they
are somehow allowed to derive their notion of the relevant locale
differently - or is this just a case of Cygwin being busted?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Peter Geoghegan

On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On MinGW-32, not that I know of:
 $ find . -name *.h | xgrep strxfrm_l
 ./lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define if 
 strxfr
 m_l is available in string.h. */
 ./mingw32/lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* 
 Define i
 f strxfrm_l is available in string.h. */
 strxfrm is defined in string.h though.

I'm not quite following. Doesn't that imply that strxfrm_l() at least
*could* be available? I guess it doesn't matter, though, because the
animal with the successful build that fails the locale regression test
(brolga) does not have locale_t support. Therefore, there is no new
strxfrm_l() caller.

My next guess is that the real problem is an assumption I've made.
That is, my assumption that strxfrm() always behaves equivalently to
strcpy() when the C locale happens to be in use may not be portable
(due to external factors). I guess we're inconsistent about making
sure that LC_COLLATE is set correctly in WIN32 and/or EXEC_BACKEND
builds, or something along those lines. The implementation in the past
got away with strcoll()/strxfrm() not having the C locale set, since
strcoll() was never called when the C locale was in use -- we just
called strcmp() instead.

Assuming that's correct, it might be easier just to entirely disable
the optimization on Windows, even with the C locale. It may not be
worth it to even bother just for C locale support of abbreviated keys.
I'm curious about what will happen there when the _strxfrm_l() fix
patch is applied.

 With your patch applied, the failure with MSVC disappeared, but there
 is still a warning showing up:
 (ClCompile target) -
   src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of
 32-bit shift implicitly converted to 64 bits (was 64-bit shift
 intended?

That seems harmless. I suggest an explicit cast to Size here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Peter Geoghegan

On Mon, Jan 19, 2015 at 5:59 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 You did notice that bowerbird isn't building, right?
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46

 Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever 
 reason.

I think that the attached patch should at least fix that much. Maybe
the problem on the other animal is also explained by the lack of this,
since there could also be a MinGW-ish strxfrm_l(), I suppose.

-- 
Peter Geoghegan
diff --git a/src/include/port/win32.h b/src/include/port/win32.h
index 550c3ec..4cb51ec 100644
--- a/src/include/port/win32.h
+++ b/src/include/port/win32.h
@@ -341,6 +341,7 @@ typedef int pid_t;
 #define isspace_l _isspace_l
 #define iswspace_l _iswspace_l
 #define strcoll_l _strcoll_l
+#define strxfrm_l _strxfrm_l
 #define wcscoll_l _wcscoll_l
 #define wcstombs_l _wcstombs_l
 #define mbstowcs_l _mbstowcs_l

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Peter Geoghegan

On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 You did notice that bowerbird isn't building, right?
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-01-19%2023%3A54%3A46

Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever reason.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Robert Haas

On Tue, Dec 2, 2014 at 8:28 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, maybe you should make the updates we've agreed on and I can take
 another look at it.

 Agreed.

 Attached, revised patchset makes these updates. I continue to use the
 sortsupport struct to convey that a given attribute of a given sort is
 abbreviation-applicable (although the field is now a bool, not an
 enum).

All right, it seems Tom is with you on that point, so after some
study, I've committed this with very minor modifications.  Sorry for
the long delay.  I have not committed the 0002 patch, though, because
I haven't studied that enough yet to know whether I think it's a good
idea.  Perhaps that could get its own CommitFest entry and thread,
though, to separate it from this exceedingly long discussion and make
it clear exactly what we're hoping to gain by that patch specifically.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Robert Haas

On Mon, Jan 19, 2015 at 3:33 PM, Robert Haas robertmh...@gmail.com wrote:
 All right, it seems Tom is with you on that point, so after some
 study, I've committed this with very minor modifications.  Sorry for
 the long delay.  I have not committed the 0002 patch, though, because
 I haven't studied that enough yet to know whether I think it's a good
 idea.  Perhaps that could get its own CommitFest entry and thread,
 though, to separate it from this exceedingly long discussion and make
 it clear exactly what we're hoping to gain by that patch specifically.

By the way, for those following along at home, here's an example of
how this patch can help:

rhaas=# create table stuff as select random()::text as a, 'filler
filler filler'::text as b, g as c from generate_series(1, 100) g;
SELECT 100
rhaas=# create index on stuff (a);
CREATE INDEX

On the PPC64 machine I normally use for performance testing, it takes
about 6.3 seconds to build the index with the commit just before this
one.  With this commit, it drops to 1.9 seconds.  That's more than a
3x speedup!

Now, if I change the query that creates the table to this.

rhaas=# create table stuff as select '' || random()::text as
a, 'filler filler filler'::text as b, g as c from generate_series(1,
100) g;

...then it takes 10.8 seconds with or without this patch.  In general,
any case where the first few characters of every string are exactly
identical (or only quite rarely different) will not benefit, but many
practical cases will benefit significantly.  Also, Peter's gone to a
fair amount of work to make sure that even when the patch does not
help, it doesn't hurt, either.

So that's pretty cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-19 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
 On the PPC64 machine I normally use for performance testing, it takes
 about 6.3 seconds to build the index with the commit just before this
 one.  With this commit, it drops to 1.9 seconds.  That's more than a
 3x speedup!
 
 Now, if I change the query that creates the table to this.
 
 rhaas=# create table stuff as select '' || random()::text as
 a, 'filler filler filler'::text as b, g as c from generate_series(1,
 100) g;
 
 ...then it takes 10.8 seconds with or without this patch.  In general,
 any case where the first few characters of every string are exactly
 identical (or only quite rarely different) will not benefit, but many
 practical cases will benefit significantly.  Also, Peter's gone to a
 fair amount of work to make sure that even when the patch does not
 help, it doesn't hurt, either.
 
 So that's pretty cool.

Wow, nice!

Good work Peter!

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-09 Thread Peter Geoghegan

There is an interesting thread about strcoll() overhead over on -general:

http://www.postgresql.org/message-id/cab25xexnondrmc1_cy3jvmb0tmydm38ef9q2d7xla0rbncj...@mail.gmail.com

My guess was that this person experienced a rather unexpected downside
of spilling to disk when sorting on a text attribute: System
throughput becomes very CPU bound, because tapesort tends to result in
more comparisons [1].  With abbreviated keys, tapesort can actually
compete with quicksort in certain cases [2]. Tapesorts of text
attributes are especially bad on released versions of Postgres, and
will perform very little actual I/O.

In all seriousness, I wonder if we should add a release note item
stating that when using Postgres 9.5, due to the abbreviated key
optimization, external sorts can be much more I/O bound than in
previous releases...

[1] 
http://www.postgresql.org/message-id/20140806035512.ga91...@tornado.leadboat.com
[2] 
http://www.postgresql.org/message-id/CAM3SWZQiGvGhMB4TMbEWoNjO17=ysb5b5y5mgqjsanq4uwt...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-03 Thread Robert Haas

On Tue, Dec 2, 2014 at 5:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Geoghegan p...@heroku.com writes:
 On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote:
 Right, and what I'm saying is that maybe the applicability flag
 shouldn't be stored in the SortSupport object, but passed down as an
 argument.

 But then how does that information get to any given sortsupport
 routine? That's the place that really needs to know if abbreviation is
 useful. In general, they're only passed a SortSupport object. Are you
 suggesting revising the signature required of SortSupport routines to
 add that extra flag as an additional argument?

 I think that is what he's suggesting, and I too am wondering why it's
 a good idea.

I find it somewhat confusing that we've got one flag which is only
used from the time the SortSupport object is created until the time
that it's fully initialized, and then a different way of indicating
whether we paid attention to that flag.  I'm not totally sure what the
right solution to that problem is, but the current situation feels
like something of a wart.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-03 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 1:21 PM, Peter Geoghegan p...@heroku.com wrote:
 Incidentally, I think that an under-appreciated possible source of
 regressions here is that attributes abbreviated have a strong
 physical/logical correlation. I could see a small regression for one
 such case even though my cost model indicated that it should be very
 profitable.

This was the column in question:

postgres=# select * from pg_stats where tablename = 'ohio_voters' and
attname = 'mailing_address1' ;
-[ RECORD 1 ]-
schemaname | public
tablename  | ohio_voters
attname| mailing_address1
inherited  | f
null_frac  | 0
avg_width  | 5
n_distinct | 789
most_common_vals   | {}
most_common_freqs  | {0.969267}
histogram_bounds   |  SNIP ***
correlation| 0.944785
 SNIP ***

This n_distinct is wrong, though. In fact, the number of distinct
columns is 25,946, while the number of distinct abbreviated keys is
13,691. So correlation was not the dominant factor here (although it
was probably still a factor) - rather, the dominant factor was that
the vast majority of comparisons would get away with an opportunistic
memcmp() == 0 anyway (although not with Postgres 9.4), and so my
baseline is very fast for this case.

This would not have come up had the value been represented as
NULL (as it clearly should have been), since that would not undergo
strxfrm() transformation/abbreviation in the first place. Even still,
highly skewed attributes exist in the wild, and deserve our
consideration - we do not model the distribution of values within the
set.

I believe that these cases are rare enough, and (thanks to the already
committed parts of this work) fast enough to probably not be worried
about; maybe a more complex cost model could do better, but I'm
inclined to think that it's not worth it. We'd end up slightly
improving this case at bigger cost to other, much more common cases.
Besides, equality-resolved comparisons are not necessarily much
cheaper for datatypes other than Postgres 9.5 text (in a world where
there is a variety of datatypes accelerated by abbreviation), which
discourages a general solution.

A custom format dump of this data (publicly available Ohio State voter
records) is available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/ohio_voters.custom.dump

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas

On Tue, Nov 25, 2014 at 1:38 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas robertmh...@gmail.com wrote:
 - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.

 Actually, the word only is removed (because PG_CACHE_LINE_SIZE has a
 new client now). So it isn't quite the same paragraph as before.

Oy, I missed that.

 - I really don't think we need a #define in pg_config_manual.h for
 this.  Please omit that.

 You'd prefer to not offer a way to disable abbreviation? Okay. I guess
 that makes sense - it should work well as a general optimization.

I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
that we expect a reasonably decent number of users to need to change
should be in that file, and this is too marginal for that.  If anybody
other than the developers of the feature is disabling this, the whole
thing is going to be ripped back out.

 I'm not sure about that. I'd prefer to have tuplesort (and one or two
 other sites) set the abbreviation is possible in principle flag.
 Otherwise, sortsupport needs to assume that the leading attribute is
 going to be the abbreviation-applicable one, which might not always be
 true. Still, it's not as if I feel strongly about it.

When wouldn't that be true?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote:
 I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
 that we expect a reasonably decent number of users to need to change
 should be in that file, and this is too marginal for that.  If anybody
 other than the developers of the feature is disabling this, the whole
 thing is going to be ripped back out.

I agree. The patch either works well in general and it goes in, or it
doesn't and should be rejected. That doesn't mean that the standard
applied is that regressions are absolutely unacceptable, but the
standard shouldn't be too far off that.  I feel pretty confident that
we'll be able to meet that standard, though, because database luminary
Jim Gray recommended this technique in about 1995. Even if the details
of what I have here could stand to be tweaked, there is no getting
around the fact that a good sort routine needs to strongly consider
locality. That was apparent even in 1995, but now it's a very major
trend.

Incidentally, I think that an under-appreciated possible source of
regressions here is that attributes abbreviated have a strong
physical/logical correlation. I could see a small regression for one
such case even though my cost model indicated that it should be very
profitable. On the other hand, on other occasions my cost model (i.e.
considering how good a proxy for full key cardinality abbreviated key
cardinality is) was quite pessimistic. Although, at least it was only
a small regression, even though the correlation was something like
0.95. And at least the sort will be very fast in any case.

You'll recall that Heikki's test case involved correlation like that,
even though it was mostly intended to make a point about the entropy
in abbreviated keys. Correlation was actually the most important
factor there. I think it might be generally true that it's the most
important factor, in practice more important even than capturing
sufficient entropy in the abbreviated key representation.

 I'm not sure about that. I'd prefer to have tuplesort (and one or two
 other sites) set the abbreviation is possible in principle flag.
 Otherwise, sortsupport needs to assume that the leading attribute is
 going to be the abbreviation-applicable one, which might not always be
 true. Still, it's not as if I feel strongly about it.

 When wouldn't that be true?

It just feels a bit wrong to me. There might be a future in which we
want to use the datum1 field for a non-leading attribute. For example,
when it is known ahead of time that there are low cardinality integers
in the leading key/attribute. Granted, that's pretty speculative, but
then it's not as if I'm insisting that it must be done that way. I
defer to you.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 4:21 PM, Peter Geoghegan p...@heroku.com wrote:
 I'm not sure about that. I'd prefer to have tuplesort (and one or two
 other sites) set the abbreviation is possible in principle flag.
 Otherwise, sortsupport needs to assume that the leading attribute is
 going to be the abbreviation-applicable one, which might not always be
 true. Still, it's not as if I feel strongly about it.

 When wouldn't that be true?

 It just feels a bit wrong to me. There might be a future in which we
 want to use the datum1 field for a non-leading attribute. For example,
 when it is known ahead of time that there are low cardinality integers
 in the leading key/attribute. Granted, that's pretty speculative, but
 then it's not as if I'm insisting that it must be done that way. I
 defer to you.

Well, maybe you should make the updates we've agreed on and I can take
another look at it.  But I didn't think that I was proposing to change
anything about the level at which the decision about whether to
abbreviate or not was made; rather, I thought I was suggesting that we
pass that flag down to the code that initializes the sortsupport
object as an argument rather than through the sortsupport structure
itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, maybe you should make the updates we've agreed on and I can take
 another look at it.

Agreed.

 But I didn't think that I was proposing to change
 anything about the level at which the decision about whether to
 abbreviate or not was made; rather, I thought I was suggesting that we
 pass that flag down to the code that initializes the sortsupport
 object as an argument rather than through the sortsupport structure
 itself.

The flag I'm talking about concerns the *applicability* of
abbreviation, and not whether or not it will actually be used (maybe
the opclass lacks support, or decides not to for some platform
specific reason). Tuplesort has a contract with abbreviation +
sortsupport that considers whether or not the function pointer used to
abbreviate is set, which relates to whether or not abbreviation will
*actually* be used. Note that for non-abbreviation-applicable
attributes, btsortsupport_worker() never sets the function pointer
(nor, incidentally, does it set the other abbreviation related
function pointers in the struct).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 5:16 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, maybe you should make the updates we've agreed on and I can take
 another look at it.

 Agreed.

 But I didn't think that I was proposing to change
 anything about the level at which the decision about whether to
 abbreviate or not was made; rather, I thought I was suggesting that we
 pass that flag down to the code that initializes the sortsupport
 object as an argument rather than through the sortsupport structure
 itself.

 The flag I'm talking about concerns the *applicability* of
 abbreviation, and not whether or not it will actually be used (maybe
 the opclass lacks support, or decides not to for some platform
 specific reason). Tuplesort has a contract with abbreviation +
 sortsupport that considers whether or not the function pointer used to
 abbreviate is set, which relates to whether or not abbreviation will
 *actually* be used. Note that for non-abbreviation-applicable
 attributes, btsortsupport_worker() never sets the function pointer
 (nor, incidentally, does it set the other abbreviation related
 function pointers in the struct).

Right, and what I'm saying is that maybe the applicability flag
shouldn't be stored in the SortSupport object, but passed down as an
argument.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote:
 Right, and what I'm saying is that maybe the applicability flag
 shouldn't be stored in the SortSupport object, but passed down as an
 argument.

But then how does that information get to any given sortsupport
routine? That's the place that really needs to know if abbreviation is
useful. In general, they're only passed a SortSupport object. Are you
suggesting revising the signature required of SortSupport routines to
add that extra flag as an additional argument?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Tom Lane

Peter Geoghegan p...@heroku.com writes:
 On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas robertmh...@gmail.com wrote:
 Right, and what I'm saying is that maybe the applicability flag
 shouldn't be stored in the SortSupport object, but passed down as an
 argument.

 But then how does that information get to any given sortsupport
 routine? That's the place that really needs to know if abbreviation is
 useful. In general, they're only passed a SortSupport object. Are you
 suggesting revising the signature required of SortSupport routines to
 add that extra flag as an additional argument?

I think that is what he's suggesting, and I too am wondering why it's
a good idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, maybe you should make the updates we've agreed on and I can take
 another look at it.

 Agreed.

Attached, revised patchset makes these updates. I continue to use the
sortsupport struct to convey that a given attribute of a given sort is
abbreviation-applicable (although the field is now a bool, not an
enum).

-- 
Peter Geoghegan
From 865a1abeb6bfb601b1ec605afb1e339c0e444e10 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sun, 9 Nov 2014 14:38:44 -0800
Subject: [PATCH 2/2] Estimate total number of rows to be sorted

Sortsupport opclasses now accept a row hint, indicating the estimated
number of rows to be sorted.  This gives opclasses a sense of proportion
about how far along the copying of tuples is when considering aborting
abbreviation.

Estimates come from various sources.  The text opclass now always avoids
aborting abbreviation if the total number of rows to be sorted is high
enough, without considering cardinality at all.
---
 src/backend/access/nbtree/nbtree.c |  5 ++-
 src/backend/access/nbtree/nbtsort.c| 14 +-
 src/backend/commands/cluster.c |  4 +-
 src/backend/executor/nodeAgg.c |  5 ++-
 src/backend/executor/nodeSort.c|  1 +
 src/backend/utils/adt/orderedsetaggs.c |  2 +-
 src/backend/utils/adt/varlena.c| 80 --
 src/backend/utils/sort/tuplesort.c | 14 --
 src/include/access/nbtree.h|  2 +-
 src/include/utils/sortsupport.h|  7 ++-
 src/include/utils/tuplesort.h  |  6 +--
 11 files changed, 121 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index d881525..d26c60b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -109,14 +109,15 @@ btbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, index \%s\ already contains data,
 			 RelationGetRelationName(index));
 
-	buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique, false);
+	buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique,
+	 indexInfo-ii_Predicate != NIL, false);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.
 	 */
 	if (indexInfo-ii_Unique)
-		buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+		buildstate.spool2 = _bt_spoolinit(heap, index, false, true, true);
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 593571b..473ac54 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -73,6 +73,7 @@
 #include storage/smgr.h
 #include tcop/tcopprot.h
 #include utils/rel.h
+#include utils/selfuncs.h
 #include utils/sortsupport.h
 #include utils/tuplesort.h
 
@@ -149,10 +150,13 @@ static void _bt_load(BTWriteState *wstate,
  * create and initialize a spool structure
  */
 BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool ispartial,
+			  bool isdead)
 {
 	BTSpool*btspool = (BTSpool *) palloc0(sizeof(BTSpool));
 	int			btKbytes;
+	double		estRows;
+	float4		relTuples;
 
 	btspool-heap = heap;
 	btspool-index = index;
@@ -165,10 +169,16 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
 	 * unique index actually requires two BTSpool objects.  We expect that the
 	 * second one (for dead tuples) won't get very full, so we give it only
 	 * work_mem.
+	 *
+	 * Certain cases will always have a relTuples of 0, such as reindexing as
+	 * part of a CLUSTER operation, or when reindexing toast tables.  This is
+	 * interpreted as no estimate available.
 	 */
 	btKbytes = isdead ? work_mem : maintenance_work_mem;
+	relTuples = RelationGetForm(heap)-reltuples;
+	estRows =  relTuples * (isdead || ispartial ?  DEFAULT_INEQ_SEL : 1);
 	btspool-sortstate = tuplesort_begin_index_btree(heap, index, isunique,
-	 btKbytes, false);
+	 btKbytes, estRows, false);
 
 	return btspool;
 }
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index bc5f33f..8e5f536 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -890,7 +890,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	/* Set up sorting if wanted */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
-			maintenance_work_mem, false);
+			maintenance_work_mem,
+			RelationGetForm(OldHeap)-reltuples,
+			false);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 89de755..95143c3 100644
---

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 5:28 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached, revised patchset makes these updates.

Whoops. Missed some obsolete comments. Here is a third commit that
makes a further small modification to one comment.

-- 
Peter Geoghegan
From 8d1aba80f95e05742047cba5bd83d8f17aa5ef37 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Tue, 2 Dec 2014 17:42:21 -0800
Subject: [PATCH 3/3] Alter comments to reflect current naming

---
 src/include/utils/sortsupport.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 659233b..f7c73b3 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -127,9 +127,9 @@ typedef struct SortSupportData
 	 * Returning zero from the alternative comparator does not indicate
 	 * equality, as with a conventional support routine 1, though -- it
 	 * indicates that it wasn't possible to determine how the two abbreviated
-	 * values compared.  A proper comparison, using auth_comparator/
-	 * ApplySortComparatorFull() is therefore required.  In many cases this
-	 * results in most or all comparisons only using the cheap alternative
+	 * values compared.  A proper comparison, using abbrev_full_comparator/
+	 * ApplySortAbbrevFullComparator() is therefore required.  In many cases
+	 * this results in most or all comparisons only using the cheap alternative
 	 * comparison func, which is typically implemented as code that compiles to
 	 * just a few CPU instructions.  CPU cache miss penalties are expensive; to
 	 * get good overall performance, sort infrastructure must heavily weigh
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-11-25 Thread Robert Haas

On Sun, Nov 9, 2014 at 10:02 PM, Peter Geoghegan p...@heroku.com wrote:
 On Sat, Oct 11, 2014 at 6:34 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch, when applied, accelerates all tuplesort cases using
 abbreviated keys, building on previous work here, as well as the patch
 posted to that other thread.

 I attach an updated patch set, rebased on top of the master branch's
 tip. All relevant tuplesort cases (B-Tree, MinimalTuple, CLUSTER) are
 now directly covered by the patch set, since there is now general
 sortsupport support for those cases in the master branch -- no need to
 apply some other patch from some other thread.

 For the convenience of reviewers, this new revision includes a new
 approach to making my improvements cumulative: A second commit adds
 tuple count estimation. This hint, passed along to the text opclass's
 convert routine, is taken from the optimizer's own estimate, or the
 relcache's reltuples, depending on the tuplesort case being
 accelerated. As in previous revisions, the idea is to give the opclass
 a sense of proportion about how far along it is, to be weighed in
 deciding whether or not to abort abbreviation. One potentially
 controversial aspect of that is how the text opclass abbreviation cost
 model/abort early stuff weighs simply having many tuples - past a
 certain point, it *always* proceeds with abbreviation, not matter what
 the cardinality of abbreviated keys so far is. For that reason it
 particular, it seemed to make sense to split these parts out into a
 second commit.

 I hope that we can finish up all 9.5 work on accelerated sorting soon.

There's a lot of stuff in this patch I'm still trying to digest, but
here are a few thoughts on patch 0001:

- This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.

- I really don't think we need a #define in pg_config_manual.h for
this.  Please omit that.

- I'm much happier with the way the changes to sortsupport.h look in
this version.  However, I think that auth_comparator is a confusing
name, because auth is often used as an abbreviation for
authentication.   We can spell it out (authoritative_comparator) or
come up with a different name (backup_comparator?
abbrev_full_comparator?).  Whatever we do, ApplySortComparatorAuth()
should be renamed to match.

- Also, I don't think making abbrev_state an enumerated value with two
values is really doing anything for us; we could just use a Boolean.
I'm wondering if we should actually go a bit further and remove this
from the SortSupport object and instead add an additional Boolean flag
to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all
the way down to the opclass's sortsupport function.  It seems like
that might be more clear.  Once the opclass function has done its
thing, the other three new nembers are enough to know whether we're
using the optimization or not (and can be fiddled if we want to make a
later decision to call the whole thing off).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-11-25 Thread Peter Geoghegan

On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas robertmh...@gmail.com wrote:
 - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.

Actually, the word only is removed (because PG_CACHE_LINE_SIZE has a
new client now). So it isn't quite the same paragraph as before.

 - I really don't think we need a #define in pg_config_manual.h for
 this.  Please omit that.

You'd prefer to not offer a way to disable abbreviation? Okay. I guess
that makes sense - it should work well as a general optimization.

 - I'm much happier with the way the changes to sortsupport.h look in
 this version.  However, I think that auth_comparator is a confusing
 name, because auth is often used as an abbreviation for
 authentication.   We can spell it out (authoritative_comparator) or
 come up with a different name (backup_comparator?
 abbrev_full_comparator?).  Whatever we do, ApplySortComparatorAuth()
 should be renamed to match.

Okay.

 - Also, I don't think making abbrev_state an enumerated value with two
 values is really doing anything for us; we could just use a Boolean.
 I'm wondering if we should actually go a bit further and remove this
 from the SortSupport object and instead add an additional Boolean flag
 to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all
 the way down to the opclass's sortsupport function.  It seems like
 that might be more clear.  Once the opclass function has done its
 thing, the other three new nembers are enough to know whether we're
 using the optimization or not (and can be fiddled if we want to make a
 later decision to call the whole thing off).

I'm not sure about that. I'd prefer to have tuplesort (and one or two
other sites) set the abbreviation is possible in principle flag.
Otherwise, sortsupport needs to assume that the leading attribute is
going to be the abbreviation-applicable one, which might not always be
true. Still, it's not as if I feel strongly about it.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-11-25 Thread Peter Geoghegan

On Tue, Nov 25, 2014 at 10:38 AM, Peter Geoghegan p...@heroku.com wrote:
 - Also, I don't think making abbrev_state an enumerated value with two
 values is really doing anything for us; we could just use a Boolean.
 I'm wondering if we should actually go a bit further and remove this
 from the SortSupport object and instead add an additional Boolean flag
 to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all
 the way down to the opclass's sortsupport function.  It seems like
 that might be more clear.  Once the opclass function has done its
 thing, the other three new nembers are enough to know whether we're
 using the optimization or not (and can be fiddled if we want to make a
 later decision to call the whole thing off).

 I'm not sure about that. I'd prefer to have tuplesort (and one or two
 other sites) set the abbreviation is possible in principle flag.

As for the related question of whether or not there should just be a
bool in place of an abbreviation state enum: I thought that we might
add some more flags to that enum (you'll recall that there actually
was another flag in earlier revisions, relating to optimistic
tie-breaks with memcmp() that the master branch now always does
anyway). But come to think of it, I think it's very unlikely that that
flag will ever be extended to represent some now unforeseen state
regarding abbreviation. It's either going to be abbreviation
applicable for this sort and this attribute or not applicable. So,
yes, let's make it a boolean instead.

As I think I mentioned at one point, I imagine that if and when we do
abbreviation with the numeric opclass, it won't ever abort - its
encoding strategy will adapt. That ought to be possible for certain
other datatypes, of which numeric is the best example. Although, I
think it's only going to be a handful of the most important datatypes
that get abbreviation support.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-11-09 Thread Peter Geoghegan

On Sat, Oct 11, 2014 at 6:34 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch, when applied, accelerates all tuplesort cases using
 abbreviated keys, building on previous work here, as well as the patch
 posted to that other thread.

I attach an updated patch set, rebased on top of the master branch's
tip. All relevant tuplesort cases (B-Tree, MinimalTuple, CLUSTER) are
now directly covered by the patch set, since there is now general
sortsupport support for those cases in the master branch -- no need to
apply some other patch from some other thread.

For the convenience of reviewers, this new revision includes a new
approach to making my improvements cumulative: A second commit adds
tuple count estimation. This hint, passed along to the text opclass's
convert routine, is taken from the optimizer's own estimate, or the
relcache's reltuples, depending on the tuplesort case being
accelerated. As in previous revisions, the idea is to give the opclass
a sense of proportion about how far along it is, to be weighed in
deciding whether or not to abort abbreviation. One potentially
controversial aspect of that is how the text opclass abbreviation cost
model/abort early stuff weighs simply having many tuples - past a
certain point, it *always* proceeds with abbreviation, not matter what
the cardinality of abbreviated keys so far is. For that reason it
particular, it seemed to make sense to split these parts out into a
second commit.

I hope that we can finish up all 9.5 work on accelerated sorting soon.
-- 
Peter Geoghegan
From 78d163220b774218170123208c238e0fe2c07eb6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sun, 9 Nov 2014 14:38:44 -0800
Subject: [PATCH 2/2] Estimate total number of rows to be sorted

Sortsupport opclasses now accept a row hint, indicating the estimated
number of rows to be sorted.  This gives opclasses a sense of proportion
about how far along the copying of tuples is when considering aborting
abbreviation.

Estimates come from various sources.  The text opclass now always avoids
aborting abbreviated if the total number of rows to be sorted is high
enough, regardless of anything else.
---
 src/backend/access/nbtree/nbtree.c |  5 ++-
 src/backend/access/nbtree/nbtsort.c| 14 +-
 src/backend/commands/cluster.c |  4 +-
 src/backend/executor/nodeAgg.c |  5 ++-
 src/backend/executor/nodeSort.c|  1 +
 src/backend/utils/adt/orderedsetaggs.c |  2 +-
 src/backend/utils/adt/varlena.c| 80 --
 src/backend/utils/sort/tuplesort.c | 14 --
 src/include/access/nbtree.h|  2 +-
 src/include/utils/sortsupport.h|  7 ++-
 src/include/utils/tuplesort.h  |  6 +--
 11 files changed, 121 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index d881525..d26c60b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -109,14 +109,15 @@ btbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, index \%s\ already contains data,
 			 RelationGetRelationName(index));
 
-	buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique, false);
+	buildstate.spool = _bt_spoolinit(heap, index, indexInfo-ii_Unique,
+	 indexInfo-ii_Predicate != NIL, false);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.
 	 */
 	if (indexInfo-ii_Unique)
-		buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+		buildstate.spool2 = _bt_spoolinit(heap, index, false, true, true);
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c60240f..dd57935 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -73,6 +73,7 @@
 #include storage/smgr.h
 #include tcop/tcopprot.h
 #include utils/rel.h
+#include utils/selfuncs.h
 #include utils/tuplesort.h
 
 
@@ -148,10 +149,13 @@ static void _bt_load(BTWriteState *wstate,
  * create and initialize a spool structure
  */
 BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool ispartial,
+			  bool isdead)
 {
 	BTSpool*btspool = (BTSpool *) palloc0(sizeof(BTSpool));
 	int			btKbytes;
+	double		estRows;
+	float4		relTuples;
 
 	btspool-heap = heap;
 	btspool-index = index;
@@ -164,10 +168,16 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
 	 * unique index actually requires two BTSpool objects.  We expect that the
 	 * second one (for dead tuples) won't get very full, so we give it only
 	 * work_mem.
+	 *
+	 * Certain cases will always have a relTuples of 0, such as reindexing as
+	 * part of a CLUSTER operation, or when reindexing toast tables.  This is
+	 * interpreted as no estimate available.
 	 */

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-10-11 Thread Peter Geoghegan

On Mon, Sep 29, 2014 at 10:34 PM, Peter Geoghegan p...@heroku.com wrote:
 single sortsupport state patch.

You probably noticed that I posted an independently useful patch to
make all tuplesort cases use sortsupport [1] - currently, both the
B-Tree and CLUSTER cases do not use the sortsupport infrastructure
more or less for no good reason. That patch was primarily written to
make abbreviated keys work for all cases, though. I think that making
heap tuple sorts based on a text attribute much faster is a very nice
thing, but in practice most OLTP or web applications are not all that
sensitive to the amount of time taken to sort heap tuples. However,
the length of time it takes to build indexes is something that most
busy production applications are very sensitive to, since of course
apart from the large consumption of system resources often required,
however long the sort takes is virtually the same amount of time as
however long we hold a very heavy, disruptive relation-level
ShareLock. Obviously the same applies to CLUSTER, but more so, since
it must acquire an AccessExclusiveLock on the relation to be
reorganized. I think almost everyone will agree that making B-Tree
builds much faster is the really compelling case here, because that's
where slow sorts cause by far the most pain for users in the real
world.

Attached patch, when applied, accelerates all tuplesort cases using
abbreviated keys, building on previous work here, as well as the patch
posted to that other thread. Exact instructions are in the commit
message of 0004-* (i.e. where to find the pieces I haven't posted
here). I also attach a minor bitrot fix commit/patch.

Performance is improved for B-Tree index builds by a great deal, too.
The improvements are only slightly less than those seen for comparable
heap tuple sorts (that is, my earlier test cases that had client
overhead removed). With larger sorts, that difference tends to get
lost in the noise easily.

I'm very encouraged by this. I think that being able to build B-Tree
indexes on text attributes very significantly faster than previous
versions of PostgreSQL is likely to be a significant feature for
PostgreSQL 9.5. After all, external sorts are where improvements are
most noticeable [2] - they're so much faster with this patch that
they're actually sometimes faster than internal sorts *with*
abbreviated keys. This would something that I found quite surprising.

[1] 
http://www.postgresql.org/message-id/CAM3SWZTfKZHTUiWDdHg+6tcYuMsdHoC=bmuaivgmp9athj4...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/cam3swzqvjcgme6ube-ydipu0n5bo7rmz31zrhmskdduynej...@mail.gmail.com
-- 
Peter Geoghegan
From 72070126e707cf356ddcb3de60cbdb14658ed077 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Fri, 10 Oct 2014 22:26:21 -0700
Subject: [PATCH 4/4] Make B-Tree/CLUSTER sortsupport use abbreviation

This commit is intended to be applied on top of several others.  First,
apply the abbreviated key support patch -- the revision posted here:

cam3swzt+v3llrbscyuj9jovumdpbq6bupf9wofbfuywgwuh...@mail.gmail.com.

Then, apply the bitrot-fixing commit that was attached alongside this
patch.

In addition, the B-Tree/CLUSTER sortsupport patch must then be applied.
Get it from:

CAM3SWZTfKZHTUiWDdHg+6tcYuMsdHoC=bmuaivgmp9athj4...@mail.gmail.com

Finally, apply this patch.

B-Tree sortsupport with abbreviated keys (for text) shows benefits (and
costs) that are similar to the heavily tested/benchmarked MinimalTuple
case.  There is additional overhead when building a B-Tree index as
compared to heap tuplesorts (with no client overhead), but even still
the vast majority of system time is spent actually sorting index tuples.
Therefore, it is not expected that the addition of B-Tree/CLUSTER
support will influence the review of abbreviated keys much either way.
---
 src/backend/access/nbtree/nbtsort.c |   3 +
 src/backend/utils/sort/tuplesort.c  | 385 +++-
 2 files changed, 298 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index a6c44a7..1ccc9fb 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -720,6 +720,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			strategy  = (scanKey-sk_flags  SK_BT_DESC) != 0 ?
 BTGreaterStrategyNumber : BTLessStrategyNumber;
 
+			/* Abbreviation is not supported here */
+			sortKey-abbrev_state = ABBREVIATED_KEYS_NO;
+
 			PrepareSortSupportFromIndexRel(wstate-index, strategy, sortKey);
 		}
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 1d57de7..c6a18dc 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -453,6 +453,7 @@ struct Tuplesortstate
 
 static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
+static

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Robert Haas

On Wed, Sep 24, 2014 at 7:04 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan p...@heroku.com wrote:
 Probably not - it appears to make very little difference to
 unoptimized pass-by-reference types whether or not datum1 can be used
 (see my simulation of Kevin's worst case, for example [1]). Streaming
 through a not inconsiderable proportion of memtuples again is probably
 a lot worse. The datum1 optimization (which is not all that old) made
 a lot of sense when initially introduced, because it avoided chasing
 through a pointer for pass-by-value types. I think that's its sole
 justification, though.

 Just to be clear -- I am blocked on this. Do you really prefer to
 restart copying heap tuples from scratch in the event of aborting,
 just to make sure that the datum1 representation is consistently
 either a pointer to text, or an abbreviated key? I don't think that
 the costs involved make that worth it, as I've said, but I'm not sure
 how to resolve that controversy.

 I suggest that we focus on making sure the abort logic itself is
 sound. There probably hasn't been enough discussion of that. Once that
 is resolved, we can revisit the question of whether or not copying
 should restart to keep the datum1 representation consistent. I suspect
 that leaving that until later will be easier all around.

The top issue on my agenda is figuring out a way to get rid of the
extra SortSupport object.  I'm not going to commit any version of this
patch that uses a second SortSupport for the tiebreak.  I doubt anyone
else will like that either, but you can try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas robertmh...@gmail.com wrote:
 The top issue on my agenda is figuring out a way to get rid of the
 extra SortSupport object.

Really? I'm surprised. Clearly the need to restart heap tuple copying
from scratch, in order to make the datum1 representation consistent,
rather than abandoning datum1 for storing abbreviated keys or pointers
entirely is a very important aspect of whether or not we should change
that. In turn, that's something that's going to (probably
significantly) affect the worst case.

Do you have an opinion on that? If you want me to start from scratch,
and then have a consistent datum1 representation, and then be able to
change the structure of comparetup_heap() as you outline (so as to get
rid of the extra SortSupport object), I can do that. My concern is the
regression. The datum1 pointer optimization appears to matter very
little for pass by value types (traditionally, before abbreviated
keys), and so I have a hard time imagining this working out.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 2:05 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas robertmh...@gmail.com wrote:
 The top issue on my agenda is figuring out a way to get rid of the
 extra SortSupport object.

 Really? I'm surprised. Clearly the need to restart heap tuple copying
 from scratch, in order to make the datum1 representation consistent,
 rather than abandoning datum1 for storing abbreviated keys or pointers
 entirely is a very important aspect of whether or not we should change
 that. In turn, that's something that's going to (probably
 significantly) affect the worst case.

 Do you have an opinion on that?

I haven't looked at that part of the patch in detail yet, so... not
really.  But I don't see why you'd ever need to restart heap tuple
copying.  At most you'd need to re-extract datum1 from the tuples you
have already copied.  To find out how much that optimization buys, you
should use tuples with many variable-length columns (say, 50)
preceding the text column you're sorting on. I won't be surprised if
that turns out to be expensive enough to be worth worrying about, but
I have not benchmarked it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 11:53 AM, Robert Haas robertmh...@gmail.com wrote:
 I haven't looked at that part of the patch in detail yet, so... not
 really.  But I don't see why you'd ever need to restart heap tuple
 copying.  At most you'd need to re-extract datum1 from the tuples you
 have already copied.

Well, okay, technically it's not restarting heap tuple copying, but
it's about the same thing. The point is that you have to stream a
significant chunk of the memtuples array through memory again. Not to
mention, having infrastructure to do that, and pick up where we left
off, which is significantly more code, all to make comparetup_heap() a
bit clearer (i.e. making it so that it won't have to think about an
extra sortsupport state).

 To find out how much that optimization buys, you
 should use tuples with many variable-length columns (say, 50)
 preceding the text column you're sorting on. I won't be surprised if
 that turns out to be expensive enough to be worth worrying about, but
 I have not benchmarked it.

Sorry, but I don't follow. I don't think the pertinent question is if
it's a noticeable cost. I think the pertinent question is if it's
worth it. Doing something about it necessitates a lot of extra memory
access. Not doing something about it hardly affects the amount of
memory access required, perhaps not at all.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 3:17 PM, Peter Geoghegan p...@heroku.com wrote:
 To find out how much that optimization buys, you
 should use tuples with many variable-length columns (say, 50)
 preceding the text column you're sorting on. I won't be surprised if
 that turns out to be expensive enough to be worth worrying about, but
 I have not benchmarked it.

 Sorry, but I don't follow. I don't think the pertinent question is if
 it's a noticeable cost. I think the pertinent question is if it's
 worth it. Doing something about it necessitates a lot of extra memory
 access. Not doing something about it hardly affects the amount of
 memory access required, perhaps not at all.

I think you're mincing words.  If you go back and fix datum1, you'll
spend a bunch of effort doing that.If you don't, you'll pay a cost
on every comparison to re-find the relevant column inside each tuple.
You can compare those costs in a variety of cases, including the one I
mentioned, where the latter cost will be relatively high.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-24 Thread Peter Geoghegan

On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan p...@heroku.com wrote:
 Probably not - it appears to make very little difference to
 unoptimized pass-by-reference types whether or not datum1 can be used
 (see my simulation of Kevin's worst case, for example [1]). Streaming
 through a not inconsiderable proportion of memtuples again is probably
 a lot worse. The datum1 optimization (which is not all that old) made
 a lot of sense when initially introduced, because it avoided chasing
 through a pointer for pass-by-value types. I think that's its sole
 justification, though.


Just to be clear -- I am blocked on this. Do you really prefer to
restart copying heap tuples from scratch in the event of aborting,
just to make sure that the datum1 representation is consistently
either a pointer to text, or an abbreviated key? I don't think that
the costs involved make that worth it, as I've said, but I'm not sure
how to resolve that controversy.

I suggest that we focus on making sure the abort logic itself is
sound. There probably hasn't been enough discussion of that. Once that
is resolved, we can revisit the question of whether or not copying
should restart to keep the datum1 representation consistent. I suspect
that leaving that until later will be easier all around.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-19 Thread Robert Haas

On Tue, Sep 16, 2014 at 4:55 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Sep 16, 2014 at 1:45 PM, Robert Haas robertmh...@gmail.com wrote:
 Even though our testing seems to indicate that the memcmp() is
 basically free, I think it would be good to make the effort to avoid
 doing memcmp() and then strcoll() and then strncmp().  Seems like it
 shouldn't be too hard.

 Really? The tie-breaker for the benefit of locales like hu_HU uses
 strcmp(), not memcmp(). It operates on the now-terminated copies of
 strings. There is no reason to think that the strings must be the same
 size for that strcmp(). I'd rather only do the new opportunistic
 memcmp() == 0 thing when len1 == len2. And I wouldn't like to have
 to also figure out that it's safe to use the earlier result, because
 as it happens len1 == len2, or any other such trickery.

OK, good point.  So committed as-is, then, except that I rewrote the
comments, which I felt were excessively long for the amount of code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-19 Thread Peter Geoghegan

On Fri, Sep 19, 2014 at 9:59 AM, Robert Haas robertmh...@gmail.com wrote:
 OK, good point.  So committed as-is, then, except that I rewrote the
 comments, which I felt were excessively long for the amount of code.

Thanks!

I look forward to hearing your thoughts on the open issues with the
patch as a whole.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-19 Thread Robert Haas

On Thu, Sep 11, 2014 at 8:34 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Sep 9, 2014 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:
 I like that I don't have to care about every combination, and can
 treat abbreviation abortion as the special case with the extra step,
 in line with how I think of the optimization conceptually. Does that
 make sense?

 No.  comparetup_heap() is hip-deep in this optimization as it stands,
 and what I proposed - if done correctly - isn't going to make that
 significantly worse.  In fact, it really ought to make things better:
 you should be able to set things up so that ssup-comparator is always
 the test that should be applied first, regardless of whether we're
 aborted or not-aborted or not doing this in the first place; and then
 ssup-tiebreak_comparator, if not NULL, can be applied after that.

 I'm not following here. Isn't that at least equally true of what I've
 proposed? Sure, I'm checking if (!state-abbrevAborted) first, but
 that's irrelevant to the non-abbreviated case. It has nothing to
 abort. Also, AFAICT we cannot abort and still call ssup-comparator()
 indifferently, since sorttuple.datum1 fields are perhaps abbreviated
 keys in half of all cases (i.e. pre-abort tuples), and uninitialized
 garbage the other half of the time (i.e. post-abort tuples).

You can if you engineer ssup-comparator() to contain the right
pointer at the right time.  Also, shouldn't you go back and fix up
those abbreviated keys to point to datum1 again if you abort?

 Where is the heap_getattr() stuff supposed to happen for the first
 attribute to get an authoritative comparison in the event of aborting
 (if we're calling ssup-comparator() on datum1 indifferently)? We
 decide that we're going to use abbreviated keys within datum1 fields
 up-front. When we abort, we cannot use datum1 fields at all (which
 doesn't appear to matter for text -- the datum1 optimization has
 historically only benefited pass-by-value types).

You always pass datum1 to a function.  The code doesn't need to care
about whether that function is expecting abbreviated or
non-abbreviated datums unless that function returns equality.  Then it
needs to think about calling a backup comparator if there is one.

 Is doing all this worth the small saving in memory? Personally, I
 don't think that it is, but I defer to you.

I don't care about the memory; I care about the code complexity and
the ease of understanding that code.  I am confident that it can be
done better than the patch does it today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-19 Thread Peter Geoghegan

On Fri, Sep 19, 2014 at 2:35 PM, Robert Haas robertmh...@gmail.com wrote:
 Also, shouldn't you go back and fix up
 those abbreviated keys to point to datum1 again if you abort?

Probably not - it appears to make very little difference to
unoptimized pass-by-reference types whether or not datum1 can be used
(see my simulation of Kevin's worst case, for example [1]). Streaming
through a not inconsiderable proportion of memtuples again is probably
a lot worse. The datum1 optimization (which is not all that old) made
a lot of sense when initially introduced, because it avoided chasing
through a pointer for pass-by-value types. I think that's its sole
justification, though.

BTW, I think that if we ever get around to doing this for numeric, it
won't ever abort. The abbreviation strategy can be adaptive, to
maximize the number of comparisons successfully resolved with
abbreviated keys. This would probably use a streaming algorithm like
HyperLogLog too.

[1] 
http://www.postgresql.org/message-id/cam3swzqhjxiyrsqbs5w3u-vtj_jt2hp8o02big5wyb4m9lp...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-16 Thread Robert Haas

On Mon, Sep 15, 2014 at 7:21 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan p...@heroku.com wrote:
 OK, I'll draft a patch for that today, including similar alterations
 to varstr_cmp() for the benefit of Windows and so on.

 I attach a much simpler patch, that only adds an opportunistic
 memcmp() == 0 before a possible strcoll().  Both
 bttextfastcmp_locale() and varstr_cmp() have the optimization added,
 since there is no point in leaving anyone out for this part.

Even though our testing seems to indicate that the memcmp() is
basically free, I think it would be good to make the effort to avoid
doing memcmp() and then strcoll() and then strncmp().  Seems like it
shouldn't be too hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-16 Thread Peter Geoghegan

On Tue, Sep 16, 2014 at 1:45 PM, Robert Haas robertmh...@gmail.com wrote:
 Even though our testing seems to indicate that the memcmp() is
 basically free, I think it would be good to make the effort to avoid
 doing memcmp() and then strcoll() and then strncmp().  Seems like it
 shouldn't be too hard.

Really? The tie-breaker for the benefit of locales like hu_HU uses
strcmp(), not memcmp(). It operates on the now-terminated copies of
strings. There is no reason to think that the strings must be the same
size for that strcmp(). I'd rather only do the new opportunistic
memcmp() == 0 thing when len1 == len2. And I wouldn't like to have
to also figure out that it's safe to use the earlier result, because
as it happens len1 == len2, or any other such trickery.

The bug fix that added the strcmp() tie-breaker was committed in 2005.
PostgreSQL had locale support for something like 8 years prior, and it
took that long for us to notice the problem. I would suggest that
makes the case for doing anything else pretty marginal. In the bug
report at the time, len1 != len2 anyway.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Heikki Linnakangas


On 09/14/2014 11:34 PM, Peter Geoghegan wrote:

On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
memcmp() is free with these parameters.

Is this the worst case scenario?


Other than pushing the differences much much later in the strings
(which you surely thought of already), yes.


Please test the absolute worst case scenario you can think of. As I said 
earlier, if you can demonstrate that the slowdown of that is acceptable, 
we don't even need to discuss how likely or realistic the case is.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Robert Haas

On Sun, Sep 14, 2014 at 10:37 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 09/13/2014 11:28 PM, Peter Geoghegan wrote:
 Anyway, attached rough test program implements what you outline. This
 is for 30,000 32 byte strings (where just the final two bytes differ).
 On my laptop, output looks like this (edited to only show median
 duration in each case):

 Got to be careful to not let the compiler optimize away microbenchmarks like
 this. At least with my version of gcc, the strcoll calls get optimized away,
 as do the memcmp calls, if you don't use the result for anything. Clang was
 even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it
 optimizes away the loops altogether.

 Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it
 runs in C locale, and we don't use strcoll() at all for C locale.

 After fixing those, it runs much slower, so I had to reduce the number of
 strings. Here's a fixed program. I'm now getting numbers like this:

 (baseline) duration of comparisons without useless memcmp()s: 6.007368
 seconds

 duration of comparisons with useless memcmp()s: 6.079826 seconds

 Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
 memcmp() is free with these parameters.

 Is this the worst case scenario?

I can't see a worse one, and I replicated your results here on my
MacBook Pro.  I also tried with 1MB strings and, surprisingly, it was
basically free there, too.

It strikes me that perhaps we should make this change (rearranging
things so that the memcmp tiebreak is run before strcoll) first,
before dealing with the rest of the abbreviated keys infrastructure.
It appears to be a separate improvement which is worthwhile
independently of what we do about that patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan

On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas robertmh...@gmail.com wrote:
 It strikes me that perhaps we should make this change (rearranging
 things so that the memcmp tiebreak is run before strcoll) first,
 before dealing with the rest of the abbreviated keys infrastructure.
 It appears to be a separate improvement which is worthwhile
 independently of what we do about that patch.

I guess we could do that, but AFAICT the only open item blocking the
commit of a basic version of abbreviated keys (the informally agreed
to basic version lacking support for single-attribute aggregates) is
what to do about the current need to create a separate sortsupport
state. I've talked about my thoughts on that question in detail now
[1].

BTW, you probably realize this, but we still need a second memcmp()
after strcoll() too. hu_HU will care about that [2].

[1] 
http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com
[2] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Robert Haas

On Mon, Sep 15, 2014 at 1:34 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas robertmh...@gmail.com wrote:
 It strikes me that perhaps we should make this change (rearranging
 things so that the memcmp tiebreak is run before strcoll) first,
 before dealing with the rest of the abbreviated keys infrastructure.
 It appears to be a separate improvement which is worthwhile
 independently of what we do about that patch.

 I guess we could do that, but AFAICT the only open item blocking the
 commit of a basic version of abbreviated keys (the informally agreed
 to basic version lacking support for single-attribute aggregates) is
 what to do about the current need to create a separate sortsupport
 state. I've talked about my thoughts on that question in detail now
 [1].

I think there's probably more than that to work out, but in any case
there's no harm in getting a simple optimization done first before
moving on to a complicated one.

 BTW, you probably realize this, but we still need a second memcmp()
 after strcoll() too. hu_HU will care about that [2].

 [1] 
 http://www.postgresql.org/message-id/cam3swzqcdcnfwd3qzoo4qmy4g8okhuqyrd26bbla7fl2x-n...@mail.gmail.com
 [2] 
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

I rather assume we could reuse the results of the first memcmp()
instead of doing it again.

x = memcmp();
if (x == 0)
   return x;
y = strcoll();
if (y == 0)
   return x;
return y;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan

On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote:
 I think there's probably more than that to work out, but in any case
 there's no harm in getting a simple optimization done first before
 moving on to a complicated one.

I guess we never talked about the abort logic in all that much detail.
I suppose there's that, too.

 I rather assume we could reuse the results of the first memcmp()
 instead of doing it again.

 x = memcmp();
 if (x == 0)
return x;
 y = strcoll();
 if (y == 0)
return x;
 return y;

Of course, but you know what I mean. (I'm sure the compiler will
realize this if the programmer doesn't)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Robert Haas

On Mon, Sep 15, 2014 at 1:55 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote:
 I think there's probably more than that to work out, but in any case
 there's no harm in getting a simple optimization done first before
 moving on to a complicated one.

 I guess we never talked about the abort logic in all that much detail.
 I suppose there's that, too.

Well, the real point is that from where I'm sitting, this...

 x = memcmp();
 if (x == 0)
return x;
 y = strcoll();
 if (y == 0)
return x;
 return y;

...looks like about a 10-line patch.  We have the data to show that
the loss is trivial even in the worst case, and we have or should be
able to get data showing that the best-case win is significant even
without the abbreviated key stuff.  If you'd care to draft a patch for
just that, I assume we could get it committed in a day or two, whereas
I'm quite sure that considerably more work than that remains for the
main patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan

On Mon, Sep 15, 2014 at 11:20 AM, Robert Haas robertmh...@gmail.com wrote:
 ...looks like about a 10-line patch.  We have the data to show that
 the loss is trivial even in the worst case, and we have or should be
 able to get data showing that the best-case win is significant even
 without the abbreviated key stuff.  If you'd care to draft a patch for
 just that, I assume we could get it committed in a day or two, whereas
 I'm quite sure that considerably more work than that remains for the
 main patch.

OK, I'll draft a patch for that today, including similar alterations
to varstr_cmp() for the benefit of Windows and so on.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan

On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan p...@heroku.com wrote:
 OK, I'll draft a patch for that today, including similar alterations
 to varstr_cmp() for the benefit of Windows and so on.

I attach a much simpler patch, that only adds an opportunistic
memcmp() == 0 before a possible strcoll().  Both
bttextfastcmp_locale() and varstr_cmp() have the optimization added,
since there is no point in leaving anyone out for this part.

When this is committed, and I hear back from you on the question of
what to do about having an independent sortsupport state for
abbreviated tie-breakers (and possibly other issues of concern), I'll
produce a rebased patch with a single commit.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
new file mode 100644
index 7549542..53be8ec
*** a/src/backend/utils/adt/varlena.c
--- b/src/backend/utils/adt/varlena.c
*** varstr_cmp(char *arg1, int len1, char *a
*** 1405,1410 
--- 1405,1438 
  #endif
  		}
  
+ 		/*
+ 		 * Fast path:  Attempt an entirely opportunistic memcmp() == 0.
+ 		 *
+ 		 * In general there is no reason to be optimistic about being able to
+ 		 * resolve the string comparison in this way.  There are many usage
+ 		 * patterns in which this fast path will never be taken, and some in
+ 		 * which it will frequently be taken.  When it is frequently taken,
+ 		 * clearly the optimization is worthwhile, since strcoll() is known to
+ 		 * be very expensive.  When it is not taken, the overhead is completely
+ 		 * negligible, and perhaps even immeasurably small, even in the worst
+ 		 * case.  We have nothing to lose and everything to gain.
+ 		 *
+ 		 * These counter-intuitive performance characteristics are likely
+ 		 * explained by the dominant memory latency cost when performing a
+ 		 * strcoll();  in practice, this may give the compiler and CPU
+ 		 * sufficient leeway to order things in a way that hides the overhead
+ 		 * of useless memcmp() calls.  Sorting is very probably bottlenecked on
+ 		 * memory bandwidth, and so it's worth it to use a few arithmetic
+ 		 * instructions on the off chance that it results in a saving in memory
+ 		 * bandwidth.  In effect, this makes use of a capacity that would
+ 		 * otherwise go unused.  The same memory needed to be read into CPU
+ 		 * cache at this juncture anyway.  Modern CPUs can execute hundreds of
+ 		 * instructions in the time taken to fetch a single cache line from
+ 		 * main memory.
+ 		 */
+ 		if (len1 == len2  memcmp(arg1, arg2, len1) == 0)
+ 			return 0;
+ 
  #ifdef WIN32
  		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
  		if (GetDatabaseEncoding() == PG_UTF8)
*** bttextfastcmp_locale(Datum x, Datum y, S
*** 1842,1847 
--- 1870,1885 
  	len1 = VARSIZE_ANY_EXHDR(arg1);
  	len2 = VARSIZE_ANY_EXHDR(arg2);
  
+ 	/*
+ 	 * Fast path:  Attempt an entirely opportunistic memcmp() == 0, as
+ 	 * explained within varstr_cmp()
+ 	 */
+ 	if (len1 == len2  memcmp(a1p, a2p, len1) == 0)
+ 	{
+ 		result = 0;
+ 		goto done;
+ 	}
+ 
  	if (len1 = tss-buflen1)
  	{
  		pfree(tss-buf1);
*** bttextfastcmp_locale(Datum x, Datum y, S
*** 1875,1880 
--- 1913,1919 
  	if (result == 0)
  		result = strcmp(tss-buf1, tss-buf2);
  
+ done:
  	/* We can't afford to leak memory here. */
  	if (PointerGetDatum(arg1) != x)
  		pfree(arg1);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-15 Thread Peter Geoghegan

On Mon, Sep 15, 2014 at 4:21 PM, Peter Geoghegan p...@heroku.com wrote:
 I attach a much simpler patch, that only adds an opportunistic
 memcmp() == 0 before a possible strcoll().  Both
 bttextfastcmp_locale() and varstr_cmp() have the optimization added,
 since there is no point in leaving anyone out for this part.

FWIW, it occurs to me that this could be a big win for cases like
ExecMergeJoin(). Obviously, abbreviated keys will usually make your
merge join on text attributes a lot faster in the common case where a
sort is involved (if we consider that the sort is integral to the cost
of the join). However, when making sure that inner and outer tuples
match, the MJCompare() call will *very* frequently get away with a
cheap memcmp().  That could make a very significant additional
difference, I think.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-14 Thread Heikki Linnakangas


On 09/13/2014 11:28 PM, Peter Geoghegan wrote:

Anyway, attached rough test program implements what you outline. This
is for 30,000 32 byte strings (where just the final two bytes differ).
On my laptop, output looks like this (edited to only show median
duration in each case):


Got to be careful to not let the compiler optimize away microbenchmarks 
like this. At least with my version of gcc, the strcoll calls get 
optimized away, as do the memcmp calls, if you don't use the result for 
anything. Clang was even more aggressive; it ran both comparisons in 0.0 
seconds. Apparently it optimizes away the loops altogether.


Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise 
it runs in C locale, and we don't use strcoll() at all for C locale.


After fixing those, it runs much slower, so I had to reduce the number 
of strings. Here's a fixed program. I'm now getting numbers like this:


(baseline) duration of comparisons without useless memcmp()s: 6.007368 
seconds


duration of comparisons with useless memcmp()s: 6.079826 seconds

Both values vary in range 5.9 - 6.1 s, so it's fair to say that the 
useless memcmp() is free with these parameters.


Is this the worst case scenario?

- Heikki

#include ctype.h
#include stdio.h
#include stdlib.h
#include string.h
#include sys/time.h
#include locale.h

/* STRING_SIZE does not include NUL byte */
#define STRING_SIZE		32
#define N_STRINGS		2000
#define LAST_N_DIFFER	2

#define INSTR_TIME_SUBTRACT(x,y) \
	do { \
		(x).tv_sec -= (y).tv_sec; \
		(x).tv_usec -= (y).tv_usec; \
		/* Normalize */ \
		while ((x).tv_usec  0) \
		{ \
			(x).tv_usec += 100; \
			(x).tv_sec--; \
		} \
	} while (0)

#define INSTR_TIME_GET_DOUBLE(t) \
	(((double) (t).tv_sec) + ((double) (t).tv_usec) / 100.0)

#define Min(x, y)		((x)  (y) ? (x) : (y))

/*
 * Generate single random alphabetical ASCII char.  Uses an unseeded rand()
 * call, to ensure inter-run determinism.
 */
static inline char
generate_prandom_printable_char(void)
{
	for (;;)
	{
		char crand = rand()  0x7F;

		if ((crand = 'A'  crand = 'Z') || (crand = 'a'  crand = 'z'))
			return crand;
	}
}

/*
 * Generate a random payload string.  The returned string is pre-calcuated
 * once, to form the bulk of each distinct string.
 */
static inline char *
generate_prandom_payload()
{
	char   *payload = malloc(STRING_SIZE + 1);
	int		i;

	/*
	 * The final LAST_N_DIFFER bytes will ultimately be clobbered with distinct
	 * bytes, but that occurs separately, per string.
	 */
	for (i = 0; i  STRING_SIZE; i++)
		payload[i] = generate_prandom_printable_char();

	/*
	 * Add NUL here, even though complete strings are not NULL terminated (or,
	 * rather, are copied into a buffer on the stack in order to add a NUL once
	 * per comparison)
	 */
	payload[STRING_SIZE] = '\0';

	return payload;
}

int
main(int argc, const char *argv[])
{
	char  **strings = malloc(N_STRINGS * sizeof(char *));
	char   *payload = generate_prandom_payload();
	int		i, j;
	int		nmatches = 0;

	setlocale(LC_ALL, );

	/* Initialize strings */
	for (i = 0; i  N_STRINGS; i++)
	{
		int		j;

		strings[i] = malloc(STRING_SIZE);

		memcpy(strings[i], payload, STRING_SIZE);

		/* Last LAST_N_DIFFER characters randomly vary */
		for (j = LAST_N_DIFFER; j  0; j--)
		{
			char 	n_last = generate_prandom_printable_char();

			strings[i][STRING_SIZE - j] = n_last;
		}

		/* Don't terminate -- no room in buffer */
	}

	printf(Strings generated - beginning tests\n);

	for (;;)
	{
		struct timeval before, after;

		/*
		 **
		 * Baseline -- no wasted memcmp().
		 *
		 * Compare each string to each other string using only strcoll()
		 **
		 */
		gettimeofday(before, NULL);
		for (i = 0; i  N_STRINGS; i++)
		{
			char   *str = strings[i];

			/* Quadratic string comparison */
			for (j = 0; j  N_STRINGS; j++)
			{
int		cmp;
char	buf1[STRING_SIZE + 1];
char 	buf2[STRING_SIZE + 1];
char   *other;

other = strings[j];

memcpy(buf1, str, STRING_SIZE);
memcpy(buf2, other, STRING_SIZE);

buf1[STRING_SIZE] = '\0';
buf2[STRING_SIZE] = '\0';

cmp = strcoll(buf1, buf2);

if (cmp == 0)
  nmatches++;

/*
 * memcmp() tie-breaker (equivalent to the one that currently
 * appears within varstr_cmp()) isn't considered.  It's only
 * relevant to a small number of locales, like hu_HU, where
 * strcoll() might (rarely) indicate equality in respect of a
 * pair of non-identical strings.  If strcoll() did return 0,
 * then for most locales it's certain that the first memcmp()
 * would have worked out.
 */
			}
		}
		/* Report no-wasted-memcmp case duration */
		gettimeofday(after, NULL);
		INSTR_TIME_SUBTRACT(after, before);
		printf((baseline) duration of comparisons without useless memcmp()s: %f seconds\n\n,
			   INSTR_TIME_GET_DOUBLE(after));

		/*

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-14 Thread Peter Geoghegan

On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Got to be careful to not let the compiler optimize away microbenchmarks like
 this. At least with my version of gcc, the strcoll calls get optimized away,
 as do the memcmp calls, if you don't use the result for anything. Clang was
 even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it
 optimizes away the loops altogether.

I suppose the fact that I saw results that fit my pre-conceived notion
of what was happening made me lose my initial concern about that.

 Also, there should be a setlocale(LC_ALL, ) call somewhere. Otherwise it
 runs in C locale, and we don't use strcoll() at all for C locale.

Oops. This might be a useful mistake, though -- if the strcoll() using
the C locale is enough to make the memcmp() not free, then that
suggests that strcoll() is the hiding place for the useless
memcmp(), where instructions relating to the memcmp() can execute in
parallel to instructions relating to strcoll() that add latency from
memory accesses (for non-C locales). With the C locale, strcoll() is
equivalent to strcmp()/memcmp().

Commenting out the setlocale(LC_ALL, ) in your revised versions
shows something like my original numbers (so I guess my compiler
wasn't smart enough to optimize away the strcoll() + memcmp() cases).
Whereas, there is no noticeable regression/difference between each
case when I run the revised program unmodified. That seems to prove
that strcoll() is a good enough hiding place.

 Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
 memcmp() is free with these parameters.

 Is this the worst case scenario?

Other than pushing the differences much much later in the strings
(which you surely thought of already), yes. I think it's worse than
the worst, because we've boiled this down to just the comparison part,
leaving only the strcoll() as a hiding place, which is evidently
good enough. I thought that it was important that there be an
unpredictable access pattern (characteristic of quicksort), so that
memory latency is added here and there. I'm happy to learn that I was
wrong about that, and that a strcoll() alone hides the would-be
memcmp() latency.

Large strings matter much less anyway, I think. If you have a pair of
strings both longer than CACHE_LINE_SIZE bytes, and the first
CACHE_LINE_SIZE bytes are identical, and the lengths are known to
match, it seems like a very sensible bet to anticipate that they're
fully equal. So in a world where that affects the outcome of this test
program, I think it still changes nothing (if, indeed, it matters at
all, which it appears not to anyway, at least with 256 byte strings).

We should probably do the a fully opportunistic memcmp() == 0 within
varstr_cmp() itself, so that Windows has the benefit of this too, as
well as callers like compareJsonbScalarValue().

Actually, looking at it closely, I think that there might still be a
microscopic regression, as there might have also been with my variant
of your SQL test case [1] - certainly in the noise, but perhaps
measurable with enough runs. If there is, that seems like an
acceptable price to pay.

When I test this stuff, I'm now very careful about power management
settings on my laptop...there are many ways to be left with egg on
your face with this kind of benchmark.

[1] 
http://www.postgresql.org/message-id/cam3swzqy95sow00b+zjycrgmr-uf1mz8ryv4_ou2encvstn...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-13 Thread Peter Geoghegan

On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote:
 Based on discussion thus far it seems that there's a possibility that
 the trade-off may be different for short strings vs. long strings.  If
 the string is small enough to fit in the L1 CPU cache, then it may be
 that memcmp() followed by strcoll() is not much more expensive than
 strcoll().  That should be easy to figure out: write a standalone C
 program that creates a bunch of arbitrary, fairly-short strings, say
 32 bytes, in a big array.  Make the strings different near the end,
 but not at the beginning.  Then have the program either do strcoll()
 on every pair (O(n^2)) or, with a #define, memcmp() followed by
 strcoll() on every pair.  It should be easy to see whether the
 memcmp() adds 1% or 25% or 100%.

I get where you're coming from now (I think). It seems like you're
interested in getting a sense of what it would cost to do an
opportunistic memcmp() in a universe where memory latency isn't by far
the dominant cost (which it clearly is, as evidenced by my most recent
benchmark where a variant of Heikki's highly unsympathetic SQL test
case showed a ~1% regression). You've described a case with totally
predictable access patterns, perfectly suited to prefetching, and with
no other memory access bottlenecks. As I've said, this seems
reasonable (at least with those caveats in mind).

The answer to your high level question appears to be: the
implementation (the totally opportunistic memcmp() == 0 thing)
benefits from the fact that we're always bottlenecked on memory, and
to a fairly appreciable degree. I highly doubt that this is something
that can fail to work out with real SQL queries, but anything that
furthers our understanding of the optimization is a good thing. Of
course, within tuplesort we're not really getting the totally
opportunistic memcmp()s for free - rather, we're using a resource
that we would not otherwise be able to use at all.

This graph illustrates the historic trends of CPU and memory performance:

http://www.cs.virginia.edu/stream/stream_logo.gif

I find this imbalance quite remarkable - no wonder researchers are
finding ways to make the best of the situation. To make matters worse,
the per-core trends for memory bandwidth are now actually *negative
growth*. We may actually be going backwards, if we assume that that's
the bottleneck, and that we cannot parallelize things.

Anyway, attached rough test program implements what you outline. This
is for 30,000 32 byte strings (where just the final two bytes differ).
On my laptop, output looks like this (edited to only show median
duration in each case):


Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 13.445506 seconds

duration of comparisons with useless memcmp()s: 17.720501 seconds


It looks like about a 30% increase in run time when we always have
useless memcmps().

You can change the constants around easily - let's consider 64 KiB
strings now (by changing STRING_SIZE to 65536). In order to make the
program not take too long, I also reduce the number of strings
(N_STRINGS) from 30,000 to 1,000. If I do so, this is what I see as
output:


Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 11.205683 seconds

duration of comparisons with useless memcmp()s: 14.542997 seconds


It looks like the overhead here is surprisingly consistent with the
first run - again, about a 30% increase in run time.

As for 1MiB strings (this time, with an N_STRINGS of 300):


Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 23.624183 seconds

duration of comparisons with useless memcmp()s: 35.332728 seconds


So, at this scale, the overhead gets quite a bit worse, but the case
also becomes quite a bit less representative (if that's even
possible). I suspect that the call stack's stupidly large size may be
a problem here, but I didn't try and fix that.

Does this answer your question? Are you intent on extrapolating across
different platforms based on this test program, rather than looking at
real SQL queries? While a certain amount of that makes sense, I think
we should focus on queries that have some chance of being seen in real
production PostgreSQL instances. Failing that, actual SQL queries.

-- 
Peter Geoghegan
#include ctype.h
#include stdio.h
#include stdlib.h
#include string.h
#include sys/time.h

/* STRING_SIZE does not include NUL byte */
#define STRING_SIZE		32
#define N_STRINGS		3
#define LAST_N_DIFFER	2

#define INSTR_TIME_SUBTRACT(x,y) \
	do { \
		(x).tv_sec -= (y).tv_sec; \
		(x).tv_usec -= (y).tv_usec; \
		/* Normalize */ \
		while ((x).tv_usec  0) \
		{ \
			(x).tv_usec += 100; \
			(x).tv_sec--; \
		} \
	} while (0)

#define INSTR_TIME_GET_DOUBLE(t) \
	(((double) (t).tv_sec) + ((double) (t).tv_usec) / 100.0)

#define Min(x, y)		((x)  (y) ? (x) : (y))

/*
 * Generate single random alphabetical ASCII

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Heikki Linnakangas


On 09/12/2014 12:46 AM, Peter Geoghegan wrote:

On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com wrote:

I think I said pretty clearly that it was.


I agree that you did, but it wasn't clear exactly what factors you
were asking me to simulate.


All factors.


Do you want me to compare the same string a million times in a loop,
both with a strcoll() and with a memcmp()?


Yes.


Should I copy it into a buffer to add a NUL byte?


Yes.


Or should it be a new string each time, with a cache miss expected
some proportion of the time?


Yes.

I'm being facetious - it's easy to ask for tests when you're not the one 
running them. But seriously, please do run the all the tests that you 
think make sense.


I'm particularly interested in the worst case. What is the worst case 
for the proposed memcmp() check? Test that. If the worst case regresses 
significantly, then we need to have a discussion of how likely that 
worst case is to happen in real life, what the performance is like in 
more realistic almost-worst-case scenarios, does it need to be tunable, 
is the trade-off worth it, etc. But if the worst case regresses less 
than, say, 1%, and there are some other cases where you get a 300% speed 
up, then I think it's safe to say that the optimization is worth it, 
without any more testing or discussion.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Robert Haas

On Fri, Sep 12, 2014 at 5:28 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 09/12/2014 12:46 AM, Peter Geoghegan wrote:

 On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com
 wrote:

 I think I said pretty clearly that it was.


 I agree that you did, but it wasn't clear exactly what factors you
 were asking me to simulate.


 All factors.

 Do you want me to compare the same string a million times in a loop,
 both with a strcoll() and with a memcmp()?


 Yes.

 Should I copy it into a buffer to add a NUL byte?


 Yes.

 Or should it be a new string each time, with a cache miss expected
 some proportion of the time?


 Yes.

 I'm being facetious - it's easy to ask for tests when you're not the one
 running them. But seriously, please do run the all the tests that you think
 make sense.

 I'm particularly interested in the worst case. What is the worst case for
 the proposed memcmp() check? Test that. If the worst case regresses
 significantly, then we need to have a discussion of how likely that worst
 case is to happen in real life, what the performance is like in more
 realistic almost-worst-case scenarios, does it need to be tunable, is the
 trade-off worth it, etc. But if the worst case regresses less than, say, 1%,
 and there are some other cases where you get a 300% speed up, then I think
 it's safe to say that the optimization is worth it, without any more testing
 or discussion.

+1 to all that, including the facetious parts.

Based on discussion thus far it seems that there's a possibility that
the trade-off may be different for short strings vs. long strings.  If
the string is small enough to fit in the L1 CPU cache, then it may be
that memcmp() followed by strcoll() is not much more expensive than
strcoll().  That should be easy to figure out: write a standalone C
program that creates a bunch of arbitrary, fairly-short strings, say
32 bytes, in a big array.  Make the strings different near the end,
but not at the beginning.  Then have the program either do strcoll()
on every pair (O(n^2)) or, with a #define, memcmp() followed by
strcoll() on every pair.  It should be easy to see whether the
memcmp() adds 1% or 25% or 100%.

Then, repeat the same thing with strings that are big enough to blow
out the L1 cache, say 1MB in length.  Some intermediate sizes (64kB?)
might be worth testing, too.  Again, it should be easy to see what the
overhead is.  Once we know that, we can make intelligent decisions
about whether this is a good idea or not, and when.  If you attach the
test programs, other people (e.g. me) can also try them on other
systems (e.g. MacOS X) to see whether the characteristics there are
different than what you saw on your system.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Peter Geoghegan

On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote:
 Based on discussion thus far it seems that there's a possibility that
 the trade-off may be different for short strings vs. long strings.  If
 the string is small enough to fit in the L1 CPU cache, then it may be
 that memcmp() followed by strcoll() is not much more expensive than
 strcoll().  That should be easy to figure out: write a standalone C
 program that creates a bunch of arbitrary, fairly-short strings, say
 32 bytes, in a big array.

While I think that's fair, the reason I didn't bother playing tricks
with only doing a (purely) opportunistic memcmp() when the string size
is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter
you'd have to have a use case where the first CACHE_LINE_SIZE of bytes
matched, and the string just happened to be identical in length, but
also ultimately differed at least a good fraction of the time. That
seems like the kind of thing that it's okay to care less about. That
might have been regressed worse than what you've seen already. It's
narrow in a whole new dimension, though. The intersection of that
issue, and the issues exercised by Heikki's existing test case must be
exceedingly rare.

I'm still confused about whether or not we're talking at cross
purposes here, Robert. Are you happy to consider this as a separate
and additional question to the question of what to do in an
abbreviated comparison tie-break? The correlated multiple sort key
attributes case strikes me as very common - it's a nice to have, and
will sometimes offer a nice additional boost. On the other hand, doing
this for abbreviated comparison tie-breakers is more or less
fundamental to the patch. In my cost model, a memcmp() abbreviated key
tie-breaker than works out is equivalent to an abbreviated comparison.
This is a bit of a fudge, but close enough.

BTW, I do appreciate your work on this. I realize that if you didn't
give this patch a fair go, there is a chance that no one else would.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Robert Haas

On Fri, Sep 12, 2014 at 2:58 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas robertmh...@gmail.com wrote:
 Based on discussion thus far it seems that there's a possibility that
 the trade-off may be different for short strings vs. long strings.  If
 the string is small enough to fit in the L1 CPU cache, then it may be
 that memcmp() followed by strcoll() is not much more expensive than
 strcoll().  That should be easy to figure out: write a standalone C
 program that creates a bunch of arbitrary, fairly-short strings, say
 32 bytes, in a big array.

 While I think that's fair, the reason I didn't bother playing tricks
 with only doing a (purely) opportunistic memcmp() when the string size
 is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter
 you'd have to have a use case where the first CACHE_LINE_SIZE of bytes
 matched, and the string just happened to be identical in length, but
 also ultimately differed at least a good fraction of the time. That
 seems like the kind of thing that it's okay to care less about. That
 might have been regressed worse than what you've seen already. It's
 narrow in a whole new dimension, though. The intersection of that
 issue, and the issues exercised by Heikki's existing test case must be
 exceedingly rare.

 I'm still confused about whether or not we're talking at cross
 purposes here, Robert. Are you happy to consider this as a separate
 and additional question to the question of what to do in an
 abbreviated comparison tie-break?

I think I've said a few times now that I really want to get this
additional data before forming an opinion.  As a certain Mr. Doyle
writes, It is a capital mistake to theorize before one has data.
Insensibly one begins to twist facts to suit theories, instead of
theories to suit facts.  I can't say it any better than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-12 Thread Peter Geoghegan

On Fri, Sep 12, 2014 at 12:02 PM, Robert Haas robertmh...@gmail.com wrote:
 I think I've said a few times now that I really want to get this
 additional data before forming an opinion.  As a certain Mr. Doyle
 writes, It is a capital mistake to theorize before one has data.
 Insensibly one begins to twist facts to suit theories, instead of
 theories to suit facts.  I can't say it any better than that.

Well, in the abbreviated key case we might know that with probability
0.9 that the memcmp() == 0 thing will work out. In the
non-abbreviated tie-breaker case, we'll definitely know nothing. That
seems like a pretty fundamental distinction, so I don't think it's
premature to ask you to consider those two questions individually.
Still, maybe it's easier to justify both cases in the same way, if we
can.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-11 Thread Peter Geoghegan

On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote:
 No, not really.  All you have to do is right a little test program to
 gather the information.

I don't think a little test program is useful - IMV it's too much of a
simplification to suppose that a strcoll() has a fixed cost, and a
memcmp() has a fixed cost, and that we can determine algebraically
that we should (say) proceed or not proceed with the additional
opportunistic memcmp() == 0 optimization based solely on that. I'm
not sure if that's what you meant, but it might have been. Temporal
locality is surely a huge factor here, for example. Are we talking
about a memcmp() that always immediately precedes a similar strcoll()
call on the same memory? Are we factoring in the cost of
NUL-termination in order to make each strcoll() possible? And that's
just for starters.

However, I think it's perfectly fair to consider a case where the
opportunistic memcmp() == 0 thing never works out (as opposed to
mostly not helping, which is what I considered earlier [1]), as long
as we're sorting real tuples. You mentioned Heikki's test case; it
seems fair to consider that, but for the non-abbreviated case where
the additional, *totally* opportunistic memcmp == 0 optimization
only applies (so no abbreviated keys), while still having the
additional optimization be 100% useless. Clearly that test case is
also just about perfectly pessimal for this case too. (Recall that
Heikki's test case shows performance on per-sorted input, so there are
far fewer comparisons than would typically be required anyway - n
comparisons, or a bubble sort best case. If I wanted to cheat, I
could reduce work_mem so that an external tape sort is used, since as
it happens tapesort doesn't opportunistically check for pre-sorted
input, but I won't do that. Heikki's case both emphasizes the
amortized cost of a strxfrm() where we abbreviate, and in this
instance de-emphasizes the importance of memory latency by having
access be sequential/predictable.)

The only variation I'm adding here to Heikki's original test case is
to have a leading int4 attribute that always has a value of 1 -- that
conveniently removes abbreviation (including strxfrm() overhead) as a
factor that can influence the outcome, since right now that isn't
under consideration. So:

create table sorttest (dummy int4, t text);
insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75)
from generate_series(1, 3) g;

Benchmark:

pg@hamster:~/tests$ cat heikki-sort.sql
select * from (select * from sorttest order by dummy, t offset 100) f;

pgbench -f heikki-sort.sql -T 100 -n

With optimization enabled

tps = 77.861353 (including connections establishing)
tps = 77.862260 (excluding connections establishing)

tps = 78.211053 (including connections establishing)
tps = 78.212016 (excluding connections establishing)

tps = 77.996117 (including connections establishing)
tps = 77.997069 (excluding connections establishing)

With optimization disabled (len1 == len2 thing is commented out)
=

tps = 78.719389 (including connections establishing)
tps = 78.720366 (excluding connections establishing)

tps = 78.764819 (including connections establishing)
tps = 78.765712 (excluding connections establishing)

tps = 78.472902 (including connections establishing)
tps = 78.473844 (excluding connections establishing)

So, yes, it looks like I might have just about regressed this case -
it's hard to be completely sure. However, this is still a very
unrealistic case, since invariably len1 == len2 without the
optimization ever working out, whereas the case that benefits [2] is
quite representative. As I'm sure you were expecting, I still favor
pursuing this additional optimization.

If you think I've been unfair or not thorough, I'm happy to look at
other cases. Also, I'm not sure that you accept that I'm justified in
considering this a separate question to the more important question of
what to do in the tie-breaker abbreviation case (where we may be
almost certain that equality will be indicated by a memcmp()). If you
don't accept that I'm right about that more important case, I guess
that means that you don't have confidence in my ad-hoc cost model (the
HyperLogLog/cardinality stuff).

[1] 
http://www.postgresql.org/message-id/CAM3SWZR9dtGO+zX4VEn7GTW2=+umSNq=c57sjgxg8oqhjar...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/cam3swzqtyv3kp+cakzjzv3rwb1ojjahwpcz9coyjxpkhbtc...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-11 Thread Robert Haas

On Thu, Sep 11, 2014 at 4:13 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote:
 No, not really.  All you have to do is right a little test program to
 gather the information.

 I don't think a little test program is useful - IMV it's too much of a
 simplification to suppose that a strcoll() has a fixed cost, and a
 memcmp() has a fixed cost, and that we can determine algebraically
 that we should (say) proceed or not proceed with the additional
 opportunistic memcmp() == 0 optimization based solely on that. I'm
 not sure if that's what you meant, but it might have been.

I think I said pretty clearly that it was.

 However, I think it's perfectly fair to consider a case where the
 opportunistic memcmp() == 0 thing never works out (as opposed to
 mostly not helping, which is what I considered earlier [1]), as long
 as we're sorting real tuples. You mentioned Heikki's test case; it
 seems fair to consider that, but for the non-abbreviated case where
 the additional, *totally* opportunistic memcmp == 0 optimization
 only applies (so no abbreviated keys), while still having the
 additional optimization be 100% useless. Clearly that test case is
 also just about perfectly pessimal for this case too. (Recall that
 Heikki's test case shows performance on per-sorted input, so there are
 far fewer comparisons than would typically be required anyway - n
 comparisons, or a bubble sort best case. If I wanted to cheat, I
 could reduce work_mem so that an external tape sort is used, since as
 it happens tapesort doesn't opportunistically check for pre-sorted
 input, but I won't do that. Heikki's case both emphasizes the
 amortized cost of a strxfrm() where we abbreviate, and in this
 instance de-emphasizes the importance of memory latency by having
 access be sequential/predictable.)

 The only variation I'm adding here to Heikki's original test case is
 to have a leading int4 attribute that always has a value of 1 -- that
 conveniently removes abbreviation (including strxfrm() overhead) as a
 factor that can influence the outcome, since right now that isn't
 under consideration. So:

 create table sorttest (dummy int4, t text);
 insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75)
 from generate_series(1, 3) g;

 Benchmark:

 pg@hamster:~/tests$ cat heikki-sort.sql
 select * from (select * from sorttest order by dummy, t offset 100) f;

 pgbench -f heikki-sort.sql -T 100 -n

 With optimization enabled
 
 tps = 77.861353 (including connections establishing)
 tps = 77.862260 (excluding connections establishing)

 tps = 78.211053 (including connections establishing)
 tps = 78.212016 (excluding connections establishing)

 tps = 77.996117 (including connections establishing)
 tps = 77.997069 (excluding connections establishing)

 With optimization disabled (len1 == len2 thing is commented out)
 =

 tps = 78.719389 (including connections establishing)
 tps = 78.720366 (excluding connections establishing)

 tps = 78.764819 (including connections establishing)
 tps = 78.765712 (excluding connections establishing)

 tps = 78.472902 (including connections establishing)
 tps = 78.473844 (excluding connections establishing)

 So, yes, it looks like I might have just about regressed this case -
 it's hard to be completely sure. However, this is still a very
 unrealistic case, since invariably len1 == len2 without the
 optimization ever working out, whereas the case that benefits [2] is
 quite representative. As I'm sure you were expecting, I still favor
 pursuing this additional optimization.

Well, I have to agree that doesn't look too bad, but your reluctance
to actually do the microbenchmark worries me.  Granted,
macrobenchmarks are what actually matters, but they can be noisy and
there can be other confounding factors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-11 Thread Peter Geoghegan

On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas robertmh...@gmail.com wrote:
 I think I said pretty clearly that it was.

I agree that you did, but it wasn't clear exactly what factors you
were asking me to simulate. It still isn't. Do you want me to compare
the same string a million times in a loop, both with a strcoll() and
with a memcmp()? Should I copy it into a buffer to add a NUL byte? Or
should it be a new string each time, with a cache miss expected some
proportion of the time? These considerations might significantly
influence the outcome here, and one variation might be significantly
less fair than another. Tell me what to do in a little more detail,
and I'll do it (plus let you know what I think of it). I honestly
don't know what you expect.

 So, yes, it looks like I might have just about regressed this case -
 it's hard to be completely sure. However, this is still a very
 unrealistic case, since invariably len1 == len2 without the
 optimization ever working out, whereas the case that benefits [2] is
 quite representative. As I'm sure you were expecting, I still favor
 pursuing this additional optimization.

 Well, I have to agree that doesn't look too bad, but your reluctance
 to actually do the microbenchmark worries me.  Granted,
 macrobenchmarks are what actually matters, but they can be noisy and
 there can be other confounding factors.

Well, I've been quite open about the fact that I think we can and
should hide things in memory latency. I don't think my benchmark was
in any way noisy, since you saw 3 100 second runs per test set/case,
with a very stable outcome throughout - plus the test case is
extremely unsympathetic/unrealistic to begin with. Hiding behind
memory latency is an increasingly influential trick that I've seen
crop up a few times in various papers. I think it's perfectly
legitimate to rely on that. But, honestly, I have little idea how much
I actually am relying on it. I think it's only fair to look at
representative cases (i.e. actually SQL queries). Anything else can
only be used as a guide. But, in this instance, a guide to what,
exactly? This is not a rhetorical question, and I'm not trying to be
difficult. If I thought there was something bad hiding here, I'd tell
you about it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-11 Thread Peter Geoghegan

On Tue, Sep 9, 2014 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:
 I like that I don't have to care about every combination, and can
 treat abbreviation abortion as the special case with the extra step,
 in line with how I think of the optimization conceptually. Does that
 make sense?

 No.  comparetup_heap() is hip-deep in this optimization as it stands,
 and what I proposed - if done correctly - isn't going to make that
 significantly worse.  In fact, it really ought to make things better:
 you should be able to set things up so that ssup-comparator is always
 the test that should be applied first, regardless of whether we're
 aborted or not-aborted or not doing this in the first place; and then
 ssup-tiebreak_comparator, if not NULL, can be applied after that.

I'm not following here. Isn't that at least equally true of what I've
proposed? Sure, I'm checking if (!state-abbrevAborted) first, but
that's irrelevant to the non-abbreviated case. It has nothing to
abort. Also, AFAICT we cannot abort and still call ssup-comparator()
indifferently, since sorttuple.datum1 fields are perhaps abbreviated
keys in half of all cases (i.e. pre-abort tuples), and uninitialized
garbage the other half of the time (i.e. post-abort tuples).

Where is the heap_getattr() stuff supposed to happen for the first
attribute to get an authoritative comparison in the event of aborting
(if we're calling ssup-comparator() on datum1 indifferently)? We
decide that we're going to use abbreviated keys within datum1 fields
up-front. When we abort, we cannot use datum1 fields at all (which
doesn't appear to matter for text -- the datum1 optimization has
historically only benefited pass-by-value types).

I'd mentioned that I'd hacked together a patch that doesn't
necessitate a separate state (if only to save a very small amount of
memory), but it is definitely messier within comparetup_heap(). I'm
still tweaking it. FYI, it does this within comparetup_heap():

+   if (!sortKey-abbrev_comparator)
+   {
+   /*
+* There are no abbreviated keys to begin with (i.e.
no opclass support
+* exists).  Compare the leading sort key, assuming an
authoritative
+* representation.
+*/
+   compare = ApplySortComparator(a-datum1, a-isnull1,
+
   b-datum1, b-isnull1,
+
   sortKey);
+   if (compare != 0)
+   return compare;
+
+   sortKey++;
+   nkey = 1;
+   }
+   else if (!state-abbrevAborted  sortKey-abbrev_comparator)
+   {
+   /*
+* Leading attribute has abbreviated key representation, and
+* abbreviation was not aborted when copying.  Compare
the leading sort
+* key using abbreviated representation.
+*/
+   compare = ApplySortAbbrevComparator(a-datum1, a-isnull1,
+
 b-datum1, b-isnull1,
+
 sortKey);
+   if (compare != 0)
+   return compare;
+
+   /*
+* Since abbreviated comparison returned 0, call
tie-breaker comparator
+* using original, authoritative representation, which
may break tie
+* when differences were not captured within
abbreviated representation
+*/
+   nkey = 0;
+   }
+   else
+   {
+   /*
+* Started with abbreviated keys, but aborted during
conversion/tuple
+* copying -- check each attribute from scratch.  It's
not safe to make
+* any assumption about the state of individual datum1 fields.
+*/
+   nkey = 0;
+   }

Is doing all this worth the small saving in memory? Personally, I
don't think that it is, but I defer to you.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-10 Thread Peter Geoghegan

On Tue, Sep 9, 2014 at 2:00 PM, Robert Haas robertmh...@gmail.com wrote:
 Boiled down, what you're saying is that you might have a set that
 contains lots of duplicates in general, but not very many where the
 abbreviated-keys also match.  Sure, that's true.

Abbreviated keys are not used in the case where we do a (fully)
opportunistic memcmp(), without really having any idea of whether or
not it'll work out. Abbreviated keys aren't really relevant to that
case, except perhaps in that we know we'll have statistics available
for leading attributes, which will make the case less frequent in
practice.

 But you might also
 not have that case, so I don't see where that gets us; the same
 worst-case test case Heikki developed the last time we relitigated
 this point is still relevant here.

Well, I think that there should definitely be a distinction made
between abbreviated and non-abbreviated cases; you could frequently
have almost 100% certainty that each of those optimistic memcmp()s
will work out with abbreviated keys. Low cardinality sets are very
common. I'm not sure what your position on that is. My proposal to
treat both of those cases (abbreviated with a cost model/cardinality
statistics; non-abbreviated without) the same is based on different
arguments for each case.

 In order to know how much we're
 giving up in that case, we need the exact number I asked you to
 provide in my previous email: the ratio of the cost of strcoll() to
 the cost of memcmp().

 I see that you haven't chosen to provide that information in any of
 your four responses.

Well, it's kind of difficult to give that number in a vacuum. I showed
a case that had a large majority of opportunistic memcmp()s go to
waste, while a small number were useful, which still put us ahead. I
can look at Heikki's case with this again if you think that'll help.
Heikki said that his case was all about wasted strxfrm()s, which are
surely much more expensive than wasted memcmp()s, particularly when
you consider temporal locality (we needed that memory to be stored in
a cacheline for the immediately subsequent operation anyway, should
the memcmp() thing not work out - the simple ratio that you're
interested in may be elusive).

In case I haven't been clear enough on this point, I re-emphasize that
I do accept that for something like the non-abbreviated case, the
opportunistic memcmp() thing must virtually be free if we're to
proceed, since it is purely opportunistic. If it can be demonstrated
that that isn't the case (and if that cannot be fixed by limiting it
to  CACHE_LINE_SIZE), clearly we should not proceed with
opportunistic (non-abbreviated) memcmp()s. In fact, I think I'm
holding it to a higher standard than you are - I believe that it had
better be virtually free.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote:
 In order to know how much we're
 giving up in that case, we need the exact number I asked you to
 provide in my previous email: the ratio of the cost of strcoll() to
 the cost of memcmp().

 I see that you haven't chosen to provide that information in any of
 your four responses.

 Well, it's kind of difficult to give that number in a vacuum.

No, not really.  All you have to do is right a little test program to
gather the information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-09 Thread Robert Haas

On Thu, Sep 4, 2014 at 5:46 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:
 Eh, maybe?  I'm not sure why the case where we're using abbreviated
 keys should be different than the case we're not.  In either case this
 is a straightforward trade-off: if we do a memcmp() before strcoll(),
 we win if it returns 0 and lose if returns non-zero and strcoll also
 returns non-zero.  (If memcmp() returns non-zero but strcoll() returns
 0, it's a tie.)  I'm not immediately sure why it should affect the
 calculus one way or the other whether abbreviated keys are in use; the
 question of how much faster memcmp() is than strcoll() seems like the
 relevant consideration either way.

 Not quite. Consider my earlier example of sorting ~300,000 cities by
 country only. That's a pretty low cardinality attribute. We win big,
 and we are almost certain that the abbreviated key cardinality is a
 perfect proxy for the full key cardinality so we stick with
 abbreviated keys while copying over tuples. Sure, most comparisons
 will actually be resolved with a memcmp() == 0 rather than an
 abbreviated comparison, but under my ad-hoc cost model there is no
 distinction, since they're both very much cheaper than a strcoll()
 (particularly when we factor in the NUL termination copying that a
 memcmp() == 0 also avoids). To a lesser extent we're also justified
 in that optimism because we've already established that roughly the
 first 8 bytes of the string are bitwise equal.

 So the difference is that in the abbreviated key case, we are at least
 somewhat justified in our optimism. Whereas, where we're just eliding
 fmgr overhead, say on the 2nd or subsequent attribute of a multi-key
 sort, it's totally opportunistic to chance a memcmp() == 0. The
 latter optimization can only be justified by the fact that the
 memcmp() is somewhere between dirt cheap and free. That seems like
 soemthing that should significantly impact the calculus.

Boiled down, what you're saying is that you might have a set that
contains lots of duplicates in general, but not very many where the
abbreviated-keys also match.  Sure, that's true.  But you might also
not have that case, so I don't see where that gets us; the same
worst-case test case Heikki developed the last time we relitigated
this point is still relevant here.  In order to know how much we're
giving up in that case, we need the exact number I asked you to
provide in my previous email: the ratio of the cost of strcoll() to
the cost of memcmp().

I see that you haven't chosen to provide that information in any of
your four responses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-09 Thread Robert Haas

On Fri, Sep 5, 2014 at 10:45 PM, Peter Geoghegan p...@heroku.com wrote:
 While I gave serious consideration to your idea of having a dedicated
 abbreviation comparator, and not duplicating sortsupport state when
 abbreviated keys are used (going so far as to almost fully implement
 the idea), I ultimately decided that my vote says we don't do that. It
 seemed to me that there were negligible benefits for increased
 complexity. In particular, I didn't want to burden tuplesort with
 having to worry about whether or not abbreviation was aborted during
 tuple copying, or was not used by the opclass in the first place -
 implementing your scheme makes that distinction relevant. It's very
 convenient to have comparetup_heap() compare the leading sort key
 (that specifically looks at SortTuple.datum1 pairs) indifferently,
 using the same comparator for abbreviated and not abbreviated
 cases indifferently. comparetup_heap() does not seem like a great
 place to burden with caring about each combination any more than
 strictly necessary.

 I like that I don't have to care about every combination, and can
 treat abbreviation abortion as the special case with the extra step,
 in line with how I think of the optimization conceptually. Does that
 make sense?

No.  comparetup_heap() is hip-deep in this optimization as it stands,
and what I proposed - if done correctly - isn't going to make that
significantly worse.  In fact, it really ought to make things better:
you should be able to set things up so that ssup-comparator is always
the test that should be applied first, regardless of whether we're
aborted or not-aborted or not doing this in the first place; and then
ssup-tiebreak_comparator, if not NULL, can be applied after that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-06 Thread Peter Geoghegan

On Fri, Sep 5, 2014 at 7:45 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached additional patches are intended to be applied on top off most
 of the patches posted on September 2nd [1].


I attach another amendment/delta patch, intended to be applied on top
of what was posted yesterday. I neglected to remove some abort logic
that was previously only justified by the lack of opportunistic
memcmp() == 0 comparisons in all instances, rather than just with
abbreviated keys.

-- 
Peter Geoghegan
From c60b07dda3e81e1fc9a2e95c386faf1fccfe8f04 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sat, 6 Sep 2014 14:56:37 -0700
Subject: [PATCH 8/8] Remove obsolete abort logic

Remove some abort logic that was previously only justified by the lack
of opportunistic memcmp() == 0 comparisons in all instances, rather
than just with abbreviated keys.  That justification no longer stands.
---
 src/backend/utils/adt/varlena.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 6aa3eaa..db4eae1 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2295,15 +2295,6 @@ bttext_abbrev_abort(int memtupcount, double rowhint, SortSupport ssup)
 	norm_key_card = key_distinct / (double) memtupcount;
 
 	/*
-	 * If this is a very low cardinality set generally, that is reason enough
-	 * to favor our strategy, since many comparisons can be resolved with
-	 * inexpensive memcmp() tie-breakers, even though abbreviated keys are of
-	 * marginal utility.
-	 */
-	if (norm_key_card  0.05)
-		return false;
-
-	/*
 	 * Abort abbreviation strategy.
 	 *
 	 * The worst case, where all abbreviated keys are identical while all
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-06 Thread Peter Geoghegan

On Sat, Sep 6, 2014 at 3:01 PM, Peter Geoghegan p...@heroku.com wrote:
 I attach another amendment/delta patch

Attached is another amendment to the patch set. With the recent
addition of abbreviation support on 32-bit platforms, we should just
hash the Datum representation as a uint32 on SIZEOF_DATUM != 8
platforms.

-- 
Peter Geoghegan
From 180ad839d8f049d83a89b42a1d85c7c99a7929f0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Sat, 6 Sep 2014 21:39:16 -0700
Subject: [PATCH 9/9] On SIZEOF_DATUM != 4 platforms, call hash_uint32()
 directly

In passing, remove some dead code within bttext_abbrev_abort().
---
 src/backend/utils/adt/varlena.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index db4eae1..23944f2 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2068,9 +2068,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
 	char			   *pres;
 	int	len;
 	Sizebsize;
-	uint32lohalf,
-		hihalf,
-		hash;
+	uint32hash;
 
 	/*
 	 * Abbreviated key representation is a pass-by-value Datum that is treated
@@ -2143,9 +2141,18 @@ retry:
 	memcpy(pres, tss-buf2, Min(sizeof(Datum), bsize));
 
 	/* Hash abbreviated key */
-	lohalf = (uint32) res;
-	hihalf = (uint32) (res  32);
-	hash = hash_uint32(lohalf ^ hihalf);
+#if SIZEOF_DATUM == 8
+	{
+		uint32lohalf,
+			hihalf;
+
+		lohalf = (uint32) res;
+		hihalf = (uint32) (res  32);
+		hash = hash_uint32(lohalf ^ hihalf);
+	}
+#else			/* SIZEOF_DATUM != 8 */
+	hash = hash_uint32((uint32) res);
+#endif
 
 	addHyperLogLog(tss-abbr_card, hash);
 
@@ -2168,8 +2175,7 @@ bttext_abbrev_abort(int memtupcount, double rowhint, SortSupport ssup)
 {
 	TextSortSupport	   *tss = (TextSortSupport *) ssup-ssup_extra;
 	doubleabbrev_distinct,
-		key_distinct,
-		norm_key_card;
+		key_distinct;
 
 	Assert(ssup-abbrev_state == ABBREVIATED_KEYS_YES);
 
@@ -2289,12 +2295,6 @@ bttext_abbrev_abort(int memtupcount, double rowhint, SortSupport ssup)
 		return false;
 
 	/*
-	 * Normalized cardinality is proportion of distinct original, authoritative
-	 * keys
-	 */
-	norm_key_card = key_distinct / (double) memtupcount;
-
-	/*
 	 * Abort abbreviation strategy.
 	 *
 	 * The worst case, where all abbreviated keys are identical while all
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-05 Thread Peter Geoghegan

On Wed, Sep 3, 2014 at 2:44 PM, Peter Geoghegan p...@heroku.com wrote:
 I guess it should still be a configure option, then. Or maybe there
 should just be a USE_ABBREV_KEYS macro within pg_config_manual.h.

Attached additional patches are intended to be applied on top off most
of the patches posted on September 2nd [1]. Note that you should not
apply patch 0001-* from that set to master, since it has already been
committed to master [2]. However, while rebasing I revised
patch/commit 0005-* to abbreviation used on all platforms, including
32-bit platforms (the prior 0005-* patch just re-enabled the
optimization on Darwin/Apple), so you should discard the earlier
0005-* patch. In a later commit I also properly formalize the idea
that we always do opportunistic memcmp() == 0 checks, no matter what
context a sortsupport-accelerated text comparison occurs in. That
seems like a good idea, but it's broken out in a separate commit in
case you are not in agreement.

While I gave serious consideration to your idea of having a dedicated
abbreviation comparator, and not duplicating sortsupport state when
abbreviated keys are used (going so far as to almost fully implement
the idea), I ultimately decided that my vote says we don't do that. It
seemed to me that there were negligible benefits for increased
complexity. In particular, I didn't want to burden tuplesort with
having to worry about whether or not abbreviation was aborted during
tuple copying, or was not used by the opclass in the first place -
implementing your scheme makes that distinction relevant. It's very
convenient to have comparetup_heap() compare the leading sort key
(that specifically looks at SortTuple.datum1 pairs) indifferently,
using the same comparator for abbreviated and not abbreviated
cases indifferently. comparetup_heap() does not seem like a great
place to burden with caring about each combination any more than
strictly necessary.

I like that I don't have to care about every combination, and can
treat abbreviation abortion as the special case with the extra step,
in line with how I think of the optimization conceptually. Does that
make sense? Otherwise, there'd have to be a ApplySortComparator()
*and* ApplySortComparatorAbbreviated() call with SortTuple.datum1
pairs passed, as appropriate for each opclass (and abortion state), as
well as a heap_getattr() tie-breaker call for the latter case alone
(when we got an inconclusive answer, OR when abbreviation was
aborted). Finally, just as things are now, there'd have to be a loop
where the second or subsequent attributes are dealt with by
ApplySortComparator()'ing. So AFAICT under your scheme there are 4
ApplySortComparator* call sites required, rather than 3 as under mine.

Along similar lines, I thought about starting from nkey = 0 within
comparetup_heap() when abortion occurs (so that there'd only be 2
ApplySortComparator() call sites - no increase from master) , but that
turns out to be messy, plus I like those special tie-breaker
assertions.

I will be away for much of next week, and will have limited access to
e-mail. I will be around tomorrow, though. I hope that what I've
posted is suitable to commit without further input from me.

[1] 
http://www.postgresql.org/message-id/cam3swztetqckc24lhwkdlasjf-b-ccnn4q0oyjhgbx+ncpn...@mail.gmail.com
[2] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d8d4965dc29263462932be03d4206aa694e2cd7e
-- 
Peter Geoghegan
From 889d615218d5cac9097c11bec33e2d8e70112922 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Fri, 5 Sep 2014 16:43:56 -0700
Subject: [PATCH 7/7] Soften wording about nodeAgg.c single column support

Support for forcing nodeAgg.c to use a heap tuplesort rather than a
Datum tuplesort, even when sorting on a single attribute may be useful.
Datum tuplesorts do not and cannot support abbreviated keys, since Datum
tuplesorting necessitates preserving the original representation for
later re-use by certain datum tuplesort clients.  When using a heap
tuplesort will make the difference between using and not using
abbreviated keys, it is likely significantly faster to prefer a heap
tuplesort.  On the other hand, nodeAgg.c's current preference for a
datum tuplesort is likely to result in superior performance for
non-abbreviated cases.

Since it is now apparent that support for forcing a heap tuplesort
within nodeAgg.c will follow in a later, separately reviewed patch,
soften the wording in comments within nodeAgg.c to reflect this.  This
doesn't seem unreasonable, since sortsupport optimization is already
used inconsistently.
---
 src/backend/executor/nodeAgg.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3b335e7..95143c3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -365,9 +365,9 @@ initialize_aggregates(AggState *aggstate,
 			 * otherwise sort the full tuple.  (See comments

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-04 Thread Robert Haas

On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan p...@heroku.com wrote:
 * Still doesn't address the open question of whether or not we should
 optimistically always try memcmp() == 0 on tiebreak. I still lean
 towards yes.

Let m be the cost of a memcmp() that fails near the end of the
strings; and let s be the cost of a strcoll that does likewise.
Clearly s  m.  But approximately what is s/m on platforms where you
can test?  Say, with 100 byte string, in a few different locales.

If for example s/m  100 then it's a no-brainer, because in the worst
case we're adding 1% overhead, and in the best case we're saving 99%.
OTOH, if s/m  2 then I almost certainly wouldn't do it, because in
the worst case we're adding 50% overhead, and in the best case we're
saving 50%.  That seems like it's doubling down on the abbreviated
key stuff to work mostly all the time, and I'm not prepared to make
that bet.  There is of course a lot of daylight between a 2-to-1 ratio
and a 100-to-1 ratio and I expect the real value is somewhere in the
middle (probably closer to 2); I haven't at this time made up my mind
what value would make this worthwhile, but I'd like to know what the
real numbers are.

 * Leaves open the question of what to do when we can't use the
 abbreviated keys optimization just because a datum tuplesort is
 preferred when sorting single-attribute tuples (recall that datum case
 tuplesorts cannot use abbreviated keys). We want to avail of tuplesort
 datum sorting where we can, except when abbreviated keys are
 available, which presumably tips the balance in favor of heap tuple
 sorting even when sorting on only one attribute, simply because then
 we can then use abbreviated keys. I'm thinking in particular of
 nodeAgg.c, which is an important case.

I favor leaving this issue to a future patch.  The last thing this
patch needs is more changes that someone might potentially dislike.
Let's focus on getting the core thing working, and then you can
enhance it once we all agree that it is.

On the substance of this issue, I suspect that for pass-by-value data
types it can hardly be wrong to use the datum tuplesort approach; but
it's possible we will want to disable it for pass-by-reference data
types when the abbreviated-key infrastructure is available.  That will
lose if it turns out that the abbreviated keys aren't capturing enough
of the entropy, but maybe we'll decide that's OK.  Or maybe not.  But
I don't think it's imperative that this patch make a change in that
area, and indeed, in the interest of keeping separate changes
isolated, I think it's better if it doesn't.

 There are still FIXME/TODO comments for each of these two points.
 Further, this revised/rebased patch set:

 * Incorporates your feedback on stylistic issues, with changes
 confined to their own commit (on top of earlier commits that are
 almost, but not quite, the same as the prior revision that your
 remarks apply to).

 * No longer does anything special within reversedirection_heap(),
 since that is unnecessary, as it's only used by bounded sorts, which
 aren't a useful target for abbreviated keys. This is noted. There is
 no convenient point to add a defensive assertion against this, so I
 haven't.

 * Updates comments in master in a broken-out way, reflecting opclass
 contract with sortsupport as established by
 1d41739e5a04b0e93304d24d864b6bfa3efc45f2, that is convenient to apply
 to and commit in the master branch immediately.

Thanks, committed that one.  The remaining patches can be squashed
into a single one, as none of them can be applied without the others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-04 Thread Robert Haas

On Wed, Sep 3, 2014 at 5:44 PM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Sep 3, 2014 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:
 My suggestion is to remove the special cases for Darwin and 32-bit
 systems and see how it goes.

 I guess it should still be a configure option, then. Or maybe there
 should just be a USE_ABBREV_KEYS macro within pg_config_manual.h.

 Are you suggesting that the patch be committed with the optimization
 enabled on all platforms by default, with the option to revisit
 disabling it if and when there is user push-back? I don't think that's
 unreasonable, given the precautions now taken, but I'm just not sure
 that's what you mean.

That's what I mean.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-04 Thread Peter Geoghegan

On Thu, Sep 4, 2014 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan p...@heroku.com wrote:
 * Still doesn't address the open question of whether or not we should
 optimistically always try memcmp() == 0 on tiebreak. I still lean
 towards yes.

 Let m be the cost of a memcmp() that fails near the end of the
 strings; and let s be the cost of a strcoll that does likewise.
 Clearly s  m.  But approximately what is s/m on platforms where you
 can test?  Say, with 100 byte string, in a few different locales.

Just to be clear: I imagine you're more or less sold on the idea of
testing equality in the event of a tie-break, where the leading 8
primary weight bytes are already known to be equal (and the full text
string lengths also match); the theory of operation behind testing how
good a proxy for full key cardinality abbreviated key cardinality is
is very much predicated on that. We can still win big with very low
cardinality sets this way, which are an important case. What I
consider an open question is whether or not we should do that on the
first call when there is no abbreviated comparison, such as on the
second or subsequent attribute in a multi-column sort, in the hope
that equality will just happen to be indicated.

 If for example s/m  100 then it's a no-brainer, because in the worst
 case we're adding 1% overhead, and in the best case we're saving 99%.
 OTOH, if s/m  2 then I almost certainly wouldn't do it, because in
 the worst case we're adding 50% overhead, and in the best case we're
 saving 50%.  That seems like it's doubling down on the abbreviated
 key stuff to work mostly all the time, and I'm not prepared to make
 that bet.  There is of course a lot of daylight between a 2-to-1 ratio
 and a 100-to-1 ratio and I expect the real value is somewhere in the
 middle (probably closer to 2); I haven't at this time made up my mind
 what value would make this worthwhile, but I'd like to know what the
 real numbers are.

Well, we can only lose when the strings happen to be the same size. So
that's something. But I'm willing to consider the possibility that the
memcmp() is virtually free. I would only proceed with this extra
optimization if that is actually the case. Modern CPUs are odd things.
Branch prediction/instruction pipelining, and the fact that we're
frequently stalled on cache misses might combine to make it
effectively the case that the opportunistic memcmp() is free. I could
be wrong about that, and I'm certainly wrong if you test large enough
strings with differences only towards the very end, but it seems
reasonable to speculate that it would work well with appropriate
precautions (in particular, don't do it when the strings are huge).
Let me try and come up with some numbers for a really unsympathetic
case, since you've already seen sympathetic numbers. I think the
sympathetic country/province/city sort test case [1] is actually
fairly representative; sort keys *are* frequently correlated like
that, implying that there are lots of savings to be had by being
memcmp() == 0 optimistic when resolving comparisons using the second
or subsequent attribute.

 * Leaves open the question of what to do when we can't use the
 abbreviated keys optimization just because a datum tuplesort is
 preferred when sorting single-attribute tuples (recall that datum case
 tuplesorts cannot use abbreviated keys). We want to avail of tuplesort
 datum sorting where we can, except when abbreviated keys are
 available, which presumably tips the balance in favor of heap tuple
 sorting even when sorting on only one attribute, simply because then
 we can then use abbreviated keys. I'm thinking in particular of
 nodeAgg.c, which is an important case.

 I favor leaving this issue to a future patch.  The last thing this
 patch needs is more changes that someone might potentially dislike.
 Let's focus on getting the core thing working, and then you can
 enhance it once we all agree that it is.

Makes sense. I think we should make a separate pass to enable sort
support for B-Tree sorting - that's probably the most compelling case,
after all. That's certainly the thing that I've heard complaints
about. There could be as many as 2-3 follow-up commits.

 On the substance of this issue, I suspect that for pass-by-value data
 types it can hardly be wrong to use the datum tuplesort approach; but
 it's possible we will want to disable it for pass-by-reference data
 types when the abbreviated-key infrastructure is available.  That will
 lose if it turns out that the abbreviated keys aren't capturing enough
 of the entropy, but maybe we'll decide that's OK.  Or maybe not.  But
 I don't think it's imperative that this patch make a change in that
 area, and indeed, in the interest of keeping separate changes
 isolated, I think it's better if it doesn't.

Right. I had presumed that we'd want to figure that out each time. I
wasn't sure how best to go about doing that, which is why it's

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-04 Thread Robert Haas

On Thu, Sep 4, 2014 at 2:12 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 4, 2014 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan p...@heroku.com wrote:
 * Still doesn't address the open question of whether or not we should
 optimistically always try memcmp() == 0 on tiebreak. I still lean
 towards yes.

 Let m be the cost of a memcmp() that fails near the end of the
 strings; and let s be the cost of a strcoll that does likewise.
 Clearly s  m.  But approximately what is s/m on platforms where you
 can test?  Say, with 100 byte string, in a few different locales.

 Just to be clear: I imagine you're more or less sold on the idea of
 testing equality in the event of a tie-break, where the leading 8
 primary weight bytes are already known to be equal (and the full text
 string lengths also match); the theory of operation behind testing how
 good a proxy for full key cardinality abbreviated key cardinality is
 is very much predicated on that. We can still win big with very low
 cardinality sets this way, which are an important case. What I
 consider an open question is whether or not we should do that on the
 first call when there is no abbreviated comparison, such as on the
 second or subsequent attribute in a multi-column sort, in the hope
 that equality will just happen to be indicated.

Eh, maybe?  I'm not sure why the case where we're using abbreviated
keys should be different than the case we're not.  In either case this
is a straightforward trade-off: if we do a memcmp() before strcoll(),
we win if it returns 0 and lose if returns non-zero and strcoll also
returns non-zero.  (If memcmp() returns non-zero but strcoll() returns
0, it's a tie.)  I'm not immediately sure why it should affect the
calculus one way or the other whether abbreviated keys are in use; the
question of how much faster memcmp() is than strcoll() seems like the
relevant consideration either way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-04 Thread Peter Geoghegan

On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas robertmh...@gmail.com wrote:
 Eh, maybe?  I'm not sure why the case where we're using abbreviated
 keys should be different than the case we're not.  In either case this
 is a straightforward trade-off: if we do a memcmp() before strcoll(),
 we win if it returns 0 and lose if returns non-zero and strcoll also
 returns non-zero.  (If memcmp() returns non-zero but strcoll() returns
 0, it's a tie.)  I'm not immediately sure why it should affect the
 calculus one way or the other whether abbreviated keys are in use; the
 question of how much faster memcmp() is than strcoll() seems like the
 relevant consideration either way.


Not quite. Consider my earlier example of sorting ~300,000 cities by
country only. That's a pretty low cardinality attribute. We win big,
and we are almost certain that the abbreviated key cardinality is a
perfect proxy for the full key cardinality so we stick with
abbreviated keys while copying over tuples. Sure, most comparisons
will actually be resolved with a memcmp() == 0 rather than an
abbreviated comparison, but under my ad-hoc cost model there is no
distinction, since they're both very much cheaper than a strcoll()
(particularly when we factor in the NUL termination copying that a
memcmp() == 0 also avoids). To a lesser extent we're also justified
in that optimism because we've already established that roughly the
first 8 bytes of the string are bitwise equal.

So the difference is that in the abbreviated key case, we are at least
somewhat justified in our optimism. Whereas, where we're just eliding
fmgr overhead, say on the 2nd or subsequent attribute of a multi-key
sort, it's totally opportunistic to chance a memcmp() == 0. The
latter optimization can only be justified by the fact that the
memcmp() is somewhere between dirt cheap and free. That seems like
soemthing that should significantly impact the calculus.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-04 Thread Peter Geoghegan

On Thu, Sep 4, 2014 at 11:12 AM, Peter Geoghegan p...@heroku.com wrote:
 What I
 consider an open question is whether or not we should do that on the
 first call when there is no abbreviated comparison, such as on the
 second or subsequent attribute in a multi-column sort, in the hope
 that equality will just happen to be indicated.

 Let me try and come up with some numbers for a really unsympathetic
 case, since you've already seen sympathetic numbers.

So I came up with what I imagined to be an unsympathetic case:

postgres=# create table opt_eq_test as select 1 as dummy, country ||
', ' || city as country  from cities order by city;
SELECT 317102
[local]/postgres=# select * from opt_eq_test limit 5;
 dummy | country
---+--
 1 | India, 108 Kalthur
 1 | Mexico, 10 de Abril
 1 | India, 113 Thalluru
 1 | Argentina, 11 de Octubre
 1 | India, 11 Dlp
(5 rows)

I added the dummy column to prevent abbreviated keys from being used -
this is all about the question of trying to get away with a memcmp()
== 0 in all circumstances, without abbreviated keys/statistics on
attribute cardinality. This is a question that has nothing to do with
abbreviated keys in particular.

With the most recent revision of the patch, performance of a
representative query against this data stabilizes as follows on my
laptop:

LOG:  duration: 2252.500 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2261.505 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2315.903 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2260.132 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2247.340 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2246.723 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2276.297 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2241.911 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2259.540 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2248.740 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2245.913 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2230.583 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;

If I apply the additional optimization that we're on the fence about:

--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1967,7 +1967,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
len1 = VARSIZE_ANY_EXHDR(arg1);
len2 = VARSIZE_ANY_EXHDR(arg2);

-   if (ssup-abbrev_state == ABBREVIATED_KEYS_TIE  len1 == len2)
+   if (len1 == len2)
{
/*
 * Being called as authoritative tie-breaker for an
abbreviated key

Then the equivalent numbers look like this:

LOG:  duration: 2178.220 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2175.005 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2219.648 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2174.865 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2246.387 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2234.023 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2186.957 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2177.778 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2186.709 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2171.557 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2211.822 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2224.198 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 100) d;
LOG:  duration: 2192.506 ms  statement: select * from (select * from
opt_eq_test order

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-04 Thread Peter Geoghegan

On Thu, Sep 4, 2014 at 5:07 PM, Peter Geoghegan p...@heroku.com wrote:
 So I came up with what I imagined to be an unsympathetic case:

BTW, this cities data is still available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-03 Thread Robert Haas

On Tue, Sep 2, 2014 at 4:41 PM, Peter Geoghegan p...@heroku.com wrote:
 HyperLogLog isn't sample-based - it's useful for streaming a set and
 accurately tracking its cardinality with fixed overhead.

OK.

 Is it the right decision to suppress the abbreviated-key optimization
 unconditionally on 32-bit systems and on Darwin?  There's certainly
 more danger, on those platforms, that the optimization could fail to
 pay off.  But it could also win big, if in fact the first character or
 two of the string is enough to distinguish most rows, or if Darwin
 improves their implementation in the future.  If the other defenses
 against pathological cases in the patch are adequate, I would think
 it'd be OK to remove the hard-coded checks here and let those cases
 use the optimization or not according to its merits in particular
 cases.  We'd want to look at what the impact of that is, of course,
 but if it's bad, maybe those other defenses aren't adequate anyway.

 I'm not sure. Perhaps the Darwin thing is a bad idea because no one is
 using Macs to run real database servers. Apple haven't had a server
 product in years, and typically people only use Postgres on their Macs
 for development. We might as well have coverage of the new code for
 the benefit of Postgres hackers that favor Apple machines. Or, to look
 at it another way, the optimization is so beneficially that it's
 probably worth the risk, even for more marginal cases.

 8 primary weights (the leading 8 bytes, frequently isomorphic to the
 first 8 Latin characters, regardless of whether or not they have
 accents/diacritics, or punctuation/whitespace) is twice as many as 4.
 But every time you add a byte of space to the abbreviated
 representation that can resolve a comparison, the number of
 unresolvable-without-tiebreak comparisons (in general) is, I imagine,
 reduced considerably. Basically, 8 bytes is way better than twice as
 good as 4 bytes in terms of its effect on the proportion of
 comparisons that are resolved only with abbreviated keys. Even still,
 I suspect it's still worth it to apply the optimization with only 4.

 You've seen plenty of suggestions on assessing the applicability of
 the optimization from me. Perhaps you have a few of your own.

My suggestion is to remove the special cases for Darwin and 32-bit
systems and see how it goes.

 That wouldn't be harmless - it would probably result in incorrect
 answers in practice, and would certainly be unspecified. However, I'm
 not reading uninitialized bytes. I call memset() so that in the event
 of the final strxfrm() blob being less than 8 bytes (which can happen
 even on glibc with en_US.UTF-8). It cannot be harmful to memcmp()
 every Datum byte if the remaining bytes are always initialized to NUL.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 >

1 - 100 of 234 matches

Mail list logo