Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-02-05 Thread Robert Haas
On Fri, Feb 5, 2016 at 6:14 AM, Peter Geoghegan  wrote:
> On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas  wrote:
>> Thanks for the review.  I fixed the OID conflict, tweaked a few
>> comments, and committed this.
>
> Thanks. I noticed a tiny, preexisting typo in the string abbreviated
> key code. The attached patch fixes it.

Gosh, I must have looked at that line 10 times without seeing that
mistake.  Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-02-05 Thread Peter Geoghegan
On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas  wrote:
> Thanks for the review.  I fixed the OID conflict, tweaked a few
> comments, and committed this.

Thanks. I noticed a tiny, preexisting typo in the string abbreviated
key code. The attached patch fixes it.

-- 
Peter Geoghegan
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 1a74e5e..5e7536a 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2154,7 +2154,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		len = bpchartruelen(authoritative_data, len);
 
 	/*
-	 * If we're using the C collation, use memcmp(), rather than strxfrm(), to
+	 * If we're using the C collation, use memcpy(), rather than strxfrm(), to
 	 * abbreviate keys.  The full comparator for the C locale is always
 	 * memcmp().  It would be incorrect to allow bytea callers (callers that
 	 * always force the C collation -- bytea isn't a collatable type, but this

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-02-04 Thread Peter Geoghegan
On Sun, Jan 31, 2016 at 7:59 PM, Andreas Karlsson  wrote:
> Nice work, I like your sorting patches.

Thanks. I like your reviews of my sorting patches. :-)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-02-03 Thread Robert Haas
On Sun, Jan 31, 2016 at 10:59 PM, Andreas Karlsson  wrote:
> I have reviewed this now and I think this is a useful addition even though
> these indexes are less common. Consistent behavior is worth a lot in my mind
> and this patch is reasonably small.
>
> The patch no longer applies due to 1) oid collisions and 2) a trivial
> conflict. When I fixed those two the test suite passed.
>
> I tested this patch by building indexes with all the typess and got nice
> measurable speedups.
>
> Logically I think the patch makes sense and the code seems to be correct,
> but I have some comments on it.
>
> - You use two names a lot "string" vs "varstr". What is the difference
> between those? Is there any reason for not using varstr consistently?
>
> - You have a lot of renaming as has been mentioned previously in this
> thread. I do not care strongly for it either way, but it did make it harder
> to spot what the patch changed. If it was my own project I would have
> considered splitting the patch into two parts, one renaming everything and
> another adding the new feature, but the PostgreSQL seem to often prefer
> having one commit per feature. Do as you wish here.
>
> - I think the comment about NUL bytes in varstr_abbrev_convert makes it seem
> like the handling is much more complicated than it actually is. I did not
> understand it after a couple of readings and had to read the code understand
> what it was talking about.
>
> Nice work, I like your sorting patches.

Thanks for the review.  I fixed the OID conflict, tweaked a few
comments, and committed this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-01-31 Thread Andreas Karlsson

Hi,

I have reviewed this now and I think this is a useful addition even 
though these indexes are less common. Consistent behavior is worth a lot 
in my mind and this patch is reasonably small.


The patch no longer applies due to 1) oid collisions and 2) a trivial 
conflict. When I fixed those two the test suite passed.


I tested this patch by building indexes with all the typess and got nice 
measurable speedups.


Logically I think the patch makes sense and the code seems to be 
correct, but I have some comments on it.


- You use two names a lot "string" vs "varstr". What is the difference 
between those? Is there any reason for not using varstr consistently?


- You have a lot of renaming as has been mentioned previously in this 
thread. I do not care strongly for it either way, but it did make it 
harder to spot what the patch changed. If it was my own project I would 
have considered splitting the patch into two parts, one renaming 
everything and another adding the new feature, but the PostgreSQL seem 
to often prefer having one commit per feature. Do as you wish here.


- I think the comment about NUL bytes in varstr_abbrev_convert makes it 
seem like the handling is much more complicated than it actually is. I 
did not understand it after a couple of readings and had to read the 
code understand what it was talking about.


Nice work, I like your sorting patches.

Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-01-07 Thread Peter Geoghegan
On Thu, Jan 7, 2016 at 7:41 AM, Alvaro Herrera  wrote:
> You're stealthily introducing a new abstraction called "string",
> including a typedef and DatumGetString support macros.  Is that really
> necessary?  Shouldn't it be discussed specifically?  I don't necessarily
> oppose it as is, mainly because it's limited to within varlena.c for
> now, but I'm not sure it'd okay to make it more public.

Note that a similar abstraction for the "unknown" type also exists
only within varlena.c.  So, DatumGetStringP() and so on appear right
alongside DatumGetUnknownP() and so on.

The idea of the "string" abstraction is that is advertises that
certain functions could equally well apply to a variety of "varlena
header + some bytes" types. I thought about just using the varlena
type instead, but preferred the "string" abstraction.

> Also, there's a lot of churn in this patch to just remove tss to sss.
> Can't we just keep the original variable name?

I think that minimizing lines changed in a mechanical way by a commit
is overrated as a goal for Postgres patches, but I don't feel too
strongly about holding on to the "churn" in this patch. I attach a new
revision, which has the changes I outlined to code comments. I haven't
minimized the differences in the way you suggest just yet.

-- 
Peter Geoghegan
From b18d45437e99657fed990edf718c21ad6d212970 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 8 Nov 2015 15:34:32 -0800
Subject: [PATCH] Generalize SortSupport for text

Expose interface that allows char(n) and bytea types to piggyback on a
now-generalized SortSupport for text.  This pushes a little more
knowledge of the bpchar/char(n) type into varlena.c than might be
preferred, but that seems like the approach that creates least friction.

Also, accelerate index builds that use the text_pattern_ops opfamily
(text_pattern_ops and varchar_pattern_ops opclasses), and the
bpchar_pattern_ops opfamily (which has one opclass, also named
bpchar_pattern_ops) by introducing SortSupport.  These work just like
the bytea caller of the generalized SortSupport for text interface --
the "C" locale is forced.
---
 doc/src/sgml/datatype.sgml  |   3 +-
 src/backend/utils/adt/varchar.c |  55 +-
 src/backend/utils/adt/varlena.c | 391 +++-
 src/include/catalog/pg_amproc.h |   4 +
 src/include/catalog/pg_proc.h   |   8 +
 src/include/utils/builtins.h|   6 +
 src/include/utils/bytea.h   |   1 +
 7 files changed, 334 insertions(+), 134 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index d1db0d2..6513b21 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1140,8 +1140,7 @@ SELECT '52093.89'::money::numeric::float8;
  advantages in some other database systems, there is no such advantage in
  PostgreSQL; in fact
  character(n) is usually the slowest of
- the three because of its additional storage costs and slower
- sorting.  In most situations
+ the three because of its additional storage costs.  In most situations
  text or character varying should be used
  instead.
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 0498fef..94d6da5 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/tuptoaster.h"
+#include "catalog/pg_collation.h"
 #include "libpq/pqformat.h"
 #include "nodes/nodeFuncs.h"
 #include "utils/array.h"
@@ -649,14 +650,21 @@ varchartypmodout(PG_FUNCTION_ARGS)
  */
 
 /* "True" length (not counting trailing blanks) of a BpChar */
-static int
+static inline int
 bcTruelen(BpChar *arg)
 {
-	char	   *s = VARDATA_ANY(arg);
+	return bpchartruelen(VARDATA_ANY(arg), VARSIZE_ANY_EXHDR(arg));
+}
+
+int
+bpchartruelen(char *s, int len)
+{
 	int			i;
-	int			len;
 
-	len = VARSIZE_ANY_EXHDR(arg);
+	/*
+	 * Note that we rely on the assumption that ' ' is a singleton unit on
+	 * every supported multibyte server encoding.
+	 */
 	for (i = len - 1; i >= 0; i--)
 	{
 		if (s[i] != ' ')
@@ -859,6 +867,23 @@ bpcharcmp(PG_FUNCTION_ARGS)
 }
 
 Datum
+bpchar_sortsupport(PG_FUNCTION_ARGS)
+{
+	SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+	Oid			collid = ssup->ssup_collation;
+	MemoryContext oldcontext;
+
+	oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt);
+
+	/* Use generic string SortSupport */
+	varstr_sortsupport(ssup, collid, true);
+
+	MemoryContextSwitchTo(oldcontext);
+
+	PG_RETURN_VOID();
+}
+
+Datum
 bpchar_larger(PG_FUNCTION_ARGS)
 {
 	BpChar	   *arg1 = PG_GETARG_BPCHAR_PP(0);
@@ -926,8 +951,9 @@ hashbpchar(PG_FUNCTION_ARGS)
 /*
  * The following operators support character-by-character comparison
  * of bpchar datums, to allow building indexes suitable for LIKE clauses.
- * Note that the regular bpchareq/bpcharne comparison operators are assumed
- * to be compatible wi

Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2016-01-07 Thread Alvaro Herrera
Peter Geoghegan wrote:
> We lack SortSupport for many character-like-type cases. In full, the
> cases within the core system are:

You're stealthily introducing a new abstraction called "string",
including a typedef and DatumGetString support macros.  Is that really
necessary?  Shouldn't it be discussed specifically?  I don't necessarily
oppose it as is, mainly because it's limited to within varlena.c for
now, but I'm not sure it'd okay to make it more public.

Also, there's a lot of churn in this patch to just remove tss to sss.
Can't we just keep the original variable name?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses

2015-11-12 Thread Peter Geoghegan
We lack SortSupport for many character-like-type cases. In full, the
cases within the core system are:

* char(n) opfamily (bpchar_ops).

* text_pattern_ops opfamily (includes text and varchar "pattern"
opclasses, which are generally recommended for accelerating LIKE
operator queries).

* bpchar_pattern_ops -- the operator family/class for char(n), used
where "pattern" style indexing is required for the char(n) type.

* bytea default opclass. This is a type that, like the others, shares
its representation with text (a varlena header and some data bytes --
a string, essentially). Its comparator already behaves identically to
that of the text comparator when the "C" collation is used. I've
actually seen a specific request for this [1].

These cases do matter. For one thing, even if they're less important
than the default text/varchar opclasses, having such large
inconsistencies in character type sort performance is a fairly major
POLA violation; opclasses like text_pattern_ops are *supposed* to be
faster though less capable alternatives to the defaults. For another,
char(n) is in the SQL standard, which may be why all TPC benchmarks
use char(n) for columns that are sorted on or used for grouping.
char(n) sorting can be made about 8x faster with
SortSupport/abbreviation, making it the best candidate for
abbreviation optimization that I've ever seen. It would be regrettable
if we accidentally lost a benchmark due to not having char(n)
SortSupport.

Attached patch adds SortSupport for all of the cases listed above.

I did not bother doing anything with contrib/citext, because I think
it's not worth it. I think that we should definitely invest in case
insensitive collations, and retire contrib/citext. The fact that the
*default* collation (and not the input collation) is passed by
citextcmp()'s call to str_tolower() suggests to me that the only way
to make citext do the right thing is to basically introduce case
insensitive collations to Postgres. Of course, doing so would make
contrib/citext obsolete.

I also didn't bother extending the name type's SortSupport (something
that has existed since 9.2) to perform abbreviation, although that
wouldn't be very hard. I saw no point.

I think I might also get around to adding abbreviated key support for
the network address types during the 9.6 cycle. That seems like
something a less experienced contributor could easily pick up, though
-- if anyone wants to take it off my hands, please do so.

[1] https://lwn.net/Articles/653721/
-- 
Peter Geoghegan
From 4b6cfc423ab2d586af987a410b2706cdfc02d9fb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 8 Nov 2015 15:34:32 -0800
Subject: [PATCH] Generalize SortSupport for text

Expose interface that allows char(n) and bytea types to piggyback on a
now-generalized SortSupport for text.  This pushes a little more
knowledge of the bpchar/char(n) type into varlena.c than might be
preferred, but that seems like the approach that creates least friction.

Also, accelerate index builds that use the text_pattern_ops opfamily
(text_pattern_ops and varchar_pattern_ops opclasses), and the
bpchar_pattern_ops opfamily (which has one opclass, also named
bpchar_pattern_ops) by introducing SortSupport.  These work just like
the bytea caller of the generalized SortSupport for text interface --
the "C" locale is forced.
---
 doc/src/sgml/datatype.sgml  |   3 +-
 src/backend/utils/adt/varchar.c |  56 ++-
 src/backend/utils/adt/varlena.c | 357 ++--
 src/include/catalog/pg_amproc.h |   4 +
 src/include/catalog/pg_proc.h   |   8 +
 src/include/utils/builtins.h|   6 +
 src/include/utils/bytea.h   |   1 +
 7 files changed, 303 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index b5191f4..0ee358b 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1140,8 +1140,7 @@ SELECT '52093.89'::money::numeric::float8;
  advantages in some other database systems, there is no such advantage in
  PostgreSQL; in fact
  character(n) is usually the slowest of
- the three because of its additional storage costs and slower
- sorting.  In most situations
+ the three because of its additional storage costs.  In most situations
  text or character varying should be used
  instead.
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index df9a2d7..b3a1cdf 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -17,10 +17,12 @@
 
 #include "access/hash.h"
 #include "access/tuptoaster.h"
+#include "catalog/pg_collation.h"
 #include "libpq/pqformat.h"
 #include "nodes/nodeFuncs.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/sortsupport.h"
 #include "mb/pg_wchar.h"
 
 
@@ -649,14 +651,21 @@ varchartypmodout(PG_FUNCTION_ARGS)
  */
 
 /* "True" length