Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
On Fri, Feb 5, 2016 at 6:14 AM, Peter Geoghegan wrote: > On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas wrote: >> Thanks for the review. I fixed the OID conflict, tweaked a few >> comments, and committed this. > > Thanks. I noticed a tiny, preexisting typo in the string abbreviated > key code. The attached patch fixes it. Gosh, I must have looked at that line 10 times without seeing that mistake. Thanks, committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
On Wed, Feb 3, 2016 at 11:31 AM, Robert Haas wrote: > Thanks for the review. I fixed the OID conflict, tweaked a few > comments, and committed this. Thanks. I noticed a tiny, preexisting typo in the string abbreviated key code. The attached patch fixes it. -- Peter Geoghegan diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 1a74e5e..5e7536a 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -2154,7 +2154,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup) len = bpchartruelen(authoritative_data, len); /* - * If we're using the C collation, use memcmp(), rather than strxfrm(), to + * If we're using the C collation, use memcpy(), rather than strxfrm(), to * abbreviate keys. The full comparator for the C locale is always * memcmp(). It would be incorrect to allow bytea callers (callers that * always force the C collation -- bytea isn't a collatable type, but this -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
On Sun, Jan 31, 2016 at 7:59 PM, Andreas Karlsson wrote: > Nice work, I like your sorting patches. Thanks. I like your reviews of my sorting patches. :-) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
On Sun, Jan 31, 2016 at 10:59 PM, Andreas Karlsson wrote: > I have reviewed this now and I think this is a useful addition even though > these indexes are less common. Consistent behavior is worth a lot in my mind > and this patch is reasonably small. > > The patch no longer applies due to 1) oid collisions and 2) a trivial > conflict. When I fixed those two the test suite passed. > > I tested this patch by building indexes with all the typess and got nice > measurable speedups. > > Logically I think the patch makes sense and the code seems to be correct, > but I have some comments on it. > > - You use two names a lot "string" vs "varstr". What is the difference > between those? Is there any reason for not using varstr consistently? > > - You have a lot of renaming as has been mentioned previously in this > thread. I do not care strongly for it either way, but it did make it harder > to spot what the patch changed. If it was my own project I would have > considered splitting the patch into two parts, one renaming everything and > another adding the new feature, but the PostgreSQL seem to often prefer > having one commit per feature. Do as you wish here. > > - I think the comment about NUL bytes in varstr_abbrev_convert makes it seem > like the handling is much more complicated than it actually is. I did not > understand it after a couple of readings and had to read the code understand > what it was talking about. > > Nice work, I like your sorting patches. Thanks for the review. I fixed the OID conflict, tweaked a few comments, and committed this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
Hi, I have reviewed this now and I think this is a useful addition even though these indexes are less common. Consistent behavior is worth a lot in my mind and this patch is reasonably small. The patch no longer applies due to 1) oid collisions and 2) a trivial conflict. When I fixed those two the test suite passed. I tested this patch by building indexes with all the typess and got nice measurable speedups. Logically I think the patch makes sense and the code seems to be correct, but I have some comments on it. - You use two names a lot "string" vs "varstr". What is the difference between those? Is there any reason for not using varstr consistently? - You have a lot of renaming as has been mentioned previously in this thread. I do not care strongly for it either way, but it did make it harder to spot what the patch changed. If it was my own project I would have considered splitting the patch into two parts, one renaming everything and another adding the new feature, but the PostgreSQL seem to often prefer having one commit per feature. Do as you wish here. - I think the comment about NUL bytes in varstr_abbrev_convert makes it seem like the handling is much more complicated than it actually is. I did not understand it after a couple of readings and had to read the code understand what it was talking about. Nice work, I like your sorting patches. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
On Thu, Jan 7, 2016 at 7:41 AM, Alvaro Herrera wrote: > You're stealthily introducing a new abstraction called "string", > including a typedef and DatumGetString support macros. Is that really > necessary? Shouldn't it be discussed specifically? I don't necessarily > oppose it as is, mainly because it's limited to within varlena.c for > now, but I'm not sure it'd okay to make it more public. Note that a similar abstraction for the "unknown" type also exists only within varlena.c. So, DatumGetStringP() and so on appear right alongside DatumGetUnknownP() and so on. The idea of the "string" abstraction is that is advertises that certain functions could equally well apply to a variety of "varlena header + some bytes" types. I thought about just using the varlena type instead, but preferred the "string" abstraction. > Also, there's a lot of churn in this patch to just remove tss to sss. > Can't we just keep the original variable name? I think that minimizing lines changed in a mechanical way by a commit is overrated as a goal for Postgres patches, but I don't feel too strongly about holding on to the "churn" in this patch. I attach a new revision, which has the changes I outlined to code comments. I haven't minimized the differences in the way you suggest just yet. -- Peter Geoghegan From b18d45437e99657fed990edf718c21ad6d212970 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sun, 8 Nov 2015 15:34:32 -0800 Subject: [PATCH] Generalize SortSupport for text Expose interface that allows char(n) and bytea types to piggyback on a now-generalized SortSupport for text. This pushes a little more knowledge of the bpchar/char(n) type into varlena.c than might be preferred, but that seems like the approach that creates least friction. Also, accelerate index builds that use the text_pattern_ops opfamily (text_pattern_ops and varchar_pattern_ops opclasses), and the bpchar_pattern_ops opfamily (which has one opclass, also named bpchar_pattern_ops) by introducing SortSupport. These work just like the bytea caller of the generalized SortSupport for text interface -- the "C" locale is forced. --- doc/src/sgml/datatype.sgml | 3 +- src/backend/utils/adt/varchar.c | 55 +- src/backend/utils/adt/varlena.c | 391 +++- src/include/catalog/pg_amproc.h | 4 + src/include/catalog/pg_proc.h | 8 + src/include/utils/builtins.h| 6 + src/include/utils/bytea.h | 1 + 7 files changed, 334 insertions(+), 134 deletions(-) diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index d1db0d2..6513b21 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -1140,8 +1140,7 @@ SELECT '52093.89'::money::numeric::float8; advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of - the three because of its additional storage costs and slower - sorting. In most situations + the three because of its additional storage costs. In most situations text or character varying should be used instead. diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c index 0498fef..94d6da5 100644 --- a/src/backend/utils/adt/varchar.c +++ b/src/backend/utils/adt/varchar.c @@ -17,6 +17,7 @@ #include "access/hash.h" #include "access/tuptoaster.h" +#include "catalog/pg_collation.h" #include "libpq/pqformat.h" #include "nodes/nodeFuncs.h" #include "utils/array.h" @@ -649,14 +650,21 @@ varchartypmodout(PG_FUNCTION_ARGS) */ /* "True" length (not counting trailing blanks) of a BpChar */ -static int +static inline int bcTruelen(BpChar *arg) { - char *s = VARDATA_ANY(arg); + return bpchartruelen(VARDATA_ANY(arg), VARSIZE_ANY_EXHDR(arg)); +} + +int +bpchartruelen(char *s, int len) +{ int i; - int len; - len = VARSIZE_ANY_EXHDR(arg); + /* + * Note that we rely on the assumption that ' ' is a singleton unit on + * every supported multibyte server encoding. + */ for (i = len - 1; i >= 0; i--) { if (s[i] != ' ') @@ -859,6 +867,23 @@ bpcharcmp(PG_FUNCTION_ARGS) } Datum +bpchar_sortsupport(PG_FUNCTION_ARGS) +{ + SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); + Oid collid = ssup->ssup_collation; + MemoryContext oldcontext; + + oldcontext = MemoryContextSwitchTo(ssup->ssup_cxt); + + /* Use generic string SortSupport */ + varstr_sortsupport(ssup, collid, true); + + MemoryContextSwitchTo(oldcontext); + + PG_RETURN_VOID(); +} + +Datum bpchar_larger(PG_FUNCTION_ARGS) { BpChar *arg1 = PG_GETARG_BPCHAR_PP(0); @@ -926,8 +951,9 @@ hashbpchar(PG_FUNCTION_ARGS) /* * The following operators support character-by-character comparison * of bpchar datums, to allow building indexes suitable for LIKE clauses. - * Note that the regular bpchareq/bpcharne comparison operators are assumed - * to be compatible wi
Re: [HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
Peter Geoghegan wrote: > We lack SortSupport for many character-like-type cases. In full, the > cases within the core system are: You're stealthily introducing a new abstraction called "string", including a typedef and DatumGetString support macros. Is that really necessary? Shouldn't it be discussed specifically? I don't necessarily oppose it as is, mainly because it's limited to within varlena.c for now, but I'm not sure it'd okay to make it more public. Also, there's a lot of churn in this patch to just remove tss to sss. Can't we just keep the original variable name? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Generalizing SortSupport for text to work with char(n), bytea, and alternative opclasses
We lack SortSupport for many character-like-type cases. In full, the cases within the core system are: * char(n) opfamily (bpchar_ops). * text_pattern_ops opfamily (includes text and varchar "pattern" opclasses, which are generally recommended for accelerating LIKE operator queries). * bpchar_pattern_ops -- the operator family/class for char(n), used where "pattern" style indexing is required for the char(n) type. * bytea default opclass. This is a type that, like the others, shares its representation with text (a varlena header and some data bytes -- a string, essentially). Its comparator already behaves identically to that of the text comparator when the "C" collation is used. I've actually seen a specific request for this [1]. These cases do matter. For one thing, even if they're less important than the default text/varchar opclasses, having such large inconsistencies in character type sort performance is a fairly major POLA violation; opclasses like text_pattern_ops are *supposed* to be faster though less capable alternatives to the defaults. For another, char(n) is in the SQL standard, which may be why all TPC benchmarks use char(n) for columns that are sorted on or used for grouping. char(n) sorting can be made about 8x faster with SortSupport/abbreviation, making it the best candidate for abbreviation optimization that I've ever seen. It would be regrettable if we accidentally lost a benchmark due to not having char(n) SortSupport. Attached patch adds SortSupport for all of the cases listed above. I did not bother doing anything with contrib/citext, because I think it's not worth it. I think that we should definitely invest in case insensitive collations, and retire contrib/citext. The fact that the *default* collation (and not the input collation) is passed by citextcmp()'s call to str_tolower() suggests to me that the only way to make citext do the right thing is to basically introduce case insensitive collations to Postgres. Of course, doing so would make contrib/citext obsolete. I also didn't bother extending the name type's SortSupport (something that has existed since 9.2) to perform abbreviation, although that wouldn't be very hard. I saw no point. I think I might also get around to adding abbreviated key support for the network address types during the 9.6 cycle. That seems like something a less experienced contributor could easily pick up, though -- if anyone wants to take it off my hands, please do so. [1] https://lwn.net/Articles/653721/ -- Peter Geoghegan From 4b6cfc423ab2d586af987a410b2706cdfc02d9fb Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Sun, 8 Nov 2015 15:34:32 -0800 Subject: [PATCH] Generalize SortSupport for text Expose interface that allows char(n) and bytea types to piggyback on a now-generalized SortSupport for text. This pushes a little more knowledge of the bpchar/char(n) type into varlena.c than might be preferred, but that seems like the approach that creates least friction. Also, accelerate index builds that use the text_pattern_ops opfamily (text_pattern_ops and varchar_pattern_ops opclasses), and the bpchar_pattern_ops opfamily (which has one opclass, also named bpchar_pattern_ops) by introducing SortSupport. These work just like the bytea caller of the generalized SortSupport for text interface -- the "C" locale is forced. --- doc/src/sgml/datatype.sgml | 3 +- src/backend/utils/adt/varchar.c | 56 ++- src/backend/utils/adt/varlena.c | 357 ++-- src/include/catalog/pg_amproc.h | 4 + src/include/catalog/pg_proc.h | 8 + src/include/utils/builtins.h| 6 + src/include/utils/bytea.h | 1 + 7 files changed, 303 insertions(+), 132 deletions(-) diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index b5191f4..0ee358b 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -1140,8 +1140,7 @@ SELECT '52093.89'::money::numeric::float8; advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of - the three because of its additional storage costs and slower - sorting. In most situations + the three because of its additional storage costs. In most situations text or character varying should be used instead. diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c index df9a2d7..b3a1cdf 100644 --- a/src/backend/utils/adt/varchar.c +++ b/src/backend/utils/adt/varchar.c @@ -17,10 +17,12 @@ #include "access/hash.h" #include "access/tuptoaster.h" +#include "catalog/pg_collation.h" #include "libpq/pqformat.h" #include "nodes/nodeFuncs.h" #include "utils/array.h" #include "utils/builtins.h" +#include "utils/sortsupport.h" #include "mb/pg_wchar.h" @@ -649,14 +651,21 @@ varchartypmodout(PG_FUNCTION_ARGS) */ /* "True" length