Re: [HACKERS] make_greater_string() does not return a string in some cases
This is an update of a patch for NEXT CommitFest 2011/09. Please ignore this message. 1 Additional Feature - EUC-JP incrementer 2 Bug fixes - bytea incrementer, libpq compilation. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10b73fb..48a58a0 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5502,6 +5502,18 @@ pattern_selectivity(Const *patt, Pattern_Type ptype) /* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool byte_increment(unsigned char *ptr, int len) +{ + if (*ptr >= 255) return false; + + (*ptr)++; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return a palloc'd string * in the form of a Const node; else return NULL. @@ -5540,6 +5552,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5601,27 +5614,38 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype != BYTEAOID) + charincfunc = pg_database_encoding_character_incrementer(); + else + charincfunc = &byte_increment; + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen; + unsigned char *lastchar; + unsigned char savelastbyte; + Const *workstr_const; + + if (datatype == BYTEAOID) + charlen = 1; + else + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* -* Try to generate a larger string by incrementing the last byte. +* savelastbyte has meaning only for datatype == BYTEAOID */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; + savelastbyte = *lastchar; - (*lastchar)++; + /* +* Try to generate a larger string by incrementing the last byte or +* character. +*/ + if (charincfunc(lastchar, charlen)) { if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; workstr_const = string_to_const(workstr, datatype); - } else workstr_const = string_to_bytea_const(workstr, len); @@ -5636,26 +5660,17 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) pfree(workstr); return workstr_const; } - + /* No good, release unusable value and try again */ pfree(DatumGetPointer(workstr_const->constvalue)); pfree(workstr_const); } - /* restore last byte so we don't confuse pg_mbcliplen */ - *lastchar = savelastchar; - /* -* Truncate off the last character, which might be more than 1 byte, -* depending on the character encoding. +* Truncate off the last character or restore last byte for BYTEA. */ - if (datatype != BYTEAOID && pg_database_encoding_max_length() > 1) - len = pg_mbcliplen(workstr, len, len - 1); - else - len -= 1; - - if (datatype != BYTEAOID) - workstr[len] = '\0'; + len -= charlen; + workstr[len] = (datatype != BYTEAOID ? '\0' : savelastbyte); } /* Failed... */ diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c index 5b0cf62..8505bcb 100644 --- a/src/backend/utils/mb/wchar.c +++ b/src/backend/utils/mb/wchar.c @@ -1336,53 +1336,254 @@ pg_utf8_islegal(const unsigned char *source, int length) /* *--- + *
Re: [HACKERS] make_greater_string() does not return a string in some cases
Hello, Could you let me go on with this topic? It is hard to ignore this glitch for us using CJK - Chinese, Japanese, and Korean - characters on databse.. Maybe.. Saying on Japanese under the standard usage, about a hundred characters out of seven thousand make make_greater_string() fail. This is not so frequent to happen but also not as rare as ignorable. I think this glitch is caused because the method to derive the `next character' is fundamentally a secret of each encoding but now it is done in make_greater_string() using the method extended from that of 1 byte ASCII charset for all encodings together. So, I think it is reasonable that encoding info table (struct pg_wchar_tbl) holds the function to do that. How about this idea? Points to realize this follows, - pg_wchar_tbl@pg_wchar.c has new element `charinc' that holds a function to increment a character of this encoding. - Basically, the value of charinc is a `generic' increment function that does what make_greater_string() does in current implement. - make_greater_string() now uses charinc for database encoding to increment characters instead of the code directly written in it. - Give UTF-8 a special increment function. As a consequence of this modification, make_greater_string() looks somewhat simple thanks to disappearing of the sequence that handles bare bytes in string. And doing `increment character' with the knowledge of the encoding can be straightforward and light and backtrack-free, and have fewer glitches than the generic method. # But the process for BYTEAOID remains there dissapointingly. There still remains some glitches but I think it is overdo to do conversion that changes the length of the character. Only 5 points out of 17 thousands (in current method, roughly for all BMP characters) remains, and none of them are not Japanese character :-) The attached patch is sample implement of this idea. What do you think about this patch? -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10b73fb..4151ce2 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5502,6 +5502,16 @@ pattern_selectivity(Const *patt, Pattern_Type ptype) /* + * This function is "character increment" function for bytea used in + * make_greater_string() that has same interface with pg_wchar_tbl.charinc. + */ +static bool byte_increment(unsigned char *ptr, int len) +{ + (*ptr)--; + return true; +} + +/* * Try to generate a string greater than the given string or any * string it is a prefix of. If successful, return a palloc'd string * in the form of a Const node; else return NULL. @@ -5540,6 +5550,7 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) int len; Datum cmpstr; text *cmptxt = NULL; + character_incrementer charincfunc; /* * Get a modifiable copy of the prefix string in C-string format, and set @@ -5601,27 +5612,38 @@ make_greater_string(const Const *str_const, FmgrInfo *ltproc, Oid collation) } } + if (datatype != BYTEAOID) + charincfunc = pg_database_encoding_character_incrementer(); + else + charincfunc = &byte_increment; + while (len > 0) { - unsigned char *lastchar = (unsigned char *) (workstr + len - 1); - unsigned char savelastchar = *lastchar; + int charlen; + unsigned char *lastchar; + unsigned char savelastbyte; + Const *workstr_const; + + if (datatype == BYTEAOID) + charlen = 1; + else + charlen = len - pg_mbcliplen(workstr, len, len - 1); + + lastchar = (unsigned char *) (workstr + len - charlen); /* -* Try to generate a larger string by incrementing the last byte. +* savelastbyte has meaning only for datatype == BYTEAOID */ - while (*lastchar < (unsigned char) 255) - { - Const *workstr_const; + savelastbyte = *lastchar; - (*lastchar)++; + /* +* Try to generate a larger string by incrementing the last byte or +* character. +*/ + if (charincfunc(lastchar, charlen)) { if (datatype != BYTEAOID) - { - /* do not generate invalid encoding sequences */ - if (!pg_verifymbstr(workstr, len, true)) - continue; workstr_const = string_to_const(workstr, datatype); - }