Re: [HACKERS] make_greater_string() does not return a string in some cases

2011-07-12 Thread Kyotaro HORIGUCHI
This is an update of a patch for NEXT CommitFest 2011/09.

Please ignore this message.

1 Additional Feature - EUC-JP incrementer
2 Bug fixes - bytea incrementer, libpq compilation.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 10b73fb..48a58a0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5502,6 +5502,18 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)
 
 
 /*
+ * This function is "character increment" function for bytea used in
+ * make_greater_string() that has same interface with pg_wchar_tbl.charinc.
+ */
+static bool byte_increment(unsigned char *ptr, int len)
+{
+   if (*ptr >= 255) return false;
+
+   (*ptr)++;
+   return true;
+}
+
+/*
  * Try to generate a string greater than the given string or any
  * string it is a prefix of.  If successful, return a palloc'd string
  * in the form of a Const node; else return NULL.
@@ -5540,6 +5552,7 @@ make_greater_string(const Const *str_const, FmgrInfo 
*ltproc, Oid collation)
int len;
Datum   cmpstr;
text   *cmptxt = NULL;
+   character_incrementer charincfunc;
 
/*
 * Get a modifiable copy of the prefix string in C-string format, and 
set
@@ -5601,27 +5614,38 @@ make_greater_string(const Const *str_const, FmgrInfo 
*ltproc, Oid collation)
}
}
 
+   if (datatype != BYTEAOID)
+   charincfunc = pg_database_encoding_character_incrementer();
+   else
+   charincfunc = &byte_increment;
+
while (len > 0)
{
-   unsigned char *lastchar = (unsigned char *) (workstr + len - 1);
-   unsigned char savelastchar = *lastchar;
+   int charlen;
+   unsigned char *lastchar;
+   unsigned char savelastbyte;
+   Const  *workstr_const;
+   
+   if (datatype == BYTEAOID)
+   charlen = 1;
+   else
+   charlen = len - pg_mbcliplen(workstr, len, len - 1);
+
+   lastchar = (unsigned char *) (workstr + len - charlen);
 
/*
-* Try to generate a larger string by incrementing the last 
byte.
+* savelastbyte has meaning only for datatype == BYTEAOID
 */
-   while (*lastchar < (unsigned char) 255)
-   {
-   Const  *workstr_const;
+   savelastbyte = *lastchar;
 
-   (*lastchar)++;
+   /*
+* Try to generate a larger string by incrementing the last 
byte or
+* character.
+*/
 
+   if (charincfunc(lastchar, charlen)) {
if (datatype != BYTEAOID)
-   {
-   /* do not generate invalid encoding sequences */
-   if (!pg_verifymbstr(workstr, len, true))
-   continue;
workstr_const = string_to_const(workstr, 
datatype);
-   }
else
workstr_const = string_to_bytea_const(workstr, 
len);
 
@@ -5636,26 +5660,17 @@ make_greater_string(const Const *str_const, FmgrInfo 
*ltproc, Oid collation)
pfree(workstr);
return workstr_const;
}
-
+   
/* No good, release unusable value and try again */
pfree(DatumGetPointer(workstr_const->constvalue));
pfree(workstr_const);
}
 
-   /* restore last byte so we don't confuse pg_mbcliplen */
-   *lastchar = savelastchar;
-
/*
-* Truncate off the last character, which might be more than 1 
byte,
-* depending on the character encoding.
+* Truncate off the last character or restore last byte for 
BYTEA.
 */
-   if (datatype != BYTEAOID && pg_database_encoding_max_length() > 
1)
-   len = pg_mbcliplen(workstr, len, len - 1);
-   else
-   len -= 1;
-
-   if (datatype != BYTEAOID)
-   workstr[len] = '\0';
+   len -= charlen;
+   workstr[len] = (datatype != BYTEAOID ? '\0' : savelastbyte);
}
 
/* Failed... */
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index 5b0cf62..8505bcb 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -1336,53 +1336,254 @@ pg_utf8_islegal(const unsigned char *source, int 
length)
 
 /*
  *---
+ * 

Re: [HACKERS] make_greater_string() does not return a string in some cases

2011-07-08 Thread Kyotaro HORIGUCHI
Hello, Could you let me go on with this topic?

It is hard to ignore this glitch for us using CJK - Chinese,
Japanese, and Korean - characters on databse.. Maybe..

Saying on Japanese under the standard usage, about a hundred
characters out of seven thousand make make_greater_string() fail.

This is not so frequent to happen but also not as rare as
ignorable.

I think this glitch is caused because the method to derive the
`next character' is fundamentally a secret of each encoding but
now it is done in make_greater_string() using the method extended
from that of 1 byte ASCII charset for all encodings together.

 So, I think it is reasonable that encoding info table (struct
pg_wchar_tbl) holds the function to do that.

How about this idea?


Points to realize this follows,

- pg_wchar_tbl@pg_wchar.c has new element `charinc' that holds a
  function to increment a character of this encoding.

- Basically, the value of charinc is a `generic' increment
  function that does what make_greater_string() does in current
  implement.

- make_greater_string() now uses charinc for database encoding to
  increment characters instead of the code directly written in
  it.

- Give UTF-8 a special increment function.


As a consequence of this modification, make_greater_string()
looks somewhat simple thanks to disappearing of the sequence that
handles bare bytes in string.  And doing `increment character'
with the knowledge of the encoding can be straightforward and
light and backtrack-free, and have fewer glitches than the
generic method.

# But the process for BYTEAOID remains there dissapointingly.

There still remains some glitches but I think it is overdo to do
conversion that changes the length of the character. Only 5
points out of 17 thousands (in current method, roughly for all
BMP characters) remains, and none of them are not Japanese
character :-)

The attached patch is sample implement of this idea.

What do you think about this patch?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 10b73fb..4151ce2 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5502,6 +5502,16 @@ pattern_selectivity(Const *patt, Pattern_Type ptype)
 
 
 /*
+ * This function is "character increment" function for bytea used in
+ * make_greater_string() that has same interface with pg_wchar_tbl.charinc.
+ */
+static bool byte_increment(unsigned char *ptr, int len)
+{
+   (*ptr)--;
+   return true;
+}
+
+/*
  * Try to generate a string greater than the given string or any
  * string it is a prefix of.  If successful, return a palloc'd string
  * in the form of a Const node; else return NULL.
@@ -5540,6 +5550,7 @@ make_greater_string(const Const *str_const, FmgrInfo 
*ltproc, Oid collation)
int len;
Datum   cmpstr;
text   *cmptxt = NULL;
+   character_incrementer charincfunc;
 
/*
 * Get a modifiable copy of the prefix string in C-string format, and 
set
@@ -5601,27 +5612,38 @@ make_greater_string(const Const *str_const, FmgrInfo 
*ltproc, Oid collation)
}
}
 
+   if (datatype != BYTEAOID)
+   charincfunc = pg_database_encoding_character_incrementer();
+   else
+   charincfunc = &byte_increment;
+
while (len > 0)
{
-   unsigned char *lastchar = (unsigned char *) (workstr + len - 1);
-   unsigned char savelastchar = *lastchar;
+   int charlen;
+   unsigned char *lastchar;
+   unsigned char savelastbyte;
+   Const  *workstr_const;
+   
+   if (datatype == BYTEAOID)
+   charlen = 1;
+   else
+   charlen = len - pg_mbcliplen(workstr, len, len - 1);
+
+   lastchar = (unsigned char *) (workstr + len - charlen);
 
/*
-* Try to generate a larger string by incrementing the last 
byte.
+* savelastbyte has meaning only for datatype == BYTEAOID
 */
-   while (*lastchar < (unsigned char) 255)
-   {
-   Const  *workstr_const;
+   savelastbyte = *lastchar;
 
-   (*lastchar)++;
+   /*
+* Try to generate a larger string by incrementing the last 
byte or
+* character.
+*/
 
+   if (charincfunc(lastchar, charlen)) {
if (datatype != BYTEAOID)
-   {
-   /* do not generate invalid encoding sequences */
-   if (!pg_verifymbstr(workstr, len, true))
-   continue;
workstr_const = string_to_const(workstr, 
datatype);
-   }