Re: [HACKERS] tolower() identifier downcasing versus multibyte encodings
Did we ever address this? --- Tom Lane wrote: I've been able to reproduce the behavior described here: http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php It's specific to UTF8 locales on Mac OS X. I'm not sure if the problem can manifest anywhere else; considering that OS X's UTF8 locales have a general reputation of being broken, it may only happen on that platform. What is happening is that downcase_truncate_identifier() tries to downcase identifiers like this: unsigned char ch = (unsigned char) ident[i]; if (ch = 'A' ch = 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch) isupper(ch)) ch = tolower(ch); result[i] = (char) ch; This is of course incapable of successfully downcasing any multibyte characters, but there's an assumption that isupper() won't return TRUE for a character fragment in a multibyte locale. However, on OS X it seems that that's not the case :-(. For the particular example cited by Francisco Figueiredo, I see the byte sequence \303\251 converted to \343\251, because isupper() returns TRUE for \303 and then tolower() returns \343. The byte \251 is not changed, but the damage is already done: we now have an invalidly-encoded string. It looks like the blame for the subsequent disappearance of the bogus data lies with fprintf back on the client side; that surprises me a bit because I'd only heard of glibc being so cavalier with data it thought was invalidly encoded. But anyway, the origin of the problem is in the downcasing transformation. We could possibly fix this by not attempting the downcasing transformation on high-bit-set characters unless the encoding is single-byte. However, we have the exact same downcasing logic embedded in the functions in src/port/pgstrcasecmp.c, and those don't have any convenient way of knowing what the prevailing encoding is --- when compiled for frontend use, they can't use pg_database_encoding_max_length. Or we could bite the bullet and start using str_tolower(), but the performance implications of that are unpleasant; not to mention that we really don't want to re-introduce the Turkish problem with unexpected handling of i/I in identifiers. Or we could go the other way and stop downcasing non-ASCII letters altogether. None of these options seem terribly attractive. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tolower() identifier downcasing versus multibyte encodings
I just received a feedback from our bug report about this problem and it seems the problem also occurred on a windows machine. http://pgfoundry.org/tracker/index.php?func=detailaid=1010988group_id=1000140atid=590 On Sat, Mar 19, 2011 at 14:13, Marko Kreen mark...@gmail.com wrote: On Sat, Mar 19, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On Sat, Mar 19, 2011 at 6:10 AM, Tom Lane t...@sss.pgh.pa.us wrote: Or we could bite the bullet and start using str_tolower(), but the performance implications of that are unpleasant; not to mention that we really don't want to re-introduce the Turkish problem with unexpected handling of i/I in identifiers. How about first pass with 'a' - 'A' and if highbit is found then str_tolower()? Hm, maybe. There's still the problem of what to do in src/port/pgstrcasecmp.c, which won't have the infrastructure needed to do that. You mean client-side? Could we have a str_tolower without xxx_l branch that always does wide-char conversion if high-bit is set? Custom locale there won't make sense there anyway? -- marko -- Regards, Francisco Figueiredo Jr. Npgsql Lead Developer http://www.npgsql.org http://fxjr.blogspot.com http://twitter.com/franciscojunior -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tolower() identifier downcasing versus multibyte encodings
On Sat, Mar 19, 2011 at 6:10 AM, Tom Lane t...@sss.pgh.pa.us wrote: Or we could bite the bullet and start using str_tolower(), but the performance implications of that are unpleasant; not to mention that we really don't want to re-introduce the Turkish problem with unexpected handling of i/I in identifiers. How about first pass with 'a' - 'A' and if highbit is found then str_tolower()? You will still confuse turks, but at least nothing should break. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tolower() identifier downcasing versus multibyte encodings
Marko Kreen mark...@gmail.com writes: On Sat, Mar 19, 2011 at 6:10 AM, Tom Lane t...@sss.pgh.pa.us wrote: Or we could bite the bullet and start using str_tolower(), but the performance implications of that are unpleasant; not to mention that we really don't want to re-introduce the Turkish problem with unexpected handling of i/I in identifiers. How about first pass with 'a' - 'A' and if highbit is found then str_tolower()? Hm, maybe. There's still the problem of what to do in src/port/pgstrcasecmp.c, which won't have the infrastructure needed to do that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tolower() identifier downcasing versus multibyte encodings
On Sat, Mar 19, 2011 at 5:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On Sat, Mar 19, 2011 at 6:10 AM, Tom Lane t...@sss.pgh.pa.us wrote: Or we could bite the bullet and start using str_tolower(), but the performance implications of that are unpleasant; not to mention that we really don't want to re-introduce the Turkish problem with unexpected handling of i/I in identifiers. How about first pass with 'a' - 'A' and if highbit is found then str_tolower()? Hm, maybe. There's still the problem of what to do in src/port/pgstrcasecmp.c, which won't have the infrastructure needed to do that. You mean client-side? Could we have a str_tolower without xxx_l branch that always does wide-char conversion if high-bit is set? Custom locale there won't make sense there anyway? -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] tolower() identifier downcasing versus multibyte encodings
I've been able to reproduce the behavior described here: http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php It's specific to UTF8 locales on Mac OS X. I'm not sure if the problem can manifest anywhere else; considering that OS X's UTF8 locales have a general reputation of being broken, it may only happen on that platform. What is happening is that downcase_truncate_identifier() tries to downcase identifiers like this: unsigned char ch = (unsigned char) ident[i]; if (ch = 'A' ch = 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch) isupper(ch)) ch = tolower(ch); result[i] = (char) ch; This is of course incapable of successfully downcasing any multibyte characters, but there's an assumption that isupper() won't return TRUE for a character fragment in a multibyte locale. However, on OS X it seems that that's not the case :-(. For the particular example cited by Francisco Figueiredo, I see the byte sequence \303\251 converted to \343\251, because isupper() returns TRUE for \303 and then tolower() returns \343. The byte \251 is not changed, but the damage is already done: we now have an invalidly-encoded string. It looks like the blame for the subsequent disappearance of the bogus data lies with fprintf back on the client side; that surprises me a bit because I'd only heard of glibc being so cavalier with data it thought was invalidly encoded. But anyway, the origin of the problem is in the downcasing transformation. We could possibly fix this by not attempting the downcasing transformation on high-bit-set characters unless the encoding is single-byte. However, we have the exact same downcasing logic embedded in the functions in src/port/pgstrcasecmp.c, and those don't have any convenient way of knowing what the prevailing encoding is --- when compiled for frontend use, they can't use pg_database_encoding_max_length. Or we could bite the bullet and start using str_tolower(), but the performance implications of that are unpleasant; not to mention that we really don't want to re-introduce the Turkish problem with unexpected handling of i/I in identifiers. Or we could go the other way and stop downcasing non-ASCII letters altogether. None of these options seem terribly attractive. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers