Re: [HACKERS] Unicode problems on IRC
On 2005-04-10, Tom Lane [EMAIL PROTECTED] wrote: The impression I get is that most of the 'Unicode characters above 0x1' reports we've seen did not come from people who actually needed more-than-16-bit Unicode codepoints, but from people who had screwed up their encoding settings and were trying to tell the backend that Latin1 was Unicode or some such. So I'm a bit worried that extending the backend support to full 32-bit Unicode will do more to mask encoding mistakes than it will do to create needed functionality. I think you will find that this impression is actually false. Or that at the very least, _correct_ verification of UTF-8 sequences will still catch essentially all cases of non-utf-8 input mislabelled as utf-8 while allowing the full range of Unicode codepoints. (The current check will report the characters above 0x1 error even on input which is blatantly not utf-8 at all.) One of UTF-8's nicest properties is that other encodings are almost never also valid utf-8. I did some tests on this myself some years ago, feeding hundreds of thousands of short non-utf-8 strings (taken from Usenet subject lines in non-english-speaking hierarchies) into a utf-8 decoder. The false accept rate was on the order of 0.01%, and going back and re-checking my old data, _none_ of the incorrectly detected sequences would have been interpreted as characters over 0x. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Unicode problems on IRC
Andrew - Supernews [EMAIL PROTECTED] writes: On 2005-04-10, Tom Lane [EMAIL PROTECTED] wrote: The impression I get is that most of the 'Unicode characters above 0x1' reports we've seen did not come from people who actually needed more-than-16-bit Unicode codepoints, but from people who had screwed up their encoding settings and were trying to tell the backend that Latin1 was Unicode or some such. I think you will find that this impression is actually false. Or that at the very least, _correct_ verification of UTF-8 sequences will still catch essentially all cases of non-utf-8 input mislabelled as utf-8 while allowing the full range of Unicode codepoints. Yeah? Cool. Does John's proposed patch do it correctly? http://candle.pha.pa.us/mhonarc/patches2/msg00076.html regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Unicode problems on IRC
On 2005-04-10, Tom Lane [EMAIL PROTECTED] wrote: Andrew - Supernews [EMAIL PROTECTED] writes: I think you will find that this impression is actually false. Or that at the very least, _correct_ verification of UTF-8 sequences will still catch essentially all cases of non-utf-8 input mislabelled as utf-8 while allowing the full range of Unicode codepoints. Yeah? Cool. Does John's proposed patch do it correctly? http://candle.pha.pa.us/mhonarc/patches2/msg00076.html It looks correct to me. The only thing I think that code will let through incorrectly are encoded surrogates; those could be fixed by adding one line: switch (*source) { /* no fall-through in this inner switch */ case 0xE0: if (a 0xA0) return false; break; + case 0xED: if (a 0x9F) return false; break; case 0xF0: if (a 0x90) return false; break; case 0xF4: if (a 0x8F) return false; break; (Accepting encoded surrogates in utf-8 was always forbidden by most specifications that used utf-8, though the Unicode specs originally were not absolute about it (but forbade generating them). Current Unicode specifications define those sequences as malformed. Surrogates are the code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode characters 0x1 - 0x10 as two 16-bit values; UTF-8 requires that such characters are encoded directly rather than via surrogate pairs.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Unicode problems on IRC
On 2005-04-10, Tom Lane tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us wrote: Andrew - Supernews andrew+nonews ( at ) supernews ( dot ) com writes: I think you will find that this impression is actually false. Or that at the very least, _correct_ verification of UTF-8 sequences will still catch essentially all cases of non-utf-8 input mislabelled as utf-8 while allowing the full range of Unicode codepoints. Yeah? Cool. Does John's proposed patch do it correctly? http://candle.pha.pa.us/mhonarc/patches2/msg00076.html It looks correct to me. The only thing I think that code will let through incorrectly are encoded surrogates; those could be fixed by adding one line: switch (*source) { /* no fall-through in this inner switch */ case 0xE0: if (a 0xA0) return false; break; + case 0xED: if (a 0x9F) return false; break; case 0xF0: if (a 0x90) return false; break; case 0xF4: if (a 0x8F) return false; break; That's right, dono how I missed that one, but looks correct to me, and is in line with the code in ConvertUTF.c from unicode.org, on which I based the patch, extended to support 6 byte utf8 characters. (Accepting encoded surrogates in utf-8 was always forbidden by most specifications that used utf-8, though the Unicode specs originally were not absolute about it (but forbade generating them). Current Unicode specifications define those sequences as malformed. Surrogates are the code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode characters 0x1 - 0x10 as two 16-bit values; UTF-8 requires that such characters are encoded directly rather than via surrogate pairs.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Unicode problems on IRC
Tom Lane wrote: Yeah? Cool. Does John's proposed patch do it correctly? http://candle.pha.pa.us/mhonarc/patches2/msg00076.html Some comments on that patch: Doesn't pg_utf2wchar_with_len need changes for the longer sequences? UtfToLocal also appears to need changes. If we support sequences 4 bytes (U+10), then UtfToLocal/LocalToUtf and the associated translation tables need a redesign as they currently assume the sequence fits in an unsigned int. (IIRC, Unicode doesn't use U+10, but UTF-8 can encode it?) -O ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Unicode problems on IRC
On 2005-04-10, John Hansen [EMAIL PROTECTED] wrote: That's right, dono how I missed that one, but looks correct to me, and is in line with the code in ConvertUTF.c from unicode.org, on which I based the patch, extended to support 6 byte utf8 characters. Frankly, you should probably de-extend it back down to 4 bytes. That's enough to encode the Unicode range of 0x00 - 0x10, and enough other stuff would break if anyone allocated a character outside that range that I don't think it it worth worrying about. (Even the ISO people have agreed to conform to that limitation.) Even if insanity struck simultaneously at both standards bodies, 4 bytes is enough to go to 0x1F so there is still substantial slack. (A number of other specifications based on utf-8 have removed the 5 and 6 byte sequences too, so there is substantial precedent for this.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Unicode problems on IRC
Christopher Kings-Lynne wrote: Hey guys, The 'Unicode characters above 0x1' issue keeps rearing its ugly head in the IRC channel. I propose that it be fixed, even backported... This is John Hansen's most recent patch to fix it: http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php And from what I can tell it was committed, then reverted because it wasn't a bug. It was going to go in for 8.1. We on the channel are starting to think that it is in fact a bug. There are are people with legitimately utf-8 encoded XML documents that they cannot store in PostgreSQL. Apparently in the distant past, Unicode was limited to 0x1, but then was extended. Perhaps we can reopen this case... Uh, I thought we fixed this another way, buy not using Unicode-aware functions for upper/lower/initcap when the locale is C or POSIX. That is backpatched to 8.0.X. Does that not fix the problem reported? -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Unicode problems on IRC
On 2005-04-09, Bruce Momjian pgman@candle.pha.pa.us wrote: Uh, I thought we fixed this another way, buy not using Unicode-aware functions for upper/lower/initcap when the locale is C or POSIX. That is backpatched to 8.0.X. Does that not fix the problem reported? Unicode values over 0x are simply not accepted on input, so no, it doesn't fix the problem. What do upper/lower/initcap have to do with it? textin() unconditionally calls pg_verifymbstr, which in turn explicitly checks for such values (if the encoding is UTF8) and throws ERROR if it finds them. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Unicode problems on IRC
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bruce Momjian Sent: Sunday, April 10, 2005 8:18 AM To: Christopher Kings-Lynne Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Unicode problems on IRC Christopher Kings-Lynne wrote: Hey guys, The 'Unicode characters above 0x1' issue keeps rearing its ugly head in the IRC channel. I propose that it be fixed, even backported... This is John Hansen's most recent patch to fix it: http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php And from what I can tell it was committed, then reverted because it wasn't a bug. It was going to go in for 8.1. We on the channel are starting to think that it is in fact a bug. There are are people with legitimately utf-8 encoded XML documents that they cannot store in PostgreSQL. Apparently in the distant past, Unicode was limited to 0x1, but then was extended. Perhaps we can reopen this case... Uh, I thought we fixed this another way, buy not using Unicode-aware functions for upper/lower/initcap when the locale is C or POSIX. That is backpatched to 8.0.X. Does that not fix the problem reported? No, as andrew said, what this patch does, is allow values 0x and at the same time validates the input to make sure it's valid utf8. ... John -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Unicode problems on IRC
John Hansen [EMAIL PROTECTED] writes: That is backpatched to 8.0.X. Does that not fix the problem reported? No, as andrew said, what this patch does, is allow values 0x and at the same time validates the input to make sure it's valid utf8. The impression I get is that most of the 'Unicode characters above 0x1' reports we've seen did not come from people who actually needed more-than-16-bit Unicode codepoints, but from people who had screwed up their encoding settings and were trying to tell the backend that Latin1 was Unicode or some such. So I'm a bit worried that extending the backend support to full 32-bit Unicode will do more to mask encoding mistakes than it will do to create needed functionality. Not that I'm against adding the functionality. I'm just doubtful that the reports we've seen really indicate that we need it, or that adding it will cut down on the incidence of complaints :-( regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Unicode problems on IRC
Tom Lane wrote: John Hansen [EMAIL PROTECTED] writes: That is backpatched to 8.0.X. Does that not fix the problem reported? No, as andrew said, what this patch does, is allow values 0x and at the same time validates the input to make sure it's valid utf8. The impression I get is that most of the 'Unicode characters above 0x1' reports we've seen did not come from people who actually needed more-than-16-bit Unicode codepoints, but from people who had screwed up their encoding settings and were trying to tell the backend that Latin1 was Unicode or some such. So I'm a bit worried that extending the backend support to full 32-bit Unicode will do more to mask encoding mistakes than it will do to create needed functionality. Yes, that was my impression too. The upper/lower/initcap issue was that some operating systems were testing unicode values even if the local was set to C. That is fixed in 8.0.2, but I now see this is a different problem. Not that I'm against adding the functionality. I'm just doubtful that the reports we've seen really indicate that we need it, or that adding it will cut down on the incidence of complaints :-( Yea, that was my question too. I figured Japan or Chinese would be using these longer values, and if they are fine, why are others having problems. It would be great to find a test case that actually shows the failure. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[HACKERS] Unicode problems on IRC
Hey guys, The 'Unicode characters above 0x1' issue keeps rearing its ugly head in the IRC channel. I propose that it be fixed, even backported... This is John Hansen's most recent patch to fix it: http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php And from what I can tell it was committed, then reverted because it wasn't a bug. It was going to go in for 8.1. We on the channel are starting to think that it is in fact a bug. There are are people with legitimately utf-8 encoded XML documents that they cannot store in PostgreSQL. Apparently in the distant past, Unicode was limited to 0x1, but then was extended. Perhaps we can reopen this case... Chris ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings