Re: [HACKERS] Unicode problems on IRC

2005-04-10 Thread Andrew - Supernews
On 2005-04-10, Tom Lane [EMAIL PROTECTED] wrote:
 The impression I get is that most of the 'Unicode characters above
 0x1' reports we've seen did not come from people who actually needed
 more-than-16-bit Unicode codepoints, but from people who had screwed up
 their encoding settings and were trying to tell the backend that Latin1
 was Unicode or some such.  So I'm a bit worried that extending the
 backend support to full 32-bit Unicode will do more to mask encoding
 mistakes than it will do to create needed functionality.

I think you will find that this impression is actually false. Or that at
the very least, _correct_ verification of UTF-8 sequences will still
catch essentially all cases of non-utf-8 input mislabelled as utf-8
while allowing the full range of Unicode codepoints. (The current check
will report the characters above 0x1 error even on input which is
blatantly not utf-8 at all.)

One of UTF-8's nicest properties is that other encodings are almost never
also valid utf-8. I did some tests on this myself some years ago, feeding
hundreds of thousands of short non-utf-8 strings (taken from Usenet
subject lines in non-english-speaking hierarchies) into a utf-8 decoder.
The false accept rate was on the order of 0.01%, and going back and
re-checking my old data, _none_ of the incorrectly detected sequences
would have been interpreted as characters over 0x.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Unicode problems on IRC

2005-04-10 Thread Tom Lane
Andrew - Supernews [EMAIL PROTECTED] writes:
 On 2005-04-10, Tom Lane [EMAIL PROTECTED] wrote:
 The impression I get is that most of the 'Unicode characters above
 0x1' reports we've seen did not come from people who actually needed
 more-than-16-bit Unicode codepoints, but from people who had screwed up
 their encoding settings and were trying to tell the backend that Latin1
 was Unicode or some such.

 I think you will find that this impression is actually false. Or that at
 the very least, _correct_ verification of UTF-8 sequences will still
 catch essentially all cases of non-utf-8 input mislabelled as utf-8
 while allowing the full range of Unicode codepoints.

Yeah?  Cool.  Does John's proposed patch do it correctly?

http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Unicode problems on IRC

2005-04-10 Thread Andrew - Supernews
On 2005-04-10, Tom Lane [EMAIL PROTECTED] wrote:
 Andrew - Supernews [EMAIL PROTECTED] writes:
 I think you will find that this impression is actually false. Or that at
 the very least, _correct_ verification of UTF-8 sequences will still
 catch essentially all cases of non-utf-8 input mislabelled as utf-8
 while allowing the full range of Unicode codepoints.

 Yeah?  Cool.  Does John's proposed patch do it correctly?

 http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

It looks correct to me. The only thing I think that code will let through
incorrectly are encoded surrogates; those could be fixed by adding one line:

  switch (*source) {
  /* no fall-through in this inner switch */
  case 0xE0: if (a  0xA0) return false; break;
+ case 0xED: if (a  0x9F) return false; break;
  case 0xF0: if (a  0x90) return false; break;
  case 0xF4: if (a  0x8F) return false; break;

(Accepting encoded surrogates in utf-8 was always forbidden by most
specifications that used utf-8, though the Unicode specs originally were
not absolute about it (but forbade generating them). Current Unicode
specifications define those sequences as malformed. Surrogates are the
code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
characters 0x1 - 0x10 as two 16-bit values; UTF-8 requires that
such characters are encoded directly rather than via surrogate pairs.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Unicode problems on IRC

2005-04-10 Thread John Hansen
On 2005-04-10, Tom Lane tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot )
us wrote:
 Andrew - Supernews andrew+nonews ( at ) supernews ( dot ) com
writes:
 I think you will find that this impression is actually false. Or
that at
 the very least, _correct_ verification of UTF-8 sequences will still
 catch essentially all cases of non-utf-8 input mislabelled as utf-8
 while allowing the full range of Unicode codepoints.

 Yeah?  Cool.  Does John's proposed patch do it correctly?

 http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

It looks correct to me. The only thing I think that code will let
through
incorrectly are encoded surrogates; those could be fixed by adding one
line:

  switch (*source) {
  /* no fall-through in this inner switch */
  case 0xE0: if (a  0xA0) return false; break;
+ case 0xED: if (a  0x9F) return false; break;
  case 0xF0: if (a  0x90) return false; break;
  case 0xF4: if (a  0x8F) return false; break;


That's right, dono how I missed that one, but looks correct to me, and
is in line with the code in ConvertUTF.c from unicode.org, on which I
based the patch, extended to support 6 byte utf8 characters.

(Accepting encoded surrogates in utf-8 was always forbidden by most
specifications that used utf-8, though the Unicode specs originally
were
not absolute about it (but forbade generating them). Current Unicode
specifications define those sequences as malformed. Surrogates are the
code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
characters 0x1 - 0x10 as two 16-bit values; UTF-8 requires that
such characters are encoded directly rather than via surrogate pairs.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Unicode problems on IRC

2005-04-10 Thread Oliver Jowett
Tom Lane wrote:

 Yeah?  Cool.  Does John's proposed patch do it correctly?
 
 http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

Some comments on that patch:

Doesn't pg_utf2wchar_with_len need changes for the longer sequences?

UtfToLocal also appears to need changes.

If we support sequences 4 bytes (U+10), then UtfToLocal/LocalToUtf
and the associated translation tables need a redesign as they currently
assume the sequence fits in an unsigned int. (IIRC, Unicode doesn't use
U+10, but UTF-8 can encode it?)

-O

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Unicode problems on IRC

2005-04-10 Thread Andrew - Supernews
On 2005-04-10, John Hansen [EMAIL PROTECTED] wrote:
 That's right, dono how I missed that one, but looks correct to me, and
 is in line with the code in ConvertUTF.c from unicode.org, on which I
 based the patch, extended to support 6 byte utf8 characters.

Frankly, you should probably de-extend it back down to 4 bytes. That's
enough to encode the Unicode range of 0x00 - 0x10, and enough
other stuff would break if anyone allocated a character outside that
range that I don't think it it worth worrying about. (Even the ISO
people have agreed to conform to that limitation.) Even if insanity
struck simultaneously at both standards bodies, 4 bytes is enough to
go to 0x1F so there is still substantial slack. (A number of other
specifications based on utf-8 have removed the 5 and 6 byte sequences
too, so there is substantial precedent for this.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Unicode problems on IRC

2005-04-09 Thread Bruce Momjian
Christopher Kings-Lynne wrote:
 Hey guys,
 
 The 'Unicode characters above 0x1' issue keeps rearing its ugly head 
 in the IRC channel.  I propose that it be fixed, even backported...
 
 This is John Hansen's most recent patch to fix it:
 
 http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
 
 And from what I can tell it was committed, then reverted because it 
 wasn't a bug.  It was going to go in for 8.1.
 
 We on the channel are starting to think that it is in fact a bug.  There 
 are are people with legitimately utf-8 encoded XML documents that they 
 cannot store in PostgreSQL.  Apparently in the distant past, Unicode was 
 limited to 0x1, but then was extended.
 
 Perhaps we can reopen this case...

Uh, I thought we fixed this another way, buy not using Unicode-aware
functions for upper/lower/initcap when the locale is C or POSIX. 
That is backpatched to 8.0.X.  Does that not fix the problem reported?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Unicode problems on IRC

2005-04-09 Thread Andrew - Supernews
On 2005-04-09, Bruce Momjian pgman@candle.pha.pa.us wrote:
 Uh, I thought we fixed this another way, buy not using Unicode-aware
 functions for upper/lower/initcap when the locale is C or POSIX. 
 That is backpatched to 8.0.X.  Does that not fix the problem reported?

Unicode values over 0x are simply not accepted on input, so no, it
doesn't fix the problem. What do upper/lower/initcap have to do with it?

textin() unconditionally calls pg_verifymbstr, which in turn explicitly
checks for such values (if the encoding is UTF8) and throws ERROR if it
finds them.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Unicode problems on IRC

2005-04-09 Thread John Hansen
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Bruce Momjian
 Sent: Sunday, April 10, 2005 8:18 AM
 To: Christopher Kings-Lynne
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Unicode problems on IRC
 
 Christopher Kings-Lynne wrote:
  Hey guys,
  
  The 'Unicode characters above 0x1' issue keeps rearing its ugly 
  head in the IRC channel.  I propose that it be fixed, even 
 backported...
  
  This is John Hansen's most recent patch to fix it:
  
  http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
  
  And from what I can tell it was committed, then reverted because it 
  wasn't a bug.  It was going to go in for 8.1.
  
  We on the channel are starting to think that it is in fact a bug.  
  There are are people with legitimately utf-8 encoded XML documents 
  that they cannot store in PostgreSQL.  Apparently in the 
 distant past, 
  Unicode was limited to 0x1, but then was extended.
  
  Perhaps we can reopen this case...
 
 Uh, I thought we fixed this another way, buy not using 
 Unicode-aware functions for upper/lower/initcap when the 
 locale is C or POSIX. 
 That is backpatched to 8.0.X.  Does that not fix the problem reported?

No, as andrew said, what this patch does, is allow values  0x and
at the same time validates the input to make sure it's valid utf8.


... John
 
 -- 
   Bruce Momjian|  http://candle.pha.pa.us
   pgman@candle.pha.pa.us   |  (610) 359-1001
   +  If your life is a hard drive, |  13 Roberts Road
   +  Christ can be your backup.|  Newtown Square, 
 Pennsylvania 19073
 
 ---(end of 
 broadcast)---
 TIP 9: the planner will ignore your desire to choose an index 
 scan if your
   joining column's datatypes do not match
 
 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Unicode problems on IRC

2005-04-09 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 That is backpatched to 8.0.X.  Does that not fix the problem reported?

 No, as andrew said, what this patch does, is allow values  0x and
 at the same time validates the input to make sure it's valid utf8.

The impression I get is that most of the 'Unicode characters above
0x1' reports we've seen did not come from people who actually needed
more-than-16-bit Unicode codepoints, but from people who had screwed up
their encoding settings and were trying to tell the backend that Latin1
was Unicode or some such.  So I'm a bit worried that extending the
backend support to full 32-bit Unicode will do more to mask encoding
mistakes than it will do to create needed functionality.

Not that I'm against adding the functionality.  I'm just doubtful that
the reports we've seen really indicate that we need it, or that adding
it will cut down on the incidence of complaints :-(

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Unicode problems on IRC

2005-04-09 Thread Bruce Momjian
Tom Lane wrote:
 John Hansen [EMAIL PROTECTED] writes:
  That is backpatched to 8.0.X.  Does that not fix the problem reported?
 
  No, as andrew said, what this patch does, is allow values  0x and
  at the same time validates the input to make sure it's valid utf8.
 
 The impression I get is that most of the 'Unicode characters above
 0x1' reports we've seen did not come from people who actually needed
 more-than-16-bit Unicode codepoints, but from people who had screwed up
 their encoding settings and were trying to tell the backend that Latin1
 was Unicode or some such.  So I'm a bit worried that extending the
 backend support to full 32-bit Unicode will do more to mask encoding
 mistakes than it will do to create needed functionality.

Yes, that was my impression too.

The upper/lower/initcap issue was that some operating systems were
testing unicode values even if the local was set to C.  That is fixed in
8.0.2, but I now see this is a different problem.

 Not that I'm against adding the functionality.  I'm just doubtful that
 the reports we've seen really indicate that we need it, or that adding
 it will cut down on the incidence of complaints :-(

Yea, that was my question too.  I figured Japan or Chinese would be
using these longer values, and if they are fine, why are others having
problems.  It would be great to find a test case that actually shows the
failure.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[HACKERS] Unicode problems on IRC

2005-04-04 Thread Christopher Kings-Lynne
Hey guys,
The 'Unicode characters above 0x1' issue keeps rearing its ugly head 
in the IRC channel.  I propose that it be fixed, even backported...

This is John Hansen's most recent patch to fix it:
http://archives.postgresql.org/pgsql-patches/2004-11/msg00259.php
And from what I can tell it was committed, then reverted because it 
wasn't a bug.  It was going to go in for 8.1.

We on the channel are starting to think that it is in fact a bug.  There 
are are people with legitimately utf-8 encoded XML documents that they 
cannot store in PostgreSQL.  Apparently in the distant past, Unicode was 
limited to 0x1, but then was extended.

Perhaps we can reopen this case...
Chris
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings