Re: [HACKERS] Unicode characters above 0x10000 #2

2004-11-21 Thread John Hansen
Updated patch, Disregard old one, it broke ucs2.


... John


cvs.diff
Description: cvs.diff

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Unicode characters above 0x10000 #2

2004-11-21 Thread John Hansen
3 times lucky?

Last one broke utf8 G

This one works, Too tired, sorry for the inconvenience..

... John


cvs.diff
Description: cvs.diff

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[HACKERS] Unicode characters above 0x10000 #2

2004-11-16 Thread John Hansen
I sent this to -patches, but it has not shown up, so I resend to -hackers.
Comments on the matter so we can get this issue resolved welcome.

Kind Regards,

John Hansen

--
Hello,

Seing that the limit is still in place, attached patch against CVS.


Kind Regards,

John Hansen
Index: src/backend/utils/mb/wchar.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/wchar.c,v
retrieving revision 1.38
diff -c -r1.38 wchar.c
*** src/backend/utils/mb/wchar.c	17 Sep 2004 21:59:57 -	1.38
--- src/backend/utils/mb/wchar.c	16 Nov 2004 04:06:01 -
***
*** 343,348 
--- 343,373 
  	return (pg_euc_dsplen(s));
  }
  
+ bool isLegalUTF8(const UTF8 *source, int len) {
+ if(pg_utf_mblen(source)  len) return false;
+ UTF8 a;
+ const UTF8 *srcptr = source + pg_utf_mblen(source);
+ switch (pg_utf_mblen(source)) {
+ default: return false;
+ /* Everything else falls through when true... */
+ case 6: if ((a = (*--srcptr))  0x80 || a  0xBF) return false;
+ case 5: if ((a = (*--srcptr))  0x80 || a  0xBF) return false;
+ case 4: if ((a = (*--srcptr))  0x80 || a  0xBF) return false;
+ case 3: if ((a = (*--srcptr))  0x80 || a  0xBF) return false;
+ case 2: if ((a = (*--srcptr))  0xBF) return false;
+ switch (*source) {
+ /* no fall-through in this inner switch */
+ case 0xE0: if (a  0xA0) return false; break;
+ case 0xF0: if (a  0x90) return false; break;
+ case 0xF4: if (a  0x8F) return false; break;
+ default:  if (a  0x80) return false;
+ }
+ case 1: if (*source = 0x80  *source  0xC2) return false;
+ if (*source  0xFD) return false;
+ }
+ return true;
+ }
+ 
  /*
   * convert UTF-8 string to pg_wchar (UCS-2)
   * caller should allocate enough space for to
***
*** 350,404 
   * from not necessarily null terminated.
   */
  static int
! pg_utf2wchar_with_len(const unsigned char *from, pg_wchar *to, int len)
  {
! 	unsigned char c1,
! c2,
! c3;
! 	int			cnt = 0;
! 
! 	while (len  0  *from)
! 	{
! 		if ((*from  0x80) == 0)
! 		{
! 			*to = *from++;
! 			len--;
! 		}
! 		else if ((*from  0xe0) == 0xc0  len = 2)
! 		{
! 			c1 = *from++  0x1f;
! 			c2 = *from++  0x3f;
! 			*to = c1  6;
! 			*to |= c2;
! 			len -= 2;
! 		}
! 		else if ((*from  0xe0) == 0xe0  len = 3)
! 		{
! 			c1 = *from++  0x0f;
! 			c2 = *from++  0x3f;
! 			c3 = *from++  0x3f;
! 			*to = c1  12;
! 			*to |= c2  6;
! 			*to |= c3;
! 			len -= 3;
! 		}
! 		else
! 		{
! 			*to = *from++;
! 			len--;
! 		}
! 		to++;
! 		cnt++;
! 	}
! 	*to = 0;
! 	return (cnt);
  }
  
  /*
   * returns the byte length of a UTF-8 word pointed to by s
   */
  int
! pg_utf_mblen(const unsigned char *s)
  {
  	int			len = 1;
  
--- 375,437 
   * from not necessarily null terminated.
   */
  static int
! pg_utf2wchar_with_len(const UTF8 *from, pg_wchar *to, int len)
  {
! const UTF8* fromEnd = from + len;
! 	const UTF32 offsetsFromUTF8[6] = { 0xUL, 0x3080UL, 0x000E2080UL, 0x03C82080UL, 0xFA082080UL, 0x82082080UL };
! unsigned int cnt = 0;
! while (from  fromEnd) {
! UTF32 ch = 0;
! unsigned int extraBytesToRead = pg_utf_mblen(from) - 1;
! if (from + extraBytesToRead = fromEnd) {
! cnt = 0; break;
! }
! /* Do this check whether lenient or strict */
! if (! isLegalUTF8(from, extraBytesToRead + 1)) {
! cnt = 0;
! break;
! }
! /*
!  * The cases all fall through. See Note A below.
!  */
! switch (extraBytesToRead) {
! case 5: ch += *from++; ch = 6;
! case 4: ch += *from++; ch = 6;
! case 3: ch += *from++; ch = 6;
! case 2: ch += *from++; ch = 6;
! case 1: ch += *from++; ch = 6;
! case 0: ch += *from++;
! }
! ch -= offsetsFromUTF8[extraBytesToRead];
! 
! if (ch = UNI_MAX_BMP) { /* character is = 0x */
! if (ch = UNI_SUR_HIGH_START  ch = UNI_SUR_LOW_END) {
! from -= (extraBytesToRead+1); /* return to the illegal value itself */
! cnt = 0;
! break;
! } else {
! *to++ = ch; /* normal case */
! }
! } else if (ch  UNI_MAX_UTF16) {
! 

Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Possibly, since I got it wrong once more
About to give up, but attached, Updated patch.


Regards,

John Hansen

-Original Message-
From: Oliver Elphick [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 3:56 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
 Now it's entirely possible that the underlying support is a few bricks

 shy of a load --- for instance I see that pg_utf_mblen thinks there 
 are no UTF8 codes longer than 3 bytes whereas your code goes to 4.  
 I'm not an expert on this stuff, so I don't know what the UTF8 spec 
 actually says.  But I do think you are fixing the code at the wrong
level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode.  How many
of our supported platforms don't have these?  If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

-- 
Oliver Elphick  [EMAIL PROTECTED]
Isle of Wight  http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
 
 Be still before the LORD and wait patiently for him;
  do not fret when men succeed in their ways, when they
  carry out their wicked schemes. 
Psalms 37:7 





wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Tom Lane wrote:

 shy of a load --- for instance I see that pg_utf_mblen thinks there are
 no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
 an expert on this stuff, so I don't know what the UTF8 spec actually
 says.  But I do think you are fixing the code at the wrong level.

I can give some general info about utf-9. This is how it is encoded:

characterencoding
---  -
 - 007F: 0xxx
0080 - 07FF: 110x 10xx
0800 - : 1110 10xx 10xx
0001 - 001F: 0xxx 10xx 10xx 10xx
0020 - 03FF: 10xx 10xx 10xx 10xx 10xx
0400 - 7FFF: 110x 10xx 10xx 10xx 10xx 10xx

If the first byte starts with a 1 then the number of ones give the 
length of the utf-8 sequence. And the rest of the bytes in the sequence 
always starts with 10 (this makes it possble to look anywhere in the 
string and fast find the start of a character).

This also means that the start byte can never start with 7 or 8 ones, that 
is illegal and should be tested for and rejected. So the longest utf-8 
sequence is 6 bytes (and the longest character needs 4 bytes (or 31 
bits)).

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE  FF never
appear in a byte sequence for instance.
UTF8 is more that two bytes btw, up to 6 bytes are used to represent an
UTF8 character.
The 5 and 6 byte characters are currently not in use tho.

I didn't actually notice the difference in UTF8 width between my
original patch and my last, so attached, updated patch.

Regards,

John Hansen

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 3:07 PM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

John Hansen [EMAIL PROTECTED] writes:
 My apologies for not reading the code properly.

 Attached patch using pg_utf_mblen() instead of an indexed table.
 It now also do bounds checks.

I think you missed my point.  If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable).  What's really at stake here is whether anything else
breaks if we do that.  What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.

regards, tom lane




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Tom Lane wrote:

 question at hand is whether we can support 32-bit characters or not ---
 and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple 
testing, make sure regexps work and then treat anything else that might 
not work, as bugs to be fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun 
at all :-)

My previous mail talked about utf-8 translation. Not all characters
possible to form using utf-8 are assigned by the unicode org. However,
the part that interprets the unicode strings are in the os so different
os'es can give different results. So I think pg should just accept even 6 
byte utf-8 sequences even if some characters are not currently assigned.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
This should do it.

Regards,

John Hansen 

-Original Message-
From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 5:02 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

On Sat, 7 Aug 2004, Tom Lane wrote:

 question at hand is whether we can support 32-bit characters or not 
 --- and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple testing, 
make sure regexps work and then treat anything else that might not work, as bugs to be 
fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun at all :-)

My previous mail talked about utf-8 translation. Not all characters possible to form 
using utf-8 are assigned by the unicode org. However, the part that interprets the 
unicode strings are in the os so different os'es can give different results. So I 
think pg should just accept even 6 byte utf-8 sequences even if some characters are 
not currently assigned.

--
/Dennis Björklund





wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
 Dennis Bjorklund [EMAIL PROTECTED] writes:
  ... This also means that the start byte can never start with 7 or 8
  ones, that is illegal and should be tested for and rejected. So the
  longest utf-8 sequence is 6 bytes (and the longest character needs 4
  bytes (or 31 bits)).
 
 Tatsuo would know more about this than me, but it looks from here like
 our coding was originally designed to support only 16-bit-wide internal
 characters (ie, 16-bit pg_wchar datatype width).  I believe that the
 regex library limitation here is gone, and that as far as that library
 is concerned we could assume a 32-bit internal character width.  The
 question at hand is whether we can support 32-bit characters or not ---
 and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype.  However I doubt there's
actually a need for 32-but width character sets. Even Unicode only
uese up 0x0010, so 24-bit should be enough...
--
Tatsuo Ishii

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Yes, but the specification allows for 6byte sequences, or 32bit
characters.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

Regards,

John Hansen

-Original Message-
From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:09 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; John Hansen; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 

 Dennis Bjorklund [EMAIL PROTECTED] writes:
  ... This also means that the start byte can never start with 7 or 8 
  ones, that is illegal and should be tested for and rejected. So the 
  longest utf-8 sequence is 6 bytes (and the longest character needs 4

  bytes (or 31 bits)).
 
 Tatsuo would know more about this than me, but it looks from here like

 our coding was originally designed to support only 16-bit-wide 
 internal characters (ie, 16-bit pg_wchar datatype width).  I believe 
 that the regex library limitation here is gone, and that as far as 
 that library is concerned we could assume a 32-bit internal character 
 width.  The question at hand is whether we can support 32-bit 
 characters or not --- and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype.  However I doubt there's
actually a need for 32-but width character sets. Even Unicode only uese
up 0x0010, so 24-bit should be enough...
--
Tatsuo Ishii



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
 Yes, but the specification allows for 6byte sequences, or 32bit
 characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its
specification.

 As dennis pointed out, just because they're not used, doesn't mean we
 should not allow them to be stored, since there might me someone using
 the high ranges for a private character set, which could very well be
 included in the specification some day.

We should expand it to 64-bit since some day the specification might
be changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Christopher Kings-Lynne
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.
Surely there are UTF-8 codes that are at least 3 bytes.  I have a 
_vague_ recollection that you have to keep escaping and escaping to get 
up to like 4 bytes for some asian code points?

Chris
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
4 actually,
10 needs four bytes:

0xxx 10xx 10xx 10xx
10 = 1010  

Fill in the blanks, starting from the bottom, you get:
 1010 1011 1011

Regards,

John Hansen 

-Original Message-
From: Christopher Kings-Lynne [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:47 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1

 Now it's entirely possible that the underlying support is a few bricks

 shy of a load --- for instance I see that pg_utf_mblen thinks there 
 are no UTF8 codes longer than 3 bytes whereas your code goes to 4.  
 I'm not an expert on this stuff, so I don't know what the UTF8 spec 
 actually says.  But I do think you are fixing the code at the wrong
level.

Surely there are UTF-8 codes that are at least 3 bytes.  I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris




---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, John Hansen wrote:

 should not allow them to be stored, since there might me someone using
 the high ranges for a private character set, which could very well be
 included in the specification some day.

There are areas reserved for private character sets.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Well, maybe we'd be better off, compiling a list of (in?)valid ranges
from the full unicode database 
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Unihan.txt)
and with every release of pg, update the detection logic so only valid
characters are allowed?

Regards,

John Hansen

-Original Message-
From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:46 PM
To: John Hansen
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 

 Yes, but the specification allows for 6byte sequences, or 32bit 
 characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its specification.

 As dennis pointed out, just because they're not used, doesn't mean we 
 should not allow them to be stored, since there might me someone using

 the high ranges for a private character set, which could very well be 
 included in the specification some day.

We should expand it to 64-bit since some day the specification might be
changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Tatsuo Ishii wrote:

 More seriously, Unicode is filled with tons of confusion and
 inconsistency IMO. Remember that once Unicode adovocates said that the
 merit of Unicode was it only requires 16-bit width. Now they say they
 need surrogate pairs and 32-bit width chars...
 
 Anyway my point is if current specification of Unicode only allows
 24-bit range, why we need to allow usage against the specification?

Whatever problems they have had in the past, the ISO 10646 defines
formally a 31-bit character set. Are you saying that applications should
reject strings that contain characters that it does not recognize?

Is there a specific reason you want to restrict it to 24 bits? In practice 
it does not matter much since it's not used today, I just don't know why 
you want it.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Yea,. I know

10 - 10 : 2 separate planes iirc

... John 

-Original Message-
From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 9:06 PM
To: John Hansen
Cc: Tatsuo Ishii; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1 

On Sat, 7 Aug 2004, John Hansen wrote:

 should not allow them to be stored, since there might me someone using 
 the high ranges for a private character set, which could very well be 
 included in the specification some day.

There are areas reserved for private character sets.

--
/Dennis Björklund




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Takehiko Abe wrote:

It looked like you sent the last mail only to me and not the list. I 
assume it was a misstake and I send the reply to both.

  Is there a specific reason you want to restrict it to 24 bits?
 
 ISO 10646 is said to have removed its private use codepoints outside of
 the Unicode 0 - 10 range to ensure the compatibility with Unicode.
 
 see Section C.2 and C.3 of Unicode 4.0 Appendix C Relationship to ISO
 10646: http://www.unicode.org/versions/Unicode4.0.0/appC.pdf.

The one and only reason for allowing 31 bit is that it's defined by iso
10646. In practice there is probably no one that uses the upper part of
10646 so not supporting it will most likely not hurt anyone.

I'm happy either way so I will put my voice on letting PG use unicode (not
ISO 10646) and restrict it to 24 bits. By the time someone wants (if ever)
iso 10646 we probably have support for different charsets and can easily
handle both at the same time.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Dennis Bjorklund
 Sent: Saturday, August 07, 2004 10:48 PM
 To: Takehiko Abe
 Cc: [EMAIL PROTECTED]
 Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1
 
 On Sat, 7 Aug 2004, Takehiko Abe wrote:
 
 It looked like you sent the last mail only to me and not the 
 list. I assume it was a misstake and I send the reply to both.
 
   Is there a specific reason you want to restrict it to 24 bits?
  
  ISO 10646 is said to have removed its private use codepoints outside 
  of the Unicode 0 - 10 range to ensure the compatibility with Unicode.
  
  see Section C.2 and C.3 of Unicode 4.0 Appendix C 
 Relationship to ISO
  10646: http://www.unicode.org/versions/Unicode4.0.0/appC.pdf.
 
 The one and only reason for allowing 31 bit is that it's 
 defined by iso 10646. In practice there is probably no one 
 that uses the upper part of
 10646 so not supporting it will most likely not hurt anyone.
   
   
 I'm happy either way so I will put my voice on letting PG use 
 unicode (not ISO 10646) and restrict it to 24 bits. By the 
 time someone wants (if ever) iso 10646 we probably have 
 support for different charsets and can easily handle both at 
 the same time.
 

Point taken. 
Since we're supporting UTF8, and not ISO 10646.

Now, is it really 24 bits tho? 
Afaict, it's really 21 (0 - 10 or 0 - xxx1  )

This would require that we suport 4 byte sequences
(0100 1000 1011 1011 = 10)

 --
 /Dennis Björklund
 
 
 ---(end of 
 broadcast)---
 TIP 7: don't forget to increase your free space map settings
 
 


Regards,

John Hansen

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, John Hansen wrote:

 Now, is it really 24 bits tho? 
 Afaict, it's really 21 (0 - 10 or 0 - xxx1  )

Yes, up to 0x10 should be enough.

The 24 is not really important, this is all about what utf-8 strings to 
accept as input. The strings are stored as utf-8 strings and when 
processed inside pg it uses wchar_t that is 32 bit (on some systems at 
least). By restricting the utf-8 input to unicode we can in the future 
store each character as 3 bytes if we want.

--
/Dennis Björklund


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
 -Original Message-
 From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, August 07, 2004 11:23 PM
 To: John Hansen
 Cc: Takehiko Abe; [EMAIL PROTECTED]
 Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1
 
 On Sat, 7 Aug 2004, John Hansen wrote:
 
  Now, is it really 24 bits tho? 
  Afaict, it's really 21 (0 - 10 or 0 - xxx1  
 )
 
 Yes, up to 0x10 should be enough.
 
 The 24 is not really important, this is all about what utf-8 
 strings to accept as input. The strings are stored as utf-8 
 strings and when processed inside pg it uses wchar_t that is 
 32 bit (on some systems at least). By restricting the utf-8 
 input to unicode we can in the future store each character as 
 3 bytes if we want.

Which brings us back to something like the attached...

 
 --
 /Dennis Björklund
 
 
 

Regards,

John Hansen


wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
Dennis Bjorklund [EMAIL PROTECTED] writes:
 On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
 Anyway my point is if current specification of Unicode only allows
 24-bit range, why we need to allow usage against the specification?

 Is there a specific reason you want to restrict it to 24 bits?

I see several places that have to allocate space on the basis of the
maximum encoded character length possible in the current encoding
(look for uses of pg_database_encoding_max_length).  Probably the only
one that's really significant for performance is text_substr(), but
that's enough to be an argument against setting maxmblen higher than
we have to.

It looks to me like supporting 4-byte UTF-8 characters would be enough
to handle the existing range of Unicode codepoints, and that is probably
as much as we want to do.

If I understood what I was reading, this would take several things:
* Remove the special UTF-8 check in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, August 08, 2004 2:43 AM
 To: Dennis Bjorklund
 Cc: Tatsuo Ishii; John Hansen; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]
 Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 
 
 Dennis Bjorklund [EMAIL PROTECTED] writes:
  On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
  Anyway my point is if current specification of Unicode only allows 
  24-bit range, why we need to allow usage against the specification?
 
  Is there a specific reason you want to restrict it to 24 bits?
 
 I see several places that have to allocate space on the basis 
 of the maximum encoded character length possible in the 
 current encoding (look for uses of 
 pg_database_encoding_max_length).  Probably the only one 
 that's really significant for performance is text_substr(), 
 but that's enough to be an argument against setting maxmblen 
 higher than we have to.
 
 It looks to me like supporting 4-byte UTF-8 characters would 
 be enough to handle the existing range of Unicode codepoints, 
 and that is probably as much as we want to do.
 
 If I understood what I was reading, this would take several things:
 * Remove the special UTF-8 check in pg_verifymbstr;

I strongly disagree, this would mean one could store any sequence of
characters in the db, as long as the bytes are above 0x80. This would
not be valid utf8, and leave the data in an inconsistent state.
Setting the client encoding to unicode, implies that this is what we're
going to feed the database, and should guarantee, that what comes out of
a select is valid utf8. We can make sure of that, by doing the check
before it's inserted.

 * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte
case;

pg_utf_mblen should handle any case according to the specification.
Currently, it will return 3, even for 4,5, and 6 byte sequences. Those
places where pg_utf_mblen is called, we should check to make sure, that
the length is between 1 and 4 inclusive, and that the sequence is valid.
This is what I made the patch for.

 * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

That I have no problem with.

 Are there any other places that would have to change?  Would 
 this break anything?  The testing aspect is what's bothering 
 me at the moment.
 
   regards, tom lane
 
 

Just my $0.02 worth,

Kind Regards,

John Hansen

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
 -Original Message-
 From: Oliver Elphick [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, August 08, 2004 7:43 AM
 To: Tom Lane
 Cc: John Hansen; Hackers; Patches
 Subject: Re: [HACKERS] UNICODE characters above 0x1
 
 On Sat, 2004-08-07 at 07:10, Tom Lane wrote:
  Oliver Elphick [EMAIL PROTECTED] writes:
   glibc provides various routines (mb...) for handling Unicode.  How

   many of our supported platforms don't have these?
  
  Every one that doesn't use glibc.  Don't bother proposing a
glibc-only 
  solution (and that's from someone who works for a glibc-only
company; 
  you don't even want to think about the push-back you'll get from
other 
  quarters).
 
 No. that's not what I was proposing.  My suggestion was to 
 use these routines if they are sufficiently widely 
 implemented, and our own routines where standard ones are not 
 available.
 
 The man page for mblen says
 CONFORMING TO
ISO/ANSI C, UNIX98
 
 Is glibc really the only C library to conform?
 
 If using the mb... routines isn't feasible, IBM's ICU library
 (http://oss.software.ibm.com/icu/) is available under the X 
 licence, which is compatible with BSD as far as I can see.  
 Besides character conversion, ICU can also do collation in 
 various locales and encodings. 
 My point is, we shouldn't be writing a new set of routines to 
 do half a job if there are already libraries available to do 
 all of it.
 

This sounds like a brilliant move, if anything.

 -- 
 Oliver Elphick  
 [EMAIL PROTECTED]
 Isle of Wight  
 http://www.lfix.co.uk/oliver
 GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F 
 A543 10EA
  
  Be still before the LORD and wait patiently for him;
   do not fret when men succeed in their ways, when they
   carry out their wicked schemes. 
 Psalms 37:7 
 
 
 

Kind Regards,

John Hansen


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Oliver Jowett
Tom Lane wrote:
If I understood what I was reading, this would take several things:
* Remove the special UTF-8 check in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.
Does this change what client_encoding = UNICODE might produce? The JDBC 
driver will need some tweaking to handle this -- Java uses UTF-16 
internally and I think some supplementary character (?) scheme for 
values above 0x as of JDK 1.5.

-O
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
Oliver Jowett [EMAIL PROTECTED] writes:
 Does this change what client_encoding = UNICODE might produce? The JDBC 
 driver will need some tweaking to handle this -- Java uses UTF-16 
 internally and I think some supplementary character (?) scheme for 
 values above 0x as of JDK 1.5.

You're not likely to get out anything you didn't put in, so I'm not sure
it matters.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
 Tom Lane wrote:
 
  If I understood what I was reading, this would take several things:
  * Remove the special UTF-8 check in pg_verifymbstr;
  * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
  * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
  
  Are there any other places that would have to change?  Would this break
  anything?  The testing aspect is what's bothering me at the moment.
 
 Does this change what client_encoding = UNICODE might produce? The JDBC 
 driver will need some tweaking to handle this -- Java uses UTF-16 
 internally and I think some supplementary character (?) scheme for 
 values above 0x as of JDK 1.5.

Java doesn't handle UCS above 0x? I didn't know that. As long as
you put in/out JDBC, it shouldn't be a problem. However if other APIs
put in such a data, you will get into trouble...
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 Ahh, but that's not the case. You cannot just delete the check, since
 not all combinations of bytes are valid UTF8. UTF bytes FE  FF never
 appear in a byte sequence for instance.

Well, this is still working at the wrong level.  The code that's in
pg_verifymbstr is mainly intended to enforce the *system wide*
assumption that multibyte characters must have the high bit set in
every byte.  (We do not support encodings without this property in
the backend, because it breaks code that looks for ASCII characters
... such as the main parser/lexer ...)  It's not really intended to
check that the multibyte character is actually legal in its encoding.

The special UTF-8 check was never more than a very quick-n-dirty hack
that was in the wrong place to start with.  We ought to be getting rid
of it not institutionalizing it.  If you want an exact encoding-specific
check on the legitimacy of a multibyte sequence, I think the right way
to do it is to add another function pointer to pg_wchar_table entries to
let each encoding have its own check routine.  Perhaps this could be
defined so as to avoid a separate call to pg_mblen inside the loop, and
thereby not add any new overhead.  I'm thinking about an API something
like

int validate_mbchar(const unsigned char *str, int len)

with result +N if a valid character N bytes long is present at
*str, and -N if an invalid character is present at *str and
it would be appropriate to display N bytes in the complaint.
(N must be = len in either case.)  This would reduce the main
loop of pg_verifymbstr to a call of this function and an
error-case-handling block.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Oliver Jowett
Tatsuo Ishii wrote:
Tom Lane wrote:

If I understood what I was reading, this would take several things:
* Remove the special UTF-8 check in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.
Does this change what client_encoding = UNICODE might produce? The JDBC 
driver will need some tweaking to handle this -- Java uses UTF-16 
internally and I think some supplementary character (?) scheme for 
values above 0x as of JDK 1.5.

Java doesn't handle UCS above 0x? I didn't know that. As long as
you put in/out JDBC, it shouldn't be a problem. However if other APIs
put in such a data, you will get into trouble...
Internally, Java strings are arrays of UTF-16 values. Before JDK 1.5, 
all the string-manipulation library routines assumed that one code point 
== one UTF-16 value, so you can't represent values above 0x. The 1.5 
libraries understand using supplementary characters to use multiple 
UTF-16 values per code point. See 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

However, the JDBC driver needs to be taught about how to translate 
between UTF-8 representations of code points above 0x and pairs of 
UTF-16 values. Previously it didn't need to do anything since the server 
didn't use those high values. It's a minor thing..

-O
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
 Well, this is still working at the wrong level.  The code 
 that's in pg_verifymbstr is mainly intended to enforce the 
 *system wide* assumption that multibyte characters must have 
 the high bit set in every byte.  (We do not support encodings 
 without this property in the backend, because it breaks code 
 that looks for ASCII characters ... such as the main 
 parser/lexer ...)  It's not really intended to check that the 
 multibyte character is actually legal in its encoding.
 

Ok, point taken.

 The special UTF-8 check was never more than a very 
 quick-n-dirty hack that was in the wrong place to start with. 
  We ought to be getting rid of it not institutionalizing it.  
 If you want an exact encoding-specific check on the 
 legitimacy of a multibyte sequence, I think the right way to 
 do it is to add another function pointer to pg_wchar_table 
 entries to let each encoding have its own check routine.  
 Perhaps this could be defined so as to avoid a separate call 
 to pg_mblen inside the loop, and thereby not add any new 
 overhead.  I'm thinking about an API something like
 
   int validate_mbchar(const unsigned char *str, int len)
 
 with result +N if a valid character N bytes long is present 
 at *str, and -N if an invalid character is present at *str 
 and it would be appropriate to display N bytes in the complaint.
 (N must be = len in either case.)  This would reduce the 
 main loop of pg_verifymbstr to a call of this function and an 
 error-case-handling block.
 

Sounds like a plan...

   regards, tom lane
 
 

Regards,

John Hansen

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Regards,

John

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
Sent: Friday, August 06, 2004 2:20 PM
To: 'Hackers'
Subject: [HACKERS] UNICODE characters above 0x1

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.


Regards,

John


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 Attached, as promised, small patch removing the limitation, adding
 correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string.  Also, doesn't pg_mblen already know the
length rules for UTF8?  Why are you duplicating that knowledge?

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

Regards,

John Hansen

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 4:37 AM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

John Hansen [EMAIL PROTECTED] writes:
 Attached, as promised, small patch removing the limitation, adding 
 correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string.  Also, doesn't pg_mblen already know the
length rules for UTF8?  Why are you duplicating that knowledge?

regards, tom lane




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 My apologies for not reading the code properly.

 Attached patch using pg_utf_mblen() instead of an indexed table.
 It now also do bounds checks.

I think you missed my point.  If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable).  What's really at stake here is whether anything else
breaks if we do that.  What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[HACKERS] UNICODE characters above 0x10000

2004-08-05 Thread John Hansen
I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.


Regards,

John


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html