Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> Well, this is still working at the wrong level.  The code 
> that's in pg_verifymbstr is mainly intended to enforce the 
> *system wide* assumption that multibyte characters must have 
> the high bit set in every byte.  (We do not support encodings 
> without this property in the backend, because it breaks code 
> that looks for ASCII characters ... such as the main 
> parser/lexer ...)  It's not really intended to check that the 
> multibyte character is actually legal in its encoding.
> 

Ok, point taken.

> The "special UTF-8 check" was never more than a very 
> quick-n-dirty hack that was in the wrong place to start with. 
>  We ought to be getting rid of it not institutionalizing it.  
> If you want an exact encoding-specific check on the 
> legitimacy of a multibyte sequence, I think the right way to 
> do it is to add another function pointer to pg_wchar_table 
> entries to let each encoding have its own check routine.  
> Perhaps this could be defined so as to avoid a separate call 
> to pg_mblen inside the loop, and thereby not add any new 
> overhead.  I'm thinking about an API something like
> 
>   int validate_mbchar(const unsigned char *str, int len)
> 
> with result +N if a valid character N bytes long is present 
> at *str, and -N if an invalid character is present at *str 
> and it would be appropriate to display N bytes in the complaint.
> (N must be <= len in either case.)  This would reduce the 
> main loop of pg_verifymbstr to a call of this function and an 
> error-case-handling block.
> 

Sounds like a plan...

>   regards, tom lane
> 
> 

Regards,

John Hansen

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Oliver Jowett
Tatsuo Ishii wrote:
Tom Lane wrote:

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.
Does this change what client_encoding = UNICODE might produce? The JDBC 
driver will need some tweaking to handle this -- Java uses UTF-16 
internally and I think some supplementary character (?) scheme for 
values above 0x as of JDK 1.5.

Java doesn't handle UCS above 0x? I didn't know that. As long as
you put in/out JDBC, it shouldn't be a problem. However if other APIs
put in such a data, you will get into trouble...
Internally, Java strings are arrays of UTF-16 values. Before JDK 1.5, 
all the string-manipulation library routines assumed that one code point 
== one UTF-16 value, so you can't represent values above 0x. The 1.5 
libraries understand using supplementary characters to use multiple 
UTF-16 values per code point. See 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

However, the JDBC driver needs to be taught about how to translate 
between UTF-8 representations of code points above 0x and pairs of 
UTF-16 values. Previously it didn't need to do anything since the server 
didn't use those high values. It's a minor thing..

-O
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
"John Hansen" <[EMAIL PROTECTED]> writes:
> Ahh, but that's not the case. You cannot just delete the check, since
> not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
> appear in a byte sequence for instance.

Well, this is still working at the wrong level.  The code that's in
pg_verifymbstr is mainly intended to enforce the *system wide*
assumption that multibyte characters must have the high bit set in
every byte.  (We do not support encodings without this property in
the backend, because it breaks code that looks for ASCII characters
... such as the main parser/lexer ...)  It's not really intended to
check that the multibyte character is actually legal in its encoding.

The "special UTF-8 check" was never more than a very quick-n-dirty hack
that was in the wrong place to start with.  We ought to be getting rid
of it not institutionalizing it.  If you want an exact encoding-specific
check on the legitimacy of a multibyte sequence, I think the right way
to do it is to add another function pointer to pg_wchar_table entries to
let each encoding have its own check routine.  Perhaps this could be
defined so as to avoid a separate call to pg_mblen inside the loop, and
thereby not add any new overhead.  I'm thinking about an API something
like

int validate_mbchar(const unsigned char *str, int len)

with result +N if a valid character N bytes long is present at
*str, and -N if an invalid character is present at *str and
it would be appropriate to display N bytes in the complaint.
(N must be <= len in either case.)  This would reduce the main
loop of pg_verifymbstr to a call of this function and an
error-case-handling block.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
> Tom Lane wrote:
> 
> > If I understood what I was reading, this would take several things:
> > * Remove the "special UTF-8 check" in pg_verifymbstr;
> > * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
> > * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
> > 
> > Are there any other places that would have to change?  Would this break
> > anything?  The testing aspect is what's bothering me at the moment.
> 
> Does this change what client_encoding = UNICODE might produce? The JDBC 
> driver will need some tweaking to handle this -- Java uses UTF-16 
> internally and I think some supplementary character (?) scheme for 
> values above 0x as of JDK 1.5.

Java doesn't handle UCS above 0x? I didn't know that. As long as
you put in/out JDBC, it shouldn't be a problem. However if other APIs
put in such a data, you will get into trouble...
--
Tatsuo Ishii

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
Oliver Jowett <[EMAIL PROTECTED]> writes:
> Does this change what client_encoding = UNICODE might produce? The JDBC 
> driver will need some tweaking to handle this -- Java uses UTF-16 
> internally and I think some supplementary character (?) scheme for 
> values above 0x as of JDK 1.5.

You're not likely to get out anything you didn't put in, so I'm not sure
it matters.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Oliver Jowett
Tom Lane wrote:
If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.
Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.
Does this change what client_encoding = UNICODE might produce? The JDBC 
driver will need some tweaking to handle this -- Java uses UTF-16 
internally and I think some supplementary character (?) scheme for 
values above 0x as of JDK 1.5.

-O
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message-
> From: Oliver Elphick [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, August 08, 2004 7:43 AM
> To: Tom Lane
> Cc: John Hansen; Hackers; Patches
> Subject: Re: [HACKERS] UNICODE characters above 0x1
> 
> On Sat, 2004-08-07 at 07:10, Tom Lane wrote:
> > Oliver Elphick <[EMAIL PROTECTED]> writes:
> > > glibc provides various routines (mb...) for handling Unicode.  How

> > > many of our supported platforms don't have these?
> > 
> > Every one that doesn't use glibc.  Don't bother proposing a
glibc-only 
> > solution (and that's from someone who works for a glibc-only
company; 
> > you don't even want to think about the push-back you'll get from
other 
> > quarters).
> 
> No. that's not what I was proposing.  My suggestion was to 
> use these routines if they are sufficiently widely 
> implemented, and our own routines where standard ones are not 
> available.
> 
> The man page for mblen says
> "CONFORMING TO
>ISO/ANSI C, UNIX98"
> 
> Is glibc really the only C library to conform?
> 
> If using the mb... routines isn't feasible, IBM's ICU library
> (http://oss.software.ibm.com/icu/) is available under the X 
> licence, which is compatible with BSD as far as I can see.  
> Besides character conversion, ICU can also do collation in 
> various locales and encodings. 
> My point is, we shouldn't be writing a new set of routines to 
> do half a job if there are already libraries available to do 
> all of it.
> 

This sounds like a brilliant move, if anything.

> -- 
> Oliver Elphick  
> [EMAIL PROTECTED]
> Isle of Wight  
> http://www.lfix.co.uk/oliver
> GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F 
> A543 10EA
>  
>  "Be still before the LORD and wait patiently for him;
>   do not fret when men succeed in their ways, when they
>   carry out their wicked schemes." 
> Psalms 37:7 
> 
> 
> 

Kind Regards,

John Hansen


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message-
> From: Tom Lane [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, August 08, 2004 2:43 AM
> To: Dennis Bjorklund
> Cc: Tatsuo Ishii; John Hansen; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]
> Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 
> 
> Dennis Bjorklund <[EMAIL PROTECTED]> writes:
> > On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
> >> Anyway my point is if current specification of Unicode only allows 
> >> 24-bit range, why we need to allow usage against the specification?
> 
> > Is there a specific reason you want to restrict it to 24 bits?
> 
> I see several places that have to allocate space on the basis 
> of the maximum encoded character length possible in the 
> current encoding (look for uses of 
> pg_database_encoding_max_length).  Probably the only one 
> that's really significant for performance is text_substr(), 
> but that's enough to be an argument against setting maxmblen 
> higher than we have to.
> 
> It looks to me like supporting 4-byte UTF-8 characters would 
> be enough to handle the existing range of Unicode codepoints, 
> and that is probably as much as we want to do.
> 
> If I understood what I was reading, this would take several things:
> * Remove the "special UTF-8 check" in pg_verifymbstr;

I strongly disagree, this would mean one could store any sequence of
characters in the db, as long as the bytes are above 0x80. This would
not be valid utf8, and leave the data in an inconsistent state.
Setting the client encoding to unicode, implies that this is what we're
going to feed the database, and should guarantee, that what comes out of
a select is valid utf8. We can make sure of that, by doing the check
before it's inserted.

> * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte
case;

pg_utf_mblen should handle any case according to the specification.
Currently, it will return 3, even for 4,5, and 6 byte sequences. Those
places where pg_utf_mblen is called, we should check to make sure, that
the length is between 1 and 4 inclusive, and that the sequence is valid.
This is what I made the patch for.

> * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

That I have no problem with.

> Are there any other places that would have to change?  Would 
> this break anything?  The testing aspect is what's bothering 
> me at the moment.
> 
>   regards, tom lane
> 
> 

Just my $0.02 worth,

Kind Regards,

John Hansen

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Oliver Elphick
On Sat, 2004-08-07 at 07:10, Tom Lane wrote:
> Oliver Elphick <[EMAIL PROTECTED]> writes:
> > glibc provides various routines (mb...) for handling Unicode.  How many
> > of our supported platforms don't have these?
> 
> Every one that doesn't use glibc.  Don't bother proposing a glibc-only
> solution (and that's from someone who works for a glibc-only company;
> you don't even want to think about the push-back you'll get from other
> quarters).

No. that's not what I was proposing.  My suggestion was to use these
routines if they are sufficiently widely implemented, and our own
routines where standard ones are not available.

The man page for mblen says
"CONFORMING TO
   ISO/ANSI C, UNIX98"

Is glibc really the only C library to conform?

If using the mb... routines isn't feasible, IBM's ICU library
(http://oss.software.ibm.com/icu/) is available under the X licence,
which is compatible with BSD as far as I can see.  Besides character
conversion, ICU can also do collation in various locales and encodings. 
My point is, we shouldn't be writing a new set of routines to do half a
job if there are already libraries available to do all of it.

-- 
Oliver Elphick  [EMAIL PROTECTED]
Isle of Wight  http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
 
 "Be still before the LORD and wait patiently for him;
  do not fret when men succeed in their ways, when they
  carry out their wicked schemes." 
Psalms 37:7 


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tom Lane
Dennis Bjorklund <[EMAIL PROTECTED]> writes:
> On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
>> Anyway my point is if current specification of Unicode only allows
>> 24-bit range, why we need to allow usage against the specification?

> Is there a specific reason you want to restrict it to 24 bits?

I see several places that have to allocate space on the basis of the
maximum encoded character length possible in the current encoding
(look for uses of pg_database_encoding_max_length).  Probably the only
one that's really significant for performance is text_substr(), but
that's enough to be an argument against setting maxmblen higher than
we have to.

It looks to me like supporting 4-byte UTF-8 characters would be enough
to handle the existing range of Unicode codepoints, and that is probably
as much as we want to do.

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change?  Would this break
anything?  The testing aspect is what's bothering me at the moment.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
> -Original Message-
> From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, August 07, 2004 11:23 PM
> To: John Hansen
> Cc: Takehiko Abe; [EMAIL PROTECTED]
> Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1
> 
> On Sat, 7 Aug 2004, John Hansen wrote:
> 
> > Now, is it really 24 bits tho? 
> > Afaict, it's really 21 (0 - 10 or 0 - xxx1  
> )
> 
> Yes, up to 0x10 should be enough.
> 
> The 24 is not really important, this is all about what utf-8 
> strings to accept as input. The strings are stored as utf-8 
> strings and when processed inside pg it uses wchar_t that is 
> 32 bit (on some systems at least). By restricting the utf-8 
> input to unicode we can in the future store each character as 
> 3 bytes if we want.

Which brings us back to something like the attached...

> 
> --
> /Dennis Björklund
> 
> 
> 

Regards,

John Hansen


wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Yea,. I know

10 - 10 : 2 separate planes iirc

... John 

-Original Message-
From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 9:06 PM
To: John Hansen
Cc: Tatsuo Ishii; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1 

On Sat, 7 Aug 2004, John Hansen wrote:

> should not allow them to be stored, since there might me someone using 
> the high ranges for a private character set, which could very well be 
> included in the specification some day.

There are areas reserved for private character sets.

--
/Dennis Björklund




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Tatsuo Ishii wrote:

> More seriously, Unicode is filled with tons of confusion and
> inconsistency IMO. Remember that once Unicode adovocates said that the
> merit of Unicode was it only requires 16-bit width. Now they say they
> need surrogate pairs and 32-bit width chars...
> 
> Anyway my point is if current specification of Unicode only allows
> 24-bit range, why we need to allow usage against the specification?

Whatever problems they have had in the past, the ISO 10646 defines
formally a 31-bit character set. Are you saying that applications should
reject strings that contain characters that it does not recognize?

Is there a specific reason you want to restrict it to 24 bits? In practice 
it does not matter much since it's not used today, I just don't know why 
you want it.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Well, maybe we'd be better off, compiling a list of (in?)valid ranges
from the full unicode database 
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Unihan.txt)
and with every release of pg, update the detection logic so only valid
characters are allowed?

Regards,

John Hansen

-Original Message-
From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:46 PM
To: John Hansen
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 

> Yes, but the specification allows for 6byte sequences, or 32bit 
> characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its specification.

> As dennis pointed out, just because they're not used, doesn't mean we 
> should not allow them to be stored, since there might me someone using

> the high ranges for a private character set, which could very well be 
> included in the specification some day.

We should expand it to 64-bit since some day the specification might be
changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, John Hansen wrote:

> should not allow them to be stored, since there might me someone using
> the high ranges for a private character set, which could very well be
> included in the specification some day.

There are areas reserved for private character sets.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
4 actually,
10 needs four bytes:

0xxx 10xx 10xx 10xx
10 = 1010  

Fill in the blanks, starting from the bottom, you get:
 1010 1011 1011

Regards,

John Hansen 

-Original Message-
From: Christopher Kings-Lynne [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:47 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1

> Now it's entirely possible that the underlying support is a few bricks

> shy of a load --- for instance I see that pg_utf_mblen thinks there 
> are no UTF8 codes longer than 3 bytes whereas your code goes to 4.  
> I'm not an expert on this stuff, so I don't know what the UTF8 spec 
> actually says.  But I do think you are fixing the code at the wrong
level.

Surely there are UTF-8 codes that are at least 3 bytes.  I have a
_vague_ recollection that you have to keep escaping and escaping to get
up to like 4 bytes for some asian code points?

Chris




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Christopher Kings-Lynne
Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.
Surely there are UTF-8 codes that are at least 3 bytes.  I have a 
_vague_ recollection that you have to keep escaping and escaping to get 
up to like 4 bytes for some asian code points?

Chris
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
> Yes, but the specification allows for 6byte sequences, or 32bit
> characters.

UTF-8 is just an encoding specification, not character set
specification. Unicode only has 17 256x256 planes in its
specification.

> As dennis pointed out, just because they're not used, doesn't mean we
> should not allow them to be stored, since there might me someone using
> the high ranges for a private character set, which could very well be
> included in the specification some day.

We should expand it to 64-bit since some day the specification might
be changed then:-)

More seriously, Unicode is filled with tons of confusion and
inconsistency IMO. Remember that once Unicode adovocates said that the
merit of Unicode was it only requires 16-bit width. Now they say they
need surrogate pairs and 32-bit width chars...

Anyway my point is if current specification of Unicode only allows
24-bit range, why we need to allow usage against the specification?
--
Tatsuo Ishii

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
Yes, but the specification allows for 6byte sequences, or 32bit
characters.
As dennis pointed out, just because they're not used, doesn't mean we
should not allow them to be stored, since there might me someone using
the high ranges for a private character set, which could very well be
included in the specification some day.

Regards,

John Hansen

-Original Message-
From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 8:09 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; John Hansen; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 

> Dennis Bjorklund <[EMAIL PROTECTED]> writes:
> > ... This also means that the start byte can never start with 7 or 8 
> > ones, that is illegal and should be tested for and rejected. So the 
> > longest utf-8 sequence is 6 bytes (and the longest character needs 4

> > bytes (or 31 bits)).
> 
> Tatsuo would know more about this than me, but it looks from here like

> our coding was originally designed to support only 16-bit-wide 
> internal characters (ie, 16-bit pg_wchar datatype width).  I believe 
> that the regex library limitation here is gone, and that as far as 
> that library is concerned we could assume a 32-bit internal character 
> width.  The question at hand is whether we can support 32-bit 
> characters or not --- and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype.  However I doubt there's
actually a need for 32-but width character sets. Even Unicode only uese
up 0x0010, so 24-bit should be enough...
--
Tatsuo Ishii



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Tatsuo Ishii
> Dennis Bjorklund <[EMAIL PROTECTED]> writes:
> > ... This also means that the start byte can never start with 7 or 8
> > ones, that is illegal and should be tested for and rejected. So the
> > longest utf-8 sequence is 6 bytes (and the longest character needs 4
> > bytes (or 31 bits)).
> 
> Tatsuo would know more about this than me, but it looks from here like
> our coding was originally designed to support only 16-bit-wide internal
> characters (ie, 16-bit pg_wchar datatype width).  I believe that the
> regex library limitation here is gone, and that as far as that library
> is concerned we could assume a 32-bit internal character width.  The
> question at hand is whether we can support 32-bit characters or not ---
> and if not, what's the next bug to fix?

pg_wchar has been already 32-bit datatype.  However I doubt there's
actually a need for 32-but width character sets. Even Unicode only
uese up 0x0010, so 24-bit should be enough...
--
Tatsuo Ishii

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread John Hansen
This should do it.

Regards,

John Hansen 

-Original Message-
From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 5:02 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

On Sat, 7 Aug 2004, Tom Lane wrote:

> question at hand is whether we can support 32-bit characters or not 
> --- and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple testing, 
make sure regexps work and then treat anything else that might not work, as bugs to be 
fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun at all :-)

My previous mail talked about utf-8 translation. Not all characters possible to form 
using utf-8 are assigned by the unicode org. However, the part that interprets the 
unicode strings are in the os so different os'es can give different results. So I 
think pg should just accept even 6 byte utf-8 sequences even if some characters are 
not currently assigned.

--
/Dennis Björklund





wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-07 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Tom Lane wrote:

> question at hand is whether we can support 32-bit characters or not ---
> and if not, what's the next bug to fix?

True, and that's hard to just give an answer to. One could do some simple 
testing, make sure regexps work and then treat anything else that might 
not work, as bugs to be fixed later on when found.

The alternative is to inspect all code paths that involve strings, not fun 
at all :-)

My previous mail talked about utf-8 translation. Not all characters
possible to form using utf-8 are assigned by the unicode org. However,
the part that interprets the unicode strings are in the os so different
os'es can give different results. So I think pg should just accept even 6 
byte utf-8 sequences even if some characters are not currently assigned.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Tom Lane
Dennis Bjorklund <[EMAIL PROTECTED]> writes:
> ... This also means that the start byte can never start with 7 or 8
> ones, that is illegal and should be tested for and rejected. So the
> longest utf-8 sequence is 6 bytes (and the longest character needs 4
> bytes (or 31 bits)).

Tatsuo would know more about this than me, but it looks from here like
our coding was originally designed to support only 16-bit-wide internal
characters (ie, 16-bit pg_wchar datatype width).  I believe that the
regex library limitation here is gone, and that as far as that library
is concerned we could assume a 32-bit internal character width.  The
question at hand is whether we can support 32-bit characters or not ---
and if not, what's the next bug to fix?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
Ahh, but that's not the case. You cannot just delete the check, since
not all combinations of bytes are valid UTF8. UTF bytes FE & FF never
appear in a byte sequence for instance.
UTF8 is more that two bytes btw, up to 6 bytes are used to represent an
UTF8 character.
The 5 and 6 byte characters are currently not in use tho.

I didn't actually notice the difference in UTF8 width between my
original patch and my last, so attached, updated patch.

Regards,

John Hansen

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 3:07 PM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

"John Hansen" <[EMAIL PROTECTED]> writes:
> My apologies for not reading the code properly.

> Attached patch using pg_utf_mblen() instead of an indexed table.
> It now also do bounds checks.

I think you missed my point.  If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable).  What's really at stake here is whether anything else
breaks if we do that.  What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.

regards, tom lane




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Oliver Elphick
On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
> Now it's entirely possible that the underlying support is a few bricks
> shy of a load --- for instance I see that pg_utf_mblen thinks there are
> no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
> an expert on this stuff, so I don't know what the UTF8 spec actually
> says.  But I do think you are fixing the code at the wrong level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode.  How many
of our supported platforms don't have these?  If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

-- 
Oliver Elphick  [EMAIL PROTECTED]
Isle of Wight  http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
 
 "Be still before the LORD and wait patiently for him;
  do not fret when men succeed in their ways, when they
  carry out their wicked schemes." 
Psalms 37:7 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Tom Lane
Oliver Elphick <[EMAIL PROTECTED]> writes:
> glibc provides various routines (mb...) for handling Unicode.  How many
> of our supported platforms don't have these?

Every one that doesn't use glibc.  Don't bother proposing a glibc-only
solution (and that's from someone who works for a glibc-only company;
you don't even want to think about the push-back you'll get from other
quarters).

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Dennis Bjorklund
On Sat, 7 Aug 2004, Tom Lane wrote:

> shy of a load --- for instance I see that pg_utf_mblen thinks there are
> no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
> an expert on this stuff, so I don't know what the UTF8 spec actually
> says.  But I do think you are fixing the code at the wrong level.

I can give some general info about utf-9. This is how it is encoded:

characterencoding
---  -
 - 007F: 0xxx
0080 - 07FF: 110x 10xx
0800 - : 1110 10xx 10xx
0001 - 001F: 0xxx 10xx 10xx 10xx
0020 - 03FF: 10xx 10xx 10xx 10xx 10xx
0400 - 7FFF: 110x 10xx 10xx 10xx 10xx 10xx

If the first byte starts with a 1 then the number of ones give the 
length of the utf-8 sequence. And the rest of the bytes in the sequence 
always starts with 10 (this makes it possble to look anywhere in the 
string and fast find the start of a character).

This also means that the start byte can never start with 7 or 8 ones, that 
is illegal and should be tested for and rejected. So the longest utf-8 
sequence is 6 bytes (and the longest character needs 4 bytes (or 31 
bits)).

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
Possibly, since I got it wrong once more
About to give up, but attached, Updated patch.


Regards,

John Hansen

-Original Message-
From: Oliver Elphick [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 3:56 PM
To: Tom Lane
Cc: John Hansen; Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1

On Sat, 2004-08-07 at 06:06, Tom Lane wrote:
> Now it's entirely possible that the underlying support is a few bricks

> shy of a load --- for instance I see that pg_utf_mblen thinks there 
> are no UTF8 codes longer than 3 bytes whereas your code goes to 4.  
> I'm not an expert on this stuff, so I don't know what the UTF8 spec 
> actually says.  But I do think you are fixing the code at the wrong
level.

UTF-8 characters can be up to 6 bytes long:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

glibc provides various routines (mb...) for handling Unicode.  How many
of our supported platforms don't have these?  If there are still some
that don't, wouldn't it be better to use the standard routines where
they do exist?

-- 
Oliver Elphick  [EMAIL PROTECTED]
Isle of Wight  http://www.lfix.co.uk/oliver
GPG: 1024D/A54310EA  92C8 39E7 280E 3631 3F0E  1EC0 5664 7A2F A543 10EA
 
 "Be still before the LORD and wait patiently for him;
  do not fret when men succeed in their ways, when they
  carry out their wicked schemes." 
Psalms 37:7 





wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Tom Lane
"John Hansen" <[EMAIL PROTECTED]> writes:
> My apologies for not reading the code properly.

> Attached patch using pg_utf_mblen() instead of an indexed table.
> It now also do bounds checks.

I think you missed my point.  If we don't need this limitation, the
correct patch is simply to delete the whole check (ie, delete lines
827-836 of wchar.c, and for that matter we'd then not need the encoding
local variable).  What's really at stake here is whether anything else
breaks if we do that.  What else, if anything, assumes that UTF
characters are not more than 2 bytes?

Now it's entirely possible that the underlying support is a few bricks
shy of a load --- for instance I see that pg_utf_mblen thinks there are
no UTF8 codes longer than 3 bytes whereas your code goes to 4.  I'm not
an expert on this stuff, so I don't know what the UTF8 spec actually
says.  But I do think you are fixing the code at the wrong level.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
My apologies for not reading the code properly.

Attached patch using pg_utf_mblen() instead of an indexed table.
It now also do bounds checks.

Regards,

John Hansen

-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 07, 2004 4:37 AM
To: John Hansen
Cc: Hackers; Patches
Subject: Re: [HACKERS] UNICODE characters above 0x1 

"John Hansen" <[EMAIL PROTECTED]> writes:
> Attached, as promised, small patch removing the limitation, adding 
> correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string.  Also, doesn't pg_mblen already know the
length rules for UTF8?  Why are you duplicating that knowledge?

regards, tom lane




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread Tom Lane
"John Hansen" <[EMAIL PROTECTED]> writes:
> Attached, as promised, small patch removing the limitation, adding
> correct utf8 validation.

Surely this is badly broken --- it will happily access data outside the
bounds of the given string.  Also, doesn't pg_mblen already know the
length rules for UTF8?  Why are you duplicating that knowledge?

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

2004-08-06 Thread John Hansen
Attached, as promised, small patch removing the limitation, adding
correct utf8 validation.

Regards,

John

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of John Hansen
Sent: Friday, August 06, 2004 2:20 PM
To: 'Hackers'
Subject: [HACKERS] UNICODE characters above 0x1

I've started work on a patch for this problem.

Doing regression tests at present.

I'll get back when done.


Regards,

John


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html




wchar.c.patch
Description: wchar.c.patch

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly