Re: [HACKERS] Unicode support

2009-04-20 Thread Peter Eisentraut
On Sunday 19 April 2009 18:54:45 Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: On Monday 13 April 2009 20:18:31 - - wrote: 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code

Re: [HACKERS] Unicode support

2009-04-19 Thread Peter Eisentraut
On Monday 13 April 2009 20:18:31 - - wrote: 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. I have added a Todo item about possibly fixing this. -- Sent via pgsql-hackers mailing

Re: [HACKERS] Unicode support

2009-04-19 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes: On Monday 13 April 2009 20:18:31 - - wrote: 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. I have added a Todo item about possibly fixing

Re: [HACKERS] Unicode support

2009-04-15 Thread Martijn van Oosterhout
On Tue, Apr 14, 2009 at 11:32:57AM -0700, David E. Wheeler wrote: I've no idea what it would require, but the mapping table must be pretty substantial. Still, I'd love to have this functionality in the database. The Unicode tables in ICU outweigh the size of the code by a factor 5 or so.

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: Umm, but isn't that because your encoding is using one code point? See the OP's explanation w.r.t. canonical equivalence. This isn't about the number of bytes, but about whether or not we should count characters encoded as two or more

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote: FWIW, the SQL spec puts the onus of normalization squarely on the application; the database is allowed to assume that Unicode strings are already normalized, is allowed to behave in implementation-defined ways when presented with strings

Re: [HACKERS] Unicode support

2009-04-14 Thread Greg Stark
On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut pete...@gmx.net wrote: On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote: Umm, but isn't that because your encoding is using one code point? See the OP's explanation w.r.t. canonical equivalence. This isn't about the number of bytes, but

Re: [HACKERS] Unicode support

2009-04-14 Thread Tom Lane
Greg Stark st...@enterprisedb.com writes: What's really at issue is what is a string?. That is, it a sequence of characters or a sequence of code points. If it's the former then we would also have to prohibit certain strings such as U'\0301' entirely. And we have to make substr() pick out the

Re: [HACKERS] Unicode support

2009-04-14 Thread David E. Wheeler
On Apr 14, 2009, at 9:26 AM, Tom Lane wrote: Another question is what is the purpose of a database? To me it would be quite the wrong thing for the DB to not store what is presented, as long as it's considered legal. Normalization of legal variant forms seems pretty questionable. So I'm

Re: [HACKERS] Unicode support

2009-04-14 Thread Andrew Dunstan
David E. Wheeler wrote: On Apr 14, 2009, at 9:26 AM, Tom Lane wrote: Another question is what is the purpose of a database? To me it would be quite the wrong thing for the DB to not store what is presented, as long as it's considered legal. Normalization of legal variant forms seems pretty

Re: [HACKERS] Unicode support

2009-04-14 Thread Andrew Dunstan
Kevin Grittner wrote: I'm curious -- can every multi-code-point character be normalized to a single-code-point character? I don't believe so. Those combinations used in the most common orthographic languages have their own code points, but I understand you can use the combining

Re: [HACKERS] Unicode support

2009-04-14 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: I think there's a good case for some functions implementing the various Unicode normalization functions, though. I have no objection to that so long as the code footprint is in line with the utility gain (i.e. not all that much). If we have to bring

Re: [HACKERS] Unicode support

2009-04-14 Thread Kevin Grittner
Greg Stark st...@enterprisedb.com wrote: Peter Eisentraut pete...@gmx.net wrote: SELECT U'\00E9', char_length(U'\00E9'); ?column? | char_length --+- é| 1 (1 row) SELECT U'\0065\0301', char_length(U'\0065\0301'); ?column? | char_length

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 18:49:45 Greg Stark wrote: What's really at issue is what is a string?. That is, it a sequence of characters or a sequence of code points. I think a sequence of codepoints would be about as silly a definition as the antiquated notion of a string as a sequence of bytes.

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Monday 13 April 2009 20:18:31 - - wrote: 2) PG has no support for the Unicode collation algorithm. Collation is offloaded to the OS, which makes this quite inflexible. This argument is unclear. Do you want the Unicode collation algorithm or do you want flexibility? Some OS do implement

Re: [HACKERS] Unicode support

2009-04-14 Thread Peter Eisentraut
On Tuesday 14 April 2009 19:26:41 Tom Lane wrote: Another question is what is the purpose of a database? To me it would be quite the wrong thing for the DB to not store what is presented, as long as it's considered legal. Normalization of legal variant forms seems pretty questionable. So

Re: [HACKERS] Unicode support

2009-04-14 Thread - -
I don't believe that the standard forbids the use of combining chars at all. RFC 3629 says: ... This issue is amenable to solutions based on Unicode Normalization Forms, see [UAX15]. This is the relevant part. Tom was claiming that the UTF8 encoding required normalizing the string of

Re: [HACKERS] Unicode support

2009-04-14 Thread David E. Wheeler
On Apr 14, 2009, at 11:10 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: I think there's a good case for some functions implementing the various Unicode normalization functions, though. I have no objection to that so long as the code footprint is in line with the utility

Re: [HACKERS] Unicode support

2009-04-14 Thread Andrew Gierth
Peter == Peter Eisentraut pete...@gmx.net writes: On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote: FWIW, the SQL spec puts the onus of normalization squarely on the application; the database is allowed to assume that Unicode strings are already normalized, is allowed to behave in

[HACKERS] Unicode support

2009-04-13 Thread - -
Hi. While PostgreSQL is a great database, it lacks some fundamental Unicode support. I want to present some points that have--to my knowledge--not been addressed so far. In the following text, it is assumed that the database and client encoding is UTF-8. 1) Functions like char_length() or

Re: [HACKERS] Unicode support

2009-04-13 Thread Kevin Grittner
Alvaro Herrera alvhe...@commandprompt.com wrote: 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. I think you have client_encoding misconfigured. alvherre=# select

Re: [HACKERS] Unicode support

2009-04-13 Thread Andrew Dunstan
Alvaro Herrera wrote: - - wrote: 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. I think you have client_encoding misconfigured. alvherre=# select length('á'::text);

Re: [HACKERS] Unicode support

2009-04-13 Thread Alvaro Herrera
- - wrote: 1) Functions like char_length() or length() do NOT return the number of characters (the manual says they do), instead they return the number of code points. I think you have client_encoding misconfigured. alvherre=# select length('á'::text); length 1 (1 fila)

Re: [HACKERS] Unicode support

2009-04-13 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes: This isn't about the number of bytes, but about whether or not we should count characters encoded as two or more combined code points as a single char or not. It's really about whether we should support non-canonical encodings. AFAIK that's a hack

Re: [HACKERS] Unicode support

2009-04-13 Thread Greg Stark
On Mon, Apr 13, 2009 at 9:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: This isn't about the number of bytes, but about whether or not we should count characters encoded as two or more combined code points as a single char or not. It's really about

Re: [HACKERS] Unicode support

2009-04-13 Thread Tom Lane
Greg Stark st...@enterprisedb.com writes: Is it really true trhat canonical encodings never contain any composed characters in them? I thought there were some glyphs which could only be represented by composed characters. AFAIK that's not true. However, in my original comment I was thinking

Re: [HACKERS] Unicode support

2009-04-13 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: This isn't about the number of bytes, but about whether or not we should count characters encoded as two or more combined code points as a single char or not. It's really about whether we should support non-canonical

Re: [HACKERS] Unicode support

2009-04-13 Thread - -
Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark st...@enterprisedb.com writes: Is it really true trhat canonical encodings never contain any composed characters in them? I thought there were some glyphs which could only be represented by composed characters. AFAIK that's not true. However, in

Re: [HACKERS] Unicode support

2009-04-13 Thread Gregory Stark
- - crossroads0...@googlemail.com writes: The original post seemed to be a contrived attempt to say you should use ICU. Indeed. The OP should go read all the previous arguments about ICU in our archives. Not at all. I just was making a suggestion. You may use any other library or

Re: [HACKERS] Unicode support

2009-04-13 Thread Andrew Gierth
Gregory == Gregory Stark st...@enterprisedb.com writes: I don't believe that the standard forbids the use of combining chars at all. RFC 3629 says: ... This issue is amenable to solutions based on Unicode Normalization Forms, see [UAX15]. Gregory This is the relevant part. Tom was

[HACKERS] Unicode support in postgresql code

2009-01-06 Thread Kalyankumar Ramaseshan
Hi, Any one could throw some light on how the unicode support is enabled in the postgresql code? I know that this is a step during the installation to select the default locale of the postgresql system other place is during the creation of a database, there is a option to select the