On Sunday 19 April 2009 18:54:45 Tom Lane wrote:
Peter Eisentraut pete...@gmx.net writes:
On Monday 13 April 2009 20:18:31 - - wrote:
1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code
On Monday 13 April 2009 20:18:31 - - wrote:
1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code points.
I have added a Todo item about possibly fixing this.
--
Sent via pgsql-hackers mailing
Peter Eisentraut pete...@gmx.net writes:
On Monday 13 April 2009 20:18:31 - - wrote:
1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code points.
I have added a Todo item about possibly fixing
On Tue, Apr 14, 2009 at 11:32:57AM -0700, David E. Wheeler wrote:
I've no idea what it would require, but the mapping table must be
pretty substantial. Still, I'd love to have this functionality in the
database.
The Unicode tables in ICU outweigh the size of the code by a factor 5
or so.
On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
Umm, but isn't that because your encoding is using one code point?
See the OP's explanation w.r.t. canonical equivalence.
This isn't about the number of bytes, but about whether or not we should
count characters encoded as two or more
On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:
FWIW, the SQL spec puts the onus of normalization squarely on the
application; the database is allowed to assume that Unicode strings
are already normalized, is allowed to behave in implementation-defined
ways when presented with strings
On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut pete...@gmx.net wrote:
On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
Umm, but isn't that because your encoding is using one code point?
See the OP's explanation w.r.t. canonical equivalence.
This isn't about the number of bytes, but
Greg Stark st...@enterprisedb.com writes:
What's really at issue is what is a string?. That is, it a sequence
of characters or a sequence of code points. If it's the former then we
would also have to prohibit certain strings such as U'\0301'
entirely. And we have to make substr() pick out the
On Apr 14, 2009, at 9:26 AM, Tom Lane wrote:
Another question is what is the purpose of a database? To me it
would
be quite the wrong thing for the DB to not store what is presented, as
long as it's considered legal. Normalization of legal variant forms
seems pretty questionable. So I'm
David E. Wheeler wrote:
On Apr 14, 2009, at 9:26 AM, Tom Lane wrote:
Another question is what is the purpose of a database? To me it would
be quite the wrong thing for the DB to not store what is presented, as
long as it's considered legal. Normalization of legal variant forms
seems pretty
Kevin Grittner wrote:
I'm curious -- can every multi-code-point character be normalized to a
single-code-point character?
I don't believe so. Those combinations used in the most common
orthographic languages have their own code points, but I understand you
can use the combining
Andrew Dunstan and...@dunslane.net writes:
I think there's a good case for some functions implementing the various
Unicode normalization functions, though.
I have no objection to that so long as the code footprint is in line
with the utility gain (i.e. not all that much). If we have to bring
Greg Stark st...@enterprisedb.com wrote:
Peter Eisentraut pete...@gmx.net wrote:
SELECT U'\00E9', char_length(U'\00E9');
?column? | char_length
--+-
é| 1
(1 row)
SELECT U'\0065\0301', char_length(U'\0065\0301');
?column? | char_length
On Tuesday 14 April 2009 18:49:45 Greg Stark wrote:
What's really at issue is what is a string?. That is, it a sequence
of characters or a sequence of code points.
I think a sequence of codepoints would be about as silly a definition as the
antiquated notion of a string as a sequence of bytes.
On Monday 13 April 2009 20:18:31 - - wrote:
2) PG has no support for the Unicode collation algorithm. Collation is
offloaded to the OS, which makes this quite inflexible.
This argument is unclear. Do you want the Unicode collation algorithm or do
you want flexibility? Some OS do implement
On Tuesday 14 April 2009 19:26:41 Tom Lane wrote:
Another question is what is the purpose of a database? To me it would
be quite the wrong thing for the DB to not store what is presented, as
long as it's considered legal. Normalization of legal variant forms
seems pretty questionable. So
I don't believe that the standard forbids the use of combining chars at all.
RFC 3629 says:
... This issue is amenable to solutions based on Unicode Normalization
Forms, see [UAX15].
This is the relevant part. Tom was claiming that the UTF8 encoding required
normalizing the string of
On Apr 14, 2009, at 11:10 AM, Tom Lane wrote:
Andrew Dunstan and...@dunslane.net writes:
I think there's a good case for some functions implementing the
various
Unicode normalization functions, though.
I have no objection to that so long as the code footprint is in line
with the utility
Peter == Peter Eisentraut pete...@gmx.net writes:
On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:
FWIW, the SQL spec puts the onus of normalization squarely on the
application; the database is allowed to assume that Unicode
strings are already normalized, is allowed to behave in
Hi.
While PostgreSQL is a great database, it lacks some fundamental
Unicode support. I want to present some points that have--to my
knowledge--not been addressed so far. In the following text, it is
assumed that the database and client encoding is UTF-8.
1) Functions like char_length() or
Alvaro Herrera alvhe...@commandprompt.com wrote:
1) Functions like char_length() or length() do NOT return the
number
of characters (the manual says they do), instead they return the
number of code points.
I think you have client_encoding misconfigured.
alvherre=# select
Alvaro Herrera wrote:
- - wrote:
1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code points.
I think you have client_encoding misconfigured.
alvherre=# select length('á'::text);
- - wrote:
1) Functions like char_length() or length() do NOT return the number
of characters (the manual says they do), instead they return the
number of code points.
I think you have client_encoding misconfigured.
alvherre=# select length('á'::text);
length
1
(1 fila)
Andrew Dunstan and...@dunslane.net writes:
This isn't about the number of bytes, but about whether or not we should
count characters encoded as two or more combined code points as a single
char or not.
It's really about whether we should support non-canonical encodings.
AFAIK that's a hack
On Mon, Apr 13, 2009 at 9:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Andrew Dunstan and...@dunslane.net writes:
This isn't about the number of bytes, but about whether or not we should
count characters encoded as two or more combined code points as a single
char or not.
It's really about
Greg Stark st...@enterprisedb.com writes:
Is it really true trhat canonical encodings never contain any composed
characters in them? I thought there were some glyphs which could only
be represented by composed characters.
AFAIK that's not true. However, in my original comment I was thinking
Tom Lane wrote:
Andrew Dunstan and...@dunslane.net writes:
This isn't about the number of bytes, but about whether or not we should
count characters encoded as two or more combined code points as a single
char or not.
It's really about whether we should support non-canonical
Tom Lane t...@sss.pgh.pa.us wrote:
Greg Stark st...@enterprisedb.com writes:
Is it really true trhat canonical encodings never contain any composed
characters in them? I thought there were some glyphs which could only
be represented by composed characters.
AFAIK that's not true. However, in
- - crossroads0...@googlemail.com writes:
The original post seemed to be a contrived attempt to say you should
use ICU.
Indeed. The OP should go read all the previous arguments about ICU
in our archives.
Not at all. I just was making a suggestion. You may use any other
library or
Gregory == Gregory Stark st...@enterprisedb.com writes:
I don't believe that the standard forbids the use of combining
chars at all. RFC 3629 says:
... This issue is amenable to solutions based on Unicode
Normalization Forms, see [UAX15].
Gregory This is the relevant part. Tom was
Hi,
Any one could throw some light on how the unicode support is enabled in
the postgresql code? I know that this is a step during the installation
to select the default locale of the postgresql system other place is
during the creation of a database, there is a option to select the
31 matches
Mail list logo