Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-24 Thread Martijn van Oosterhout
On Tue, Jul 22, 2008 at 04:32:39PM +0200, Zdenek Kotala wrote:
 Oh, so you're thinking of a charset as a sort of check constraint. If
 your locale is turkish and you have a column marked charset ASCII then
 storing lower('HI') results in an error.
 
 Yeah, if you use strcoll function it fails when illegal character is found.
 See
 http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

Wierd, at least in glibc and ICU it can't happen but perhaps there are
other implementations where it can...

 Collation cannot be defined on any character. There is not any relation 
 between
 Latin and Chines characters. Collation has sense when you are able to 
 specify  =  operators.

There is no standardised relation. However, if your collation library
decides to define all chinese characters after all latin characters,
they will have defined a collation that will work for all strings with
any characters... Which is basically the approach glibc/ICU takes.

I think the standard is kind of pathetic to say that strcoll can set
errno but have no value to indicate error. I wonder how many platforms
actually use that feature.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-22 Thread Martijn van Oosterhout
On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:
 I was trying to sort out the problem with not creating new catalog for
 character sets and I came up following ideas. Correct me if my ideas are
 wrong.
 
 Since collation has to have a defined character set.

Not really. AIUI at least glibc and ICU define a collation over all
possible characters (ie unicode). When you create a locale you take a
subset and use that. Think about it: if you want to sort strings and
one of them happens to contain a chinese charater, it can't *fail*.
Note strcoll() has no error return for unknown characters.

 I'm suggesting to use
 already written infrastructure of encodings and to use list of encodings in
 chklocale.c. Currently databases are not created with specified character
 set but with specified encoding. I think instead of pointing a record in
 collation catalog to another record in character set catalog we might use
 only name (string) of the encoding.

That's reasonable. From an abstract point of view collations and
encodings are orthoginal, it's only when you're using POSIX locales
that there are limitations on how you combine them. I think you can
assume a collation can handle any characters that can be produced by
encoding.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-22 Thread Zdenek Kotala

Martijn van Oosterhout napsal(a):

On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:

Background:
We specify encoding in initdb phase. ANSI specify repertoire, charset, 
encoding and collation. If I understand it correctly, then charset is 
subset of repertoire and specify list of allowed characters for 
language-collation. Encoding is mapping of character set to binary format. 
For example for Czech alphabet(charset) we have 6 different encoding for 
8bit ASCII, but on other side for UTF8 there is specified multi charsets.


Oh, so you're thinking of a charset as a sort of check constraint. If
your locale is turkish and you have a column marked charset ASCII then
storing lower('HI') results in an error.


Yeah, if you use strcoll function it fails when illegal character is found.
See
http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html


A collation must be defined over all possible characters, it can't
depend on the character set. That doesn't mean sorting in en_US must do
something meaningful with japanese characters, it does mean it can't
throw an error (the usual procedure is to sort on unicode point).


Collation cannot be defined on any character. There is not any relation between
Latin and Chines characters. Collation has sense when you are able to specify  
 =  operators.


If you need compare Japanese and Latin characters then ansi specify default 
collation for each repertoire. I think it is usually bitwise comparing.



Zdenek

--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-22 Thread Zdenek Kotala

Tom Lane napsal(a):

Zdenek Kotala [EMAIL PROTECTED] writes:
I think if we support UTF8 encoding, than it make sense to create own charsets, 
because system locales could have defined collation for that.


Say what?  I cannot imagine a scenario in which a user-defined encoding
would be useful. 


I did not mean user defined encoding but user defined charset. For example
[EMAIL PROTECTED] locale uses UTF8 encoding and collation is defined on czech 
charset which specifies list of allowed character. If somebody will have 
installed e.g. Turkish locale then he will want to have also Turkish charset in 
 postgres. I guess, Charset also defines how upper/lower case will work (see 
i/I in Turkish).


Please, correct me if I wrong.

thanks Zdenek

--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-22 Thread Zdenek Kotala

Martijn van Oosterhout napsal(a):

On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:

I was trying to sort out the problem with not creating new catalog for
character sets and I came up following ideas. Correct me if my ideas are
wrong.

Since collation has to have a defined character set.


Not really. AIUI at least glibc and ICU define a collation over all
possible characters (ie unicode). When you create a locale you take a
subset and use that. Think about it: if you want to sort strings and
one of them happens to contain a chinese charater, it can't *fail*.
Note strcoll() has no error return for unknown characters.


It has.
See http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

The strcoll() function may fail if:

[EINVAL]
[CX]  The s1 or s2 arguments contain characters outside the domain of 
the collating sequence.




I'm suggesting to use
already written infrastructure of encodings and to use list of encodings in
chklocale.c. Currently databases are not created with specified character
set but with specified encoding. I think instead of pointing a record in
collation catalog to another record in character set catalog we might use
only name (string) of the encoding.


That's reasonable. From an abstract point of view collations and
encodings are orthoginal, it's only when you're using POSIX locales
that there are limitations on how you combine them. I think you can
assume a collation can handle any characters that can be produced by
encoding.


I think you are not correct. You cannot use collation over all UNICODE. See 
http://www.unicode.org/reports/tr10/#Common_Misperceptions. Same characters can 
be ordered differently in different languages.


Zdenek



--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-20 Thread Radek Strnad
I was trying to sort out the problem with not creating new catalog for
character sets and I came up following ideas. Correct me if my ideas are
wrong.

Since collation has to have a defined character set I'm suggesting to use
already written infrastructure of encodings and to use list of encodings in
chklocale.c. Currently databases are not created with specified character
set but with specified encoding. I think instead of pointing a record in
collation catalog to another record in character set catalog we might use
only name (string) of the encoding.

So each collation will be set over these encodings set in chklocale.c. Each
database will be able to use only collations that are created over same
(compatible) encodings regarding encoding_match_list. Each standard
collation (SQL standard) will be defined over all possible encodings
(hard-coded).

Comments?

Regards

 Radek Strnad

On Sat, Jul 12, 2008 at 5:17 PM, Tom Lane [EMAIL PROTECTED] wrote:

 Zdenek Kotala [EMAIL PROTECTED] writes:
  I think if we support UTF8 encoding, than it make sense to create own
 charsets,
  because system locales could have defined collation for that.

 Say what?  I cannot imagine a scenario in which a user-defined encoding
 would be useful. The amount of infrastructure you need for a new
 encoding is so large that providing management commands is just silly
 --- anyone who can create the infrastructure can do the last little bit
 for themselves.  The analogy to index access methods is on point, again.

regards, tom lane



Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-12 Thread Zdenek Kotala

Alvaro Herrera napsal(a):

Zdenek Kotala escribió:

The example is when you have translation data (vocabulary) in database. 
But the  reason is that ANSI specify (chapter 4.2) charset as a part of 
string descriptor. See below:


— The length or maximum length in characters of the character string type.
— The catalog name, schema name, and character set name of the character 
set of the character string type.
— The catalog name, schema name, and collation name of the collation of 
the character string type.


We already support multiple charsets, and are able to do conversions
between them.  The set of charsets is hardcoded and it's hard to make a
case that a user needs to create new ones.  I concur with Martijn's
suggestion -- there's no need for this to appear in a system catalog.

Perhaps it could be argued that we need to be able to specify the
charset a given string is in -- currently all strings are in the server
encoding (charset) which is fixed at initdb time.  Making the system
support multiple server encodings would be a major undertaking in itself
and I'm not sure that there's a point.



Background:
We specify encoding in initdb phase. ANSI specify repertoire, charset, encoding 
and collation. If I understand it correctly, then charset is subset of 
repertoire and specify list of allowed characters for language-collation. 
Encoding is mapping of character set to binary format. For example for Czech 
alphabet(charset) we have 6 different encoding for 8bit ASCII, but on other side 
for UTF8 there is specified multi charsets.



I think if we support UTF8 encoding, than it make sense to create own charsets, 
because system locales could have defined collation for that. We need conversion 
only in case when client encoding is not compatible with charset and conversion 
is not defined.


Any comments?

Zdenek

--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-12 Thread Martijn van Oosterhout
On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
 Background:
 We specify encoding in initdb phase. ANSI specify repertoire, charset, 
 encoding and collation. If I understand it correctly, then charset is 
 subset of repertoire and specify list of allowed characters for 
 language-collation. Encoding is mapping of character set to binary format. 
 For example for Czech alphabet(charset) we have 6 different encoding for 
 8bit ASCII, but on other side for UTF8 there is specified multi charsets.

Oh, so you're thinking of a charset as a sort of check constraint. If
your locale is turkish and you have a column marked charset ASCII then
storing lower('HI') results in an error.

A collation must be defined over all possible characters, it can't
depend on the character set. That doesn't mean sorting in en_US must do
something meaningful with japanese characters, it does mean it can't
throw an error (the usual procedure is to sort on unicode point).

 I think if we support UTF8 encoding, than it make sense to create own 
 charsets, because system locales could have defined collation for that. We 
 need conversion only in case when client encoding is not compatible with 
 charset and conversion is not defined.

The problem is that locales in POSIX are defined on an encoding, not a
charset. In locale en_US.UTF-8 doesn't actually sort any differently
than en_US.latin1, it's just that japanese characters are not
representable in the latter.

locale-gen can create a locale for any pair of (locale code,encoding),
whether the result is meaningful is another question.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-12 Thread Tom Lane
Zdenek Kotala [EMAIL PROTECTED] writes:
 I think if we support UTF8 encoding, than it make sense to create own 
 charsets, 
 because system locales could have defined collation for that.

Say what?  I cannot imagine a scenario in which a user-defined encoding
would be useful. The amount of infrastructure you need for a new
encoding is so large that providing management commands is just silly
--- anyone who can create the infrastructure can do the last little bit
for themselves.  The analogy to index access methods is on point, again.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-11 Thread Martijn van Oosterhout
On Thu, Jul 10, 2008 at 11:24:29PM +0200, Radek Strnad wrote:
 Hi,
 
 after long discussion with Mr. Kotala, we've decided to redesign our
 collation support proposal.
 For those of you who aren't familiar with my WIP patch and comments from
 other hackers here's the original mail:
 http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php

snip
 phase 1
 Implement sort of framework so the PostgreSQL will have basic guts
 (pg_collation  pg_charset catalogs, CREATE COLLATION, add collation support
 for each type needed) and will support collation at database level. This
 phase has been accepted as a Google Summer of Code project.

Why bother with pg_charset? I don't think I've seen anyone even asking
for multiple charsets in a single DB and certainly no actual use case.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-11 Thread Zdenek Kotala

Martijn van Oosterhout napsal(a):

On Thu, Jul 10, 2008 at 11:24:29PM +0200, Radek Strnad wrote:

Hi,

after long discussion with Mr. Kotala, we've decided to redesign our
collation support proposal.
For those of you who aren't familiar with my WIP patch and comments from
other hackers here's the original mail:
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php


snip

phase 1
Implement sort of framework so the PostgreSQL will have basic guts
(pg_collation  pg_charset catalogs, CREATE COLLATION, add collation support
for each type needed) and will support collation at database level. This
phase has been accepted as a Google Summer of Code project.


Why bother with pg_charset? I don't think I've seen anyone even asking
for multiple charsets in a single DB and certainly no actual use case.


The example is when you have translation data (vocabulary) in database. But the 
 reason is that ANSI specify (chapter 4.2) charset as a part of string 
descriptor. See below:


— The length or maximum length in characters of the character string type.
— The catalog name, schema name, and character set name of the character set of 
the character string type.
— The catalog name, schema name, and collation name of the collation of the 
character string type.


If I think about it, the question is if we need both information in pg_type and 
so on, because collation automatically define charset.


any idea?

Zdenek



--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] collation support revisited (phase 1)

2008-07-11 Thread Alvaro Herrera
Zdenek Kotala escribió:

 The example is when you have translation data (vocabulary) in database. 
 But the  reason is that ANSI specify (chapter 4.2) charset as a part of 
 string descriptor. See below:

 — The length or maximum length in characters of the character string type.
 — The catalog name, schema name, and character set name of the character 
 set of the character string type.
 — The catalog name, schema name, and collation name of the collation of 
 the character string type.

We already support multiple charsets, and are able to do conversions
between them.  The set of charsets is hardcoded and it's hard to make a
case that a user needs to create new ones.  I concur with Martijn's
suggestion -- there's no need for this to appear in a system catalog.

Perhaps it could be argued that we need to be able to specify the
charset a given string is in -- currently all strings are in the server
encoding (charset) which is fixed at initdb time.  Making the system
support multiple server encodings would be a major undertaking in itself
and I'm not sure that there's a point.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [WIP] collation support revisited (phase 1)

2008-07-10 Thread Radek Strnad
Hi,

after long discussion with Mr. Kotala, we've decided to redesign our
collation support proposal.
For those of you who aren't familiar with my WIP patch and comments from
other hackers here's the original mail:
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php

In a few sentences - I'm writing collation support for PostgreSQL that is
almost independent on used collating function. I will implement POSIX
locales but switch to ICU will be quite easy. Collations and character sets
defined by SQL standard will be hard coded so we avoid non-existence in some
functions.

The whole project will be divided into two phases:

phase 1
Implement sort of framework so the PostgreSQL will have basic guts
(pg_collation  pg_charset catalogs, CREATE COLLATION, add collation support
for each type needed) and will support collation at database level. This
phase has been accepted as a Google Summer of Code project.

phase 2
Implement the rest - full collation at column level. I will continue working
on this after finishing phase one and it will be my master degree thesis.

How will the first part work?

Catalogs
- new catalogs pg_collation and pg_charset will be defined
- pg_collation and pg_charset will contain SQL standard collations +
optional default collation (when set other than SQL standard one)
- pg_type, pg_attribute, pg_namespace will be extended with references to
default records in pg_collation and pg_charset

initdb
- pg_collation  pg_charset will contain each pre-defined records regarding
SQL standard and optionally one record that will be non-standard set when
creating initdb (the one using system locales)
- these two records will be referenced by pg_type, pg_attribute,
pg_namespace in concerned columns and will be concidered as default
collation that will be inherited

CREATE DATABASE ... COLLATE ...
- after copying the new database the collation will be default (same as
cluster collation) or changed by COLLATE statement. Then we update pg_type,
pg_attribute and pg_namespace catalogs
- reindex database

When changing databases the database collation will be retrieved from type
text from pg_type. This part should be the only one that will be deleted
when proceeding with phase 2. But that will take a while :-)

Thanks for all your comments

Regards

Radek Strnad