Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Heikki Linnakangas

Committed.

Tom Lane wrote:

* You should try to get rid of LOCALE_NAME_BUFLEN altogether.  Definitely
the comment about it in pg_control.h is now obsolete.


Yep. I removed LOCALE_NAME_BUFLEN. The real max length of a locale name 
is now NAMEDATALEN, because it's stored in a name field in pg_database. 
NAMEDATALEN is only 64 bytes, whereas LOCALE_NAME_BUFLEN was 128. 64 
bytes should be enough for en_GB.UTF8 style locale names, but I wonder 
if it's enough for the longer names used on Windows? Could someone 
confirm that, please?



An important restriction, however, is that each database's character set
must be compatible with the database's envarLC_CTYPE/ setting.

Also I wonder whether we shouldn't say that it must be compatible with
LC_CTYPE *and* LC_COLLATE.


I think we should, but that's in fact not what is tested. Before the 
patch as well, we only tested that the encoding matches LC_CTYPE, but 
you could set LC_COLLATE to anything. I'll work on a subsequent patch to 
tighten that.



* This makes sense, but then shouldn't we make the identical restriction
for encoding?

+The literalCOLLATE/ and literalCTYPE/ settings must match
+those of the template database, except when template0 is used as
+template. This is because literalCOLLATE/ and literalCTYPE/


It wouldn't be as bullet-proof for encoding, because we'd still have the 
problem that the encoding in the shared system tables would be 
ill-defined. That's a pre-existing problem, though. We could simply 
remove support for per-database encodings altogether and fix it at 
initdb time, as Martijn suggest earlier, but now that we have 
per-database locales, per-database encodings is a lot more useful as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Martijn van Oosterhout
On Mon, Sep 22, 2008 at 10:22:35AM +0300, Heikki Linnakangas wrote:
 BTW, the original patch didn't have any provision for creating rows in 
 pg_collation reflecting the locales available in the OS, but I think 
 we'd need that. Otherwise the DBA would need to manually run CREATE 
 COLLATION for every collation they want users to be able to use. 
 Assuming we do that, the situation that we can't find a row with given 
 LC_COLLATE and LC_CTYPE should not arise in practice.

You're assuming collations are denumerable. They're not. There is no way
to find the list of available collations/locales. You may be able to
guess a few but certainly not all of them.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Dave Page
On Tue, Sep 23, 2008 at 10:20 AM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:
 Committed.

*adds yet another item to his pgAdmin todo list* :-(

 Tom Lane wrote:

 * You should try to get rid of LOCALE_NAME_BUFLEN altogether.  Definitely
 the comment about it in pg_control.h is now obsolete.

 Yep. I removed LOCALE_NAME_BUFLEN. The real max length of a locale name is
 now NAMEDATALEN, because it's stored in a name field in pg_database.
 NAMEDATALEN is only 64 bytes, whereas LOCALE_NAME_BUFLEN was 128. 64 bytes
 should be enough for en_GB.UTF8 style locale names, but I wonder if it's
 enough for the longer names used on Windows? Could someone confirm that,
 please?

The longest I can find is:

Serbian (Cyrillic)_Bosnia and Herzegovina

at 42 characters.

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Martijn van Oosterhout
On Mon, Sep 22, 2008 at 06:11:04PM +0300, Heikki Linnakangas wrote:
 This patch should allow to use both system catalog and ICU. 
 
 Not without another patch that actually introduces ICU support. What 
 that would look like, how that would be stored in the catalogs, and 
 whether we want that is whole another topic. Without that, the STRCOLFN 
 part of the original patch is pointless, and I would've ripped that out 
 anyway even if we decided to add the pg_collation catalog in this release.

Eh? How you store collations names is easy. Every collation has a
textual name which is what we store in the catalog. I'm not sure why
you'd think it'd be any more complicated than that. And it has
precisely nothing to do with ICU and everything to do with being able
to support multiple source of collation information. We already have
two: builtin (strcmp/memcmp) and C library (strcoll). Letting peope add
more is the goal.

I'm sure once the catalog exists the existing ICU-for-Postgres patch
will be adjusted to use it.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Heikki Linnakangas

Martijn van Oosterhout wrote:

On Mon, Sep 22, 2008 at 10:22:35AM +0300, Heikki Linnakangas wrote:
BTW, the original patch didn't have any provision for creating rows in 
pg_collation reflecting the locales available in the OS, but I think 
we'd need that. Otherwise the DBA would need to manually run CREATE 
COLLATION for every collation they want users to be able to use. 
Assuming we do that, the situation that we can't find a row with given 
LC_COLLATE and LC_CTYPE should not arise in practice.


You're assuming collations are denumerable. They're not. There is no way
to find the list of available collations/locales. You may be able to
guess a few but certainly not all of them.


locale -a manages to do it somehow...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Martijn van Oosterhout
On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote:
 Martijn van Oosterhout wrote:
 You're assuming collations are denumerable. They're not. There is no way
 to find the list of available collations/locales. You may be able to
 guess a few but certainly not all of them.
 
 locale -a manages to do it somehow...

Sure, by (on glibc) opening the binary archive and parsing it and then
trying to reverse lookup the alias list. We could ofcourse program
something for each platform to determine a list but who is going to
maintain that? How do you handle the list changing?

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Zdenek Kotala

Martijn van Oosterhout napsal(a):

On Mon, Sep 22, 2008 at 06:11:04PM +0300, Heikki Linnakangas wrote:
This patch should allow to use both system catalog and ICU. 
Not without another patch that actually introduces ICU support. What 
that would look like, how that would be stored in the catalogs, and 
whether we want that is whole another topic. Without that, the STRCOLFN 
part of the original patch is pointless, and I would've ripped that out 
anyway even if we decided to add the pg_collation catalog in this release.


Eh? How you store collations names is easy. Every collation has a
textual name which is what we store in the catalog. I'm not sure why
you'd think it'd be any more complicated than that. And it has
precisely nothing to do with ICU and everything to do with being able
to support multiple source of collation information. We already have
two: builtin (strcmp/memcmp) and C library (strcoll). Letting peope add
more is the goal.


pg_collation catalog is also important for pg_dump, because system 
collation names are not compatible over OS and pg_dump output should be 
portable. pg_collation adds abstract layer which solve this problem.



I'm sure once the catalog exists the existing ICU-for-Postgres patch
will be adjusted to use it.


Yes, I agree with Martijn.

Zdenek

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Heikki Linnakangas

Zdenek Kotala wrote:
pg_collation catalog is also important for pg_dump, because system 
collation names are not compatible over OS and pg_dump output should be 
portable. pg_collation adds abstract layer which solve this problem.


That's a valid point. We'll still need a way to map OS locale to 
whatever internal names we invent for them, though, so I'm not sure if 
the pg_collation catalog helps much, but just moves the problem 
elsewhere. The pg_dump output of the CREATE COLLATION statements still 
wouldn't be portable from one OS to another.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Tom Lane
Martijn van Oosterhout [EMAIL PROTECTED] writes:
 On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote:
 locale -a manages to do it somehow...

 Sure, by (on glibc) opening the binary archive and parsing it and then
 trying to reverse lookup the alias list. We could ofcourse program
 something for each platform to determine a list but who is going to
 maintain that? How do you handle the list changing?

exec(locale -a) ...

I suppose we'd need something else for Windows, but I'm sure there's
a way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Magnus Hagander
Tom Lane wrote:
 Martijn van Oosterhout [EMAIL PROTECTED] writes:
 On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote:
 locale -a manages to do it somehow...
 
 Sure, by (on glibc) opening the binary archive and parsing it and then
 trying to reverse lookup the alias list. We could ofcourse program
 something for each platform to determine a list but who is going to
 maintain that? How do you handle the list changing?
 
 exec(locale -a) ...
 
 I suppose we'd need something else for Windows, but I'm sure there's
 a way.

IIRC, the data is in the registry. Should be enumerable somehow - we'll
have to do it platform specific of course, but it's not the first time
we'd do that for windows...

//Magnus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Zdenek Kotala

Heikki Linnakangas napsal(a):

Zdenek Kotala wrote:
pg_collation catalog is also important for pg_dump, because system 
collation names are not compatible over OS and pg_dump output should 
be portable. pg_collation adds abstract layer which solve this problem.


That's a valid point. We'll still need a way to map OS locale to 
whatever internal names we invent for them, though, so I'm not sure if 
the pg_collation catalog helps much, but just moves the problem 
elsewhere. 


It is true. For names we can for example use RFC479 0IANA register) 
http://tools.ietf.org/html/rfc4790#section-7

or use UNICODE terminology CLDR.

The pg_dump output of the CREATE COLLATION statements still 
wouldn't be portable from one OS to another.




I don't think so that pg_collation catalog should be dumped (maybe only 
with extra switch).




Zdenek

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Petr Jelinek

Magnus Hagander wrote:

exec(locale -a) ...

I suppose we'd need something else for Windows, but I'm sure there's
a way.


IIRC, the data is in the registry. Should be enumerable somehow - we'll
have to do it platform specific of course, but it's not the first time
we'd do that for windows...



There is EnumSystemLocales API function in Windows.

--
Regards
Petr Jelinek (PJMODOS)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-23 Thread Magnus Hagander
Petr Jelinek wrote:
 Magnus Hagander wrote:
 exec(locale -a) ...

 I suppose we'd need something else for Windows, but I'm sure there's
 a way.

 IIRC, the data is in the registry. Should be enumerable somehow - we'll
 have to do it platform specific of course, but it's not the first time
 we'd do that for windows...

 
 There is EnumSystemLocales API function in Windows.

Ha, right. We even use it in the installer :-)
Bottom line remains: we can easily do this in a Windows-specific way if
we need to.

//Magnus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-22 Thread Heikki Linnakangas

Martijn van Oosterhout wrote:

On Fri, Sep 19, 2008 at 10:13:43AM +0300, Heikki Linnakangas wrote:
It's not like the patch is going to disappear from planet Earth if it 
doesn't get committed for 8.4. It's still valuable and available when 
the new catalogs are needed.


I just prefer it as it was because it takes care of a useful subset of
the features people want in a way that is compatable for the future.
Whereas the stripped down version, I'm not sure it gets us anywhere.


It gives the capability to have different collations in different 
databases within the same cluster. IOW, the same feature as the original 
patch. Finer-grained collation would be even better, of course, but 
database-level collations is a valuable feature on its own.


The critical question is how much compatibility trouble we're going to 
get by having to support the extension to CREATE DATABASE in the 
stripped-down patch, when the pg_collation catalog is introduced in a 
later version in one form or another. So let's investigate that a bit 
further:


In the stripped down version, the CREATE DATABASE syntax is:

CREATE DATABASE name WITH COLLATE=locale name CTYPE=locale name

In the original patch, the CREATE DATABASE syntax is:

CREATE DATABASE name WITH COLLATE=collation name

The first thing that we see is that the COLLATE keyword means different 
things, so it's probably best to change that into:


CREATE DATABASE name WITH LC_COLLATE=locale name LC_CTYPE=locale name

in the stripped-down version. Then we need a way to map the 
stripped-down syntax into the one in the original patch. That's just a 
matter of looking up the collation in the pg_collation catalog with the 
right LC_COLLATE and LC_CTYPE.


Things get slightly more complicated if there is no such collation in 
the pg_collation catalog. One option is to simply create it at that point.


BTW, the original patch didn't have any provision for creating rows in 
pg_collation reflecting the locales available in the OS, but I think 
we'd need that. Otherwise the DBA would need to manually run CREATE 
COLLATION for every collation they want users to be able to use. 
Assuming we do that, the situation that we can't find a row with given 
LC_COLLATE and LC_CTYPE should not arise in practice.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-22 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 The first thing that we see is that the COLLATE keyword means different 
 things, so it's probably best to change that into:

 CREATE DATABASE name WITH LC_COLLATE=locale name LC_CTYPE=locale name

 in the stripped-down version. Then we need a way to map the 
 stripped-down syntax into the one in the original patch. That's just a 
 matter of looking up the collation in the pg_collation catalog with the 
 right LC_COLLATE and LC_CTYPE.

It seems to me that in an installation using libc-based collation
support, the collation names are likely to be the same as allowed values
of LC_COLLATE anyway.  So inventing different keywords doesn't really
seem necessary.

What might be sensible to ask is whether it is ever actually reasonable
for LC_COLLATE and LC_CTYPE to have different settings.  If we were
willing to enforce that they be the same, we could reduce this to just
the standard syntax COLLATE=something and be done with it.  Not being
much of a user of anything except C locale, I might be the wrong person
to opine on this; but it seems to me that having them different is far
more likely to be a mistake than desirable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-22 Thread Heikki Linnakangas

Tom Lane wrote:

What might be sensible to ask is whether it is ever actually reasonable
for LC_COLLATE and LC_CTYPE to have different settings.  If we were
willing to enforce that they be the same, we could reduce this to just
the standard syntax COLLATE=something and be done with it.  Not being
much of a user of anything except C locale, I might be the wrong person
to opine on this; but it seems to me that having them different is far
more likely to be a mistake than desirable.


Agreed, it doesn't make much sense. I find it hard to imagine anyone 
doing that on purpose, but we have supported it at initdb time for ages.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-22 Thread Zdenek Kotala

Heikki Linnakangas napsal(a):


For anyone counting, Firebird added support for ICU more than three
years ago.


ICU is orthogonal to this patch. This patch didn't provide ICU support, 
and we could start using ICU without the catalog changes.


This patch should allow to use both system catalog and ICU. pg_collate catalog 
contains comparing function which is called for string comparing and if somebody 
creates function which will use ICU then ICU will be supported. It is advantage 
of pg_catalog. Without them you can have system or ICU but probably not both.


Zdenek




--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-22 Thread Heikki Linnakangas

Zdenek Kotala wrote:

Heikki Linnakangas napsal(a):


For anyone counting, Firebird added support for ICU more than three
years ago.


ICU is orthogonal to this patch. This patch didn't provide ICU 
support, and we could start using ICU without the catalog changes.


This patch should allow to use both system catalog and ICU. 


Not without another patch that actually introduces ICU support. What 
that would look like, how that would be stored in the catalogs, and 
whether we want that is whole another topic. Without that, the STRCOLFN 
part of the original patch is pointless, and I would've ripped that out 
anyway even if we decided to add the pg_collation catalog in this release.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-19 Thread Martijn van Oosterhout
On Thu, Sep 18, 2008 at 05:41:16PM +0300, Heikki Linnakangas wrote:
 Attached is an updated version of the stripped-down patch. I've cleaned 
 it up a bit, and added more sanity checks. Documentation is still 
 missing and I haven't test it much.

FWIW, I disagree with the stripping down and think we should aim for
the whole patch as submitted. Historically, the hardest part of getting
collation support into postgres has been the catalog changes and easier
the support for something other than OS locales. I supported the patch as
it was precisely because it finally did that, and the stripping down
takes us back to square one.

Implementing the COLLATE keyword is much easier once you have the underlying
support and woldn't involve any backward incompatabilities. The
stripped down version gets us to a state which will only be supported
(hopefully) for one release.

For anyone counting, Firebird added support for ICU more than three
years ago.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-19 Thread Heikki Linnakangas

Martijn van Oosterhout wrote:

FWIW, I disagree with the stripping down and think we should aim for
the whole patch as submitted. Historically, the hardest part of getting
collation support into postgres has been the catalog changes and easier
the support for something other than OS locales. I supported the patch as
it was precisely because it finally did that, and the stripping down
takes us back to square one.


The catalog changes are one issue, but hardly the biggest one. 
Database-level collation works just as well without them, and for 
finer-grain control, the proposed patch was nowhere near enough. You'll 
need further catalog changes to track collation for each index column, 
for example, probably something in the operator families or classes etc.


In addition to catalog changes, for finer grained collation you need 
changes in the planner and executor as well. In planner, to provide 
support for the concept of a sort order using a specific collation, and 
track that whenever sort order is handled. In executor, the capability 
to compare and sort using different locales at different times. Those 
changes seem a lot more difficult to me than adding a couple of system 
catalogs, which is pretty straightforward.


It's not like the patch is going to disappear from planet Earth if it 
doesn't get committed for 8.4. It's still valuable and available when 
the new catalogs are needed.



Implementing the COLLATE keyword is much easier once you have the underlying
support and woldn't involve any backward incompatabilities. The
stripped down version gets us to a state which will only be supported
(hopefully) for one release.


Yes, I hope so too. Implementation in phases is a good thing in my mind, 
but adding a lot of complexity that isn't need until in a future release 
is not. We will need to support the 8.4 CREATE DATABASE syntax for 
backwards-compatibility, but I don't see a problem with that.



For anyone counting, Firebird added support for ICU more than three
years ago.


ICU is orthogonal to this patch. This patch didn't provide ICU support, 
and we could start using ICU without the catalog changes.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-19 Thread Heikki Linnakangas
Here's an updated version of the stripped-down patch, now with 
documentation changes, plus a couple of minor bug fixes.


I think this is looking pretty good now, and I would appreciate review 
from others before I go ahead committing this. Is there any more places 
in the documentation that needs updating, or any grammar mistakes? 
Further opinions on whether to allow using a database with different 
collation order as template, requiring a manual reindex and facing weird 
results if that's neglected, or to error out with a hint suggesting to 
use template0 as template (as is in the patch now)?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index f484db8..1e1786a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -130,23 +130,24 @@ initdb --locale=sv_SE
 
para
 The nature of some locale categories is that their value has to be
-fixed for the lifetime of a database cluster.  That is, once
-commandinitdb/command has run, you cannot change them anymore.
-literalLC_COLLATE/literal and literalLC_CTYPE/literal are
-those categories.  They affect the sort order of indexes, so they
-must be kept fixed, or indexes on text columns will become corrupt.
-productnamePostgreSQL/productname enforces this by recording
-the values of envarLC_COLLATE/ and envarLC_CTYPE/ that are
-seen by commandinitdb/.  The server automatically adopts
-those two values when it is started.
+fixed when the database is created.  You can use different settings
+for different databases, but once a database is created, you cannot
+change them for that database anymore. literalLC_COLLATE/literal
+and literalLC_CTYPE/literal are those categories.  They affect
+the sort order of indexes, so they must be kept fixed, or indexes on
+text columns will become corrupt.  The default values for these
+categories are defined when commandinitdb/command is run, and
+those values are used when new databases are created, unless
+explicitly specified otherwise in the commandCREATE
+DATABASE/command command.
/para
 
para
 The other locale categories can be changed as desired whenever the
 server is running by setting the run-time configuration variables
 that have the same name as the locale categories (see xref
-linkend=runtime-config-client-format for details).  The defaults that are
-chosen by commandinitdb/command are actually only written into
+linkend=runtime-config-client-format for details).  The defaults
+that are chosen by commandinitdb/command are actually only written into
 the configuration file filenamepostgresql.conf/filename to
 serve as defaults when the server is started.  If you delete these
 assignments from filenamepostgresql.conf/filename then the
@@ -261,7 +262,7 @@ initdb --locale=sv_SE
 
para
 Check that productnamePostgreSQL/ is actually using the locale
-that you think it is.  envarLC_COLLATE/ and envarLC_CTYPE/
+that you think it is.  The default envarLC_COLLATE/ and envarLC_CTYPE/
 settings are determined at commandinitdb/ time and cannot be
 changed without repeating commandinitdb/.  Other locale
 settings including envarLC_MESSAGES/ and envarLC_MONETARY/
@@ -320,16 +321,10 @@ initdb --locale=sv_SE
 
   para
An important restriction, however, is that each database character set
-   must be compatible with the server's envarLC_CTYPE/ setting.
+   must be compatible with the database's envarLC_CTYPE/ setting.
When envarLC_CTYPE/ is literalC/ or literalPOSIX/, any
character set is allowed, but for other settings of envarLC_CTYPE/
there is only one character set that will work correctly.
-   Since the envarLC_CTYPE/ setting is frozen by commandinitdb/, the
-   apparent flexibility to use different encodings in different databases
-   of a cluster is more theoretical than real, except when you select
-   literalC/ or literalPOSIX/ locale (thus disabling any real locale
-   awareness).  It is likely that these mechanisms will be revisited in future
-   versions of productnamePostgreSQL/productname.
   /para
 
sect2 id=multibyte-charset-supported
@@ -734,19 +729,19 @@ initdb -E EUC_JP
 /para
 
 para
- If you have selected literalC/ or literalPOSIX/ locale,
- you can create a database with a different character set:
+ You can specify a non-default encoding at database creation time,
+ provided that the encoding is compatible with the selected locale:
 
 screen
-createdb -E EUC_KR korean
+createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
 /screen
 
  This will create a database named literalkorean/literal that
- uses the character set literalEUC_KR/literal.  Another way to
- accomplish this is to use this SQL command:
+ uses the character set literalEUC_KR/literal, and locale 

Re: [HACKERS] WIP patch: Collation support

2008-09-19 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Here's an updated version of the stripped-down patch, now with 
 documentation changes, plus a couple of minor bug fixes.

 I think this is looking pretty good now, and I would appreciate review 
 from others before I go ahead committing this.

I guess the $64 question is whether this moves us towards where we want to
be?  If the ultimate goal is to get to per-column collations, I think this
patch is likely to be counterproductive, because it will establish syntax
and semantics that we'll have to try to remain compatible with.  If we're
going to be satisfied with per-database collations, then great, let's do
it.


As far as reviewing the patch itself goes:

* It's missing pg_dump (or I guess really pg_dumpall) support.

* Please call the new columns datcollate and datctype.  Just because
someone ignored the naming convention years ago for pg_database.encoding
doesn't mean we should ignore it here.  Also, if you're going to name one
column based on an LC_foo variable's name, do the other one the same way.
I note also that this would match the keywords used in CREATE DATABASE.

* Don't even think of adding system catalog columns without updating
catalogs.sgml.

* You should try to get rid of LOCALE_NAME_BUFLEN altogether.  Definitely
the comment about it in pg_control.h is now obsolete.

* You *must* bump PG_CONTROL_VERSION when you change its layout.
(Don't forget catversion too, since you're also changing pg_database)

* This doc sentence reads a bit awkwardly:

An important restriction, however, is that each database character set
must be compatible with the database's envarLC_CTYPE/ setting.

Maybe

An important restriction, however, is that each database's character set
must be compatible with the database's envarLC_CTYPE/ setting.

Also I wonder whether we shouldn't say that it must be compatible with
LC_CTYPE *and* LC_COLLATE.

* This bit in the CREATE DATABASE ref page is also awkward:

Any character set encoding specified for the new database must be
compatible with the chosen COLLATE and CTYPE settings.

Maybe

The character set encoding used for the new database must be
compatible with the chosen COLLATE and CTYPE settings.

* This bit doesn't look ready to commit:

***
*** 754,759 
--- 749,756 
   option-l/option option or the command\l/command command
   of commandpsql/command.
  
+ XXX It would be nice to show the korean database here. And we should
+ modify psql to show locale as well
  screen
  $ userinputpsql -l/userinput
  List of databases

* This makes sense, but then shouldn't we make the identical restriction
for encoding?

+The literalCOLLATE/ and literalCTYPE/ settings must match
+those of the template database, except when template0 is used as
+template. This is because literalCOLLATE/ and literalCTYPE/

* I can't tell whether the writer of this bit thought that a blank line
would come out as a paragraph break or not.  Either add para's or
remove the blank lines to make it look like one para in the source:

--- 76,105 
  
para
 commandinitdb/command initializes the database cluster's default
!locale and character set encoding. 
! 
!The character set encoding, collation order (literalLC_COLLATE/)
!and character set classes (literalLC_CTYPE/, e.g. upper, lower,
!digit) can be set separately for a database when it is created.
!commandinitdb/command determines those settings for the
!literaltemplate1/literal database, which will serve as the
!default for all other databases.
! 
!To alter the default collation order or character set classes, use the
!option--lc-collate/option and option--lc-ctype/option options.
!Collation orders other than literalC/ or literalPOSIX/ also have
!a performance penalty.  For these reasons it is important to choose the
!right locale when running commandinitdb/command. 
! 
!The remaining locale categories can be changed later when the server
!is started.  You can also use option--locale/option to set the
!default for all locale categories, including collation order and
!character set classes. All server locale values (literallc_*/) can
!be displayed via commandSHOW ALL/.
 More details can be found in xref linkend=locale.
  
!To alter the default encoding, use the option--encoding/option.
!More details can be found in xref linkend=multibyte.
/para

* This bit for resetxlog is just wrong now:

--- 62,70 
 by specifying the literal-f/ (force) switch.  In this case plausible
 values will be substituted for the missing data.  Most of the fields can be
 expected to match, but manual assistance might be needed for the next OID,
!next transaction ID and epoch, next multitransaction ID and offset, and
!WAL starting address fields.
!The first five of these can be set using the switches discussed below.
 If you are 

Re: [HACKERS] WIP patch: Collation support

2008-09-19 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
Here's an updated version of the stripped-down patch, now with 
documentation changes, plus a couple of minor bug fixes.


Another update, marching towards committing. Now with pg_dump/pg_dumpall 
support, and collation/ctype is also shown in psql \l output.


I wonder if we should provide a shorthand LOCALE=localename option, 
in addition to separate COLLATE and CTYPE options? Most people will 
always set them together. It could also set the rest of the locale 
categories, lc_numeric, lc_monetary, and so on, as database-specific 
options. Like ALTER DATABASE dbname SET lc_messages=localename ... 
would.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index f484db8..5581b10 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -130,23 +130,24 @@ initdb --locale=sv_SE
 
para
 The nature of some locale categories is that their value has to be
-fixed for the lifetime of a database cluster.  That is, once
-commandinitdb/command has run, you cannot change them anymore.
-literalLC_COLLATE/literal and literalLC_CTYPE/literal are
-those categories.  They affect the sort order of indexes, so they
-must be kept fixed, or indexes on text columns will become corrupt.
-productnamePostgreSQL/productname enforces this by recording
-the values of envarLC_COLLATE/ and envarLC_CTYPE/ that are
-seen by commandinitdb/.  The server automatically adopts
-those two values when it is started.
+fixed when the database is created.  You can use different settings
+for different databases, but once a database is created, you cannot
+change them for that database anymore. literalLC_COLLATE/literal
+and literalLC_CTYPE/literal are those categories.  They affect
+the sort order of indexes, so they must be kept fixed, or indexes on
+text columns will become corrupt.  The default values for these
+categories are defined when commandinitdb/command is run, and
+those values are used when new databases are created, unless
+explicitly specified otherwise in the commandCREATE
+DATABASE/command command.
/para
 
para
 The other locale categories can be changed as desired whenever the
 server is running by setting the run-time configuration variables
 that have the same name as the locale categories (see xref
-linkend=runtime-config-client-format for details).  The defaults that are
-chosen by commandinitdb/command are actually only written into
+linkend=runtime-config-client-format for details).  The defaults
+that are chosen by commandinitdb/command are actually only written into
 the configuration file filenamepostgresql.conf/filename to
 serve as defaults when the server is started.  If you delete these
 assignments from filenamepostgresql.conf/filename then the
@@ -261,7 +262,7 @@ initdb --locale=sv_SE
 
para
 Check that productnamePostgreSQL/ is actually using the locale
-that you think it is.  envarLC_COLLATE/ and envarLC_CTYPE/
+that you think it is.  The default envarLC_COLLATE/ and envarLC_CTYPE/
 settings are determined at commandinitdb/ time and cannot be
 changed without repeating commandinitdb/.  Other locale
 settings including envarLC_MESSAGES/ and envarLC_MONETARY/
@@ -320,16 +321,10 @@ initdb --locale=sv_SE
 
   para
An important restriction, however, is that each database character set
-   must be compatible with the server's envarLC_CTYPE/ setting.
+   must be compatible with the database's envarLC_CTYPE/ setting.
When envarLC_CTYPE/ is literalC/ or literalPOSIX/, any
character set is allowed, but for other settings of envarLC_CTYPE/
there is only one character set that will work correctly.
-   Since the envarLC_CTYPE/ setting is frozen by commandinitdb/, the
-   apparent flexibility to use different encodings in different databases
-   of a cluster is more theoretical than real, except when you select
-   literalC/ or literalPOSIX/ locale (thus disabling any real locale
-   awareness).  It is likely that these mechanisms will be revisited in future
-   versions of productnamePostgreSQL/productname.
   /para
 
sect2 id=multibyte-charset-supported
@@ -734,19 +729,19 @@ initdb -E EUC_JP
 /para
 
 para
- If you have selected literalC/ or literalPOSIX/ locale,
- you can create a database with a different character set:
+ You can specify a non-default encoding at database creation time,
+ provided that the encoding is compatible with the selected locale:
 
 screen
-createdb -E EUC_KR korean
+createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
 /screen
 
  This will create a database named literalkorean/literal that
- uses the character set literalEUC_KR/literal.  Another way to
- accomplish this is to use this SQL command:
+ uses the character 

Re: [HACKERS] WIP patch: Collation support

2008-09-19 Thread Martijn van Oosterhout
On Fri, Sep 19, 2008 at 10:13:43AM +0300, Heikki Linnakangas wrote:
 In addition to catalog changes, for finer grained collation you need 
 changes in the planner and executor as well. In planner, to provide 
 support for the concept of a sort order using a specific collation, and 
 track that whenever sort order is handled. In executor, the capability 
 to compare and sort using different locales at different times. Those 
 changes seem a lot more difficult to me than adding a couple of system 
 catalogs, which is pretty straightforward.

To be honest, I think that's the easy part. When I submitted the
collate patch a few years back, getting the sort order working was the
easy part, precisely because COLLATE clauses can affect the generated
plan in precisely one way: can you use index FOO or not. Once you're
past that point you can throw almost all the collation information
away.

As you point out though, the shear volume of catalog changes required
obscure that fact somewhat.

 It's not like the patch is going to disappear from planet Earth if it 
 doesn't get committed for 8.4. It's still valuable and available when 
 the new catalogs are needed.

I just prefer it as it was because it takes care of a useful subset of
the features people want in a way that is compatable for the future.
Whereas the stripped down version, I'm not sure it gets us anywhere.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-18 Thread Heikki Linnakangas

Gregory Stark wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:


Well, I proposed disallowing using a different collation than the source
database, except for using template0 as the source. That's pretty limited, but
is trivial to implement and still let's you have databases with different
collations in the same cluster.


+   if (strcmp(dbtemplate, template0) != 0 
+   (strcmp(lc_collate, src_collation) || strcmp(lc_ctype, 
src_ctype)))
+   ereport(NOTICE,
+   (errmsg(database \%s\ needs to be reindexed 
manually (REINDEX DATABASE),
+   dbname)));  
+ 


This isn't what you described but I think I prefer it this way as just a
warning not an error.


Well, I'd prefer to make it an error, but I'm willing to listen if 
others feel otherwise. I don't think the inconvenience of having to use 
template0 is that big, compared to the potential of strange behavior 
people would run into if they ignore the advice to reindex.


One weakness with a straight strcmp comparison is that it won't 
recognize aliases of the same locale. For example, fi_FI.UTF8 and 
fi_FI.UTF-8.



AFAIK we can't easily connect to the new database and do some fiddling with
it, can we? If we could we could check if there are any non-empty indexes
which depend on the collation and only print the warning if we find any (and
even mark them invalid).


I don't see that happening, unfortunately..

Attached is an updated version of the stripped-down patch. I've cleaned 
it up a bit, and added more sanity checks. Documentation is still 
missing and I haven't test it much.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** doc/src/sgml/ref/create_database.sgml
--- doc/src/sgml/ref/create_database.sgml
***
*** 24,29  CREATE DATABASE replaceable class=PARAMETERname/replaceable
--- 24,31 
  [ [ WITH ] [ OWNER [=] replaceable class=parameterdbowner/replaceable ]
 [ TEMPLATE [=] replaceable class=parametertemplate/replaceable ]
 [ ENCODING [=] replaceable class=parameterencoding/replaceable ]
+[ COLLATE [=] replaceable class=parametercollation/replaceable ]
+[ CTYPE [=] replaceable class=parameterctype/replaceable ]
 [ TABLESPACE [=] replaceable class=parametertablespace/replaceable ]
 [ CONNECTION LIMIT [=] replaceable class=parameterconnlimit/replaceable ] ]
  /synopsis
***
*** 113,118  CREATE DATABASE replaceable class=PARAMETERname/replaceable
--- 115,136 
/listitem
   /varlistentry
   varlistentry
+   termreplaceable class=parametercollation/replaceable/term
+   listitem
+para
+ LC_COLLATE setting to use in the new database.  XXX
+/para
+   /listitem
+  /varlistentry
+  varlistentry
+   termreplaceable class=parameterctype/replaceable/term
+   listitem
+para
+ LC_CTYPE setting to use in the new database.  XXX
+/para
+   /listitem
+  /varlistentry
+  varlistentry
termreplaceable class=parametertablespace/replaceable/term
listitem
 para
*** src/backend/access/transam/xlog.c
--- src/backend/access/transam/xlog.c
***
*** 3847,3853  WriteControlFile(void)
  {
  	int			fd;
  	char		buffer[PG_CONTROL_SIZE];		/* need not be aligned */
- 	char	   *localeptr;
  
  	/*
  	 * Initialize version and compatibility-check fields
--- 3847,3852 
***
*** 3876,3893  WriteControlFile(void)
  	ControlFile-float4ByVal = FLOAT4PASSBYVAL;
  	ControlFile-float8ByVal = FLOAT8PASSBYVAL;
  
- 	ControlFile-localeBuflen = LOCALE_NAME_BUFLEN;
- 	localeptr = setlocale(LC_COLLATE, NULL);
- 	if (!localeptr)
- 		ereport(PANIC,
- (errmsg(invalid LC_COLLATE setting)));
- 	StrNCpy(ControlFile-lc_collate, localeptr, LOCALE_NAME_BUFLEN);
- 	localeptr = setlocale(LC_CTYPE, NULL);
- 	if (!localeptr)
- 		ereport(PANIC,
- (errmsg(invalid LC_CTYPE setting)));
- 	StrNCpy(ControlFile-lc_ctype, localeptr, LOCALE_NAME_BUFLEN);
- 
  	/* Contents are protected with a CRC */
  	INIT_CRC32(ControlFile-crc);
  	COMP_CRC32(ControlFile-crc,
--- 3875,3880 
***
*** 4126,4159  ReadControlFile(void)
  		but the server was compiled without USE_FLOAT8_BYVAL.),
   errhint(It looks like you need to recompile or initdb.)));
  #endif
- 
- 	if (ControlFile-localeBuflen != LOCALE_NAME_BUFLEN)
- 		ereport(FATAL,
- (errmsg(database files are incompatible with server),
-  errdetail(The database cluster was initialized with LOCALE_NAME_BUFLEN %d,
-    but the server was compiled with LOCALE_NAME_BUFLEN %d.,
- 		   ControlFile-localeBuflen, LOCALE_NAME_BUFLEN),
-  errhint(It looks like you need to recompile or initdb.)));
- 	if (pg_perm_setlocale(LC_COLLATE, ControlFile-lc_collate) == NULL)
- 		ereport(FATAL,
- 			(errmsg(database files are 

Re: [HACKERS] WIP patch: Collation support

2008-09-16 Thread Heikki Linnakangas

Martijn van Oosterhout wrote:

On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:

Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.
Let's just use the name of the OS locale, like we do now. Having a 
pg_collation catalog just moves the problem elsewhere: we'd still need 
something in pg_collation to tie the collation to the OS locale.


There's not a one-to-one mapping between collation and locale name. A
locale name includes information about the charset and a collation may
have paramters like case-sensetivity and pad-attribute which are not
present in the locale name. You need a mapping anyway, which is what
this table is for.


Ideally, we would delegate the case-sensitivity and padding to the 
collation implementation (ie. OS setlocale() or ICU). That said, I don't 
think operating systems normally ship case-insensitive variants of 
locales by default, so I agree it would be nice if we could implement 
that ourselves. Still, we could identify case-sensitive locale names for 
example by a suffix, like en_GB.UTF8.case-insensitive.


I agree we will eventually need a way to give shorthand names for 
collations, and a pg_collation catalog will then come handy. But that 
can wait until we have the basic infrastructure ready to support column 
and query-level collation.



But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.

That's easier said than done, unfortunately.


I don't see an alternative.


Well, I proposed disallowing using a different collation than the source 
database, except for using template0 as the source. That's pretty 
limited, but is trivial to implement and still let's you have databases 
with different collations in the same cluster.


I worked a bit on Radek's patch, stripping out all the pg_collate and 
pg_charset catalog changes and commands, leaving just the core 
functionality of database-level collations. It needs some cleanup and 
documentation, but something like this I'd like to commit in this commit 
fest. The new catalogs can wait until we have a real need for them.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/transam/xlog.c
--- src/backend/access/transam/xlog.c
***
*** 3847,3853  WriteControlFile(void)
  {
  	int			fd;
  	char		buffer[PG_CONTROL_SIZE];		/* need not be aligned */
- 	char	   *localeptr;
  
  	/*
  	 * Initialize version and compatibility-check fields
--- 3847,3852 
***
*** 3875,3893  WriteControlFile(void)
  #endif
  	ControlFile-float4ByVal = FLOAT4PASSBYVAL;
  	ControlFile-float8ByVal = FLOAT8PASSBYVAL;
! 
! 	ControlFile-localeBuflen = LOCALE_NAME_BUFLEN;
! 	localeptr = setlocale(LC_COLLATE, NULL);
! 	if (!localeptr)
! 		ereport(PANIC,
! (errmsg(invalid LC_COLLATE setting)));
! 	StrNCpy(ControlFile-lc_collate, localeptr, LOCALE_NAME_BUFLEN);
! 	localeptr = setlocale(LC_CTYPE, NULL);
! 	if (!localeptr)
! 		ereport(PANIC,
! (errmsg(invalid LC_CTYPE setting)));
! 	StrNCpy(ControlFile-lc_ctype, localeptr, LOCALE_NAME_BUFLEN);
! 
  	/* Contents are protected with a CRC */
  	INIT_CRC32(ControlFile-crc);
  	COMP_CRC32(ControlFile-crc,
--- 3874,3880 
  #endif
  	ControlFile-float4ByVal = FLOAT4PASSBYVAL;
  	ControlFile-float8ByVal = FLOAT8PASSBYVAL;
! 	
  	/* Contents are protected with a CRC */
  	INIT_CRC32(ControlFile-crc);
  	COMP_CRC32(ControlFile-crc,
***
*** 4126,4159  ReadControlFile(void)
  		but the server was compiled without USE_FLOAT8_BYVAL.),
   errhint(It looks like you need to recompile or initdb.)));
  #endif
- 
- 	if (ControlFile-localeBuflen != LOCALE_NAME_BUFLEN)
- 		ereport(FATAL,
- (errmsg(database files are incompatible with server),
-  errdetail(The database cluster was initialized with LOCALE_NAME_BUFLEN %d,
-    but the server was compiled with LOCALE_NAME_BUFLEN %d.,
- 		   ControlFile-localeBuflen, LOCALE_NAME_BUFLEN),
-  errhint(It looks like you need to recompile or initdb.)));
- 	if (pg_perm_setlocale(LC_COLLATE, ControlFile-lc_collate) == NULL)
- 		ereport(FATAL,
- 			(errmsg(database files are incompatible with operating system),
- 			 errdetail(The database cluster was initialized with LC_COLLATE \%s\,
- 	which is not recognized by setlocale().,
- 	   ControlFile-lc_collate),
- 			 errhint(It looks like you need to initdb or install locale support.)));
- 	if (pg_perm_setlocale(LC_CTYPE, ControlFile-lc_ctype) == NULL)
- 		ereport(FATAL,
- 			(errmsg(database files are incompatible with operating system),
- 		errdetail(The database cluster was initialized with LC_CTYPE \%s\,
-    which is not recognized by setlocale().,
-   ControlFile-lc_ctype),
- 			 errhint(It looks like 

Re: [HACKERS] WIP patch: Collation support

2008-09-16 Thread Gregory Stark
Heikki Linnakangas [EMAIL PROTECTED] writes:

 Well, I proposed disallowing using a different collation than the source
 database, except for using template0 as the source. That's pretty limited, but
 is trivial to implement and still let's you have databases with different
 collations in the same cluster.

+   if (strcmp(dbtemplate, template0) != 0 
+   (strcmp(lc_collate, src_collation) || strcmp(lc_ctype, 
src_ctype)))
+   ereport(NOTICE,
+   (errmsg(database \%s\ needs to be reindexed 
manually (REINDEX DATABASE),
+   dbname)));  
+ 

This isn't what you described but I think I prefer it this way as just a
warning not an error. I can easily imagine cases where the admin knows there
are either no indexes or no data in their database or where they're perfectly
happy to reindex. A flat prohibition seems annoying.

That said it's worth noting that this would tie our hands with proposals like
retail vacuum which risk database corruption if they fail to refind an index
pointer for a tuple...

Incidentally it seems like the warning should actually explain *why* it needs
to be reindexed manually and perhaps what the consequences are until it is.

AFAIK we can't easily connect to the new database and do some fiddling with
it, can we? If we could we could check if there are any non-empty indexes
which depend on the collation and only print the warning if we find any (and
even mark them invalid).

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-16 Thread Zdenek Kotala

Gregory Stark napsal(a):

Heikki Linnakangas [EMAIL PROTECTED] writes:



snip


AFAIK we can't easily connect to the new database and do some fiddling with
it, can we? If we could we could check if there are any non-empty indexes
which depend on the collation and only print the warning if we find any (and
even mark them invalid).



Autovacum uses InitPostgres function to swith to another database. I'm 
not sure how much safe it is in create database command and when we are 
already switched we can reindex affected indexes.


Zdenek

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-16 Thread Zdenek Kotala

Heikki Linnakangas napsal(a):

Martijn van Oosterhout wrote:

On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:

Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.
Let's just use the name of the OS locale, like we do now. Having a 
pg_collation catalog just moves the problem elsewhere: we'd still 
need something in pg_collation to tie the collation to the OS locale.


There's not a one-to-one mapping between collation and locale name. A
locale name includes information about the charset and a collation may
have paramters like case-sensetivity and pad-attribute which are not
present in the locale name. You need a mapping anyway, which is what
this table is for.


Ideally, we would delegate the case-sensitivity and padding to the 
collation implementation (ie. OS setlocale() or ICU). That said, I don't 
think operating systems normally ship case-insensitive variants of 
locales by default, so I agree it would be nice if we could implement 
that ourselves. Still, we could identify case-sensitive locale names for 
example by a suffix, like en_GB.UTF8.case-insensitive.


The idea was to call to_upper (or to_lower) before case-sensitive 
collation processing. It is difficult to determine from suffix if it is 
sensitive or not.


Zdenek

PS: We can discuss it in Prato

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-16 Thread Alvaro Herrera
Zdenek Kotala escribió:
 Gregory Stark napsal(a):

 AFAIK we can't easily connect to the new database and do some fiddling with
 it, can we? If we could we could check if there are any non-empty indexes
 which depend on the collation and only print the warning if we find any (and
 even mark them invalid).

 Autovacum uses InitPostgres function to swith to another database. I'm  
 not sure how much safe it is in create database command and when we are  
 already switched we can reindex affected indexes.

It's only supposed to work if you're not previously connected to any
database.  (Autovacuum never switches to another database; any
particular worker only connects to a single database).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-10 Thread Heikki Linnakangas

Radek Strnad wrote:

Progress so far:
- created catalogs pg_collation a pg_charset which are filled with three
standard collations
- initdb changes rows called DEFAULT in both catalogs during the bki
bootstrap phase with current system LC_COLLATE and LC_CTYPE or those set by
command line.
- new collations can be defined with command CREATE COLLATION collation
name FOR character set specification  FROM existing collation name
[STRCOLFN fn name]
[ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] [
LCCTYPE lc_ctype ]
- because of pg_collation and pg_charset are catalogs individual for each
database, if you want to create a database with collation other than
specified, create it in template1 and then create database


I have to wonder, is all that really necessary? The feature you're 
trying to implement is to support database-level collation at first, and 
perhaps column-level collation later. We don't need support for 
user-defined collations and charsets for that.


If leave all that out of the patch for now, we'll have a much slimmer, 
and just as useful patch, implementing database-level collation. We can 
add those catalogs later if we need them, but I don't think there's much 
point in adding all that infrastructure if they just reflect the locales 
installed in the operating system.



- when connecting to database, it retrieves locales from pg_database and
sets them


This is the real gist of this patch.


Design  functionality changes left:
- move retrieveing collation from pg_database to pg_type


I don't understand this item. What will you move?


- get case sensitivity and pad characteristic working


I feel we should leave this to the collation implementation.


- when creating database with different collation than database cluster, the
database has to be reindexed. Any idea how to do it? Function
ReindexDatabase works only when database is opened.


That's a tricky one. One idea is to prohibit choosing a different 
collation than the one in the template database, unless we know it's 
safe to do so without reindexing. The problem is that we don't know 
whether it's safe. A simple but limiting solution would be to require 
that the template database has the same collation as the database that's 
being created, except that template0 can always be used as template. 
template0 is safe, because there's no indexes on text columns there.


Note that we already have the same problem with encodings. If you create 
a database with LATIN1 encoding, load it with data, and then use that as 
a template for a database with UTF-8 encoding, the text data will be 
incorrectly encoded. We should probably fix that too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-10 Thread Martijn van Oosterhout
On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote:
 Radek Strnad wrote:
 - because of pg_collation and pg_charset are catalogs individual for each
 database, if you want to create a database with collation other than
 specified, create it in template1 and then create database
 
 I have to wonder, is all that really necessary? The feature you're 
 trying to implement is to support database-level collation at first, and 
 perhaps column-level collation later. We don't need support for 
 user-defined collations and charsets for that.

Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.

 Design  functionality changes left:
 - move retrieveing collation from pg_database to pg_type
 
 I don't understand this item. What will you move?

Long term, the collation is a property of the type, but I agree, I'm not
sure why this patch needs it.

 That's a tricky one. One idea is to prohibit choosing a different 
 collation than the one in the template database, unless we know it's 
 safe to do so without reindexing.

But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.

 Note that we already have the same problem with encodings. If you create 
 a database with LATIN1 encoding, load it with data, and then use that as 
 a template for a database with UTF-8 encoding, the text data will be 
 incorrectly encoded. We should probably fix that too.

I'd say forbid more than one encoding in a cluster, but that's just my
opinion :)

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-10 Thread Heikki Linnakangas

Martijn van Oosterhout wrote:

On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote:

Radek Strnad wrote:

- because of pg_collation and pg_charset are catalogs individual for each
database, if you want to create a database with collation other than
specified, create it in template1 and then create database
I have to wonder, is all that really necessary? The feature you're 
trying to implement is to support database-level collation at first, and 
perhaps column-level collation later. We don't need support for 
user-defined collations and charsets for that.


Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.


Let's just use the name of the OS locale, like we do now. Having a 
pg_collation catalog just moves the problem elsewhere: we'd still need 
something in pg_collation to tie the collation to the OS locale.



Design  functionality changes left:
- move retrieveing collation from pg_database to pg_type

I don't understand this item. What will you move?


Long term, the collation is a property of the type, ...


You might want to provide a default collation for a type as well, but 
the very finest grade is that you can specify collation for every (text) 
comparison operator in your query. Of course you don't want to do that 
for every query, which is why we should provide defaults at different 
levels: columns, tables, database. And perhaps types as well, but that's 
not the most interesting case.


I'm not sure what the SQL spec says about that, but I believe it 
provides syntax and rules for all that.


That's a tricky one. One idea is to prohibit choosing a different 
collation than the one in the template database, unless we know it's 
safe to do so without reindexing.


But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.


That's easier said than done, unfortunately.

Note that we already have the same problem with encodings. If you create 
a database with LATIN1 encoding, load it with data, and then use that as 
a template for a database with UTF-8 encoding, the text data will be 
incorrectly encoded. We should probably fix that too.


I'd say forbid more than one encoding in a cluster, but that's just my
opinion :)


Yeah, that's pretty useless, at least without support for different 
locales on different databases. But might as well keep it unless there's 
a pressing reason to drop it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-10 Thread Zdenek Kotala

Heikki Linnakangas napsal(a):


Design  functionality changes left:
- move retrieveing collation from pg_database to pg_type




The problem there is that pg_collation is local catalog, but pg_database 
is global catalog. IIRC, It was discussed during last commitfest. I 
think it is bad idea to make this kind of dependency. It seems to me 
better to implement something what we will use later to avoid useless 
and confusing dependency.


Zdenek


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-10 Thread Martijn van Oosterhout
On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:
 Since the set of collations isn't exactly denumerable, we need some way
 to allow the user to specify the collation they want. The only
 collation PostgreSQL knows about is the C collation. Anything else is
 user-defined.
 
 Let's just use the name of the OS locale, like we do now. Having a 
 pg_collation catalog just moves the problem elsewhere: we'd still need 
 something in pg_collation to tie the collation to the OS locale.

There's not a one-to-one mapping between collation and locale name. A
locale name includes information about the charset and a collation may
have paramters like case-sensetivity and pad-attribute which are not
present in the locale name. You need a mapping anyway, which is what
this table is for.

The difference in collation between CHAR() and VARCHAR() is the usual
example here.

 Long term, the collation is a property of the type, ...
 
 I'm not sure what the SQL spec says about that, but I believe it 
 provides syntax and rules for all that.

The spec is quite detailed about and I posted code to do it years ago.
The point is that we don't need to go that far with this patch.

 But that put us back where we started: every database having the same
 collation. We're trying to move away from that. Just reindex everything
 and be done with it.
 
 That's easier said than done, unfortunately.

I don't see an alternative.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Peter Eisentraut

Radek Strnad wrote:
- new collations can be defined with command CREATE COLLATION collation 
name FOR character set specification  FROM existing collation name 
[STRCOLFN fn name]
[ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] 
[ LCCTYPE lc_ctype ]


How do you plan to make a collation case sensitive or accent sensitive? 
 I have previously commented that this is not a realistic view on how 
collations work.  Since you are apparently planning to use the system 
locales, I don't see how you can make this work.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Martijn van Oosterhout
On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
 Radek Strnad wrote:
 - new collations can be defined with command CREATE COLLATION collation 
 name FOR character set specification  FROM existing collation name 
 [STRCOLFN fn name]
 [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] 
 [ LCCTYPE lc_ctype ]
 
 How do you plan to make a collation case sensitive or accent sensitive? 
  I have previously commented that this is not a realistic view on how 
 collations work.  Since you are apparently planning to use the system 
 locales, I don't see how you can make this work.

While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Radek Strnad
I think at least case sensitivity can be done by comparing two strings
converted to upper case with toupper() function.

Regards

Radek Strnad


 On Tue, Sep 2, 2008 at 2:00 PM, Martijn van Oosterhout [EMAIL 
 PROTECTED]wrote:

 On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
  Radek Strnad wrote:
  - new collations can be defined with command CREATE COLLATION
 collation
  name FOR character set specification  FROM existing collation name
  [STRCOLFN fn name]
  [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate
 ]
  [ LCCTYPE lc_ctype ]
 
  How do you plan to make a collation case sensitive or accent sensitive?
   I have previously commented that this is not a realistic view on how
  collations work.  Since you are apparently planning to use the system
  locales, I don't see how you can make this work.

 While it's true POSIX locales don't handle this, other collation
 libraries do and we should support them if the user wants.

 Have a nice day,
 --
 Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
  Please line up in a tree and maintain the heap invariant while
  boarding. Thank you for flying nlogn airlines.

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (GNU/Linux)

 iD8DBQFIvSrIIB7bNG8LQkwRAnkWAJ9FaiR9cOHFN2vkVmQaK5y7N9OJoQCbB+Ks
 e0E4722hY/Q+Cz8tpzA0CGs=
 =2Svh
 -END PGP SIGNATURE-





Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Peter Eisentraut

Martijn van Oosterhout wrote:

On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:

Radek Strnad wrote:
- new collations can be defined with command CREATE COLLATION collation 
name FOR character set specification  FROM existing collation name 
[STRCOLFN fn name]
[ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] 
[ LCCTYPE lc_ctype ]
How do you plan to make a collation case sensitive or accent sensitive? 
 I have previously commented that this is not a realistic view on how 
collations work.  Since you are apparently planning to use the system 
locales, I don't see how you can make this work.


While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.


Do they handle exactly those two attributes specifically?  Can you point 
out references?  Or do you mean, other collation libraries allow their 
collations to be configured/customized?  I think linguistically it is a 
very narrow view of the world to hardcode those two attributes.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Martijn van Oosterhout
On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
 While it's true POSIX locales don't handle this, other collation
 libraries do and we should support them if the user wants.
 
 Do they handle exactly those two attributes specifically?  Can you point 
 out references?  Or do you mean, other collation libraries allow their 
 collations to be configured/customized?  I think linguistically it is a 
 very narrow view of the world to hardcode those two attributes.

Well, yes. Accents and case are attributes of a character. (I'm using
the unicode model here). So, to do a case insensetive match you take
the characters, strip the attributes and then do the comparison. There
are specialised routines which handle the denormalisation of the string
for you so in theory you could even get specific about which accents
you ignore. In practice I don't think people do that.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Gregory Stark

Martijn van Oosterhout [EMAIL PROTECTED] writes:

 On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
 While it's true POSIX locales don't handle this, other collation
 libraries do and we should support them if the user wants.

I think that's backwards. We have to go with the lowest common denominator
functionality of those libraries if we're going to be portable. As long as
it's a superset of the SQL standard functionality. If we support features of
some of them that can't be emulated with others then users end up with SQL
code that will only work on some builds and not others. That might be worth it
for some features but I'm not sure this is one.

 Well, yes. Accents and case are attributes of a character. (I'm using
 the unicode model here). So, to do a case insensetive match you take
 the characters, strip the attributes and then do the comparison. There
 are specialised routines which handle the denormalisation of the string
 for you so in theory you could even get specific about which accents
 you ignore. In practice I don't think people do that.

I don't think composable unicode characters are really about collations. I
think it had more to do with representing glyphs in UTF32 before they gave up
on that. Does anyone still use composable characters?

Note that we don't currently support composable characters at all. I'm not
sure if that's a nobody really cares issue or a bug we should aim to fix
with real collation support.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Martijn van Oosterhout
On Tue, Sep 02, 2008 at 05:42:13PM +0100, Gregory Stark wrote:
 Martijn van Oosterhout [EMAIL PROTECTED] writes:
  On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
  While it's true POSIX locales don't handle this, other collation
  libraries do and we should support them if the user wants.
 
 I think that's backwards. We have to go with the lowest common denominator
 functionality of those libraries if we're going to be portable. 

And I think that's backwards. Why can we only use a feature once every
OS out there implements it? We still run on systems that don't have SSL
support. LC_TYPE settings are not portable between systems, yet that
doesn't bother anyone. Why should we have a problem with collate
settings not being portable?

 I don't think composable unicode characters are really about collations. I
 think it had more to do with representing glyphs in UTF32 before they gave up
 on that. Does anyone still use composable characters?

Lookup the various normalisations forms:
http://en.wikipedia.org/wiki/Unicode_normalization
In particular Normal Form D.

Sure, composable characters have nothing to do with collation, but they
provide a uniform way of doing accent insensetive collation.

 Note that we don't currently support composable characters at all. 

Any character which is an accent on a latin character is a decomposable
character. And last I checked we supported those.

Have a niceday,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Alvaro Herrera
Radek Strnad escribió:
 Ok, so do you suggest to leave it with a notice reindex database or start
 to solve it somehow?

I don't know.  If there are two tasks that need the same treatment, it
seems a safe conclusion that they need a common solution.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-02 Thread Tom Lane
Martijn van Oosterhout [EMAIL PROTECTED] writes:
 And I think that's backwards. Why can we only use a feature once every
 OS out there implements it? We still run on systems that don't have SSL
 support. LC_TYPE settings are not portable between systems, yet that
 doesn't bother anyone. Why should we have a problem with collate
 settings not being portable?

If we're going to approach it that way, we need a syntax for CREATE
COLLATION that doesn't hard-wire what the possible collation modifiers
are.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: Collation support

2008-09-01 Thread Radek Strnad
Ok, so do you suggest to leave it with a notice reindex database or start
to solve it somehow?

Regards

Radek Strnad

On Mon, Sep 1, 2008 at 12:08 AM, Alvaro Herrera
[EMAIL PROTECTED]wrote:

 Radek Strnad escribió:

  - when creating database with different collation than database cluster,
 the
  database has to be reindexed. Any idea how to do it? Function
  ReindexDatabase works only when database is opened.

 We have this Todo item:

  Set proper permissions on non-system schemas during db creation
Currently all schemas are owned by the super-user because they are
copied from the template1 database. However, since all objects are
inherited from the template database, it is not clear that setting
schemas to the db owner is correct.

 When this was discussed years ago, one proposed idea was that on the
 first connection the backend should, as a first task, ensure that
 certain administrative chores be done.  The first of those was changing
 the ownership of schemas.  It sounds like this reindexing you propose
 falls into the same category.

 --
 Alvaro Herrera
 http://www.CommandPrompt.com/
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support



Re: [HACKERS] WIP patch: Collation support

2008-08-31 Thread Alvaro Herrera
Radek Strnad escribió:

 - when creating database with different collation than database cluster, the
 database has to be reindexed. Any idea how to do it? Function
 ReindexDatabase works only when database is opened.

We have this Todo item:

  Set proper permissions on non-system schemas during db creation
Currently all schemas are owned by the super-user because they are
copied from the template1 database. However, since all objects are
inherited from the template database, it is not clear that setting
schemas to the db owner is correct. 

When this was discussed years ago, one proposed idea was that on the
first connection the backend should, as a first task, ensure that
certain administrative chores be done.  The first of those was changing
the ownership of schemas.  It sounds like this reindexing you propose
falls into the same category.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers