Re: [HACKERS] WIP patch: Collation support
Committed. Tom Lane wrote: * You should try to get rid of LOCALE_NAME_BUFLEN altogether. Definitely the comment about it in pg_control.h is now obsolete. Yep. I removed LOCALE_NAME_BUFLEN. The real max length of a locale name is now NAMEDATALEN, because it's stored in a name field in pg_database. NAMEDATALEN is only 64 bytes, whereas LOCALE_NAME_BUFLEN was 128. 64 bytes should be enough for en_GB.UTF8 style locale names, but I wonder if it's enough for the longer names used on Windows? Could someone confirm that, please? An important restriction, however, is that each database's character set must be compatible with the database's envarLC_CTYPE/ setting. Also I wonder whether we shouldn't say that it must be compatible with LC_CTYPE *and* LC_COLLATE. I think we should, but that's in fact not what is tested. Before the patch as well, we only tested that the encoding matches LC_CTYPE, but you could set LC_COLLATE to anything. I'll work on a subsequent patch to tighten that. * This makes sense, but then shouldn't we make the identical restriction for encoding? +The literalCOLLATE/ and literalCTYPE/ settings must match +those of the template database, except when template0 is used as +template. This is because literalCOLLATE/ and literalCTYPE/ It wouldn't be as bullet-proof for encoding, because we'd still have the problem that the encoding in the shared system tables would be ill-defined. That's a pre-existing problem, though. We could simply remove support for per-database encodings altogether and fix it at initdb time, as Martijn suggest earlier, but now that we have per-database locales, per-database encodings is a lot more useful as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Mon, Sep 22, 2008 at 10:22:35AM +0300, Heikki Linnakangas wrote: BTW, the original patch didn't have any provision for creating rows in pg_collation reflecting the locales available in the OS, but I think we'd need that. Otherwise the DBA would need to manually run CREATE COLLATION for every collation they want users to be able to use. Assuming we do that, the situation that we can't find a row with given LC_COLLATE and LC_CTYPE should not arise in practice. You're assuming collations are denumerable. They're not. There is no way to find the list of available collations/locales. You may be able to guess a few but certainly not all of them. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
On Tue, Sep 23, 2008 at 10:20 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote: Committed. *adds yet another item to his pgAdmin todo list* :-( Tom Lane wrote: * You should try to get rid of LOCALE_NAME_BUFLEN altogether. Definitely the comment about it in pg_control.h is now obsolete. Yep. I removed LOCALE_NAME_BUFLEN. The real max length of a locale name is now NAMEDATALEN, because it's stored in a name field in pg_database. NAMEDATALEN is only 64 bytes, whereas LOCALE_NAME_BUFLEN was 128. 64 bytes should be enough for en_GB.UTF8 style locale names, but I wonder if it's enough for the longer names used on Windows? Could someone confirm that, please? The longest I can find is: Serbian (Cyrillic)_Bosnia and Herzegovina at 42 characters. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Mon, Sep 22, 2008 at 06:11:04PM +0300, Heikki Linnakangas wrote: This patch should allow to use both system catalog and ICU. Not without another patch that actually introduces ICU support. What that would look like, how that would be stored in the catalogs, and whether we want that is whole another topic. Without that, the STRCOLFN part of the original patch is pointless, and I would've ripped that out anyway even if we decided to add the pg_collation catalog in this release. Eh? How you store collations names is easy. Every collation has a textual name which is what we store in the catalog. I'm not sure why you'd think it'd be any more complicated than that. And it has precisely nothing to do with ICU and everything to do with being able to support multiple source of collation information. We already have two: builtin (strcmp/memcmp) and C library (strcoll). Letting peope add more is the goal. I'm sure once the catalog exists the existing ICU-for-Postgres patch will be adjusted to use it. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout wrote: On Mon, Sep 22, 2008 at 10:22:35AM +0300, Heikki Linnakangas wrote: BTW, the original patch didn't have any provision for creating rows in pg_collation reflecting the locales available in the OS, but I think we'd need that. Otherwise the DBA would need to manually run CREATE COLLATION for every collation they want users to be able to use. Assuming we do that, the situation that we can't find a row with given LC_COLLATE and LC_CTYPE should not arise in practice. You're assuming collations are denumerable. They're not. There is no way to find the list of available collations/locales. You may be able to guess a few but certainly not all of them. locale -a manages to do it somehow... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote: Martijn van Oosterhout wrote: You're assuming collations are denumerable. They're not. There is no way to find the list of available collations/locales. You may be able to guess a few but certainly not all of them. locale -a manages to do it somehow... Sure, by (on glibc) opening the binary archive and parsing it and then trying to reverse lookup the alias list. We could ofcourse program something for each platform to determine a list but who is going to maintain that? How do you handle the list changing? Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout napsal(a): On Mon, Sep 22, 2008 at 06:11:04PM +0300, Heikki Linnakangas wrote: This patch should allow to use both system catalog and ICU. Not without another patch that actually introduces ICU support. What that would look like, how that would be stored in the catalogs, and whether we want that is whole another topic. Without that, the STRCOLFN part of the original patch is pointless, and I would've ripped that out anyway even if we decided to add the pg_collation catalog in this release. Eh? How you store collations names is easy. Every collation has a textual name which is what we store in the catalog. I'm not sure why you'd think it'd be any more complicated than that. And it has precisely nothing to do with ICU and everything to do with being able to support multiple source of collation information. We already have two: builtin (strcmp/memcmp) and C library (strcoll). Letting peope add more is the goal. pg_collation catalog is also important for pg_dump, because system collation names are not compatible over OS and pg_dump output should be portable. pg_collation adds abstract layer which solve this problem. I'm sure once the catalog exists the existing ICU-for-Postgres patch will be adjusted to use it. Yes, I agree with Martijn. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Zdenek Kotala wrote: pg_collation catalog is also important for pg_dump, because system collation names are not compatible over OS and pg_dump output should be portable. pg_collation adds abstract layer which solve this problem. That's a valid point. We'll still need a way to map OS locale to whatever internal names we invent for them, though, so I'm not sure if the pg_collation catalog helps much, but just moves the problem elsewhere. The pg_dump output of the CREATE COLLATION statements still wouldn't be portable from one OS to another. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout [EMAIL PROTECTED] writes: On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote: locale -a manages to do it somehow... Sure, by (on glibc) opening the binary archive and parsing it and then trying to reverse lookup the alias list. We could ofcourse program something for each platform to determine a list but who is going to maintain that? How do you handle the list changing? exec(locale -a) ... I suppose we'd need something else for Windows, but I'm sure there's a way. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Tom Lane wrote: Martijn van Oosterhout [EMAIL PROTECTED] writes: On Tue, Sep 23, 2008 at 01:32:38PM +0300, Heikki Linnakangas wrote: locale -a manages to do it somehow... Sure, by (on glibc) opening the binary archive and parsing it and then trying to reverse lookup the alias list. We could ofcourse program something for each platform to determine a list but who is going to maintain that? How do you handle the list changing? exec(locale -a) ... I suppose we'd need something else for Windows, but I'm sure there's a way. IIRC, the data is in the registry. Should be enumerable somehow - we'll have to do it platform specific of course, but it's not the first time we'd do that for windows... //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas napsal(a): Zdenek Kotala wrote: pg_collation catalog is also important for pg_dump, because system collation names are not compatible over OS and pg_dump output should be portable. pg_collation adds abstract layer which solve this problem. That's a valid point. We'll still need a way to map OS locale to whatever internal names we invent for them, though, so I'm not sure if the pg_collation catalog helps much, but just moves the problem elsewhere. It is true. For names we can for example use RFC479 0IANA register) http://tools.ietf.org/html/rfc4790#section-7 or use UNICODE terminology CLDR. The pg_dump output of the CREATE COLLATION statements still wouldn't be portable from one OS to another. I don't think so that pg_collation catalog should be dumped (maybe only with extra switch). Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Magnus Hagander wrote: exec(locale -a) ... I suppose we'd need something else for Windows, but I'm sure there's a way. IIRC, the data is in the registry. Should be enumerable somehow - we'll have to do it platform specific of course, but it's not the first time we'd do that for windows... There is EnumSystemLocales API function in Windows. -- Regards Petr Jelinek (PJMODOS) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Petr Jelinek wrote: Magnus Hagander wrote: exec(locale -a) ... I suppose we'd need something else for Windows, but I'm sure there's a way. IIRC, the data is in the registry. Should be enumerable somehow - we'll have to do it platform specific of course, but it's not the first time we'd do that for windows... There is EnumSystemLocales API function in Windows. Ha, right. We even use it in the installer :-) Bottom line remains: we can easily do this in a Windows-specific way if we need to. //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout wrote: On Fri, Sep 19, 2008 at 10:13:43AM +0300, Heikki Linnakangas wrote: It's not like the patch is going to disappear from planet Earth if it doesn't get committed for 8.4. It's still valuable and available when the new catalogs are needed. I just prefer it as it was because it takes care of a useful subset of the features people want in a way that is compatable for the future. Whereas the stripped down version, I'm not sure it gets us anywhere. It gives the capability to have different collations in different databases within the same cluster. IOW, the same feature as the original patch. Finer-grained collation would be even better, of course, but database-level collations is a valuable feature on its own. The critical question is how much compatibility trouble we're going to get by having to support the extension to CREATE DATABASE in the stripped-down patch, when the pg_collation catalog is introduced in a later version in one form or another. So let's investigate that a bit further: In the stripped down version, the CREATE DATABASE syntax is: CREATE DATABASE name WITH COLLATE=locale name CTYPE=locale name In the original patch, the CREATE DATABASE syntax is: CREATE DATABASE name WITH COLLATE=collation name The first thing that we see is that the COLLATE keyword means different things, so it's probably best to change that into: CREATE DATABASE name WITH LC_COLLATE=locale name LC_CTYPE=locale name in the stripped-down version. Then we need a way to map the stripped-down syntax into the one in the original patch. That's just a matter of looking up the collation in the pg_collation catalog with the right LC_COLLATE and LC_CTYPE. Things get slightly more complicated if there is no such collation in the pg_collation catalog. One option is to simply create it at that point. BTW, the original patch didn't have any provision for creating rows in pg_collation reflecting the locales available in the OS, but I think we'd need that. Otherwise the DBA would need to manually run CREATE COLLATION for every collation they want users to be able to use. Assuming we do that, the situation that we can't find a row with given LC_COLLATE and LC_CTYPE should not arise in practice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas [EMAIL PROTECTED] writes: The first thing that we see is that the COLLATE keyword means different things, so it's probably best to change that into: CREATE DATABASE name WITH LC_COLLATE=locale name LC_CTYPE=locale name in the stripped-down version. Then we need a way to map the stripped-down syntax into the one in the original patch. That's just a matter of looking up the collation in the pg_collation catalog with the right LC_COLLATE and LC_CTYPE. It seems to me that in an installation using libc-based collation support, the collation names are likely to be the same as allowed values of LC_COLLATE anyway. So inventing different keywords doesn't really seem necessary. What might be sensible to ask is whether it is ever actually reasonable for LC_COLLATE and LC_CTYPE to have different settings. If we were willing to enforce that they be the same, we could reduce this to just the standard syntax COLLATE=something and be done with it. Not being much of a user of anything except C locale, I might be the wrong person to opine on this; but it seems to me that having them different is far more likely to be a mistake than desirable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Tom Lane wrote: What might be sensible to ask is whether it is ever actually reasonable for LC_COLLATE and LC_CTYPE to have different settings. If we were willing to enforce that they be the same, we could reduce this to just the standard syntax COLLATE=something and be done with it. Not being much of a user of anything except C locale, I might be the wrong person to opine on this; but it seems to me that having them different is far more likely to be a mistake than desirable. Agreed, it doesn't make much sense. I find it hard to imagine anyone doing that on purpose, but we have supported it at initdb time for ages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas napsal(a): For anyone counting, Firebird added support for ICU more than three years ago. ICU is orthogonal to this patch. This patch didn't provide ICU support, and we could start using ICU without the catalog changes. This patch should allow to use both system catalog and ICU. pg_collate catalog contains comparing function which is called for string comparing and if somebody creates function which will use ICU then ICU will be supported. It is advantage of pg_catalog. Without them you can have system or ICU but probably not both. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Zdenek Kotala wrote: Heikki Linnakangas napsal(a): For anyone counting, Firebird added support for ICU more than three years ago. ICU is orthogonal to this patch. This patch didn't provide ICU support, and we could start using ICU without the catalog changes. This patch should allow to use both system catalog and ICU. Not without another patch that actually introduces ICU support. What that would look like, how that would be stored in the catalogs, and whether we want that is whole another topic. Without that, the STRCOLFN part of the original patch is pointless, and I would've ripped that out anyway even if we decided to add the pg_collation catalog in this release. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Thu, Sep 18, 2008 at 05:41:16PM +0300, Heikki Linnakangas wrote: Attached is an updated version of the stripped-down patch. I've cleaned it up a bit, and added more sanity checks. Documentation is still missing and I haven't test it much. FWIW, I disagree with the stripping down and think we should aim for the whole patch as submitted. Historically, the hardest part of getting collation support into postgres has been the catalog changes and easier the support for something other than OS locales. I supported the patch as it was precisely because it finally did that, and the stripping down takes us back to square one. Implementing the COLLATE keyword is much easier once you have the underlying support and woldn't involve any backward incompatabilities. The stripped down version gets us to a state which will only be supported (hopefully) for one release. For anyone counting, Firebird added support for ICU more than three years ago. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout wrote: FWIW, I disagree with the stripping down and think we should aim for the whole patch as submitted. Historically, the hardest part of getting collation support into postgres has been the catalog changes and easier the support for something other than OS locales. I supported the patch as it was precisely because it finally did that, and the stripping down takes us back to square one. The catalog changes are one issue, but hardly the biggest one. Database-level collation works just as well without them, and for finer-grain control, the proposed patch was nowhere near enough. You'll need further catalog changes to track collation for each index column, for example, probably something in the operator families or classes etc. In addition to catalog changes, for finer grained collation you need changes in the planner and executor as well. In planner, to provide support for the concept of a sort order using a specific collation, and track that whenever sort order is handled. In executor, the capability to compare and sort using different locales at different times. Those changes seem a lot more difficult to me than adding a couple of system catalogs, which is pretty straightforward. It's not like the patch is going to disappear from planet Earth if it doesn't get committed for 8.4. It's still valuable and available when the new catalogs are needed. Implementing the COLLATE keyword is much easier once you have the underlying support and woldn't involve any backward incompatabilities. The stripped down version gets us to a state which will only be supported (hopefully) for one release. Yes, I hope so too. Implementation in phases is a good thing in my mind, but adding a lot of complexity that isn't need until in a future release is not. We will need to support the 8.4 CREATE DATABASE syntax for backwards-compatibility, but I don't see a problem with that. For anyone counting, Firebird added support for ICU more than three years ago. ICU is orthogonal to this patch. This patch didn't provide ICU support, and we could start using ICU without the catalog changes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Here's an updated version of the stripped-down patch, now with documentation changes, plus a couple of minor bug fixes. I think this is looking pretty good now, and I would appreciate review from others before I go ahead committing this. Is there any more places in the documentation that needs updating, or any grammar mistakes? Further opinions on whether to allow using a database with different collation order as template, requiring a manual reindex and facing weird results if that's neglected, or to error out with a hint suggesting to use template0 as template (as is in the patch now)? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index f484db8..1e1786a 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -130,23 +130,24 @@ initdb --locale=sv_SE para The nature of some locale categories is that their value has to be -fixed for the lifetime of a database cluster. That is, once -commandinitdb/command has run, you cannot change them anymore. -literalLC_COLLATE/literal and literalLC_CTYPE/literal are -those categories. They affect the sort order of indexes, so they -must be kept fixed, or indexes on text columns will become corrupt. -productnamePostgreSQL/productname enforces this by recording -the values of envarLC_COLLATE/ and envarLC_CTYPE/ that are -seen by commandinitdb/. The server automatically adopts -those two values when it is started. +fixed when the database is created. You can use different settings +for different databases, but once a database is created, you cannot +change them for that database anymore. literalLC_COLLATE/literal +and literalLC_CTYPE/literal are those categories. They affect +the sort order of indexes, so they must be kept fixed, or indexes on +text columns will become corrupt. The default values for these +categories are defined when commandinitdb/command is run, and +those values are used when new databases are created, unless +explicitly specified otherwise in the commandCREATE +DATABASE/command command. /para para The other locale categories can be changed as desired whenever the server is running by setting the run-time configuration variables that have the same name as the locale categories (see xref -linkend=runtime-config-client-format for details). The defaults that are -chosen by commandinitdb/command are actually only written into +linkend=runtime-config-client-format for details). The defaults +that are chosen by commandinitdb/command are actually only written into the configuration file filenamepostgresql.conf/filename to serve as defaults when the server is started. If you delete these assignments from filenamepostgresql.conf/filename then the @@ -261,7 +262,7 @@ initdb --locale=sv_SE para Check that productnamePostgreSQL/ is actually using the locale -that you think it is. envarLC_COLLATE/ and envarLC_CTYPE/ +that you think it is. The default envarLC_COLLATE/ and envarLC_CTYPE/ settings are determined at commandinitdb/ time and cannot be changed without repeating commandinitdb/. Other locale settings including envarLC_MESSAGES/ and envarLC_MONETARY/ @@ -320,16 +321,10 @@ initdb --locale=sv_SE para An important restriction, however, is that each database character set - must be compatible with the server's envarLC_CTYPE/ setting. + must be compatible with the database's envarLC_CTYPE/ setting. When envarLC_CTYPE/ is literalC/ or literalPOSIX/, any character set is allowed, but for other settings of envarLC_CTYPE/ there is only one character set that will work correctly. - Since the envarLC_CTYPE/ setting is frozen by commandinitdb/, the - apparent flexibility to use different encodings in different databases - of a cluster is more theoretical than real, except when you select - literalC/ or literalPOSIX/ locale (thus disabling any real locale - awareness). It is likely that these mechanisms will be revisited in future - versions of productnamePostgreSQL/productname. /para sect2 id=multibyte-charset-supported @@ -734,19 +729,19 @@ initdb -E EUC_JP /para para - If you have selected literalC/ or literalPOSIX/ locale, - you can create a database with a different character set: + You can specify a non-default encoding at database creation time, + provided that the encoding is compatible with the selected locale: screen -createdb -E EUC_KR korean +createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean /screen This will create a database named literalkorean/literal that - uses the character set literalEUC_KR/literal. Another way to - accomplish this is to use this SQL command: + uses the character set literalEUC_KR/literal, and locale
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas [EMAIL PROTECTED] writes: Here's an updated version of the stripped-down patch, now with documentation changes, plus a couple of minor bug fixes. I think this is looking pretty good now, and I would appreciate review from others before I go ahead committing this. I guess the $64 question is whether this moves us towards where we want to be? If the ultimate goal is to get to per-column collations, I think this patch is likely to be counterproductive, because it will establish syntax and semantics that we'll have to try to remain compatible with. If we're going to be satisfied with per-database collations, then great, let's do it. As far as reviewing the patch itself goes: * It's missing pg_dump (or I guess really pg_dumpall) support. * Please call the new columns datcollate and datctype. Just because someone ignored the naming convention years ago for pg_database.encoding doesn't mean we should ignore it here. Also, if you're going to name one column based on an LC_foo variable's name, do the other one the same way. I note also that this would match the keywords used in CREATE DATABASE. * Don't even think of adding system catalog columns without updating catalogs.sgml. * You should try to get rid of LOCALE_NAME_BUFLEN altogether. Definitely the comment about it in pg_control.h is now obsolete. * You *must* bump PG_CONTROL_VERSION when you change its layout. (Don't forget catversion too, since you're also changing pg_database) * This doc sentence reads a bit awkwardly: An important restriction, however, is that each database character set must be compatible with the database's envarLC_CTYPE/ setting. Maybe An important restriction, however, is that each database's character set must be compatible with the database's envarLC_CTYPE/ setting. Also I wonder whether we shouldn't say that it must be compatible with LC_CTYPE *and* LC_COLLATE. * This bit in the CREATE DATABASE ref page is also awkward: Any character set encoding specified for the new database must be compatible with the chosen COLLATE and CTYPE settings. Maybe The character set encoding used for the new database must be compatible with the chosen COLLATE and CTYPE settings. * This bit doesn't look ready to commit: *** *** 754,759 --- 749,756 option-l/option option or the command\l/command command of commandpsql/command. + XXX It would be nice to show the korean database here. And we should + modify psql to show locale as well screen $ userinputpsql -l/userinput List of databases * This makes sense, but then shouldn't we make the identical restriction for encoding? +The literalCOLLATE/ and literalCTYPE/ settings must match +those of the template database, except when template0 is used as +template. This is because literalCOLLATE/ and literalCTYPE/ * I can't tell whether the writer of this bit thought that a blank line would come out as a paragraph break or not. Either add para's or remove the blank lines to make it look like one para in the source: --- 76,105 para commandinitdb/command initializes the database cluster's default !locale and character set encoding. ! !The character set encoding, collation order (literalLC_COLLATE/) !and character set classes (literalLC_CTYPE/, e.g. upper, lower, !digit) can be set separately for a database when it is created. !commandinitdb/command determines those settings for the !literaltemplate1/literal database, which will serve as the !default for all other databases. ! !To alter the default collation order or character set classes, use the !option--lc-collate/option and option--lc-ctype/option options. !Collation orders other than literalC/ or literalPOSIX/ also have !a performance penalty. For these reasons it is important to choose the !right locale when running commandinitdb/command. ! !The remaining locale categories can be changed later when the server !is started. You can also use option--locale/option to set the !default for all locale categories, including collation order and !character set classes. All server locale values (literallc_*/) can !be displayed via commandSHOW ALL/. More details can be found in xref linkend=locale. !To alter the default encoding, use the option--encoding/option. !More details can be found in xref linkend=multibyte. /para * This bit for resetxlog is just wrong now: --- 62,70 by specifying the literal-f/ (force) switch. In this case plausible values will be substituted for the missing data. Most of the fields can be expected to match, but manual assistance might be needed for the next OID, !next transaction ID and epoch, next multitransaction ID and offset, and !WAL starting address fields. !The first five of these can be set using the switches discussed below. If you are
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas wrote: Here's an updated version of the stripped-down patch, now with documentation changes, plus a couple of minor bug fixes. Another update, marching towards committing. Now with pg_dump/pg_dumpall support, and collation/ctype is also shown in psql \l output. I wonder if we should provide a shorthand LOCALE=localename option, in addition to separate COLLATE and CTYPE options? Most people will always set them together. It could also set the rest of the locale categories, lc_numeric, lc_monetary, and so on, as database-specific options. Like ALTER DATABASE dbname SET lc_messages=localename ... would. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index f484db8..5581b10 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -130,23 +130,24 @@ initdb --locale=sv_SE para The nature of some locale categories is that their value has to be -fixed for the lifetime of a database cluster. That is, once -commandinitdb/command has run, you cannot change them anymore. -literalLC_COLLATE/literal and literalLC_CTYPE/literal are -those categories. They affect the sort order of indexes, so they -must be kept fixed, or indexes on text columns will become corrupt. -productnamePostgreSQL/productname enforces this by recording -the values of envarLC_COLLATE/ and envarLC_CTYPE/ that are -seen by commandinitdb/. The server automatically adopts -those two values when it is started. +fixed when the database is created. You can use different settings +for different databases, but once a database is created, you cannot +change them for that database anymore. literalLC_COLLATE/literal +and literalLC_CTYPE/literal are those categories. They affect +the sort order of indexes, so they must be kept fixed, or indexes on +text columns will become corrupt. The default values for these +categories are defined when commandinitdb/command is run, and +those values are used when new databases are created, unless +explicitly specified otherwise in the commandCREATE +DATABASE/command command. /para para The other locale categories can be changed as desired whenever the server is running by setting the run-time configuration variables that have the same name as the locale categories (see xref -linkend=runtime-config-client-format for details). The defaults that are -chosen by commandinitdb/command are actually only written into +linkend=runtime-config-client-format for details). The defaults +that are chosen by commandinitdb/command are actually only written into the configuration file filenamepostgresql.conf/filename to serve as defaults when the server is started. If you delete these assignments from filenamepostgresql.conf/filename then the @@ -261,7 +262,7 @@ initdb --locale=sv_SE para Check that productnamePostgreSQL/ is actually using the locale -that you think it is. envarLC_COLLATE/ and envarLC_CTYPE/ +that you think it is. The default envarLC_COLLATE/ and envarLC_CTYPE/ settings are determined at commandinitdb/ time and cannot be changed without repeating commandinitdb/. Other locale settings including envarLC_MESSAGES/ and envarLC_MONETARY/ @@ -320,16 +321,10 @@ initdb --locale=sv_SE para An important restriction, however, is that each database character set - must be compatible with the server's envarLC_CTYPE/ setting. + must be compatible with the database's envarLC_CTYPE/ setting. When envarLC_CTYPE/ is literalC/ or literalPOSIX/, any character set is allowed, but for other settings of envarLC_CTYPE/ there is only one character set that will work correctly. - Since the envarLC_CTYPE/ setting is frozen by commandinitdb/, the - apparent flexibility to use different encodings in different databases - of a cluster is more theoretical than real, except when you select - literalC/ or literalPOSIX/ locale (thus disabling any real locale - awareness). It is likely that these mechanisms will be revisited in future - versions of productnamePostgreSQL/productname. /para sect2 id=multibyte-charset-supported @@ -734,19 +729,19 @@ initdb -E EUC_JP /para para - If you have selected literalC/ or literalPOSIX/ locale, - you can create a database with a different character set: + You can specify a non-default encoding at database creation time, + provided that the encoding is compatible with the selected locale: screen -createdb -E EUC_KR korean +createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean /screen This will create a database named literalkorean/literal that - uses the character set literalEUC_KR/literal. Another way to - accomplish this is to use this SQL command: + uses the character
Re: [HACKERS] WIP patch: Collation support
On Fri, Sep 19, 2008 at 10:13:43AM +0300, Heikki Linnakangas wrote: In addition to catalog changes, for finer grained collation you need changes in the planner and executor as well. In planner, to provide support for the concept of a sort order using a specific collation, and track that whenever sort order is handled. In executor, the capability to compare and sort using different locales at different times. Those changes seem a lot more difficult to me than adding a couple of system catalogs, which is pretty straightforward. To be honest, I think that's the easy part. When I submitted the collate patch a few years back, getting the sort order working was the easy part, precisely because COLLATE clauses can affect the generated plan in precisely one way: can you use index FOO or not. Once you're past that point you can throw almost all the collation information away. As you point out though, the shear volume of catalog changes required obscure that fact somewhat. It's not like the patch is going to disappear from planet Earth if it doesn't get committed for 8.4. It's still valuable and available when the new catalogs are needed. I just prefer it as it was because it takes care of a useful subset of the features people want in a way that is compatable for the future. Whereas the stripped down version, I'm not sure it gets us anywhere. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Gregory Stark wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Well, I proposed disallowing using a different collation than the source database, except for using template0 as the source. That's pretty limited, but is trivial to implement and still let's you have databases with different collations in the same cluster. + if (strcmp(dbtemplate, template0) != 0 + (strcmp(lc_collate, src_collation) || strcmp(lc_ctype, src_ctype))) + ereport(NOTICE, + (errmsg(database \%s\ needs to be reindexed manually (REINDEX DATABASE), + dbname))); + This isn't what you described but I think I prefer it this way as just a warning not an error. Well, I'd prefer to make it an error, but I'm willing to listen if others feel otherwise. I don't think the inconvenience of having to use template0 is that big, compared to the potential of strange behavior people would run into if they ignore the advice to reindex. One weakness with a straight strcmp comparison is that it won't recognize aliases of the same locale. For example, fi_FI.UTF8 and fi_FI.UTF-8. AFAIK we can't easily connect to the new database and do some fiddling with it, can we? If we could we could check if there are any non-empty indexes which depend on the collation and only print the warning if we find any (and even mark them invalid). I don't see that happening, unfortunately.. Attached is an updated version of the stripped-down patch. I've cleaned it up a bit, and added more sanity checks. Documentation is still missing and I haven't test it much. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** doc/src/sgml/ref/create_database.sgml --- doc/src/sgml/ref/create_database.sgml *** *** 24,29 CREATE DATABASE replaceable class=PARAMETERname/replaceable --- 24,31 [ [ WITH ] [ OWNER [=] replaceable class=parameterdbowner/replaceable ] [ TEMPLATE [=] replaceable class=parametertemplate/replaceable ] [ ENCODING [=] replaceable class=parameterencoding/replaceable ] +[ COLLATE [=] replaceable class=parametercollation/replaceable ] +[ CTYPE [=] replaceable class=parameterctype/replaceable ] [ TABLESPACE [=] replaceable class=parametertablespace/replaceable ] [ CONNECTION LIMIT [=] replaceable class=parameterconnlimit/replaceable ] ] /synopsis *** *** 113,118 CREATE DATABASE replaceable class=PARAMETERname/replaceable --- 115,136 /listitem /varlistentry varlistentry + termreplaceable class=parametercollation/replaceable/term + listitem +para + LC_COLLATE setting to use in the new database. XXX +/para + /listitem + /varlistentry + varlistentry + termreplaceable class=parameterctype/replaceable/term + listitem +para + LC_CTYPE setting to use in the new database. XXX +/para + /listitem + /varlistentry + varlistentry termreplaceable class=parametertablespace/replaceable/term listitem para *** src/backend/access/transam/xlog.c --- src/backend/access/transam/xlog.c *** *** 3847,3853 WriteControlFile(void) { int fd; char buffer[PG_CONTROL_SIZE]; /* need not be aligned */ - char *localeptr; /* * Initialize version and compatibility-check fields --- 3847,3852 *** *** 3876,3893 WriteControlFile(void) ControlFile-float4ByVal = FLOAT4PASSBYVAL; ControlFile-float8ByVal = FLOAT8PASSBYVAL; - ControlFile-localeBuflen = LOCALE_NAME_BUFLEN; - localeptr = setlocale(LC_COLLATE, NULL); - if (!localeptr) - ereport(PANIC, - (errmsg(invalid LC_COLLATE setting))); - StrNCpy(ControlFile-lc_collate, localeptr, LOCALE_NAME_BUFLEN); - localeptr = setlocale(LC_CTYPE, NULL); - if (!localeptr) - ereport(PANIC, - (errmsg(invalid LC_CTYPE setting))); - StrNCpy(ControlFile-lc_ctype, localeptr, LOCALE_NAME_BUFLEN); - /* Contents are protected with a CRC */ INIT_CRC32(ControlFile-crc); COMP_CRC32(ControlFile-crc, --- 3875,3880 *** *** 4126,4159 ReadControlFile(void) but the server was compiled without USE_FLOAT8_BYVAL.), errhint(It looks like you need to recompile or initdb.))); #endif - - if (ControlFile-localeBuflen != LOCALE_NAME_BUFLEN) - ereport(FATAL, - (errmsg(database files are incompatible with server), - errdetail(The database cluster was initialized with LOCALE_NAME_BUFLEN %d, - but the server was compiled with LOCALE_NAME_BUFLEN %d., - ControlFile-localeBuflen, LOCALE_NAME_BUFLEN), - errhint(It looks like you need to recompile or initdb.))); - if (pg_perm_setlocale(LC_COLLATE, ControlFile-lc_collate) == NULL) - ereport(FATAL, - (errmsg(database files are
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout wrote: On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote: Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Let's just use the name of the OS locale, like we do now. Having a pg_collation catalog just moves the problem elsewhere: we'd still need something in pg_collation to tie the collation to the OS locale. There's not a one-to-one mapping between collation and locale name. A locale name includes information about the charset and a collation may have paramters like case-sensetivity and pad-attribute which are not present in the locale name. You need a mapping anyway, which is what this table is for. Ideally, we would delegate the case-sensitivity and padding to the collation implementation (ie. OS setlocale() or ICU). That said, I don't think operating systems normally ship case-insensitive variants of locales by default, so I agree it would be nice if we could implement that ourselves. Still, we could identify case-sensitive locale names for example by a suffix, like en_GB.UTF8.case-insensitive. I agree we will eventually need a way to give shorthand names for collations, and a pg_collation catalog will then come handy. But that can wait until we have the basic infrastructure ready to support column and query-level collation. But that put us back where we started: every database having the same collation. We're trying to move away from that. Just reindex everything and be done with it. That's easier said than done, unfortunately. I don't see an alternative. Well, I proposed disallowing using a different collation than the source database, except for using template0 as the source. That's pretty limited, but is trivial to implement and still let's you have databases with different collations in the same cluster. I worked a bit on Radek's patch, stripping out all the pg_collate and pg_charset catalog changes and commands, leaving just the core functionality of database-level collations. It needs some cleanup and documentation, but something like this I'd like to commit in this commit fest. The new catalogs can wait until we have a real need for them. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/transam/xlog.c --- src/backend/access/transam/xlog.c *** *** 3847,3853 WriteControlFile(void) { int fd; char buffer[PG_CONTROL_SIZE]; /* need not be aligned */ - char *localeptr; /* * Initialize version and compatibility-check fields --- 3847,3852 *** *** 3875,3893 WriteControlFile(void) #endif ControlFile-float4ByVal = FLOAT4PASSBYVAL; ControlFile-float8ByVal = FLOAT8PASSBYVAL; ! ! ControlFile-localeBuflen = LOCALE_NAME_BUFLEN; ! localeptr = setlocale(LC_COLLATE, NULL); ! if (!localeptr) ! ereport(PANIC, ! (errmsg(invalid LC_COLLATE setting))); ! StrNCpy(ControlFile-lc_collate, localeptr, LOCALE_NAME_BUFLEN); ! localeptr = setlocale(LC_CTYPE, NULL); ! if (!localeptr) ! ereport(PANIC, ! (errmsg(invalid LC_CTYPE setting))); ! StrNCpy(ControlFile-lc_ctype, localeptr, LOCALE_NAME_BUFLEN); ! /* Contents are protected with a CRC */ INIT_CRC32(ControlFile-crc); COMP_CRC32(ControlFile-crc, --- 3874,3880 #endif ControlFile-float4ByVal = FLOAT4PASSBYVAL; ControlFile-float8ByVal = FLOAT8PASSBYVAL; ! /* Contents are protected with a CRC */ INIT_CRC32(ControlFile-crc); COMP_CRC32(ControlFile-crc, *** *** 4126,4159 ReadControlFile(void) but the server was compiled without USE_FLOAT8_BYVAL.), errhint(It looks like you need to recompile or initdb.))); #endif - - if (ControlFile-localeBuflen != LOCALE_NAME_BUFLEN) - ereport(FATAL, - (errmsg(database files are incompatible with server), - errdetail(The database cluster was initialized with LOCALE_NAME_BUFLEN %d, - but the server was compiled with LOCALE_NAME_BUFLEN %d., - ControlFile-localeBuflen, LOCALE_NAME_BUFLEN), - errhint(It looks like you need to recompile or initdb.))); - if (pg_perm_setlocale(LC_COLLATE, ControlFile-lc_collate) == NULL) - ereport(FATAL, - (errmsg(database files are incompatible with operating system), - errdetail(The database cluster was initialized with LC_COLLATE \%s\, - which is not recognized by setlocale()., - ControlFile-lc_collate), - errhint(It looks like you need to initdb or install locale support.))); - if (pg_perm_setlocale(LC_CTYPE, ControlFile-lc_ctype) == NULL) - ereport(FATAL, - (errmsg(database files are incompatible with operating system), - errdetail(The database cluster was initialized with LC_CTYPE \%s\, - which is not recognized by setlocale()., - ControlFile-lc_ctype), - errhint(It looks like
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas [EMAIL PROTECTED] writes: Well, I proposed disallowing using a different collation than the source database, except for using template0 as the source. That's pretty limited, but is trivial to implement and still let's you have databases with different collations in the same cluster. + if (strcmp(dbtemplate, template0) != 0 + (strcmp(lc_collate, src_collation) || strcmp(lc_ctype, src_ctype))) + ereport(NOTICE, + (errmsg(database \%s\ needs to be reindexed manually (REINDEX DATABASE), + dbname))); + This isn't what you described but I think I prefer it this way as just a warning not an error. I can easily imagine cases where the admin knows there are either no indexes or no data in their database or where they're perfectly happy to reindex. A flat prohibition seems annoying. That said it's worth noting that this would tie our hands with proposals like retail vacuum which risk database corruption if they fail to refind an index pointer for a tuple... Incidentally it seems like the warning should actually explain *why* it needs to be reindexed manually and perhaps what the consequences are until it is. AFAIK we can't easily connect to the new database and do some fiddling with it, can we? If we could we could check if there are any non-empty indexes which depend on the collation and only print the warning if we find any (and even mark them invalid). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Gregory Stark napsal(a): Heikki Linnakangas [EMAIL PROTECTED] writes: snip AFAIK we can't easily connect to the new database and do some fiddling with it, can we? If we could we could check if there are any non-empty indexes which depend on the collation and only print the warning if we find any (and even mark them invalid). Autovacum uses InitPostgres function to swith to another database. I'm not sure how much safe it is in create database command and when we are already switched we can reindex affected indexes. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas napsal(a): Martijn van Oosterhout wrote: On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote: Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Let's just use the name of the OS locale, like we do now. Having a pg_collation catalog just moves the problem elsewhere: we'd still need something in pg_collation to tie the collation to the OS locale. There's not a one-to-one mapping between collation and locale name. A locale name includes information about the charset and a collation may have paramters like case-sensetivity and pad-attribute which are not present in the locale name. You need a mapping anyway, which is what this table is for. Ideally, we would delegate the case-sensitivity and padding to the collation implementation (ie. OS setlocale() or ICU). That said, I don't think operating systems normally ship case-insensitive variants of locales by default, so I agree it would be nice if we could implement that ourselves. Still, we could identify case-sensitive locale names for example by a suffix, like en_GB.UTF8.case-insensitive. The idea was to call to_upper (or to_lower) before case-sensitive collation processing. It is difficult to determine from suffix if it is sensitive or not. Zdenek PS: We can discuss it in Prato -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Zdenek Kotala escribió: Gregory Stark napsal(a): AFAIK we can't easily connect to the new database and do some fiddling with it, can we? If we could we could check if there are any non-empty indexes which depend on the collation and only print the warning if we find any (and even mark them invalid). Autovacum uses InitPostgres function to swith to another database. I'm not sure how much safe it is in create database command and when we are already switched we can reindex affected indexes. It's only supposed to work if you're not previously connected to any database. (Autovacuum never switches to another database; any particular worker only connects to a single database). -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Radek Strnad wrote: Progress so far: - created catalogs pg_collation a pg_charset which are filled with three standard collations - initdb changes rows called DEFAULT in both catalogs during the bki bootstrap phase with current system LC_COLLATE and LC_CTYPE or those set by command line. - new collations can be defined with command CREATE COLLATION collation name FOR character set specification FROM existing collation name [STRCOLFN fn name] [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] [ LCCTYPE lc_ctype ] - because of pg_collation and pg_charset are catalogs individual for each database, if you want to create a database with collation other than specified, create it in template1 and then create database I have to wonder, is all that really necessary? The feature you're trying to implement is to support database-level collation at first, and perhaps column-level collation later. We don't need support for user-defined collations and charsets for that. If leave all that out of the patch for now, we'll have a much slimmer, and just as useful patch, implementing database-level collation. We can add those catalogs later if we need them, but I don't think there's much point in adding all that infrastructure if they just reflect the locales installed in the operating system. - when connecting to database, it retrieves locales from pg_database and sets them This is the real gist of this patch. Design functionality changes left: - move retrieveing collation from pg_database to pg_type I don't understand this item. What will you move? - get case sensitivity and pad characteristic working I feel we should leave this to the collation implementation. - when creating database with different collation than database cluster, the database has to be reindexed. Any idea how to do it? Function ReindexDatabase works only when database is opened. That's a tricky one. One idea is to prohibit choosing a different collation than the one in the template database, unless we know it's safe to do so without reindexing. The problem is that we don't know whether it's safe. A simple but limiting solution would be to require that the template database has the same collation as the database that's being created, except that template0 can always be used as template. template0 is safe, because there's no indexes on text columns there. Note that we already have the same problem with encodings. If you create a database with LATIN1 encoding, load it with data, and then use that as a template for a database with UTF-8 encoding, the text data will be incorrectly encoded. We should probably fix that too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote: Radek Strnad wrote: - because of pg_collation and pg_charset are catalogs individual for each database, if you want to create a database with collation other than specified, create it in template1 and then create database I have to wonder, is all that really necessary? The feature you're trying to implement is to support database-level collation at first, and perhaps column-level collation later. We don't need support for user-defined collations and charsets for that. Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Design functionality changes left: - move retrieveing collation from pg_database to pg_type I don't understand this item. What will you move? Long term, the collation is a property of the type, but I agree, I'm not sure why this patch needs it. That's a tricky one. One idea is to prohibit choosing a different collation than the one in the template database, unless we know it's safe to do so without reindexing. But that put us back where we started: every database having the same collation. We're trying to move away from that. Just reindex everything and be done with it. Note that we already have the same problem with encodings. If you create a database with LATIN1 encoding, load it with data, and then use that as a template for a database with UTF-8 encoding, the text data will be incorrectly encoded. We should probably fix that too. I'd say forbid more than one encoding in a cluster, but that's just my opinion :) Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout wrote: On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote: Radek Strnad wrote: - because of pg_collation and pg_charset are catalogs individual for each database, if you want to create a database with collation other than specified, create it in template1 and then create database I have to wonder, is all that really necessary? The feature you're trying to implement is to support database-level collation at first, and perhaps column-level collation later. We don't need support for user-defined collations and charsets for that. Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Let's just use the name of the OS locale, like we do now. Having a pg_collation catalog just moves the problem elsewhere: we'd still need something in pg_collation to tie the collation to the OS locale. Design functionality changes left: - move retrieveing collation from pg_database to pg_type I don't understand this item. What will you move? Long term, the collation is a property of the type, ... You might want to provide a default collation for a type as well, but the very finest grade is that you can specify collation for every (text) comparison operator in your query. Of course you don't want to do that for every query, which is why we should provide defaults at different levels: columns, tables, database. And perhaps types as well, but that's not the most interesting case. I'm not sure what the SQL spec says about that, but I believe it provides syntax and rules for all that. That's a tricky one. One idea is to prohibit choosing a different collation than the one in the template database, unless we know it's safe to do so without reindexing. But that put us back where we started: every database having the same collation. We're trying to move away from that. Just reindex everything and be done with it. That's easier said than done, unfortunately. Note that we already have the same problem with encodings. If you create a database with LATIN1 encoding, load it with data, and then use that as a template for a database with UTF-8 encoding, the text data will be incorrectly encoded. We should probably fix that too. I'd say forbid more than one encoding in a cluster, but that's just my opinion :) Yeah, that's pretty useless, at least without support for different locales on different databases. But might as well keep it unless there's a pressing reason to drop it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Heikki Linnakangas napsal(a): Design functionality changes left: - move retrieveing collation from pg_database to pg_type The problem there is that pg_collation is local catalog, but pg_database is global catalog. IIRC, It was discussed during last commitfest. I think it is bad idea to make this kind of dependency. It seems to me better to implement something what we will use later to avoid useless and confusing dependency. Zdenek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote: Since the set of collations isn't exactly denumerable, we need some way to allow the user to specify the collation they want. The only collation PostgreSQL knows about is the C collation. Anything else is user-defined. Let's just use the name of the OS locale, like we do now. Having a pg_collation catalog just moves the problem elsewhere: we'd still need something in pg_collation to tie the collation to the OS locale. There's not a one-to-one mapping between collation and locale name. A locale name includes information about the charset and a collation may have paramters like case-sensetivity and pad-attribute which are not present in the locale name. You need a mapping anyway, which is what this table is for. The difference in collation between CHAR() and VARCHAR() is the usual example here. Long term, the collation is a property of the type, ... I'm not sure what the SQL spec says about that, but I believe it provides syntax and rules for all that. The spec is quite detailed about and I posted code to do it years ago. The point is that we don't need to go that far with this patch. But that put us back where we started: every database having the same collation. We're trying to move away from that. Just reindex everything and be done with it. That's easier said than done, unfortunately. I don't see an alternative. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Radek Strnad wrote: - new collations can be defined with command CREATE COLLATION collation name FOR character set specification FROM existing collation name [STRCOLFN fn name] [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] [ LCCTYPE lc_ctype ] How do you plan to make a collation case sensitive or accent sensitive? I have previously commented that this is not a realistic view on how collations work. Since you are apparently planning to use the system locales, I don't see how you can make this work. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote: Radek Strnad wrote: - new collations can be defined with command CREATE COLLATION collation name FOR character set specification FROM existing collation name [STRCOLFN fn name] [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] [ LCCTYPE lc_ctype ] How do you plan to make a collation case sensitive or accent sensitive? I have previously commented that this is not a realistic view on how collations work. Since you are apparently planning to use the system locales, I don't see how you can make this work. While it's true POSIX locales don't handle this, other collation libraries do and we should support them if the user wants. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
I think at least case sensitivity can be done by comparing two strings converted to upper case with toupper() function. Regards Radek Strnad On Tue, Sep 2, 2008 at 2:00 PM, Martijn van Oosterhout [EMAIL PROTECTED]wrote: On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote: Radek Strnad wrote: - new collations can be defined with command CREATE COLLATION collation name FOR character set specification FROM existing collation name [STRCOLFN fn name] [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] [ LCCTYPE lc_ctype ] How do you plan to make a collation case sensitive or accent sensitive? I have previously commented that this is not a realistic view on how collations work. Since you are apparently planning to use the system locales, I don't see how you can make this work. While it's true POSIX locales don't handle this, other collation libraries do and we should support them if the user wants. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFIvSrIIB7bNG8LQkwRAnkWAJ9FaiR9cOHFN2vkVmQaK5y7N9OJoQCbB+Ks e0E4722hY/Q+Cz8tpzA0CGs= =2Svh -END PGP SIGNATURE-
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout wrote: On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote: Radek Strnad wrote: - new collations can be defined with command CREATE COLLATION collation name FOR character set specification FROM existing collation name [STRCOLFN fn name] [ pad characteristic ] [ case sensitive ] [ LCCOLLATE lc_collate ] [ LCCTYPE lc_ctype ] How do you plan to make a collation case sensitive or accent sensitive? I have previously commented that this is not a realistic view on how collations work. Since you are apparently planning to use the system locales, I don't see how you can make this work. While it's true POSIX locales don't handle this, other collation libraries do and we should support them if the user wants. Do they handle exactly those two attributes specifically? Can you point out references? Or do you mean, other collation libraries allow their collations to be configured/customized? I think linguistically it is a very narrow view of the world to hardcode those two attributes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote: While it's true POSIX locales don't handle this, other collation libraries do and we should support them if the user wants. Do they handle exactly those two attributes specifically? Can you point out references? Or do you mean, other collation libraries allow their collations to be configured/customized? I think linguistically it is a very narrow view of the world to hardcode those two attributes. Well, yes. Accents and case are attributes of a character. (I'm using the unicode model here). So, to do a case insensetive match you take the characters, strip the attributes and then do the comparison. There are specialised routines which handle the denormalisation of the string for you so in theory you could even get specific about which accents you ignore. In practice I don't think people do that. Have a nice day, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout [EMAIL PROTECTED] writes: On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote: While it's true POSIX locales don't handle this, other collation libraries do and we should support them if the user wants. I think that's backwards. We have to go with the lowest common denominator functionality of those libraries if we're going to be portable. As long as it's a superset of the SQL standard functionality. If we support features of some of them that can't be emulated with others then users end up with SQL code that will only work on some builds and not others. That might be worth it for some features but I'm not sure this is one. Well, yes. Accents and case are attributes of a character. (I'm using the unicode model here). So, to do a case insensetive match you take the characters, strip the attributes and then do the comparison. There are specialised routines which handle the denormalisation of the string for you so in theory you could even get specific about which accents you ignore. In practice I don't think people do that. I don't think composable unicode characters are really about collations. I think it had more to do with representing glyphs in UTF32 before they gave up on that. Does anyone still use composable characters? Note that we don't currently support composable characters at all. I'm not sure if that's a nobody really cares issue or a bug we should aim to fix with real collation support. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
On Tue, Sep 02, 2008 at 05:42:13PM +0100, Gregory Stark wrote: Martijn van Oosterhout [EMAIL PROTECTED] writes: On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote: While it's true POSIX locales don't handle this, other collation libraries do and we should support them if the user wants. I think that's backwards. We have to go with the lowest common denominator functionality of those libraries if we're going to be portable. And I think that's backwards. Why can we only use a feature once every OS out there implements it? We still run on systems that don't have SSL support. LC_TYPE settings are not portable between systems, yet that doesn't bother anyone. Why should we have a problem with collate settings not being portable? I don't think composable unicode characters are really about collations. I think it had more to do with representing glyphs in UTF32 before they gave up on that. Does anyone still use composable characters? Lookup the various normalisations forms: http://en.wikipedia.org/wiki/Unicode_normalization In particular Normal Form D. Sure, composable characters have nothing to do with collation, but they provide a uniform way of doing accent insensetive collation. Note that we don't currently support composable characters at all. Any character which is an accent on a latin character is a decomposable character. And last I checked we supported those. Have a niceday, -- Martijn van Oosterhout [EMAIL PROTECTED] http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] WIP patch: Collation support
Radek Strnad escribió: Ok, so do you suggest to leave it with a notice reindex database or start to solve it somehow? I don't know. If there are two tasks that need the same treatment, it seems a safe conclusion that they need a common solution. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Martijn van Oosterhout [EMAIL PROTECTED] writes: And I think that's backwards. Why can we only use a feature once every OS out there implements it? We still run on systems that don't have SSL support. LC_TYPE settings are not portable between systems, yet that doesn't bother anyone. Why should we have a problem with collate settings not being portable? If we're going to approach it that way, we need a syntax for CREATE COLLATION that doesn't hard-wire what the possible collation modifiers are. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: Collation support
Ok, so do you suggest to leave it with a notice reindex database or start to solve it somehow? Regards Radek Strnad On Mon, Sep 1, 2008 at 12:08 AM, Alvaro Herrera [EMAIL PROTECTED]wrote: Radek Strnad escribió: - when creating database with different collation than database cluster, the database has to be reindexed. Any idea how to do it? Function ReindexDatabase works only when database is opened. We have this Todo item: Set proper permissions on non-system schemas during db creation Currently all schemas are owned by the super-user because they are copied from the template1 database. However, since all objects are inherited from the template database, it is not clear that setting schemas to the db owner is correct. When this was discussed years ago, one proposed idea was that on the first connection the backend should, as a first task, ensure that certain administrative chores be done. The first of those was changing the ownership of schemas. It sounds like this reindexing you propose falls into the same category. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: [HACKERS] WIP patch: Collation support
Radek Strnad escribió: - when creating database with different collation than database cluster, the database has to be reindexed. Any idea how to do it? Function ReindexDatabase works only when database is opened. We have this Todo item: Set proper permissions on non-system schemas during db creation Currently all schemas are owned by the super-user because they are copied from the template1 database. However, since all objects are inherited from the template database, it is not clear that setting schemas to the db owner is correct. When this was discussed years ago, one proposed idea was that on the first connection the backend should, as a first task, ensure that certain administrative chores be done. The first of those was changing the ownership of schemas. It sounds like this reindexing you propose falls into the same category. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers