Re: [HACKERS] Per-column collation, the finale
On tis, 2011-02-08 at 22:17 +, Thom Brown wrote: postgres=# create table meow (id serial, stuff text collate de_XX); NOTICE: CREATE TABLE will create implicit sequence meow_id_seq for serial column meow.id ERROR: collation de_XX for current database encoding UTF8 does not exist LINE 1: create table meow (id serial, stuff text collate de_XX); I wouldn't expect to see that first notice. Shouldn't that step come a bit later? This isn't much different from writing create table meow (id serial, stuff nonsense); You'll still get the notice before it errors out on type-not-found. A bit of weirdness, I'm allowed to specify more than one collation on a single column ordering... Is this the same principal as casting, where they can be chained? Which one wins in this case? Yeah, last one wins. Also, if a locale is installed after initdb, is it then impossible to get pg_collate to pick up that new locale? Currently, you can insert it yourself into pg_collation. I have a CREATE COLLATION patch in the works. If a locale is somehow removed from the system, what happens on the database side when attempting to use a collated column? Then you're hosed, but that has always been the case, with per-cluster and per-database locales. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote: Looks good and tests well. I've attached the same benchmark script with updated timings, and I've marked the patch Ready for Committer. Committed. Thanks everyone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
On 8 February 2011 21:08, Peter Eisentraut pete...@gmx.net wrote: On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote: Looks good and tests well. I've attached the same benchmark script with updated timings, and I've marked the patch Ready for Committer. Committed. Thanks everyone. Awesome work Peter. Few questions: postgres=# create table meow (id serial, stuff text collate de_XX); NOTICE: CREATE TABLE will create implicit sequence meow_id_seq for serial column meow.id ERROR: collation de_XX for current database encoding UTF8 does not exist LINE 1: create table meow (id serial, stuff text collate de_XX); I wouldn't expect to see that first notice. Shouldn't that step come a bit later? A bit of weirdness, I'm allowed to specify more than one collation on a single column ordering... postgres=# select * from meow order by stuff collate en_GB collate de_DE desc; id | stuff +--- 2 | meow2 1 | meow (2 rows) Is this the same principal as casting, where they can be chained? Which one wins in this case? Although if anyone is actually doing this, then it's just silly anyway. (says Thom having just done it) Also, if a locale is installed after initdb, is it then impossible to get pg_collate to pick up that new locale? If a locale is somehow removed from the system, what happens on the database side when attempting to use a collated column? (I don't wish to TIAS on my own system) -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
Noah Misch wrote: On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote: On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote: This is good stuff. I'll send you a new patch in a day or three for perhaps another round of performance tests. Some of the other issues above can perhaps be postponed for follow-up patches. I agree -- if the performance-when-unused gets solid, none of my other comments ought to hold things up. Here is a new patch. The main change is in varstr_cmp(), avoiding the calls to pg_newlocale_from_collation() when the default locale is used. This accounts for the performance regression in my tests. It also addresses some of your refactoring ideas. Looks good and tests well. I've attached the same benchmark script with updated timings, and I've marked the patch Ready for Committer. Nice to see that performance hit is removed now! -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote: On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote: This is good stuff. I'll send you a new patch in a day or three for perhaps another round of performance tests. Some of the other issues above can perhaps be postponed for follow-up patches. I agree -- if the performance-when-unused gets solid, none of my other comments ought to hold things up. Here is a new patch. The main change is in varstr_cmp(), avoiding the calls to pg_newlocale_from_collation() when the default locale is used. This accounts for the performance regression in my tests. It also addresses some of your refactoring ideas. Looks good and tests well. I've attached the same benchmark script with updated timings, and I've marked the patch Ready for Committer. nm -- Setup -- SELECT setseed(0); -- CREATE TABLE t (c) AS -- SELECT chr(1 + (random() * 65534)::int) FROM generate_series(1,1000) gen(n); -- NOTE: due to concurrent load, timings typically varied 1-2s between runs. -- en_US.UTF8, unpatched: 68.75s -- en_US.UTF8, patched: 67.95s -- id_ID.UTF8, unpatched: 67.48s -- id_ID.UTF8, patched: 69.34s SELECT min(c) -- en_US.UTF8, patched: 81.59s -- id_ID.UTF8, patched: 84.59s --SELECT min(c COLLATE id_ID) FROM ( SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t UNION ALL SELECT c FROM t ) t_all; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
On lör, 2011-01-29 at 02:52 -0500, Noah Misch wrote: I'm reviewing this patch as part of CommitFest 2011-01. Thank you very much for this thorough review. This no longer applies cleanly against git master, so I've done my testing against 52713d0. I will send an updated patch soon, when I have figured out the issues discussed below. If you go about supporting non-GNU libc systems, you may want to look at the gnulib locale module[2] for a starting point. I don't have a Mac system, but I'll see about figuring this out. The new documentation is helpful. It would be useful to document the implications of marking your user-defined type COLLATABLE. As best I can tell, you should only set COLLATABLE if functions that would be expected to respect collation, particularly members of a B-tree operator family, do check the fmgr collation. Otherwise, the user might specify collations for data of your type and be surprised to get no differentiated behavior. Are there other considerations? Also, should implementers of COLLATABLE types observe any restrictions on when to respect the collation? For example, is equality permitted to be sensitive to collation? I improved the documentation in the CREATE TYPE man page about this. Note, however, that the questions you raise are not new. There already is a set of types that observe locale information in a set of associated functions. The answers are already hidden somewhere; this feature just exposes them in a different way. Over time, we may want to document more of this, but at the moment it's not clear how much to actually expose. On that note, the following is accepted; should it be? CREATE TABLE parent (key text COLLATE sv_SE); CREATE UNIQUE INDEX ON parent(key COLLATE id_ID); CREATE TABLE child (key text COLLATE tr_TR REFERENCES parent(key)); Does the system expect any invariants to hold on the platform implementation of collations? For example, that they be consistent with equality in some way? If so, would it be practical and/or helpful to provide a tool for testing a given platform locale for conformance? We discussed this previously in this thread. Currently, collation support does not affect equality. Values are only equal if they are bytewise equal. That's why the above is currently OK, but it should probably be restricted in the future. I'll make a note about it. pg_attribute entries for an index currently retain the attcollation of the table columns. They should have the collation of the index columns. At that point, pg_index.indcollation would exist for optimization only. Fixed. Though there's no reason it ought to happen with the first commit, it would be really excellent for make check to detect when the platform has sufficient support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR locales). Manual execution is fine for something like numeric_big, but this is so platform-sensitive that testing it widely is important. I solicited ideas about this a while ago, but no one came up with a solution. If we do get Mac or Windows support, it would be a good time to revisit this and devise a platform-independent approach, if possible. It would likewise be nice to have a way to display the collation of an arbitrary expression. A merger of pg_typeof and format_type would cover that. Agreed. I have a follow-up patch pending for that. Four call sites gain code like this (example is from varstr_cmp): +#ifdef HAVE_LOCALE_T + locale_tmylocale = pg_newlocale_from_collation(collid); +#else + check_default_collation(collid); +#endif ... +#ifdef HAVE_LOCALE_T + result = strcoll_l(a1p, a2p, mylocale); +#else result = strcoll(a1p, a2p); +#endif Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation that merely calls check_default_collation, #define strcoll_l(a, b, c) strcoll(a, b), etc. I can see six TODO sites with similar needs, and the abstraction would make the mainline code significantly cleaner. Even so, it might not pay off. Thoughts? I had considered that, but I'm mainly hesitant about introducing a fake locale_t type into the public namespace. Maybe it should be wrapped into a pg_locale_t; then we can do whatever we want. Couldn't check_default_collation be a no-op except on assert-only builds? It's currently only called in !HAVE_LOCALE_T branches, in which case initdb will not have added non-default collations. CREATE COLLATION will presumably be made to fail unconditionally on such builds, too. Unless you register the HAVE_LOCALE_T state in pg_control, you need to be prepared to deal with schemas that contain nondefault collations even though the server binary doesn't support it. wchar2char and char2wchar take a collation argument, but they only use it to Assert(!lc_ctype_is_c(collation)). Maybe that's best, but it seems weird. Yeah, it's
Re: [HACKERS] Per-column collation, the finale
On Wed, Feb 02, 2011 at 11:20:44PM +0200, Peter Eisentraut wrote: On l??r, 2011-01-29 at 02:52 -0500, Noah Misch wrote: The new documentation is helpful. It would be useful to document the implications of marking your user-defined type COLLATABLE. As best I can tell, you should only set COLLATABLE if functions that would be expected to respect collation, particularly members of a B-tree operator family, do check the fmgr collation. Otherwise, the user might specify collations for data of your type and be surprised to get no differentiated behavior. Are there other considerations? Also, should implementers of COLLATABLE types observe any restrictions on when to respect the collation? For example, is equality permitted to be sensitive to collation? I improved the documentation in the CREATE TYPE man page about this. Note, however, that the questions you raise are not new. There already is a set of types that observe locale information in a set of associated functions. The answers are already hidden somewhere; this feature just exposes them in a different way. Over time, we may want to document more of this, but at the moment it's not clear how much to actually expose. That's a helpful way to think about it. On that note, the following is accepted; should it be? CREATE TABLE parent (key text COLLATE sv_SE); CREATE UNIQUE INDEX ON parent(key COLLATE id_ID); CREATE TABLE child (key text COLLATE tr_TR REFERENCES parent(key)); Does the system expect any invariants to hold on the platform implementation of collations? For example, that they be consistent with equality in some way? If so, would it be practical and/or helpful to provide a tool for testing a given platform locale for conformance? We discussed this previously in this thread. Currently, collation support does not affect equality. Values are only equal if they are bytewise equal. That's why the above is currently OK, but it should probably be restricted in the future. I'll make a note about it. I see that this is true for texteq, bpchareq and citext_eq. However, nothing stops a user from creating a COLLATABLE type and making the = operator for its B-tree operator class sensitive to collation, right? Could that break any system assumptions? That is, do we need to document that user-defined hash and B-tree operator classes must ignore locale in their = operators? Though there's no reason it ought to happen with the first commit, it would be really excellent for make check to detect when the platform has sufficient support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR locales). Manual execution is fine for something like numeric_big, but this is so platform-sensitive that testing it widely is important. I solicited ideas about this a while ago, but no one came up with a solution. If we do get Mac or Windows support, it would be a good time to revisit this and devise a platform-independent approach, if possible. As a rough idea, we could introduce the notion of a skip condition SQL command for a pg_regress test file. When the command throws an error, pg_regress skips that file. For these tests, the skip condition would resemble: SELECT 'foo' COLLATE sv_SE, 'bar' COLLATE tr_TR Four call sites gain code like this (example is from varstr_cmp): +#ifdef HAVE_LOCALE_T + locale_tmylocale = pg_newlocale_from_collation(collid); +#else + check_default_collation(collid); +#endif ... +#ifdef HAVE_LOCALE_T + result = strcoll_l(a1p, a2p, mylocale); +#else result = strcoll(a1p, a2p); +#endif Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation that merely calls check_default_collation, #define strcoll_l(a, b, c) strcoll(a, b), etc. I can see six TODO sites with similar needs, and the abstraction would make the mainline code significantly cleaner. Even so, it might not pay off. Thoughts? I had considered that, but I'm mainly hesitant about introducing a fake locale_t type into the public namespace. Maybe it should be wrapped into a pg_locale_t; then we can do whatever we want. True. There are advantages and disadvantages to both. No strong opinion here. Couldn't check_default_collation be a no-op except on assert-only builds? It's currently only called in !HAVE_LOCALE_T branches, in which case initdb will not have added non-default collations. CREATE COLLATION will presumably be made to fail unconditionally on such builds, too. Unless you register the HAVE_LOCALE_T state in pg_control, you need to be prepared to deal with schemas that contain nondefault collations even though the server binary doesn't support it. Makes sense. [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY prosrc; -- worked ... [local] test=# SELECT prosrc
Re: [HACKERS] Per-column collation, the finale
On Sat, Jan 29, 2011 at 2:52 AM, Noah Misch n...@leadboat.com wrote: The 13% slowdown with the feature unused seems worrisome Very worrisome. This is a frequently-requested feature, but this seems like a potential show-stopper. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
Hi Peter, I'm reviewing this patch as part of CommitFest 2011-01. On Fri, Jan 14, 2011 at 11:41:46PM +0200, Peter Eisentraut wrote: I've been going over this patch with a fine-tooth comb for the last two weeks, fixed some small bugs, added comments, made initdb a little friendlier, but no substantial changes. I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION support and associated things now. This no longer applies cleanly against git master, so I've done my testing against 52713d0. I tested on Ubtunu 8.04 (GNU libc 2.7) and used the configure.in change I sent earlier[1] to get locale support detected. Since the time to seriously verify every patch hunk would be measured in weeks, I have not attempted that level of review detail. I've just commented on what stood out while reading and using the patch. If you go about supporting non-GNU libc systems, you may want to look at the gnulib locale module[2] for a starting point. The new documentation is helpful. It would be useful to document the implications of marking your user-defined type COLLATABLE. As best I can tell, you should only set COLLATABLE if functions that would be expected to respect collation, particularly members of a B-tree operator family, do check the fmgr collation. Otherwise, the user might specify collations for data of your type and be surprised to get no differentiated behavior. Are there other considerations? Also, should implementers of COLLATABLE types observe any restrictions on when to respect the collation? For example, is equality permitted to be sensitive to collation? On that note, the following is accepted; should it be? CREATE TABLE parent (key text COLLATE sv_SE); CREATE UNIQUE INDEX ON parent(key COLLATE id_ID); CREATE TABLE child (key text COLLATE tr_TR REFERENCES parent(key)); Does the system expect any invariants to hold on the platform implementation of collations? For example, that they be consistent with equality in some way? If so, would it be practical and/or helpful to provide a tool for testing a given platform locale for conformance? pg_attribute entries for an index currently retain the attcollation of the table columns. They should have the collation of the index columns. At that point, pg_index.indcollation would exist for optimization only. Though there's no reason it ought to happen with the first commit, it would be really excellent for make check to detect when the platform has sufficient support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR locales). Manual execution is fine for something like numeric_big, but this is so platform-sensitive that testing it widely is important. It would likewise be nice to have a way to display the collation of an arbitrary expression. A merger of pg_typeof and format_type would cover that. Four call sites gain code like this (example is from varstr_cmp): +#ifdef HAVE_LOCALE_T + locale_tmylocale = pg_newlocale_from_collation(collid); +#else + check_default_collation(collid); +#endif ... +#ifdef HAVE_LOCALE_T + result = strcoll_l(a1p, a2p, mylocale); +#else result = strcoll(a1p, a2p); +#endif Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation that merely calls check_default_collation, #define strcoll_l(a, b, c) strcoll(a, b), etc. I can see six TODO sites with similar needs, and the abstraction would make the mainline code significantly cleaner. Even so, it might not pay off. Thoughts? Couldn't check_default_collation be a no-op except on assert-only builds? It's currently only called in !HAVE_LOCALE_T branches, in which case initdb will not have added non-default collations. CREATE COLLATION will presumably be made to fail unconditionally on such builds, too. wchar2char and char2wchar take a collation argument, but they only use it to Assert(!lc_ctype_is_c(collation)). Maybe that's best, but it seems weird. The documentation for pg_type.typcollation marks it as a bool, but it's an oid. The documentation also says An index can only support one collation., but it's one collation per column. This surprised me; not sure how much can/should be done: [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY 1 COLLATE id_ID; ERROR: collations are not supported by type integer [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY prosrc COLLATE id_ID; -- worked ... This too: [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY prosrc; -- worked ... [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY prosrc COLLATE id_ID; -- worked ... [local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY prosrc COLLATE id_ID; ERROR: column pg_proc.prosrc must appear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY
Re: [HACKERS] Per-column collation, the finale
On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote: On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut pete...@gmx.net wrote: I've been going over this patch with a fine-tooth comb for the last two weeks, fixed some small bugs, added comments, made initdb a little friendlier, but no substantial changes. What can I do to test the patch? No regression tests are included in it, Please refer to the regress.sgml hunk about running the test. and I have an almost empty pg_collation catalog with it: =# SELECT * FROM pg_collation; collname | collnamespace | collencoding | collcollate | collctype --+---+--+-+--- default |11 |0 | | (1 row) The initdb output should say something about how it got there. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
On Wed, Jan 26, 2011 at 12:35:07AM +0200, Peter Eisentraut wrote: On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote: and I have an almost empty pg_collation catalog with it: =# SELECT * FROM pg_collation; collname | collnamespace | collencoding | collcollate | collctype --+---+--+-+--- default |11 |0 | | (1 row) The initdb output should say something about how it got there. FWIW, I tried and had the same problem. initdb gave: creating collations ... not supported on this platform configure was failing to detect locale_t for me, and this fixed it: *** a/configure.in --- b/configure.in *** *** 1116,1122 AC_TYPE_INTPTR_T AC_TYPE_UINTPTR_T AC_TYPE_LONG_LONG_INT ! AC_CHECK_TYPES([locale_t], [#include locale.h]) AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct sockcred], [], [], --- 1116,1122 AC_TYPE_UINTPTR_T AC_TYPE_LONG_LONG_INT ! AC_CHECK_TYPES([locale_t], [], [], [#include locale.h]) AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct sockcred], [], [], -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut pete...@gmx.net wrote: I've been going over this patch with a fine-tooth comb for the last two weeks, fixed some small bugs, added comments, made initdb a little friendlier, but no substantial changes. What can I do to test the patch? No regression tests are included in it, and I have an almost empty pg_collation catalog with it: =# SELECT * FROM pg_collation; collname | collnamespace | collencoding | collcollate | collctype --+---+--+-+--- default |11 |0 | | (1 row) -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentraut pete...@gmx.net wrote: I've been going over this patch with a fine-tooth comb for the last two weeks, fixed some small bugs, added comments, made initdb a little friendlier, but no substantial changes. I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION support and associated things now. Maybe I'm all wet here, but isn't it time to commit what you've got and punt the things that aren't done to 9.2? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, the finale
Em 14-01-2011 20:47, Robert Haas escreveu: On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentrautpete...@gmx.net wrote: I've been going over this patch with a fine-tooth comb for the last two weeks, fixed some small bugs, added comments, made initdb a little friendlier, but no substantial changes. I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION support and associated things now. Maybe I'm all wet here, but isn't it time to commit what you've got and punt the things that aren't done to 9.2? I think Peter want another person to take a look at his patch. I personally would like to eyeball his patch (but it will be during the week). -- Euler Taveira de Oliveira http://www.timbira.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Robert Haas wrote: I don't really have a position on whether or not this patch is ready to commit... but I do think that this is the sort of patch that is very likely to have some bugs almost no matter when we commit it I just updated the CF app to track Peter's latest update, which remains untested by anyone else for whether it fixes all the issues brought up. It would be nice to get a re-review to confirm things are still working in advance of CF 2011-01. Given the reviewer here is also a committer, that means it's possible this can go into the tree after that if everything checks out even outside of the regular CF schedule. In the interest of closing out this CF, I'm updating this one as returned for now though. That doesn't prevent it from going in anyway once it's confirmed ready, and I agree the sooner the better to help find breakage. But I don't think that's so important that it should block the critical path for the next alpha. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Thu, Dec 16, 2010 at 19:37, Greg Smith g...@2ndquadrant.com wrote: I just updated the CF app to track Peter's latest update, which remains untested by anyone else for whether it fixes all the issues brought up. It would be nice to get a re-review to confirm things are still working in advance of CF 2011-01. Sorry, I don't have enough time to review it. Also, multiple reviewers using different kinds of settings would be preferred for such feature. # We might need previous reviewers and active reviewers in commit-fest # app. Or, should non-active reviewers delete their names? I tested it in program-level, but the code-level review is still needed. I'm worried about whether no holes checks are required in COLLATE hand-off from function to function. If so, the review will be a harder work than the case where the framework guarantee the hand-off. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Itagaki Takahiro wrote: We might need previous reviewers and active reviewers in commit-fest app. Or, should non-active reviewers delete their names? This is only really an issue with patches that get moved from one CF to the next, which doesn't happen that often. Patches that are marked Returned With Feedback instead get a new entry in the next CF instead, which avoids this problem. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Sun, Dec 12, 2010 at 5:15 PM, Peter Eisentraut pete...@gmx.net wrote: On lör, 2010-12-04 at 18:04 +0200, Peter Eisentraut wrote: Here is an updated patch to address the issues discussed during this commitfest. And another one, that fixes the problems pointed out since. I don't really have a position on whether or not this patch is ready to commit... but I do think that this is the sort of patch that is very likely to have some bugs almost no matter when we commit it. Therefore, I would argue that we ought not to let the process of getting it committed drag out too long. Even if we find 3 more bugs, we might have found 10 more bugs if it had been in the tree for the same time period. Of course, any conceptual issues must be worked out before commit. But there are likely to be some loose ends however we do it. So when we think it's pretty close, we should move forward. All IMHO, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On tis, 2010-12-07 at 11:46 +0900, Itagaki Takahiro wrote: On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote: Here is an updated patch to address the issues discussed during this commitfest. I found another issue in the patch; ILIKE in WHERE clause doesn't work. It was surprising because LIKE in WHERE clause and ILIKE in SELECT list works expectedly. - SELECT * FROM pg_class WHERE relname LIKE 'pg%' - SELECT relname ILIKE 'pg%' FROM pg_class; postgres=# SELECT name, setting FROM pg_settings WHERE name IN ('lc_ctype', 'lc_collate', 'server_encoding'); name | setting -+- lc_collate | C lc_ctype| C server_encoding | UTF8 (3 rows) postgres=# SELECT * FROM pg_class WHERE relname ILIKE 'pg%'; ERROR: no collation was derived This is fixed in the 20101213 patch I'm about to send out. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On mån, 2010-12-06 at 21:26 +0200, Peter Eisentraut wrote: * contrib/citext raises an encoding error when COLLATE is specified even if it is the collation as same as the database default. We might need some special treatment for C locale. =# SHOW lc_collate; == C =# SELECT ('A'::citext) = ('a'::citext); == false =# SELECT ('A'::citext) = ('a'::citext) COLLATE C; ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. OK, I can reproduce that. That's fallout from the lc_ctype_is_c() optimization that I removed, as explained in another email. I'll have to think about that again. This is fixed in the 20101213 patch I'm about to send out. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote: Here is an updated patch to address the issues discussed during this commitfest. Here are comments and questions after I tested the latest patch: Issues * initdb itself seems to be succeeded, but it says could not determine encoding for locale messages for any combination of encoding=utf8/eucjp and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior? creating collations ...initdb: locale name has non-ASCII characters, skipped: bokm虱 initdb: locale name has non-ASCII characters, skipped: fran軋is could not determine encoding for locale hy_AM.armscii8: codeset is ARMSCII-8 ... (a dozen of lines) ... could not determine encoding for locale vi_VN.tcvn: codeset is TCVN5712-1 ok * contrib/citext raises an encoding error when COLLATE is specified even if it is the collation as same as the database default. We might need some special treatment for C locale. =# SHOW lc_collate; == C =# SELECT ('A'::citext) = ('a'::citext); == false =# SELECT ('A'::citext) = ('a'::citext) COLLATE C; ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. * pg_dump would generate unportable files for different platforms because collation names Source codes * PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION. * What is the different between InvalidOid and DEFAULT_COLLATION_OID for collation oids? The patch replaces DirectFunctionCall to DirectFunctionCallC in some places, but we could shrink the diff size if we can use InvalidOid instead of DEFAULT_COLLATION_OID, * I still think an explicit passing collations from-function-to-function is horrible because we might forget it in some places, and almost existing third party module won't work. Is it possible to make it a global variable, and push/pop the state when changed? Sorry I'm missing something, but I think we could treat the collation setting as like as GUC settings. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Dec 6, 2010, at 4:06 AM, Itagaki Takahiro wrote: * contrib/citext raises an encoding error when COLLATE is specified even if it is the collation as same as the database default. We might need some special treatment for C locale. I've been wondering if this patch will support case-insensitve collations. If so, then citext should probably be revised to use one. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
2010/12/6 David E. Wheeler da...@kineticode.com: On Dec 6, 2010, at 4:06 AM, Itagaki Takahiro wrote: * contrib/citext raises an encoding error when COLLATE is specified even if it is the collation as same as the database default. We might need some special treatment for C locale. I've been wondering if this patch will support case-insensitve collations. If so, then citext should probably be revised to use one. what I know - no. It's support only system based collations Pavel Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On mån, 2010-12-06 at 21:06 +0900, Itagaki Takahiro wrote: On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote: Here is an updated patch to address the issues discussed during this commitfest. Here are comments and questions after I tested the latest patch: Issues * initdb itself seems to be succeeded, but it says could not determine encoding for locale messages for any combination of encoding=utf8/eucjp and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior? creating collations ...initdb: locale name has non-ASCII characters, skipped: bokm虱 initdb: locale name has non-ASCII characters, skipped: fran軋is could not determine encoding for locale hy_AM.armscii8: codeset is ARMSCII-8 ... (a dozen of lines) ... could not determine encoding for locale vi_VN.tcvn: codeset is TCVN5712-1 ok What this does it take the output of locale -a and populate the pg_collation catalog with the locales it finds. When it finds an operating system locale that uses an encoding that is not recognized, you will see this warning. I understand that that would probably annoy users. We could hide the warning and silently skip those locales. But then could that hide genuine configuration problems? * contrib/citext raises an encoding error when COLLATE is specified even if it is the collation as same as the database default. We might need some special treatment for C locale. =# SHOW lc_collate; == C =# SELECT ('A'::citext) = ('a'::citext); == false =# SELECT ('A'::citext) = ('a'::citext) COLLATE C; ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. OK, I can reproduce that. That's fallout from the lc_ctype_is_c() optimization that I removed, as explained in another email. I'll have to think about that again. * pg_dump would generate unportable files for different platforms because collation names pg_dump can already produce unportable files for a number of other reasons, including per-database locale, tablespaces, OS-dependent configuration settings. The way I imagine this working is that someone who wants to design a genuinely portable application using this feature would create their own collation based on the existing, OS-specific collation (using a to-be-added CREATE COLLATION command). As mentioned earlier, however, we can't actually solve the problem that the OS locales may not behave the same across systems. Source codes * PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION. It's not the collation of a function argument, it's the collation of a function call. (You could conceivably also fetch the collation of a function argument, but that isn't used in any way.) * What is the different between InvalidOid and DEFAULT_COLLATION_OID for collation oids? The patch replaces DirectFunctionCall to DirectFunctionCallC in some places, but we could shrink the diff size if we can use InvalidOid instead of DEFAULT_COLLATION_OID, Think of DEFAULT_COLLATION_OID as analogous to UKNOWNOID. A long time ago we used InvalidOid for all kinds of types, including unknown, pseudotypes, cstring, and no type at all. The reason we changed this was that this masked errors and made processing of the unknown type difficult/impossible. I know this makes the code bigger, but it's necessary. I originally coded the patch using InvalidOid for everything, but that wasn't very robust. This also ties into the next question ... * I still think an explicit passing collations from-function-to-function is horrible because we might forget it in some places, and almost existing third party module won't work. Is it possible to make it a global variable, and push/pop the state when changed? Sorry I'm missing something, but I think we could treat the collation setting as like as GUC settings. A collation is a property of a datum or an expression. You might as well argue that we don't keep track of types of expressions and instead store it globally. Doesn't make sense. Extensions are not required to support collations. Those that might want to will usually end up calling one of the locale-enabled functions such as varstr_cmp(), and there the function prototype will ensure that specifying a collation cannot be missed. Additionally, the distinction of InvalidOid and DEFAULT_COLLATION_OID does a great deal to ensure that in case a collation is unspecified or missing in some new code, you will get a proper error message instead of unspecified behavior. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On mån, 2010-12-06 at 10:01 -0800, David E. Wheeler wrote: I've been wondering if this patch will support case-insensitve collations. If so, then citext should probably be revised to use one. This has been touch upon several times during the discussions on past patches. Essentially, the current patch only arranges that you can specify a sort order for data. The system always breaks ties using a binary comparison. This could conceivably be changed, but it's a separate problem. Some of the necessary investigation work has presumably already been done in the context of citext. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Dec 6, 2010, at 11:29 AM, Peter Eisentraut wrote: This has been touch upon several times during the discussions on past patches. Essentially, the current patch only arranges that you can specify a sort order for data. The system always breaks ties using a binary comparison. This could conceivably be changed, but it's a separate problem. Some of the necessary investigation work has presumably already been done in the context of citext. Okay, thanks, good to know. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Please It would be very important to us that the Brazilian LIKE collate worked with, and possible case-insensitive and accent-insensitive Tank's Alexandre Riveira Brazil Peter Eisentraut escreveu: On mån, 2010-12-06 at 10:01 -0800, David E. Wheeler wrote: I've been wondering if this patch will support case-insensitve collations. If so, then citext should probably be revised to use one. This has been touch upon several times during the discussions on past patches. Essentially, the current patch only arranges that you can specify a sort order for data. The system always breaks ties using a binary comparison. This could conceivably be changed, but it's a separate problem. Some of the necessary investigation work has presumably already been done in the context of citext. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote: Here is an updated patch to address the issues discussed during this commitfest. I found another issue in the patch; ILIKE in WHERE clause doesn't work. It was surprising because LIKE in WHERE clause and ILIKE in SELECT list works expectedly. - SELECT * FROM pg_class WHERE relname LIKE 'pg%' - SELECT relname ILIKE 'pg%' FROM pg_class; postgres=# SELECT name, setting FROM pg_settings WHERE name IN ('lc_ctype', 'lc_collate', 'server_encoding'); name | setting -+- lc_collate | C lc_ctype| C server_encoding | UTF8 (3 rows) postgres=# SELECT * FROM pg_class WHERE relname ILIKE 'pg%'; ERROR: no collation was derived -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On tor, 2010-11-18 at 21:37 +0200, Heikki Linnakangas wrote: On 15.11.2010 21:42, Peter Eisentraut wrote: On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote: I am checking a patch. I found a problem with initdb Ah, late night brain farts, it appears. Here is a corrected version. Some random comments: In syntax.sgml: +The literalCOLLATE/literal clause overrides the collation of +an expression. It is appended to the expression at applies to: That last sentence doesn't parse. Fixed in the patch I'm about to send out. Would it be possible to eliminate the ExecEvalCollateClause function somehow? It just calls through the argument. How about directly returning the argument ExprState in ExecInitExpr? This currently mirrors the handling of RelabelType. Perhaps both could be improved, but that is an independent issue. get_collation_name() returns the plain name without schema, so it's not good enough for use in ruleutils.c. pg_dump is also ignoring collation's schema. Fixed schema handling in ruleutils and pg_dump in new patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On ons, 2010-11-24 at 22:22 +0200, Peter Eisentraut wrote: On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote: * Did you see any performance regression by collation? I found a bug in lc_collate_is_c(); result = 0 should be checked before any other checks. SearchSysCache1() here would be a performance regression. That code turned out to be buggy anyway, because it was using the result cache variable independent of the collation parameter. Since I don't have a short-term solution for this, I have ripped out the caching of C-ness for nondefault locales. I did some profiling with this now. The problem is that this function lc_collate_is_c() would need to cache the C-ness property for any number of collations. Depending on what call pattern you expect or want to optimize for, you might end up caching most of the pg_collation catalog, which is actually the mandate of SearchSysCache, but the profile shows that SearchSysCache takes a large chunk of the additional run time. If I remove that branch altogether, that is, don't treat the C locale specially at all in the nondefault collation case, then using non-C locales as nondefault collation is almost as fast as using non-C locales as default location. However, using the C locale as a nondefault collation would then be quite slow (still faster that non-C locales). The solution would perhaps be a custom, lightweight caching system, but I haven't thought of one yet. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote: * Did you see any performance regression by collation? I found a bug in lc_collate_is_c(); result = 0 should be checked before any other checks. SearchSysCache1() here would be a performance regression. That code turned out to be buggy anyway, because it was using the result cache variable independent of the collation parameter. I did some profiling with this now. The problem is that this function lc_collate_is_c() would need to cache the C-ness property for any number of collations. Depending on what call pattern you expect or want to optimize for, you might end up caching most of the pg_collation catalog, which is actually the mandate of SearchSysCache, but the profile shows that SearchSysCache takes a large chunk of the additional run time. If I remove that branch altogether, that is, don't treat the C locale specially at all in the nondefault collation case, then using non-C locales as nondefault collation is almost as fast as using non-C locales as default location. However, using the C locale as a nondefault collation would then be quite slow (still faster that non-C locales). The solution would perhaps be a custom, lightweight caching system, but I haven't thought of one yet. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote: It seems you've falsified the header comment in pathkeys_useful_for_merging(), although I guess it's already false because it doesn't seem to have been updated for the NULLS ASC/DESC stuff, and the interior comment in right_merge_direction() also needs adjusting. But this might be more than a documentation problem, because the choice of merge direction really *is* arbitrary in the case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's actually true for collation. If collation affects the definition of equality then it certainly isn't true. I did check that again and didn't arrive at the conclusion that the comments would need updating either with respect to this patch or some previous change. Could you check again and possibly provide a suggestion? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote: * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations. This was fixed in the CF2010-11 patch. * psql \d needs a separator between collate and not null modifiers. And this as well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Wed, Nov 24, 2010 at 3:37 PM, Peter Eisentraut pete...@gmx.net wrote: On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote: It seems you've falsified the header comment in pathkeys_useful_for_merging(), although I guess it's already false because it doesn't seem to have been updated for the NULLS ASC/DESC stuff, and the interior comment in right_merge_direction() also needs adjusting. But this might be more than a documentation problem, because the choice of merge direction really *is* arbitrary in the case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's actually true for collation. If collation affects the definition of equality then it certainly isn't true. I did check that again and didn't arrive at the conclusion that the comments would need updating either with respect to this patch or some previous change. Could you check again and possibly provide a suggestion? I think that you are right and that my previous comment was erroneous. Sorry for the noise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On tor, 2010-11-18 at 21:37 +0200, Heikki Linnakangas wrote: Have you done any performance testing? Functions like text_cmp can be a hotspot in sorting, so any extra overhead there might be show up in tests. Without having optimized it very much yet, the performance for a 1GB ORDER BY is * without COLLATE clause: about the same as without the patch * with COLLATE clause: about 30%-50% slower I can imagine that there is some optimization potential in the latter case. But in any case, it's not awfully slow. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote: * COLLATE information must be explicitly passed by caller in the patch, but we might forgot the handover when we write new codes. Is it possible to pass it automatically, say using a global variable? If we could do so, existing extensions might work with collation without rewritten. I don't see how that is supposed to work. I understand the concern, but the system is fairly robust against this becoming a problem. * Did you check the regression test on Windows? We probably cannot use en_US.utf8 on Windows. Also, some output of the test includes non-ASCII characters. How will we test COLLATE feature on non-UTF8 databases? I attempted to discuss this here: http://archives.postgresql.org/pgsql-hackers/2010-11/msg00464.php For a lack of a solution, the approach now is to run the regression test file manually, and we provide separate test files for as many platforms and encodings or whatever we want to support. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Tue, Nov 16, 2010 at 04:42, Peter Eisentraut pete...@gmx.net wrote: On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote: I am checking a patch. I found a problem with initdb Ah, late night brain farts, it appears. Here is a corrected version. This version cannot be applied cleanly any more. Please update it. (I think you don't have to include changes for catversion.h) ./src/backend/optimizer/util/plancat.c.rej ./src/backend/optimizer/plan/createplan.c.rej ./src/backend/optimizer/path/indxpath.c.rej ./src/include/catalog/catversion.h.rej I didn't compile nor run the patched server, but I found a couple of issues in the design and source code: * COLLATE information must be explicitly passed by caller in the patch, but we might forgot the handover when we write new codes. Is it possible to pass it automatically, say using a global variable? If we could do so, existing extensions might work with collation without rewritten. * Did you check the regression test on Windows? We probably cannot use en_US.utf8 on Windows. Also, some output of the test includes non-ASCII characters. How will we test COLLATE feature on non-UTF8 databases? [src/test/regress/sql/collate.sql] +CREATE TABLE collate_test1 ( +a int, +b text COLLATE en_US.utf8 not null +); * Did you see any performance regression by collation? I found a bug in lc_collate_is_c(); result = 0 should be checked before any other checks. SearchSysCache1() here would be a performance regression. [src/backend/utils/adt/pg_locale.c] -lc_collate_is_c(void) +lc_collate_is_c(Oid collation) { ... + tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collation)); ... HERE = if (result = 0) return (bool) result; -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On 15.11.2010 21:42, Peter Eisentraut wrote: On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote: I am checking a patch. I found a problem with initdb Ah, late night brain farts, it appears. Here is a corrected version. Some random comments: In syntax.sgml: +The literalCOLLATE/literal clause overrides the collation of +an expression. It is appended to the expression at applies to: That last sentence doesn't parse. Would it be possible to eliminate the ExecEvalCollateClause function somehow? It just calls through the argument. How about directly returning the argument ExprState in ExecInitExpr? get_collation_name() returns the plain name without schema, so it's not good enough for use in ruleutils.c. pg_dump is also ignoring collation's schema. Have you done any performance testing? Functions like text_cmp can be a hotspot in sorting, so any extra overhead there might be show up in tests. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote: a) default encoding for collate isn't same as default encoding of database it's minimally not friendly - mostly used encoding is UTF8, but in most cases users should to write locale.utf8. I don't understand what you are trying to say. Please provide more detail. b) there is bug - default collate (database collate is ignored) postgres=# show lc_collate; lc_collate cs_CZ.UTF8 (1 row) Time: 0.518 ms postgres=# select * from jmena order by v; v ─── Chromečka Crha Drobný Čečetka (4 rows) postgres=# select * from jmena order by v collate cs_CZ.utf8; v ─── Crha Čečetka Drobný Chromečka (4 rows) both result should be same. I tried to reproduce this here but got the expected results. Could you try to isolate a complete test script? isn't there problem in case sensitive collate name? When I use a lc_collate value, I got a error message postgres=# select * from jmena order by v collate cs_CZ.UTF8; ERROR: collation cs_CZ.UTF8 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by v collate cs_CZ.UTF8; problem is when table is created without explicit collate. Well, I agree it's not totally nice, but we have to do something, and I think it's logical to use the locale names as collation names by default, and collation names are SQL identifiers. Do you have any ideas for improving this? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Hello 2010/11/16 Peter Eisentraut pete...@gmx.net: On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote: a) default encoding for collate isn't same as default encoding of database it's minimally not friendly - mostly used encoding is UTF8, but in most cases users should to write locale.utf8. I don't understand what you are trying to say. Please provide more detail. go down. b) there is bug - default collate (database collate is ignored) postgres=# show lc_collate; lc_collate cs_CZ.UTF8 (1 row) Time: 0.518 ms postgres=# select * from jmena order by v; v ─── Chromečka Crha Drobný Čečetka (4 rows) postgres=# select * from jmena order by v collate cs_CZ.utf8; v ─── Crha Čečetka Drobný Chromečka (4 rows) both result should be same. I tried to reproduce this here but got the expected results. Could you try to isolate a complete test script? I can't to reproduce now too. On different system and comp. Maybe I did some wrong. Sorry. isn't there problem in case sensitive collate name? When I use a lc_collate value, I got a error message postgres=# select * from jmena order by v collate cs_CZ.UTF8; ERROR: collation cs_CZ.UTF8 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by v collate cs_CZ.UTF8; problem is when table is created without explicit collate. Well, I agree it's not totally nice, but we have to do something, and I think it's logical to use the locale names as collation names by default, and collation names are SQL identifiers. Do you have any ideas for improving this? yes - my first question is: Why we need to specify encoding, when only one encoding is supported? I can't to use a cs_CZ.iso88592 when my db use a UTF8 - btw there is wrong message: yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592; ERROR: collation cs_CZ.iso88592 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592; ^ I don't know why, but preferred encoding for czech is iso88592 now - but I can't to use it - so I can't to use a names czech, cs_CZ. I always have to use a full name cs_CZ.utf8. It's wrong. More - from this moment, my application depends on firstly used encoding - I can't to change encoding without refactoring of SQL statements - because encoding is hardly there (in collation clause). So I don't understand, why you fill a table pg_collation with thousand collated that are not possible to use? If I use a utf8, then there should be just utf8 based collates. And if you need to work with wide collates, then I am for a preferring utf8 - minimally for central europe region. if somebody would to use a collates here, then he will use a combination cs, de, en - so it must to use a latin2 and latin1 or utf8. I think so encoding should not be a part of collation when it is possible. Regards Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote: yes - my first question is: Why we need to specify encoding, when only one encoding is supported? I can't to use a cs_CZ.iso88592 when my db use a UTF8 - btw there is wrong message: yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592; ERROR: collation cs_CZ.iso88592 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592; ^ Sorry, is there some mistake in that message? I don't know why, but preferred encoding for czech is iso88592 now - but I can't to use it - so I can't to use a names czech, cs_CZ. I always have to use a full name cs_CZ.utf8. It's wrong. More - from this moment, my application depends on firstly used encoding - I can't to change encoding without refactoring of SQL statements - because encoding is hardly there (in collation clause). I can only look at the locales that the operating system provides. We could conceivably make some simplifications like stripping off the .utf8, but then how far do we go and where do we stop? Locale names on Windows look different too. But in general, how do you suppose we should map an operating system locale name to an acceptable SQL identifier? You might hope, for example, that we could look through the list of operating system locale names and map, say, cs_CZ - czech cs_CZ.iso88592 - czech cs_CZ.utf8 - czech czech - czech but we have no way to actually know that these are semantically similar, so this illustrated mapping is AI complete. We need to take the locale names as is, and that may or may not carry encoding information. So I don't understand, why you fill a table pg_collation with thousand collated that are not possible to use? If I use a utf8, then there should be just utf8 based collates. And if you need to work with wide collates, then I am for a preferring utf8 - minimally for central europe region. if somebody would to use a collates here, then he will use a combination cs, de, en - so it must to use a latin2 and latin1 or utf8. I think so encoding should not be a part of collation when it is possible. Different databases can have different encodings, but the pg_collation catalog is copied from the template database in any case. We can't do any changes in system catalogs as we create new databases, so the useless collations have to be there. There are only a few hundred, actually, so it's not really a lot of wasted space. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
2010/11/16 Peter Eisentraut pete...@gmx.net: On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote: yes - my first question is: Why we need to specify encoding, when only one encoding is supported? I can't to use a cs_CZ.iso88592 when my db use a UTF8 - btw there is wrong message: yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592; ERROR: collation cs_CZ.iso88592 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592; ^ Sorry, is there some mistake in that message? it is unclean - I expect some like cannot to use collation cs_CZ.iso88502, because your database use a utf8 encoding. I don't know why, but preferred encoding for czech is iso88592 now - but I can't to use it - so I can't to use a names czech, cs_CZ. I always have to use a full name cs_CZ.utf8. It's wrong. More - from this moment, my application depends on firstly used encoding - I can't to change encoding without refactoring of SQL statements - because encoding is hardly there (in collation clause). I can only look at the locales that the operating system provides. We could conceivably make some simplifications like stripping off the .utf8, but then how far do we go and where do we stop? Locale names on Windows look different too. But in general, how do you suppose we should map an operating system locale name to an acceptable SQL identifier? You might hope, for example, that we could look through the list of operating system locale names and map, say, cs_CZ - czech cs_CZ.iso88592 - czech cs_CZ.utf8 - czech czech - czech but we have no way to actually know that these are semantically similar, so this illustrated mapping is AI complete. We need to take the locale names as is, and that may or may not carry encoding information. So I don't understand, why you fill a table pg_collation with thousand collated that are not possible to use? If I use a utf8, then there should be just utf8 based collates. And if you need to work with wide collates, then I am for a preferring utf8 - minimally for central europe region. if somebody would to use a collates here, then he will use a combination cs, de, en - so it must to use a latin2 and latin1 or utf8. I think so encoding should not be a part of collation when it is possible. Different databases can have different encodings, but the pg_collation catalog is copied from the template database in any case. We can't do any changes in system catalogs as we create new databases, so the useless collations have to be there. There are only a few hundred, actually, so it's not really a lot of wasted space. I have not a problem with size. Just I think, current behave isn't practical. When database encoding is utf8, then I except, so cs_CZ or czech will be for utf8. I understand, so template0 must have a all locales, and I understand why current behave is, but it is very user unfriendly. Actually, only old application in CR uses latin2, almost all uses a utf, but now latin2 is preferred. This is bad and should be solved. Regards Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
I can only look at the locales that the operating system provides. We could conceivably make some simplifications like stripping off the .utf8, but then how far do we go and where do we stop? Locale names on Windows look different too. But in general, how do you suppose we should map an operating system locale name to an acceptable SQL identifier? You might hope, for example, that we could look through the It would be nice if we could have some mapping of locale names bult in, so one doesn`t have to write alternative sql depending on DB server OS: select * from tab order by foo collate Polish, Poland select * from tab order by foo collate pl_PL.UTF-8 (that`s how it works now, correct?) Greetings Marcin Mańk -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
2010/11/16 marcin mank marcin.m...@gmail.com: I can only look at the locales that the operating system provides. We could conceivably make some simplifications like stripping off the .utf8, but then how far do we go and where do we stop? Locale names on Windows look different too. But in general, how do you suppose we should map an operating system locale name to an acceptable SQL identifier? You might hope, for example, that we could look through the It would be nice if we could have some mapping of locale names bult in, so one doesn`t have to write alternative sql depending on DB server OS: +1 Pavel select * from tab order by foo collate Polish, Poland select * from tab order by foo collate pl_PL.UTF-8 (that`s how it works now, correct?) Greetings Marcin Mańk -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote: 2010/11/16 Peter Eisentraut pete...@gmx.net: On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote: yes - my first question is: Why we need to specify encoding, when only one encoding is supported? I can't to use a cs_CZ.iso88592 when my db use a UTF8 - btw there is wrong message: yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592; ERROR: collation cs_CZ.iso88592 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592; ^ Sorry, is there some mistake in that message? it is unclean - I expect some like cannot to use collation cs_CZ.iso88502, because your database use a utf8 encoding. No, the namespace for collations is per encoding. (This is per SQL standard.) So you *could* have a collation called cs_CZ.iso88502 for the UTF8 encoding. I have not a problem with size. Just I think, current behave isn't practical. When database encoding is utf8, then I except, so cs_CZ or czech will be for utf8. I understand, so template0 must have a all locales, and I understand why current behave is, but it is very user unfriendly. Actually, only old application in CR uses latin2, almost all uses a utf, but now latin2 is preferred. This is bad and should be solved. Again, we can only look at the locale names that the operating system gives us. Mapping that to the names you expect is an AI problem. If you have a solution, please share it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote: It would be nice if we could have some mapping of locale names bult in, so one doesn`t have to write alternative sql depending on DB server OS: select * from tab order by foo collate Polish, Poland select * from tab order by foo collate pl_PL.UTF-8 Sure that would be nice, but how do you hope to do that? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
2010/11/16 Peter Eisentraut pete...@gmx.net: On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote: 2010/11/16 Peter Eisentraut pete...@gmx.net: On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote: yes - my first question is: Why we need to specify encoding, when only one encoding is supported? I can't to use a cs_CZ.iso88592 when my db use a UTF8 - btw there is wrong message: yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592; ERROR: collation cs_CZ.iso88592 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592; ^ Sorry, is there some mistake in that message? it is unclean - I expect some like cannot to use collation cs_CZ.iso88502, because your database use a utf8 encoding. No, the namespace for collations is per encoding. (This is per SQL standard.) So you *could* have a collation called cs_CZ.iso88502 for the UTF8 encoding. I have not a problem with size. Just I think, current behave isn't practical. When database encoding is utf8, then I except, so cs_CZ or czech will be for utf8. I understand, so template0 must have a all locales, and I understand why current behave is, but it is very user unfriendly. Actually, only old application in CR uses latin2, almost all uses a utf, but now latin2 is preferred. This is bad and should be solved. Again, we can only look at the locale names that the operating system gives us. Mapping that to the names you expect is an AI problem. If you have a solution, please share it. ok, then we should to define this alias manually some like - CREATE COLLATE czech FOR LOCALE cs_CZ.UTF8 or some similar. Without this, the application or stored procedures can be non portable between UNIX and WIN. Peter, now initdb check relation between encoding and locale - and this check is portable. Can we use this code? Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On tis, 2010-11-16 at 21:40 +0100, Pavel Stehule wrote: ok, then we should to define this alias manually some like - CREATE COLLATE czech FOR LOCALE cs_CZ.UTF8 or some similar. Without this, the application or stored procedures can be non portable between UNIX and WIN. Yes, such a command will be provided. You can already do it manually. Peter, now initdb check relation between encoding and locale - and this check is portable. Can we use this code? Hmm, not really, but something similar, I suppose. Only that the mapping list would be much longer and more volatile. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote: On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote: It would be nice if we could have some mapping of locale names bult in, so one doesn`t have to write alternative sql depending on DB server OS: select * from tab order by foo collate Polish, Poland select * from tab order by foo collate pl_PL.UTF-8 Sure that would be nice, but how do you hope to do that? Given that each operating system comes with a different set of collations, it seems unlikely you could even find two collations on different OSes that even correspond. There's not a lot of standardisation here (well, except for the unicode collation algorithm, but that doesn't help with language variations). I don't think this is a big deal for now, perhaps after per-column collation is implemented we can work on the portability issues. Make it work, then make it better. /me ducks Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Per-column collation
Martijn van Oosterhout klep...@svana.org writes: On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote: On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote: It would be nice if we could have some mapping of locale names bult in, so one doesn`t have to write alternative sql depending on DB server OS: Sure that would be nice, but how do you hope to do that? Given that each operating system comes with a different set of collations, it seems unlikely you could even find two collations on different OSes that even correspond. Yeah, the *real* portability problem here is that the locale behavior is likely to be different, not just the name. I don't think we'd be doing people many favors by masking behavioral differences between a forced common name. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
2010/11/16 Tom Lane t...@sss.pgh.pa.us: Martijn van Oosterhout klep...@svana.org writes: On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote: On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote: It would be nice if we could have some mapping of locale names bult in, so one doesn`t have to write alternative sql depending on DB server OS: Sure that would be nice, but how do you hope to do that? Given that each operating system comes with a different set of collations, it seems unlikely you could even find two collations on different OSes that even correspond. Yeah, the *real* portability problem here is that the locale behavior is likely to be different, not just the name. I don't think we'd be doing people many favors by masking behavioral differences between a forced common name. no, minimally there is same behave of cs_CZ.utf8 and cs_CZ.iso88592. But without any alias user should to modify source code, when he change a encoding. Pavel regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Hello I am checking a patch. I found a problem with initdb [postg...@pavel-stehule postgresql]$ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data/ could not change directory to /home/pavel/src/postgresql The files belonging to this database system will be owned by user postgres. This user must also own the server process. The database cluster will be initialized with locale en_US.utf8. The default database encoding has accordingly been set to UTF8. The default text search configuration will be set to english. fixing permissions on existing directory /usr/local/pgsql/data ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 24MB creating configuration files ... ok creating template1 database in /usr/local/pgsql/data/base/1 ... ok initializing pg_authid ... ok initializing dependencies ... ok creating system views ... ok loading system objects' descriptions ... ok creating collations ...initdb: locale name has non-ASCII characters, skipped: bokm�linitdb: locale name has non-ASCII characters, skipped: fran�aiscould not determine encoding for locale hy_AM.armscii8: codeset is ARMSCII-8 could not determine encoding for locale ka_GE: codeset is GEORGIAN-PS could not determine encoding for locale ka_GE.georgianps: codeset is GEORGIAN-PS could not determine encoding for locale kk_KZ: codeset is PT154 could not determine encoding for locale kk_KZ.pt154: codeset is PT154 could not determine encoding for locale tg_TJ: codeset is KOI8-T could not determine encoding for locale tg_TJ.koi8t: codeset is KOI8-T could not determine encoding for locale thai: codeset is TIS-620 could not determine encoding for locale th_TH: codeset is TIS-620 could not determine encoding for locale th_TH.tis620: codeset is TIS-620 could not determine encoding for locale vi_VN.tcvn: codeset is TCVN5712-1 FATAL: invalid byte sequence for encoding UTF8: 0xe56c27 child process exited with exit code 1 initdb: removing contents of data directory /usr/local/pgsql/data tested on fedora 13 [postg...@pavel-stehule local]$ locale -a| wc -l 731 Regards Pavel Stehule 2010/11/15 Peter Eisentraut pete...@gmx.net: Here is the next patch in this epic series. [0] I have addressed most of the issues pointed out in previous reviews and removed all major outstanding problems that were marked in the code. So it might just almost really work. The documentation now also covers everything that's interesting, so newcomers can start with that. For those who have previously reviewed this, two major changes: * The locales to be loaded are now computed by initdb, no longer during the build process. * The regression test file has been removed from the main test set. To run it, use make check MULTIBYTE=UTF8 EXTRA_TESTS=collate Stuff that still cannot be expected to work: * no CREATE COLLATION yet, maybe later * no support for regular expression searches * not text search support These would not be release blockers, I think. [0] http://archives.postgresql.org/message-id/1284583568.4696.20.ca...@vanquo.pezone.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation
Hello 2010/11/15 Peter Eisentraut pete...@gmx.net: On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote: I am checking a patch. I found a problem with initdb Ah, late night brain farts, it appears. Here is a corrected version. yes, it's ok now. I see still a few issues: a) default encoding for collate isn't same as default encoding of database it's minimally not friendly - mostly used encoding is UTF8, but in most cases users should to write locale.utf8. b) there is bug - default collate (database collate is ignored) postgres=# show lc_collate; lc_collate cs_CZ.UTF8 (1 row) Time: 0.518 ms postgres=# select * from jmena order by v; v ─── Chromečka Crha Drobný Čečetka (4 rows) postgres=# select * from jmena order by v collate cs_CZ.utf8; v ─── Crha Čečetka Drobný Chromečka (4 rows) both result should be same. isn't there problem in case sensitive collate name? When I use a lc_collate value, I got a error message postgres=# select * from jmena order by v collate cs_CZ.UTF8; ERROR: collation cs_CZ.UTF8 for current database encoding UTF8 does not exist LINE 1: select * from jmena order by v collate cs_CZ.UTF8; problem is when table is created without explicit collate. Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote: and maybe not that bad, but I wonder if there is some preparatory refactoring that could be done to trim it down a bit. I notice, for example, that a lot of places that looked at asc/desc, nulls first/last now look at asc/desc, nulls first/last, collation. In particular, all the pathkey stuff is like this. And similarly places that used to care about type, typmod now have to care about type, tymod, collation. There might be ways to restructure some of this code so that these things can be changed without having to touch quite so many places. Yeah, I think that's what I'll try to do next. We already have TypeName as a structure that contains type and typmod (and collation, in my patch). We could make that a primnode instead of a parsenode, and use it in more places, or we could make a new leaner structure that only contains the numeric info. We could then, for example, change things like this: typedef struct Var { Exprxpr; ... Oid vartype; int32 vartypmod; ... } into this typedef struct Var { Exprxpr; ... TypeName/TypeFoo vartype; ... } This would save boatloads of duplicate code. It looks like you've define collations as objects that exist within particular namespaces, but there's no CREATE COLLATION statement, so I don't see what purpose this serves. I suppose we could leave that to be added later, but is there actually a use case for having collations in individual schemas, or should we treat them more like we do casts - i.e. as database-global objects? The SQL standard defines it that way, and there should be a CREATE COLLATION statement later. Application-specific collation sequences might not be unreasonable in the future. Why does the executor ever need to see collate clauses? Hmm, maybe not. I think it did in an earlier working draft. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Thu, Oct 21, 2010 at 2:44 PM, Peter Eisentraut pete...@gmx.net wrote: On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote: and maybe not that bad, but I wonder if there is some preparatory refactoring that could be done to trim it down a bit. I notice, for example, that a lot of places that looked at asc/desc, nulls first/last now look at asc/desc, nulls first/last, collation. In particular, all the pathkey stuff is like this. And similarly places that used to care about type, typmod now have to care about type, tymod, collation. There might be ways to restructure some of this code so that these things can be changed without having to touch quite so many places. Yeah, I think that's what I'll try to do next. We already have TypeName as a structure that contains type and typmod (and collation, in my patch). We could make that a primnode instead of a parsenode, and use it in more places, or we could make a new leaner structure that only contains the numeric info. We could then, for example, change things like this: typedef struct Var { Expr xpr; ... Oid vartype; int32 vartypmod; ... } into this typedef struct Var { Expr xpr; ... TypeName/TypeFoo vartype; ... } This would save boatloads of duplicate code. I think that the idea of having a node that represents a type in all its glory is a very good one. I'm somewhat inclined not to reuse TypeName, because I think we'll end up wanting to use this in places where names and location are not available. In fact, judging by some of the logic in LookupTypeNames(), we have some cases like that already, which might be worth trying to clean up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
Peter Eisentraut pete...@gmx.net writes: We already have TypeName as a structure that contains type and typmod (and collation, in my patch). We could make that a primnode instead of a parsenode, and use it in more places, or we could make a new leaner structure that only contains the numeric info. TypeName per se is completely inappropriate for use beyond the first stage of parsing, because it requires catalog lookups to make any sense of. I think the post-parsing representation should still start with a type OID. I can agree with replacing typmod with a struct, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Eisentraut pete...@gmx.net writes: We already have TypeName as a structure that contains type and typmod (and collation, in my patch). We could make that a primnode instead of a parsenode, and use it in more places, or we could make a new leaner structure that only contains the numeric info. TypeName per se is completely inappropriate for use beyond the first stage of parsing, because it requires catalog lookups to make any sense of. I think the post-parsing representation should still start with a type OID. I can agree with replacing typmod with a struct, though. I think we should have both the type OID and the typmod in the struct. Carrying the type OID separately from the typmod has caused us enough heartache already. No? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
Robert Haas robertmh...@gmail.com writes: On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: TypeName per se is completely inappropriate for use beyond the first stage of parsing, because it requires catalog lookups to make any sense of. I think the post-parsing representation should still start with a type OID. I can agree with replacing typmod with a struct, though. I think we should have both the type OID and the typmod in the struct. Carrying the type OID separately from the typmod has caused us enough heartache already. No? I think that that would probably involve a whole lot more notational busywork than just replacing typmod with something more complicated. However, we're talking about breaking vast amounts of code in either case, so maybe making it even vaster isn't a real consideration. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
I think that that would probably involve a whole lot more notational busywork than just replacing typmod with something more complicated. However, we're talking about breaking vast amounts of code in either case, so maybe making it even vaster isn't a real consideration. Gods, yes. Please let's not extend typemod any further without an overhaul. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Fri, Sep 24, 2010 at 1:57 AM, Peter Eisentraut pete...@gmx.net wrote: On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote: On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut pete...@gmx.net wrote: We could support it also on MSVC. http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l Great. If we support both glibc and msvc, how to we handle CREATE TABLE DDLs in pg_dump? Since collation names depend on platforms, a backup dumped at UNIX cannot be reloaded to Windows. We might need to normalize locale names to generate a portable dump. It is not necessary that the SQL collation names are the same as the OS locale names. That is just a convenient way to initialize it. If you need to transport dumps, you can create your own SQL collation entry at the target that locally fits what you are trying to do (or perhaps rename the collation at the source). I don't think there is a universally applicable way to normalize locale names, because we want to support user-defined OS locales. What's the status of this patch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote: What's the status of this patch? I would appreciate some more review of the basic architecture. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Thu, Oct 14, 2010 at 12:53 PM, Peter Eisentraut pete...@gmx.net wrote: On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote: What's the status of this patch? I would appreciate some more review of the basic architecture. reads patch Wow, what a patch. On the whole, I think this looks pretty good. Of course, the obvious thing to be dismayed about is how many parts of the system this touches. To some extent, that's probably inevitable and maybe not that bad, but I wonder if there is some preparatory refactoring that could be done to trim it down a bit. I notice, for example, that a lot of places that looked at asc/desc, nulls first/last now look at asc/desc, nulls first/last, collation. In particular, all the pathkey stuff is like this. And similarly places that used to care about type, typmod now have to care about type, tymod, collation. There might be ways to restructure some of this code so that these things can be changed without having to touch quite so many places. If we're extending these lists from two items to three, do we need to worry about what happens when they grow to four or five or six? I particularly think this is in issue for the type information. We are still finding bugs where typemod isn't carried through properly; this kind of thing is only going to make it much worse. We need to encapsulate it in some future-proof way. It seems you've falsified the header comment in pathkeys_useful_for_merging(), although I guess it's already false because it doesn't seem to have been updated for the NULLS ASC/DESC stuff, and the interior comment in right_merge_direction() also needs adjusting. But this might be more than a documentation problem, because the choice of merge direction really *is* arbitrary in the case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's actually true for collation. If collation affects the definition of equality then it certainly isn't true. It looks like you've define collations as objects that exist within particular namespaces, but there's no CREATE COLLATION statement, so I don't see what purpose this serves. I suppose we could leave that to be added later, but is there actually a use case for having collations in individual schemas, or should we treat them more like we do casts - i.e. as database-global objects? Why does the executor ever need to see collate clauses? In the department of minor nits, the use of the word respectively in the CREATE INDEX documentation doesn't make sense to me. The message about has a collation conflict is not self-explanatory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
Hello Peter Is there any reason why you prohibit a different encodings per one database? Actually people expect from collate per column a possibility to store a two or more different encodings per one database. Without this possibility - only UTF8 is possible for practical work - and for other encodings only pairs (national locale + C). Yes - it is from my perspective (as Czech programmer) - very typical situation and request is mix latin2 and latin1. I can live with limit, but it is very hard limit and should be documented. Regards Pavel 2010/9/15 Peter Eisentraut pete...@gmx.net: Following up on my previous patch [0], here is a fairly complete implementation of this feature. The general description and implementation outline of the previous message still apply. This patch contains documentation and regression tests, which can serve as further explanations. As this patch touches pretty much everything in the system, there are probably countless bugs and bogosities, some of which I have marked with FIXME, TODO, etc. But all the functionality is basically there, so it's time someone else gives this a serious examination. Note: As previously, regression tests only work with make check MULTIBYTE=UTF8 and the feature overall only works on Linux/glibc. [0] http://archives.postgresql.org/message-id/1279045531.32647.14.ca...@vanquo.pezone.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Sun, Sep 26, 2010 at 1:15 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Is there any reason why you prohibit a different encodings per one database? Actually people expect from collate per column a possibility to store a two or more different encodings per one database. These are two completely separate problems that only look related. The main difference is that while collation is a property of the comparison or sort you're performing encoding is actually a property of the string itself. It doesn't make sense to specify a different encoding than what the string actually contains. You could actually do what you want now by using bytea columns and convert_to/convert_from and it wouldn't be much easier if the support were built into text since you would still have to keep track of the encoding it's in and the encoding you want. We could have a encoded_text data type which includes both the encoding and the string and which any comparison function automatically handles conversion based on the encoding of the collation requested -- but I wouldn't want that to be the default text datatype. It would impose a lot of overhead on the basic text operations and magnify the effects of choosing the wrong collation. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On 09/26/2010 09:37 AM, Greg Stark wrote: We could have a encoded_text data type which includes both the encoding and the string and which any comparison function automatically handles conversion based on the encoding of the collation requested -- but I wouldn't want that to be the default text datatype. It would impose a lot of overhead on the basic text operations and magnify the effects of choosing the wrong collation. Yeah, but it would be a nice gadget to have. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote: select to_char(current_date,'tmday' collate cs_CZ.utf8); I am thinking, collates can be used for this purpose too. I see some impacts - this syntax changes a stable function to immutable and it cannot be simple to solve. I don't understand how you come to that conclusion. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
2010/9/24 Peter Eisentraut pete...@gmx.net: On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote: select to_char(current_date,'tmday' collate cs_CZ.utf8); I am thinking, collates can be used for this purpose too. I see some impacts - this syntax changes a stable function to immutable and it cannot be simple to solve. I don't understand how you come to that conclusion. sorry, I was wrong - it has sense for date output function. to_char is immutable everywhere Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
Hello I am playing with your patch now. I found a few issues: 1. It's doesn't work with SQL 92 rules for sortby list. I can understand so explicit COLLATE using doesn't work, but the implicit using doesn't work too: CREATE TABLE foo(a text, b text COLLATE cs_CZ.UTF8) SELECT * FROM foo ORDER BY 1 -- produce wrong order 2. Why default encoding for collate is static? There are latin2 for czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write encoding explicitly. But the more used and preferred encoding is UTF8 now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8. 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8; to_char ── thursday -- bad result (1 row) 4. is somewhere ToDo for collation implementation? 5. postgres=# create table xy(a text, b text collate cs_CZ); ERROR: collation cs_CZ for current database encoding UTF8 does not exist can be there some more friendly message or hint ? like you cannot to use a different encoding. This collate is in pg_collates table. -- patch was applied cleanly and works in very well. Thank you. Regards Pavel Stehule 2010/9/15 Peter Eisentraut pete...@gmx.net: Following up on my previous patch [0], here is a fairly complete implementation of this feature. The general description and implementation outline of the previous message still apply. This patch contains documentation and regression tests, which can serve as further explanations. As this patch touches pretty much everything in the system, there are probably countless bugs and bogosities, some of which I have marked with FIXME, TODO, etc. But all the functionality is basically there, so it's time someone else gives this a serious examination. Note: As previously, regression tests only work with make check MULTIBYTE=UTF8 and the feature overall only works on Linux/glibc. [0] http://archives.postgresql.org/message-id/1279045531.32647.14.ca...@vanquo.pezone.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule pavel.steh...@gmail.com wrote: 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8; to_char ── thursday -- bad result (1 row) COLLATE means collation rather than locale, no? 5. postgres=# create table xy(a text, b text collate cs_CZ); ERROR: collation cs_CZ for current database encoding UTF8 does not exist can be there some more friendly message or hint ? I hope Postgres automatically detects the omitted encoding because it knows the database encoding is UTF8. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
2010/9/23 Itagaki Takahiro itagaki.takah...@gmail.com: On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule pavel.steh...@gmail.com wrote: 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8; to_char ── thursday -- bad result (1 row) COLLATE means collation rather than locale, no? ok. 5. postgres=# create table xy(a text, b text collate cs_CZ); ERROR: collation cs_CZ for current database encoding UTF8 does not exist can be there some more friendly message or hint ? I hope Postgres automatically detects the omitted encoding because it knows the database encoding is UTF8. I know what this issue means, but it needs some detail or hint I think Regards Pavel -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote: 1. It's doesn't work with SQL 92 rules for sortby list. I can understand so explicit COLLATE using doesn't work, but the implicit using doesn't work too: CREATE TABLE foo(a text, b text COLLATE cs_CZ.UTF8) SELECT * FROM foo ORDER BY 1 -- produce wrong order I can't reproduce that. Please provide more details. 2. Why default encoding for collate is static? There are latin2 for czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write encoding explicitly. But the more used and preferred encoding is UTF8 now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8. That's tweakable. One idea I had is to strip the .utf8 suffix from locale names when populating the pg_collation catalog, or create both versions. I agree that the current way is a bit cumbersome. 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8; to_char ── thursday -- bad result (1 row) As was already pointed out, collation only covers lc_collate and lc_ctype. (It could cover other things, for example an application to the money type was briefly discussed, but that's outside the current mandate.) As a point of order, what you wrote above attaches a collation to the result of the function call. To get the collation to apply to the function call itself, you have to put the collate clause on one of the arguments, e.g., select to_char(current_date,'tmday' collate cs_CZ.utf8); 4. is somewhere ToDo for collation implementation? At the moment it's mostly in the source code. I have a list of notes locally that I can clean up and put in the wiki once we agree on the general direction. 5. postgres=# create table xy(a text, b text collate cs_CZ); ERROR: collation cs_CZ for current database encoding UTF8 does not exist can be there some more friendly message or hint ? like you cannot to use a different encoding. This collate is in pg_collates table. That can surely be polished. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On tor, 2010-09-23 at 17:29 +0900, Itagaki Takahiro wrote: On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule pavel.steh...@gmail.com wrote: 5. postgres=# create table xy(a text, b text collate cs_CZ); ERROR: collation cs_CZ for current database encoding UTF8 does not exist can be there some more friendly message or hint ? I hope Postgres automatically detects the omitted encoding because it knows the database encoding is UTF8. I would rather not build too many expectations into this yet. The collation names are chosen by the user, the locale names are from the operating system. There is not necessarily a correspondence. The best fix is probably what I described earlier, populate the pg_collation table with the .utf8 suffix stripped. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
2010/9/23 Peter Eisentraut pete...@gmx.net: On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote: 1. It's doesn't work with SQL 92 rules for sortby list. I can understand so explicit COLLATE using doesn't work, but the implicit using doesn't work too: CREATE TABLE foo(a text, b text COLLATE cs_CZ.UTF8) SELECT * FROM foo ORDER BY 1 -- produce wrong order I can't reproduce that. Please provide more details. sorry, it is ok - I was confused 2. Why default encoding for collate is static? There are latin2 for czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write encoding explicitly. But the more used and preferred encoding is UTF8 now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8. That's tweakable. One idea I had is to strip the .utf8 suffix from locale names when populating the pg_collation catalog, or create both versions. I agree that the current way is a bit cumbersome. yes. now almost all databases are in utf8 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8; to_char ── thursday -- bad result (1 row) As was already pointed out, collation only covers lc_collate and lc_ctype. (It could cover other things, for example an application to the money type was briefly discussed, but that's outside the current mandate.) ook As a point of order, what you wrote above attaches a collation to the result of the function call. To get the collation to apply to the function call itself, you have to put the collate clause on one of the arguments, e.g., select to_char(current_date,'tmday' collate cs_CZ.utf8); I am thinking, collates can be used for this purpose too. I see some impacts - this syntax changes a stable function to immutable and it cannot be simple to solve. 4. is somewhere ToDo for collation implementation? At the moment it's mostly in the source code. I have a list of notes locally that I can clean up and put in the wiki once we agree on the general direction. 5. postgres=# create table xy(a text, b text collate cs_CZ); ERROR: collation cs_CZ for current database encoding UTF8 does not exist can be there some more friendly message or hint ? like you cannot to use a different encoding. This collate is in pg_collates table. That can surely be polished. Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut pete...@gmx.net wrote: We could support it also on MSVC. http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l Great. If we support both glibc and msvc, how to we handle CREATE TABLE DDLs in pg_dump? Since collation names depend on platforms, a backup dumped at UNIX cannot be reloaded to Windows. We might need to normalize locale names to generate a portable dump. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote: On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut pete...@gmx.net wrote: We could support it also on MSVC. http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l Great. If we support both glibc and msvc, how to we handle CREATE TABLE DDLs in pg_dump? Since collation names depend on platforms, a backup dumped at UNIX cannot be reloaded to Windows. We might need to normalize locale names to generate a portable dump. It is not necessary that the SQL collation names are the same as the OS locale names. That is just a convenient way to initialize it. If you need to transport dumps, you can create your own SQL collation entry at the target that locally fits what you are trying to do (or perhaps rename the collation at the source). I don't think there is a universally applicable way to normalize locale names, because we want to support user-defined OS locales. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On Thu, Sep 16, 2010 at 5:46 AM, Peter Eisentraut pete...@gmx.net wrote: Following up on my previous patch [0], here is a fairly complete implementation of this feature. The general description and implementation outline of the previous message still apply. This patch contains documentation and regression tests, which can serve as further explanations. I tested the patch on database with encoding=UTF8 and locale-C. I have a couple of questions and comments. * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations. We need to copy collations by default, or add INCLUDING COLLATE option. * upper() doesn't work if a column has a collation. It still works if a column doesn't have a collation. postgres=# \d tbl Table public.tbl Column | Type | Modifiers +--+ c | text | collate C ja | text | collate ja_JP.utf8 postgres=# SELECT name, setting FROM pg_settings WHERE name IN ('lc_ctype', 'lc_collate'); name| setting +- lc_collate | C lc_ctype | C (2 rows) postgres=# SELECT upper(c) FROM tbl; ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding. postgres=# SELECT upper(ja) FROM tbl; ERROR: invalid multibyte character for locale HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding * Comparison of strings with different collations is forbidden, but assignment is allowed, right? postgres=# SELECT * FROM tbl WHERE c = ja; ERROR: collation mismatch between implicit collations C and ja_JP.utf8 LINE 1: SELECT * FROM tbl WHERE c = ja; ^ HINT: You can override the collation by applying the COLLATE clause to one or both expressions. postgres=# INSERT INTO tbl(c, ja) SELECT ja, c FROM tbl; INSERT 0 6 * psql \d needs a separator between collate and not null modifiers. postgres=# ALTER TABLE tbl ALTER COLUMN c SET NOT NULL; ALTER TABLE postgres=# \d tbl Table public.tbl Column | Type | Modifiers +--+ c | text | collate Cnot null= HERE ja | text | collate ja_JP.utf8 the feature overall only works on Linux/glibc. We could support it also on MSVC. http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, work in progress
On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote: * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations. We need to copy collations by default, or add INCLUDING COLLATE option. OK, should be easy to fix. * upper() doesn't work if a column has a collation. It still works if a column doesn't have a collation. I think what you are observing is the result of mixing C and non-C locales. Of course that should also be fixed, but it doesn't have much to do with what upper() does. Note that there is a regression test case for lower(), which works mostly the same way. * Comparison of strings with different collations is forbidden, but assignment is allowed, right? Correct. * psql \d needs a separator between collate and not null modifiers. OK. We could support it also on MSVC. http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l Great. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On tis, 2010-08-17 at 01:16 -0500, Jaime Casanova wrote: creating collations ...FATAL: invalid byte sequence for encoding UTF8: 0xe56c09 CONTEXT: COPY tmp_pg_collation, line 86 STATEMENT: COPY tmp_pg_collation FROM E'/usr/local/pgsql/9.1/share/locales.txt'; Hmm, what is in that file on that line? bokmål ISO-8859-1 Hey, that borders on genius: Use a non-ASCII letter in the name of a locale whose purpose it is to configure how non-ASCII letters are interpreted. :-/ Interestingly, I don't see this on a Debian system. Good thing to know that this needs separate testing on different Linux variants. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On Wed, Aug 18, 2010 at 11:29 AM, Peter Eisentraut pete...@gmx.net wrote: On tis, 2010-08-17 at 01:16 -0500, Jaime Casanova wrote: creating collations ...FATAL: invalid byte sequence for encoding UTF8: 0xe56c09 CONTEXT: COPY tmp_pg_collation, line 86 STATEMENT: COPY tmp_pg_collation FROM E'/usr/local/pgsql/9.1/share/locales.txt'; Hmm, what is in that file on that line? bokmål ISO-8859-1 Hey, that borders on genius: Use a non-ASCII letter in the name of a locale whose purpose it is to configure how non-ASCII letters are interpreted. :-/ Interestingly, I don't see this on a Debian system. Good thing to know that this needs separate testing on different Linux variants. Yeah! and when installing centos 5 i don't have a chance to choose what locales i want, it just installs all of them -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On Mon, Aug 16, 2010 at 10:56 PM, Peter Eisentraut pete...@gmx.net wrote: On lör, 2010-08-14 at 02:05 -0500, Jaime Casanova wrote: BTW, why the double quotes? Because the name contains upper case letters? why everything seems so obvious once someone else state it? :) sorry to state the obvious but this doesn't work on windows, does it? Probably not, but hopefully there is some similar API that could be used on Windows. good luck with that! ;) seriously, maybe this helps http://msdn.microsoft.com/en-us/library/system.windows.forms.inputlanguage.installedinputlanguages.aspx but probably you will need to write the code yourself... at least i don't think there is something like locale -a and for some reason it also didn't work on a centos 5 (this error ocurred when initdb'ing) loading system objects' descriptions ... ok creating collations ...FATAL: invalid byte sequence for encoding UTF8: 0xe56c09 CONTEXT: COPY tmp_pg_collation, line 86 STATEMENT: COPY tmp_pg_collation FROM E'/usr/local/pgsql/9.1/share/locales.txt'; Hmm, what is in that file on that line? bokmål ISO-8859-1 (i'm attaching the locales.txt just in case) -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL aa_DJ ISO-8859-1 aa_DJ.iso88591 ISO-8859-1 aa_DJ.utf8 UTF-8 aa_ER UTF-8 aa...@saaho UTF-8 aa_ER.utf8 UTF-8 aa_er.u...@saahoUTF-8 aa_ET UTF-8 aa_ET.utf8 UTF-8 af_ZA ISO-8859-1 af_ZA.iso88591 ISO-8859-1 af_ZA.utf8 UTF-8 am_ET UTF-8 am_ET.utf8 UTF-8 an_ES ISO-8859-15 an_ES.iso885915 ISO-8859-15 an_ES.utf8 UTF-8 ar_AE ISO-8859-6 ar_AE.iso88596 ISO-8859-6 ar_AE.utf8 UTF-8 ar_BH ISO-8859-6 ar_BH.iso88596 ISO-8859-6 ar_BH.utf8 UTF-8 ar_DZ ISO-8859-6 ar_DZ.iso88596 ISO-8859-6 ar_DZ.utf8 UTF-8 ar_EG ISO-8859-6 ar_EG.iso88596 ISO-8859-6 ar_EG.utf8 UTF-8 ar_IN UTF-8 ar_IN.utf8 UTF-8 ar_IQ ISO-8859-6 ar_IQ.iso88596 ISO-8859-6 ar_IQ.utf8 UTF-8 ar_JO ISO-8859-6 ar_JO.iso88596 ISO-8859-6 ar_JO.utf8 UTF-8 ar_KW ISO-8859-6 ar_KW.iso88596 ISO-8859-6 ar_KW.utf8 UTF-8 ar_LB ISO-8859-6 ar_LB.iso88596 ISO-8859-6 ar_LB.utf8 UTF-8 ar_LY ISO-8859-6 ar_LY.iso88596 ISO-8859-6 ar_LY.utf8 UTF-8 ar_MA ISO-8859-6 ar_MA.iso88596 ISO-8859-6 ar_MA.utf8 UTF-8 ar_OM ISO-8859-6 ar_OM.iso88596 ISO-8859-6 ar_OM.utf8 UTF-8 ar_QA ISO-8859-6 ar_QA.iso88596 ISO-8859-6 ar_QA.utf8 UTF-8 ar_SA ISO-8859-6 ar_SA.iso88596 ISO-8859-6 ar_SA.utf8 UTF-8 ar_SD ISO-8859-6 ar_SD.iso88596 ISO-8859-6 ar_SD.utf8 UTF-8 ar_SY ISO-8859-6 ar_SY.iso88596 ISO-8859-6 ar_SY.utf8 UTF-8 ar_TN ISO-8859-6 ar_TN.iso88596 ISO-8859-6 ar_TN.utf8 UTF-8 ar_YE ISO-8859-6 ar_YE.iso88596 ISO-8859-6 ar_YE.utf8 UTF-8 as_IN.utf8 UTF-8 az_AZ.utf8 UTF-8 be_BY CP1251 be_BY.cp1251CP1251 be...@latin UTF-8 be_BY.utf8 UTF-8 be_by.u...@latinUTF-8 bg_BG CP1251 bg_BG.cp1251CP1251 bg_BG.utf8 UTF-8 bn_BD UTF-8 bn_BD.utf8 UTF-8 bn_IN UTF-8 bn_IN.utf8 UTF-8 bokmal ISO-8859-1 bokmål ISO-8859-1 br_FR ISO-8859-1 br...@euro ISO-8859-15 br_FR.iso88591 ISO-8859-1 br_fr.iso885...@euroISO-8859-15 br_FR.utf8 UTF-8 bs_BA ISO-8859-2 bs_BA.iso88592 ISO-8859-2 bs_BA.utf8 UTF-8 byn_ER UTF-8 byn_ER.utf8 UTF-8 C ANSI_X3.4-1968 ca_AD ISO-8859-15 ca_AD.iso885915 ISO-8859-15 ca_AD.utf8 UTF-8 ca_ES ISO-8859-1 ca...@euro ISO-8859-15 ca_ES.iso88591 ISO-8859-1 ca_es.iso885...@euroISO-8859-15 ca_ES.utf8 UTF-8 ca_FR ISO-8859-15 ca_FR.iso885915 ISO-8859-15 ca_FR.utf8 UTF-8 ca_IT ISO-8859-15 ca_IT.iso885915 ISO-8859-15 ca_IT.utf8 UTF-8 catalan ISO-8859-1 croatianISO-8859-2 csb_PL UTF-8 csb_PL.utf8 UTF-8 cs_CZ ISO-8859-2 cs_CZ.iso88592 ISO-8859-2 cs_CZ.utf8 UTF-8 cy_GB ISO-8859-14 cy_GB.iso885914 ISO-8859-14 cy_GB.utf8 UTF-8 czech ISO-8859-2 da_DK ISO-8859-1 da_DK.iso88591 ISO-8859-1 da_DK.iso885915 ISO-8859-15 da_DK.utf8 UTF-8 danish ISO-8859-1 dansk ISO-8859-1 de_AT ISO-8859-1 de...@euro ISO-8859-15 de_AT.iso88591 ISO-8859-1 de_at.iso885...@euroISO-8859-15 de_AT.utf8 UTF-8 de_BE ISO-8859-1 de...@euro ISO-8859-15 de_BE.iso88591 ISO-8859-1 de_be.iso885...@euroISO-8859-15 de_BE.utf8 UTF-8 de_CH ISO-8859-1 de_CH.iso88591 ISO-8859-1 de_CH.utf8 UTF-8 de_DE ISO-8859-1 de...@euro ISO-8859-15 de_DE.iso88591 ISO-8859-1 de_de.iso885...@euroISO-8859-15 de_DE.utf8 UTF-8 de_LU ISO-8859-1 de...@euro ISO-8859-15 de_LU.iso88591 ISO-8859-1 de_lu.iso885...@euroISO-8859-15 de_LU.utf8 UTF-8 deutsch ISO-8859-1 dutch ISO-8859-1 dz_BT UTF-8 dz_BT.utf8 UTF-8 eesti ISO-8859-1 el_CY ISO-8859-7 el_CY.iso88597 ISO-8859-7 el_CY.utf8 UTF-8 el_GR ISO-8859-7 el_GR.iso88597 ISO-8859-7 el_GR.utf8 UTF-8 en_AU ISO-8859-1 en_AU.iso88591 ISO-8859-1 en_AU.utf8
Re: [HACKERS] Per-column collation, proof of concept
On lör, 2010-08-14 at 02:05 -0500, Jaime Casanova wrote: btw, the patch no longer apply cleanly but most are just hunks the worst it's in src/backend/catalog/namespace.c because FindConversionByName() is now called get_conversion_oid()... so maybe this function should be named get_collation_oid(), i guess OK, that will need to be adjusted. well at least pg_collation should be a shared catalog, no? and i think we shouldn't be thinking in this without think first how to integrate this with at least per-database configuration Good point. But one might also want to create private collations, so a collation in a schema would also be useful. Tricky. also, it doesn't recognize C collate although it is in the locales.txt test3=# create database test4 with template=template0 encoding 'utf-8' lc_collate='C'; CREATE DATABASE test3=# create table t3 (col1 text collate C ); ERROR: collation C does not exist I've fixed this in the meantime. Your version of the patch doesn't support the C locale yet. BTW, why the double quotes? Because the name contains upper case letters? sorry to state the obvious but this doesn't work on windows, does it? Probably not, but hopefully there is some similar API that could be used on Windows. and for some reason it also didn't work on a centos 5 (this error ocurred when initdb'ing) loading system objects' descriptions ... ok creating collations ...FATAL: invalid byte sequence for encoding UTF8: 0xe56c09 CONTEXT: COPY tmp_pg_collation, line 86 STATEMENT: COPY tmp_pg_collation FROM E'/usr/local/pgsql/9.1/share/locales.txt'; Hmm, what is in that file on that line? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
Hi, sorry for the delay... btw, the patch no longer apply cleanly but most are just hunks the worst it's in src/backend/catalog/namespace.c because FindConversionByName() is now called get_conversion_oid()... so maybe this function should be named get_collation_oid(), i guess On Tue, Aug 3, 2010 at 11:32 AM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote: nowadays, CREATE DATABASE has a lc_collate clause. is the new collate clause similar as the lc_collate? i mean, is lc_collate what we will use as a default? Yes, if you do not specify anything per column, the database default is used. How to integrate the per-database or per-cluster configuration with the new system is something to figure out in the future. well at least pg_collation should be a shared catalog, no? and i think we shouldn't be thinking in this without think first how to integrate this with at least per-database configuration if yes, then probably we need to use pg_collation there too because lc_collate and the new collate clause use different collation names. postgres=# create database test with lc_collate 'en_US.UTF-8'; CREATE DATABASE test=# create table t1 (col1 text collate en_US.UTF-8); ERROR: collation en_US.UTF-8 does not exist test=# create table t1 (col1 text collate en_US.utf8); CREATE TABLE This is something that libc does for you. The locale as listed by locale -a is called en_US.utf8, but apparently libc takes en_US.UTF-8 as well. ok, but at least this is confusing also, it doesn't recognize C collate although it is in the locales.txt test3=# create database test4 with template=template0 encoding 'utf-8' lc_collate='C'; CREATE DATABASE test3=# create table t3 (col1 text collate C ); ERROR: collation C does not exist BTW, why the double quotes? also i got errors from regression tests when MULTIBYTE=UTF8 (attached). it seems i was trying to create locales that weren't defined on locales.txt (from were was fed that file?). i added a line to that file (for es_EC.utf8) then i create a table with a column using that collate and execute select * from t2 where col1 'n'; and i got this error: ERROR: could not create locale es_EC.utf8 (of course, that last part was me messing the things up, but it show we shouldn't be using a file locales.txt, i think) It might be that you don't have those locales installed in your system. locales.txt is created by using locale -a. Check what that gives you. sorry to state the obvious but this doesn't work on windows, does it? and for some reason it also didn't work on a centos 5 (this error ocurred when initdb'ing) loading system objects' descriptions ... ok creating collations ...FATAL: invalid byte sequence for encoding UTF8: 0xe56c09 CONTEXT: COPY tmp_pg_collation, line 86 STATEMENT: COPY tmp_pg_collation FROM E'/usr/local/pgsql/9.1/share/locales.txt'; -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote: nowadays, CREATE DATABASE has a lc_collate clause. is the new collate clause similar as the lc_collate? i mean, is lc_collate what we will use as a default? Yes, if you do not specify anything per column, the database default is used. How to integrate the per-database or per-cluster configuration with the new system is something to figure out in the future. if yes, then probably we need to use pg_collation there too because lc_collate and the new collate clause use different collation names. postgres=# create database test with lc_collate 'en_US.UTF-8'; CREATE DATABASE test=# create table t1 (col1 text collate en_US.UTF-8); ERROR: collation en_US.UTF-8 does not exist test=# create table t1 (col1 text collate en_US.utf8); CREATE TABLE This is something that libc does for you. The locale as listed by locale -a is called en_US.utf8, but apparently libc takes en_US.UTF-8 as well. also i got errors from regression tests when MULTIBYTE=UTF8 (attached). it seems i was trying to create locales that weren't defined on locales.txt (from were was fed that file?). i added a line to that file (for es_EC.utf8) then i create a table with a column using that collate and execute select * from t2 where col1 'n'; and i got this error: ERROR: could not create locale es_EC.utf8 (of course, that last part was me messing the things up, but it show we shouldn't be using a file locales.txt, i think) It might be that you don't have those locales installed in your system. locales.txt is created by using locale -a. Check what that gives you. i can attach a collate to a domain but i can't see where are we storing that info (actually it says it's not collatable): Domain support is not done yet. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On Tue, Jul 13, 2010 at 1:25 PM, Peter Eisentraut pete...@gmx.net wrote: Here is a proof of concept for per-column collation support. Hi, i was looking at this. nowadays, CREATE DATABASE has a lc_collate clause. is the new collate clause similar as the lc_collate? i mean, is lc_collate what we will use as a default? if yes, then probably we need to use pg_collation there too because lc_collate and the new collate clause use different collation names. postgres=# create database test with lc_collate 'en_US.UTF-8'; CREATE DATABASE test=# create table t1 (col1 text collate en_US.UTF-8); ERROR: collation en_US.UTF-8 does not exist test=# create table t1 (col1 text collate en_US.utf8); CREATE TABLE also i got errors from regression tests when MULTIBYTE=UTF8 (attached). it seems i was trying to create locales that weren't defined on locales.txt (from were was fed that file?). i added a line to that file (for es_EC.utf8) then i create a table with a column using that collate and execute select * from t2 where col1 'n'; and i got this error: ERROR: could not create locale es_EC.utf8 (of course, that last part was me messing the things up, but it show we shouldn't be using a file locales.txt, i think) i can attach a collate to a domain but i can't see where are we storing that info (actually it says it's not collatable): -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On tor, 2010-07-15 at 05:57 +0200, Pavel Stehule wrote: :( maybe we have to enhance a locales - or do some work in this way. In Czech's IS is relative often operation some like name = 'Stěhule' COLLATION cs_CZ_cs_ai -- compare case insensitive accent insensitive PostgreSQL is last db, that doesn't integreated support for it Well, the comparison function varstr_cmp() contains this comment: /* * In some locales strcoll() can claim that nonidentical strings are * equal. Believing that would be bad news for a number of reasons, * so we follow Perl's lead and sort equal strings according to * strcmp(). */ This might not be strictly necessary, seeing that citext obviously doesn't work that way, but resolving this is really an orthogonal issue. If you fix that and you have a locale that does what you want, my patch will help you get your example working. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
Peter Eisentraut pete...@gmx.net writes: Well, the comparison function varstr_cmp() contains this comment: /* * In some locales strcoll() can claim that nonidentical strings are * equal. Believing that would be bad news for a number of reasons, * so we follow Perl's lead and sort equal strings according to * strcmp(). */ This might not be strictly necessary, seeing that citext obviously doesn't work that way, but resolving this is really an orthogonal issue. The problem with not doing that is it breaks hashing --- hash joins and hash aggregation being the real pain points. citext works around this in a rather klugy fashion by decreeing that two strings are equal iff their str_tolower() conversions are bitwise equal. So it can hash the str_tolower() representation. But that's kinda slow and it fails in the general case anyhow, I think. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On Thu, Jul 15, 2010 at 4:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: The problem with not doing that is it breaks hashing --- hash joins and hash aggregation being the real pain points. citext works around this in a rather klugy fashion by decreeing that two strings are equal iff their str_tolower() conversions are bitwise equal. So it can hash the str_tolower() representation. But that's kinda slow and it fails in the general case anyhow, I think. I think the general equivalent would be to call strxfrm and hash the result of that. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
Peter Eisentraut pete...@gmx.net wrote: Here is a proof of concept for per-column collation support. Did you want a WIP review of that patch? (CF closing to new submissions soon) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
Hello I have only one question - If I understand well you can use collate just for sort. What is your plan for range search operation? Sort is interesting and I am sure important for multilangual applications, for me - more important is case sensitive, case insensitive, accent sensitive, insensitive filtering - do you have a plan for it? Regards Pavel Stehule 2010/7/13 Peter Eisentraut pete...@gmx.net: Here is a proof of concept for per-column collation support. Here is how it works: When creating a table, an optional COLLATE clause can specify a collation name, which is stored (by OID) in pg_attribute. This becomes part of the type information and is propagated through the expression parse analysis, like typmod. When an operator or function call is parsed (transformed), the collations of the arguments are unified, using some rules (like type analysis, but different in detail). The collations of the function/operator arguments come either from Var nodes which in turn got them from pg_attribute, or from other function and operator calls, or you can override them with explicit COLLATE clauses (not yet implemented, but will work a bit like RelabelType). At the end, each function or operator call gets one collation to use. what about DISTINCT clause, maybe GROUP BY clause ? regards Pavel The function call itself can then look up the collation using the fcinfo-flinfo-fn_expr field. (Works for operator calls, but doesn't work for sort operations, needs more thought.) A collation is in this implementation defined as an lc_collate string and an lc_ctype string. The implementation of functions interested in that information, such as comparison operators, or upper and lower functions, will take the collation OID that is passed in, look up the locale string, and use the xlocale.h interface (newlocale(), strcoll_l()) to compute the result. (Note that the xlocale stuff is only 10 or so lines in this patch. It should be feasible to allow other appropriate locale libraries to be used.) Loose ends: - Support function calls (currently only operator calls) (easy) - Implementation of sort clauses - Indexing support/integration - Domain support (should be straightforward) - Make all expression node types deal with collation information appropriately - Explicit COLLATE clause on expressions - Caching and not leaking memory of locale lookups - I have typcollatable to mark which types can accept collation information, but perhaps there should also be proicareaboutcollation to skip collation resolution when none of the functions in the expression tree care. You can start by reading the collate.sql regression test file to see what it can do. Btw., regression tests only work with make check MULTIBYTE=UTF8. And it (probably) only works with glibc for now. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
On ons, 2010-07-14 at 19:35 +0200, Pavel Stehule wrote: I have only one question - If I understand well you can use collate just for sort. What is your plan for range search operation? My patch does range searches. Sorting uses the same operators, so both will be supported. (Sorting is not yet implemented, as I had written.) Sort is interesting and I am sure important for multilangual applications, for me - more important is case sensitive, case insensitive, accent sensitive, insensitive filtering - do you have a plan for it? You may be able to do some of these by using appropriate locale definitions. I'd need some examples to be able to tell for sure. what about DISTINCT clause, maybe GROUP BY clause ? DISTINCT and GROUP BY work with equality, which is not affected by locales (at least under the current rules). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-column collation, proof of concept
2010/7/14 Peter Eisentraut pete...@gmx.net: On ons, 2010-07-14 at 19:35 +0200, Pavel Stehule wrote: I have only one question - If I understand well you can use collate just for sort. What is your plan for range search operation? My patch does range searches. Sorting uses the same operators, so both will be supported. (Sorting is not yet implemented, as I had written.) Sort is interesting and I am sure important for multilangual applications, for me - more important is case sensitive, case insensitive, accent sensitive, insensitive filtering - do you have a plan for it? You may be able to do some of these by using appropriate locale definitions. I'd need some examples to be able to tell for sure. what about DISTINCT clause, maybe GROUP BY clause ? DISTINCT and GROUP BY work with equality, which is not affected by locales (at least under the current rules). :( maybe we have to enhance a locales - or do some work in this way. In Czech's IS is relative often operation some like name = 'Stěhule' COLLATION cs_CZ_cs_ai -- compare case insensitive accent insensitive PostgreSQL is last db, that doesn't integreated support for it Regards Pavel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers