Re: [HACKERS] Per-column collation, the finale

2011-02-09 Thread Peter Eisentraut
On tis, 2011-02-08 at 22:17 +, Thom Brown wrote:
 postgres=# create table meow (id serial, stuff text collate de_XX);
 NOTICE:  CREATE TABLE will create implicit sequence meow_id_seq for
 serial column meow.id
 ERROR:  collation de_XX for current database encoding UTF8 does not exist
 LINE 1: create table meow (id serial, stuff text collate de_XX);
 
 I wouldn't expect to see that first notice.  Shouldn't that step come
 a bit later?

This isn't much different from writing

create table meow (id serial, stuff nonsense);

You'll still get the notice before it errors out on type-not-found.

 A bit of weirdness, I'm allowed to specify more than one collation on
 a single column ordering...

 Is this the same principal as casting, where they can be chained?
 Which one wins in this case?

Yeah, last one wins.

 Also, if a locale is installed after initdb, is it then impossible to
 get pg_collate to pick up that new locale?

Currently, you can insert it yourself into pg_collation.  I have a
CREATE COLLATION patch in the works.

 If a locale is somehow
 removed from the system, what happens on the database side when
 attempting to use a collated column?

Then you're hosed, but that has always been the case, with per-cluster
and per-database locales.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-02-08 Thread Peter Eisentraut
On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote:
 Looks good and tests well.  I've attached the same benchmark script
 with updated timings, and I've marked the patch Ready for Committer.

Committed.  Thanks everyone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-02-08 Thread Thom Brown
On 8 February 2011 21:08, Peter Eisentraut pete...@gmx.net wrote:
 On tor, 2011-02-03 at 18:36 -0500, Noah Misch wrote:
 Looks good and tests well.  I've attached the same benchmark script
 with updated timings, and I've marked the patch Ready for Committer.

 Committed.  Thanks everyone.

Awesome work Peter.  Few questions:

postgres=# create table meow (id serial, stuff text collate de_XX);
NOTICE:  CREATE TABLE will create implicit sequence meow_id_seq for
serial column meow.id
ERROR:  collation de_XX for current database encoding UTF8 does not exist
LINE 1: create table meow (id serial, stuff text collate de_XX);

I wouldn't expect to see that first notice.  Shouldn't that step come
a bit later?

A bit of weirdness, I'm allowed to specify more than one collation on
a single column ordering...

postgres=# select * from meow order by stuff collate en_GB collate
de_DE desc;
 id | stuff
+---
  2 | meow2
  1 | meow
(2 rows)

Is this the same principal as casting, where they can be chained?
Which one wins in this case?  Although if anyone is actually doing
this, then it's just silly anyway. (says Thom having just done it)

Also, if a locale is installed after initdb, is it then impossible to
get pg_collate to pick up that new locale?  If a locale is somehow
removed from the system, what happens on the database side when
attempting to use a collated column? (I don't wish to TIAS on my own
system)

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-02-04 Thread Bruce Momjian
Noah Misch wrote:
 On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote:
  On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote:
This is good stuff.  I'll send you a new patch in a day or three for
perhaps another round of performance tests.  Some of the other
   issues
above can perhaps be postponed for follow-up patches.
   
   I agree -- if the performance-when-unused gets solid, none of my other
   comments ought to hold things up. 
  
  Here is a new patch.
  
  The main change is in varstr_cmp(), avoiding the calls to
  pg_newlocale_from_collation() when the default locale is used.  This
  accounts for the performance regression in my tests.  It also addresses
  some of your refactoring ideas.
 
 Looks good and tests well.  I've attached the same benchmark script with 
 updated
 timings, and I've marked the patch Ready for Committer.

Nice to see that performance hit is removed now!

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-02-03 Thread Noah Misch
On Thu, Feb 03, 2011 at 05:53:28PM +0200, Peter Eisentraut wrote:
 On tor, 2011-02-03 at 00:10 -0500, Noah Misch wrote:
   This is good stuff.  I'll send you a new patch in a day or three for
   perhaps another round of performance tests.  Some of the other
  issues
   above can perhaps be postponed for follow-up patches.
  
  I agree -- if the performance-when-unused gets solid, none of my other
  comments ought to hold things up. 
 
 Here is a new patch.
 
 The main change is in varstr_cmp(), avoiding the calls to
 pg_newlocale_from_collation() when the default locale is used.  This
 accounts for the performance regression in my tests.  It also addresses
 some of your refactoring ideas.

Looks good and tests well.  I've attached the same benchmark script with updated
timings, and I've marked the patch Ready for Committer.

nm
-- Setup
-- SELECT setseed(0);
-- CREATE TABLE t (c) AS
-- SELECT chr(1 + (random() * 65534)::int) FROM generate_series(1,1000) 
gen(n);

-- NOTE: due to concurrent load, timings typically varied 1-2s between runs.

-- en_US.UTF8, unpatched: 68.75s
-- en_US.UTF8,   patched: 67.95s
-- id_ID.UTF8, unpatched: 67.48s
-- id_ID.UTF8,   patched: 69.34s
SELECT min(c)

-- en_US.UTF8,   patched: 81.59s
-- id_ID.UTF8,   patched: 84.59s
--SELECT min(c COLLATE id_ID)

FROM (
SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
UNION ALL SELECT c FROM t
) t_all;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-02-02 Thread Peter Eisentraut
On lör, 2011-01-29 at 02:52 -0500, Noah Misch wrote:
 I'm reviewing this patch as part of CommitFest 2011-01.

Thank you very much for this thorough review.

 This no longer applies cleanly against git master, so I've done my testing
 against 52713d0. 

I will send an updated patch soon, when I have figured out the issues
discussed below.

 If you go about supporting non-GNU libc systems, you may want to look at the
 gnulib locale module[2] for a starting point.

I don't have a Mac system, but I'll see about figuring this out.

 The new documentation is helpful.  It would be useful to document the
 implications of marking your user-defined type COLLATABLE.  As best I can 
 tell,
 you should only set COLLATABLE if functions that would be expected to respect
 collation, particularly members of a B-tree operator family, do check the fmgr
 collation.  Otherwise, the user might specify collations for data of your type
 and be surprised to get no differentiated behavior.  Are there other
 considerations?  Also, should implementers of COLLATABLE types observe any
 restrictions on when to respect the collation?  For example, is equality
 permitted to be sensitive to collation?

I improved the documentation in the CREATE TYPE man page about this.
Note, however, that the questions you raise are not new.  There already
is a set of types that observe locale information in a set of associated
functions.  The answers are already hidden somewhere; this feature just
exposes them in a different way.  Over time, we may want to document
more of this, but at the moment it's not clear how much to actually
expose.

 On that note, the following is accepted; should it be?
 
   CREATE TABLE parent (key text COLLATE sv_SE);
   CREATE UNIQUE INDEX ON parent(key COLLATE id_ID);
   CREATE TABLE child (key text COLLATE tr_TR REFERENCES parent(key));
 
 Does the system expect any invariants to hold on the platform implementation 
 of
 collations?  For example, that they be consistent with equality in some way?  
 If
 so, would it be practical and/or helpful to provide a tool for testing a given
 platform locale for conformance?

We discussed this previously in this thread.  Currently, collation
support does not affect equality.  Values are only equal if they are
bytewise equal.  That's why the above is currently OK, but it should
probably be restricted in the future.  I'll make a note about it.

 pg_attribute entries for an index currently retain the attcollation of the 
 table
 columns.  They should have the collation of the index columns.  At that point,
 pg_index.indcollation would exist for optimization only.

Fixed.

 Though there's no reason it ought to happen with the first commit, it would be
 really excellent for make check to detect when the platform has sufficient
 support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
 locales).  Manual execution is fine for something like numeric_big, but this 
 is
 so platform-sensitive that testing it widely is important.

I solicited ideas about this a while ago, but no one came up with a
solution.  If we do get Mac or Windows support, it would be a good time
to revisit this and devise a platform-independent approach, if possible.

 It would likewise be nice to have a way to display the collation of an 
 arbitrary
 expression.  A merger of pg_typeof and format_type would cover that.

Agreed.  I have a follow-up patch pending for that.

 Four call sites gain code like this (example is from varstr_cmp):
 
 +#ifdef HAVE_LOCALE_T
 + locale_tmylocale = pg_newlocale_from_collation(collid);
 +#else
 + check_default_collation(collid);
 +#endif
 ...
 +#ifdef HAVE_LOCALE_T
 + result = strcoll_l(a1p, a2p, mylocale);
 +#else
   result = strcoll(a1p, a2p);
 +#endif
 
 Under !HAVE_LOCALE_T, we could define a locale_t, a 
 pg_newlocale_from_collation
 that merely calls check_default_collation, #define strcoll_l(a, b, c)
 strcoll(a, b), etc.  I can see six TODO sites with similar needs, and the
 abstraction would make the mainline code significantly cleaner.  Even so, it
 might not pay off.  Thoughts?

I had considered that, but I'm mainly hesitant about introducing a fake
locale_t type into the public namespace.  Maybe it should be wrapped
into a pg_locale_t; then we can do whatever we want.

 Couldn't check_default_collation be a no-op except on assert-only builds?  
 It's
 currently only called in !HAVE_LOCALE_T branches, in which case initdb will 
 not
 have added non-default collations.  CREATE COLLATION will presumably be made 
 to
 fail unconditionally on such builds, too.

Unless you register the HAVE_LOCALE_T state in pg_control, you need to
be prepared to deal with schemas that contain nondefault collations even
though the server binary doesn't support it.

 wchar2char and char2wchar take a collation argument, but they only use it to
 Assert(!lc_ctype_is_c(collation)).  Maybe that's best, but it seems weird.

Yeah, it's 

Re: [HACKERS] Per-column collation, the finale

2011-02-02 Thread Noah Misch
On Wed, Feb 02, 2011 at 11:20:44PM +0200, Peter Eisentraut wrote:
 On l??r, 2011-01-29 at 02:52 -0500, Noah Misch wrote:
  The new documentation is helpful.  It would be useful to document the
  implications of marking your user-defined type COLLATABLE.  As best I can 
  tell,
  you should only set COLLATABLE if functions that would be expected to 
  respect
  collation, particularly members of a B-tree operator family, do check the 
  fmgr
  collation.  Otherwise, the user might specify collations for data of your 
  type
  and be surprised to get no differentiated behavior.  Are there other
  considerations?  Also, should implementers of COLLATABLE types observe any
  restrictions on when to respect the collation?  For example, is equality
  permitted to be sensitive to collation?
 
 I improved the documentation in the CREATE TYPE man page about this.
 Note, however, that the questions you raise are not new.  There already
 is a set of types that observe locale information in a set of associated
 functions.  The answers are already hidden somewhere; this feature just
 exposes them in a different way.  Over time, we may want to document
 more of this, but at the moment it's not clear how much to actually
 expose.

That's a helpful way to think about it.

  On that note, the following is accepted; should it be?
  
CREATE TABLE parent (key text COLLATE sv_SE);
CREATE UNIQUE INDEX ON parent(key COLLATE id_ID);
CREATE TABLE child (key text COLLATE tr_TR REFERENCES parent(key));
  
  Does the system expect any invariants to hold on the platform 
  implementation of
  collations?  For example, that they be consistent with equality in some 
  way?  If
  so, would it be practical and/or helpful to provide a tool for testing a 
  given
  platform locale for conformance?
 
 We discussed this previously in this thread.  Currently, collation
 support does not affect equality.  Values are only equal if they are
 bytewise equal.  That's why the above is currently OK, but it should
 probably be restricted in the future.  I'll make a note about it.

I see that this is true for texteq, bpchareq and citext_eq.  However, nothing
stops a user from creating a COLLATABLE type and making the = operator for its
B-tree operator class sensitive to collation, right?  Could that break any
system assumptions?  That is, do we need to document that user-defined hash and
B-tree operator classes must ignore locale in their = operators?

  Though there's no reason it ought to happen with the first commit, it would 
  be
  really excellent for make check to detect when the platform has sufficient
  support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
  locales).  Manual execution is fine for something like numeric_big, but 
  this is
  so platform-sensitive that testing it widely is important.
 
 I solicited ideas about this a while ago, but no one came up with a
 solution.  If we do get Mac or Windows support, it would be a good time
 to revisit this and devise a platform-independent approach, if possible.

As a rough idea, we could introduce the notion of a skip condition SQL command
for a pg_regress test file.  When the command throws an error, pg_regress skips
that file.  For these tests, the skip condition would resemble:
  SELECT 'foo' COLLATE sv_SE, 'bar' COLLATE tr_TR

  Four call sites gain code like this (example is from varstr_cmp):
  
  +#ifdef HAVE_LOCALE_T
  +   locale_tmylocale = pg_newlocale_from_collation(collid);
  +#else
  +   check_default_collation(collid);
  +#endif
  ...
  +#ifdef HAVE_LOCALE_T
  +   result = strcoll_l(a1p, a2p, mylocale);
  +#else
  result = strcoll(a1p, a2p);
  +#endif
  
  Under !HAVE_LOCALE_T, we could define a locale_t, a 
  pg_newlocale_from_collation
  that merely calls check_default_collation, #define strcoll_l(a, b, c)
  strcoll(a, b), etc.  I can see six TODO sites with similar needs, and the
  abstraction would make the mainline code significantly cleaner.  Even so, it
  might not pay off.  Thoughts?
 
 I had considered that, but I'm mainly hesitant about introducing a fake
 locale_t type into the public namespace.  Maybe it should be wrapped
 into a pg_locale_t; then we can do whatever we want.

True.  There are advantages and disadvantages to both.  No strong opinion here.

  Couldn't check_default_collation be a no-op except on assert-only builds?  
  It's
  currently only called in !HAVE_LOCALE_T branches, in which case initdb will 
  not
  have added non-default collations.  CREATE COLLATION will presumably be 
  made to
  fail unconditionally on such builds, too.
 
 Unless you register the HAVE_LOCALE_T state in pg_control, you need to
 be prepared to deal with schemas that contain nondefault collations even
 though the server binary doesn't support it.

Makes sense.

[local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP 
  BY prosrc;
-- worked ...
[local] test=# SELECT prosrc 

Re: [HACKERS] Per-column collation, the finale

2011-01-30 Thread Robert Haas
On Sat, Jan 29, 2011 at 2:52 AM, Noah Misch n...@leadboat.com wrote:
 The 13% slowdown with the feature unused seems worrisome

Very worrisome.  This is a frequently-requested feature, but this
seems like a potential show-stopper.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-01-28 Thread Noah Misch
Hi Peter,

I'm reviewing this patch as part of CommitFest 2011-01.

On Fri, Jan 14, 2011 at 11:41:46PM +0200, Peter Eisentraut wrote:
 I've been going over this patch with a fine-tooth comb for the last two
 weeks, fixed some small bugs, added comments, made initdb a little
 friendlier, but no substantial changes.
 
 I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
 support and associated things now.

This no longer applies cleanly against git master, so I've done my testing
against 52713d0.  I tested on Ubtunu 8.04 (GNU libc 2.7) and used the
configure.in change I sent earlier[1] to get locale support detected.  Since the
time to seriously verify every patch hunk would be measured in weeks, I have not
attempted that level of review detail.  I've just commented on what stood out
while reading and using the patch.

If you go about supporting non-GNU libc systems, you may want to look at the
gnulib locale module[2] for a starting point.

The new documentation is helpful.  It would be useful to document the
implications of marking your user-defined type COLLATABLE.  As best I can tell,
you should only set COLLATABLE if functions that would be expected to respect
collation, particularly members of a B-tree operator family, do check the fmgr
collation.  Otherwise, the user might specify collations for data of your type
and be surprised to get no differentiated behavior.  Are there other
considerations?  Also, should implementers of COLLATABLE types observe any
restrictions on when to respect the collation?  For example, is equality
permitted to be sensitive to collation?  On that note, the following is
accepted; should it be?

  CREATE TABLE parent (key text COLLATE sv_SE);
  CREATE UNIQUE INDEX ON parent(key COLLATE id_ID);
  CREATE TABLE child (key text COLLATE tr_TR REFERENCES parent(key));

Does the system expect any invariants to hold on the platform implementation of
collations?  For example, that they be consistent with equality in some way?  If
so, would it be practical and/or helpful to provide a tool for testing a given
platform locale for conformance?

pg_attribute entries for an index currently retain the attcollation of the table
columns.  They should have the collation of the index columns.  At that point,
pg_index.indcollation would exist for optimization only.

Though there's no reason it ought to happen with the first commit, it would be
really excellent for make check to detect when the platform has sufficient
support to run the collation tests (HAVE_LOCALE_T, UTF8, sv_SE and tr_TR
locales).  Manual execution is fine for something like numeric_big, but this is
so platform-sensitive that testing it widely is important.

It would likewise be nice to have a way to display the collation of an arbitrary
expression.  A merger of pg_typeof and format_type would cover that.

Four call sites gain code like this (example is from varstr_cmp):

+#ifdef HAVE_LOCALE_T
+   locale_tmylocale = pg_newlocale_from_collation(collid);
+#else
+   check_default_collation(collid);
+#endif
...
+#ifdef HAVE_LOCALE_T
+   result = strcoll_l(a1p, a2p, mylocale);
+#else
result = strcoll(a1p, a2p);
+#endif

Under !HAVE_LOCALE_T, we could define a locale_t, a pg_newlocale_from_collation
that merely calls check_default_collation, #define strcoll_l(a, b, c)
strcoll(a, b), etc.  I can see six TODO sites with similar needs, and the
abstraction would make the mainline code significantly cleaner.  Even so, it
might not pay off.  Thoughts?

Couldn't check_default_collation be a no-op except on assert-only builds?  It's
currently only called in !HAVE_LOCALE_T branches, in which case initdb will not
have added non-default collations.  CREATE COLLATION will presumably be made to
fail unconditionally on such builds, too.

wchar2char and char2wchar take a collation argument, but they only use it to
Assert(!lc_ctype_is_c(collation)).  Maybe that's best, but it seems weird.

The documentation for pg_type.typcollation marks it as a bool, but it's an oid.
The documentation also says An index can only support one collation., but it's
one collation per column.

This surprised me; not sure how much can/should be done:

  [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY 
1 COLLATE id_ID;
  ERROR:  collations are not supported by type integer
  [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY 
prosrc COLLATE id_ID;
  -- worked ...

This too:

  [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY 
prosrc;
  -- worked ...
  [local] test=# SELECT prosrc COLLATE id_ID, count(*) FROM pg_proc GROUP BY 
prosrc COLLATE id_ID;
  -- worked ...
  [local] test=# SELECT prosrc, count(*) FROM pg_proc GROUP BY 
prosrc COLLATE id_ID;
  ERROR:  column pg_proc.prosrc must appear in the GROUP BY clause or be used 
in an aggregate function
  LINE 1: SELECT prosrc, count(*) FROM pg_proc GROUP BY 

Re: [HACKERS] Per-column collation, the finale

2011-01-25 Thread Peter Eisentraut
On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote:
 On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut pete...@gmx.net wrote:
  I've been going over this patch with a fine-tooth comb for the last two
  weeks, fixed some small bugs, added comments, made initdb a little
  friendlier, but no substantial changes.
 
 What can I do to test the patch?
 No regression tests are included in it,

Please refer to the regress.sgml hunk about running the test.

 and I have an almost empty pg_collation catalog with it:
 
 =# SELECT * FROM pg_collation;
  collname | collnamespace | collencoding | collcollate | collctype
 --+---+--+-+---
  default  |11 |0 | |
 (1 row)

The initdb output should say something about how it got there.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-01-25 Thread Noah Misch
On Wed, Jan 26, 2011 at 12:35:07AM +0200, Peter Eisentraut wrote:
 On tis, 2011-01-25 at 10:14 +0900, Itagaki Takahiro wrote:
  and I have an almost empty pg_collation catalog with it:
  
  =# SELECT * FROM pg_collation;
   collname | collnamespace | collencoding | collcollate | collctype
  --+---+--+-+---
   default  |11 |0 | |
  (1 row)
 
 The initdb output should say something about how it got there.

FWIW, I tried and had the same problem.  initdb gave:

  creating collations ... not supported on this platform

configure was failing to detect locale_t for me, and this fixed it:

*** a/configure.in
--- b/configure.in
***
*** 1116,1122  AC_TYPE_INTPTR_T
  AC_TYPE_UINTPTR_T
  AC_TYPE_LONG_LONG_INT
  
! AC_CHECK_TYPES([locale_t],
  [#include locale.h])
  
  AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct sockcred], [], [],
--- 1116,1122 
  AC_TYPE_UINTPTR_T
  AC_TYPE_LONG_LONG_INT
  
! AC_CHECK_TYPES([locale_t], [], [],
  [#include locale.h])
  
  AC_CHECK_TYPES([struct cmsgcred, struct fcred, struct sockcred], [], [],

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-01-24 Thread Itagaki Takahiro
On Sat, Jan 15, 2011 at 06:41, Peter Eisentraut pete...@gmx.net wrote:
 I've been going over this patch with a fine-tooth comb for the last two
 weeks, fixed some small bugs, added comments, made initdb a little
 friendlier, but no substantial changes.

What can I do to test the patch?
No regression tests are included in it, and I have an almost empty
pg_collation catalog with it:

=# SELECT * FROM pg_collation;
 collname | collnamespace | collencoding | collcollate | collctype
--+---+--+-+---
 default  |11 |0 | |
(1 row)

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-01-14 Thread Robert Haas
On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentraut pete...@gmx.net wrote:
 I've been going over this patch with a fine-tooth comb for the last two
 weeks, fixed some small bugs, added comments, made initdb a little
 friendlier, but no substantial changes.

 I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
 support and associated things now.

Maybe I'm all wet here, but isn't it time to commit what you've got
and punt the things that aren't done to 9.2?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, the finale

2011-01-14 Thread Euler Taveira de Oliveira

Em 14-01-2011 20:47, Robert Haas escreveu:

On Fri, Jan 14, 2011 at 4:41 PM, Peter Eisentrautpete...@gmx.net  wrote:

I've been going over this patch with a fine-tooth comb for the last two
weeks, fixed some small bugs, added comments, made initdb a little
friendlier, but no substantial changes.

I'm going to start working on SQL-level CREATE/DROP/ALTER COLLATION
support and associated things now.


Maybe I'm all wet here, but isn't it time to commit what you've got
and punt the things that aren't done to 9.2?

I think Peter want another person to take a look at his patch. I personally 
would like to eyeball his patch (but it will be during the week).



--
  Euler Taveira de Oliveira
  http://www.timbira.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-16 Thread Greg Smith

Robert Haas wrote:

I don't really have a position on whether or not this patch is ready
to commit... but I do think that this is the sort of patch that is
very likely to have some bugs almost no matter when we commit it


I just updated the CF app to track Peter's latest update, which remains 
untested by anyone else for whether it fixes all the issues brought up.  
It would be nice to get a re-review to confirm things are still working 
in advance of CF 2011-01.  Given the reviewer here is also a committer, 
that means it's possible this can go into the tree after that if 
everything checks out even outside of the regular CF schedule.  In the 
interest of closing out this CF, I'm updating this one as returned for 
now though.  That doesn't prevent it from going in anyway once it's 
confirmed ready, and I agree the sooner the better to help find 
breakage.  But I don't think that's so important that it should block 
the critical path for the next alpha.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-16 Thread Itagaki Takahiro
On Thu, Dec 16, 2010 at 19:37, Greg Smith g...@2ndquadrant.com wrote:
 I just updated the CF app to track Peter's latest update, which remains
 untested by anyone else for whether it fixes all the issues brought up.  It
 would be nice to get a re-review to confirm things are still working in
 advance of CF 2011-01.

Sorry, I don't have enough time to review it. Also, multiple reviewers
using different kinds of settings would be preferred for such feature.

# We might need previous reviewers and active reviewers in commit-fest
# app. Or, should non-active reviewers delete their names?

I tested it in program-level, but the code-level review is still needed.
I'm worried about whether no holes checks are required in COLLATE
hand-off from function to function. If so, the review will be a harder
work than the case where the framework guarantee the hand-off.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-16 Thread Greg Smith

Itagaki Takahiro wrote:

We might need previous reviewers and active reviewers in commit-fest
app. Or, should non-active reviewers delete their names?
  


This is only really an issue with patches that get moved from one CF to 
the next, which doesn't happen that often.  Patches that are marked 
Returned With Feedback instead get a new entry in the next CF instead, 
which avoids this problem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-14 Thread Robert Haas
On Sun, Dec 12, 2010 at 5:15 PM, Peter Eisentraut pete...@gmx.net wrote:
 On lör, 2010-12-04 at 18:04 +0200, Peter Eisentraut wrote:
 Here is an updated patch to address the issues discussed during this
 commitfest.

 And another one, that fixes the problems pointed out since.

I don't really have a position on whether or not this patch is ready
to commit... but I do think that this is the sort of patch that is
very likely to have some bugs almost no matter when we commit it.
Therefore, I would argue that we ought not to let the process of
getting it committed drag out too long.  Even if we find 3 more bugs,
we might have found 10 more bugs if it had been in the tree for the
same time period.  Of course, any conceptual issues must be worked out
before commit.  But there are likely to be some loose ends however we
do it.  So when we think it's pretty close, we should move forward.
All IMHO, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-12 Thread Peter Eisentraut
On tis, 2010-12-07 at 11:46 +0900, Itagaki Takahiro wrote:
 On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote:
  Here is an updated patch to address the issues discussed during this
  commitfest.
 
 I found another issue in the patch; ILIKE in WHERE clause doesn't work.
 It was surprising because LIKE in WHERE clause and ILIKE in SELECT list
 works expectedly.
  - SELECT * FROM pg_class WHERE relname LIKE 'pg%'
  - SELECT relname ILIKE 'pg%' FROM pg_class;
 
 
 postgres=# SELECT name, setting FROM pg_settings
  WHERE name IN ('lc_ctype', 'lc_collate', 'server_encoding');
   name   | setting
 -+-
  lc_collate  | C
  lc_ctype| C
  server_encoding | UTF8
 (3 rows)
 
 postgres=# SELECT * FROM pg_class WHERE relname ILIKE 'pg%';
 ERROR:  no collation was derived

This is fixed in the 20101213 patch I'm about to send out.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-12 Thread Peter Eisentraut
On mån, 2010-12-06 at 21:26 +0200, Peter Eisentraut wrote:
 
  * contrib/citext raises an encoding error when COLLATE is specified
  even if it is the collation as same as the database default.
  We might need some special treatment for C locale.
  =# SHOW lc_collate;  == C
  =# SELECT ('A'::citext) = ('a'::citext);  == false
  =# SELECT ('A'::citext) = ('a'::citext) COLLATE C;
  ERROR:  invalid multibyte character for locale
  HINT:  The server's LC_CTYPE locale is probably incompatible with
 the
  database encoding.
 
 OK, I can reproduce that.  That's fallout from the lc_ctype_is_c()
 optimization that I removed, as explained in another email.  I'll have
 to think about that again.

This is fixed in the 20101213 patch I'm about to send out.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread Itagaki Takahiro
On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote:
 Here is an updated patch to address the issues discussed during this
 commitfest.

Here are comments and questions after I tested the latest patch:

 Issues 
* initdb itself seems to be succeeded, but it says could not determine
encoding for locale messages for any combination of encoding=utf8/eucjp
and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior?

creating collations ...initdb: locale name has non-ASCII characters,
skipped: bokm虱
initdb: locale name has non-ASCII characters, skipped: fran軋is
could not determine encoding for locale hy_AM.armscii8: codeset is ARMSCII-8
... (a dozen of lines) ...
could not determine encoding for locale vi_VN.tcvn: codeset is TCVN5712-1
ok


* contrib/citext raises an encoding error when COLLATE is specified
even if it is the collation as same as the database default.
We might need some special treatment for C locale.
=# SHOW lc_collate;  == C
=# SELECT ('A'::citext) = ('a'::citext);  == false
=# SELECT ('A'::citext) = ('a'::citext) COLLATE C;
ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.

* pg_dump would generate unportable files for different platforms
because collation names

 Source codes 
* PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION.

* What is the different between InvalidOid and DEFAULT_COLLATION_OID
for collation oids? The patch replaces DirectFunctionCall to
DirectFunctionCallC in some places, but we could shrink the diff size
if we can use InvalidOid instead of DEFAULT_COLLATION_OID,

* I still think an explicit passing collations from-function-to-function
is horrible because we might forget it in some places, and almost existing
third party module won't work.  Is it possible to make it a global variable,
and push/pop the state when changed? Sorry I'm missing something, but
I think we could treat the collation setting as like as GUC settings.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread David E. Wheeler
On Dec 6, 2010, at 4:06 AM, Itagaki Takahiro wrote:

 * contrib/citext raises an encoding error when COLLATE is specified
 even if it is the collation as same as the database default.
 We might need some special treatment for C locale.

I've been wondering if this patch will support case-insensitve collations. If 
so, then citext should probably be revised to use one.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread Pavel Stehule
2010/12/6 David E. Wheeler da...@kineticode.com:
 On Dec 6, 2010, at 4:06 AM, Itagaki Takahiro wrote:

 * contrib/citext raises an encoding error when COLLATE is specified
 even if it is the collation as same as the database default.
 We might need some special treatment for C locale.

 I've been wondering if this patch will support case-insensitve collations. If 
 so, then citext should probably be revised to use one.

what I know - no. It's support only system based collations

Pavel


 Best,

 David


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread Peter Eisentraut
On mån, 2010-12-06 at 21:06 +0900, Itagaki Takahiro wrote:
 On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote:
  Here is an updated patch to address the issues discussed during this
  commitfest.
 
 Here are comments and questions after I tested the latest patch:
 
  Issues 
 * initdb itself seems to be succeeded, but it says could not determine
 encoding for locale messages for any combination of encoding=utf8/eucjp
 and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior?
 
 creating collations ...initdb: locale name has non-ASCII characters,
 skipped: bokm虱
 initdb: locale name has non-ASCII characters, skipped: fran軋is
 could not determine encoding for locale hy_AM.armscii8: codeset is 
 ARMSCII-8
 ... (a dozen of lines) ...
 could not determine encoding for locale vi_VN.tcvn: codeset is TCVN5712-1
 ok
 

What this does it take the output of locale -a and populate the
pg_collation catalog with the locales it finds.  When it finds an
operating system locale that uses an encoding that is not recognized,
you will see this warning.

I understand that that would probably annoy users.  We could hide the
warning and silently skip those locales.  But then could that hide
genuine configuration problems?

 * contrib/citext raises an encoding error when COLLATE is specified
 even if it is the collation as same as the database default.
 We might need some special treatment for C locale.
 =# SHOW lc_collate;  == C
 =# SELECT ('A'::citext) = ('a'::citext);  == false
 =# SELECT ('A'::citext) = ('a'::citext) COLLATE C;
 ERROR:  invalid multibyte character for locale
 HINT:  The server's LC_CTYPE locale is probably incompatible with the
 database encoding.

OK, I can reproduce that.  That's fallout from the lc_ctype_is_c()
optimization that I removed, as explained in another email.  I'll have
to think about that again.

 * pg_dump would generate unportable files for different platforms
 because collation names

pg_dump can already produce unportable files for a number of other
reasons, including per-database locale, tablespaces, OS-dependent
configuration settings.

The way I imagine this working is that someone who wants to design a
genuinely portable application using this feature would create their own
collation based on the existing, OS-specific collation (using a
to-be-added CREATE COLLATION command).  As mentioned earlier, however,
we can't actually solve the problem that the OS locales may not behave
the same across systems.

  Source codes 
 * PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION.

It's not the collation of a function argument, it's the collation of a
function call.  (You could conceivably also fetch the collation of a
function argument, but that isn't used in any way.)

 * What is the different between InvalidOid and DEFAULT_COLLATION_OID
 for collation oids? The patch replaces DirectFunctionCall to
 DirectFunctionCallC in some places, but we could shrink the diff size
 if we can use InvalidOid instead of DEFAULT_COLLATION_OID,

Think of DEFAULT_COLLATION_OID as analogous to UKNOWNOID.  A long time
ago we used InvalidOid for all kinds of types, including unknown,
pseudotypes, cstring, and no type at all.  The reason we changed this
was that this masked errors and made processing of the unknown type
difficult/impossible.  I know this makes the code bigger, but it's
necessary.  I originally coded the patch using InvalidOid for
everything, but that wasn't very robust.

This also ties into the next question ...

 * I still think an explicit passing collations from-function-to-function
 is horrible because we might forget it in some places, and almost existing
 third party module won't work.  Is it possible to make it a global variable,
 and push/pop the state when changed? Sorry I'm missing something, but
 I think we could treat the collation setting as like as GUC settings.

A collation is a property of a datum or an expression.  You might as
well argue that we don't keep track of types of expressions and instead
store it globally.  Doesn't make sense.

Extensions are not required to support collations.  Those that might
want to will usually end up calling one of the locale-enabled functions
such as varstr_cmp(), and there the function prototype will ensure that
specifying a collation cannot be missed.

Additionally, the distinction of InvalidOid and DEFAULT_COLLATION_OID
does a great deal to ensure that in case a collation is unspecified or
missing in some new code, you will get a proper error message instead of
unspecified behavior.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread Peter Eisentraut
On mån, 2010-12-06 at 10:01 -0800, David E. Wheeler wrote:
 I've been wondering if this patch will support case-insensitve
 collations. If so, then citext should probably be revised to use one.

This has been touch upon several times during the discussions on past
patches.

Essentially, the current patch only arranges that you can specify a sort
order for data.  The system always breaks ties using a binary
comparison.  This could conceivably be changed, but it's a separate
problem.  Some of the necessary investigation work has presumably
already been done in the context of citext.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread David E. Wheeler
On Dec 6, 2010, at 11:29 AM, Peter Eisentraut wrote:

 This has been touch upon several times during the discussions on past
 patches.
 
 Essentially, the current patch only arranges that you can specify a sort
 order for data.  The system always breaks ties using a binary
 comparison.  This could conceivably be changed, but it's a separate
 problem.  Some of the necessary investigation work has presumably
 already been done in the context of citext.

Okay, thanks, good to know.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread Alexandre Riveira

Please

It would be very important to us that the Brazilian LIKE collate worked 
with, and possible case-insensitive and accent-insensitive


Tank's

Alexandre Riveira
Brazil

Peter Eisentraut escreveu:

On mån, 2010-12-06 at 10:01 -0800, David E. Wheeler wrote:
  

I've been wondering if this patch will support case-insensitve
collations. If so, then citext should probably be revised to use one.



This has been touch upon several times during the discussions on past
patches.

Essentially, the current patch only arranges that you can specify a sort
order for data.  The system always breaks ties using a binary
comparison.  This could conceivably be changed, but it's a separate
problem.  Some of the necessary investigation work has presumably
already been done in the context of citext.



  



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-06 Thread Itagaki Takahiro
On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut pete...@gmx.net wrote:
 Here is an updated patch to address the issues discussed during this
 commitfest.

I found another issue in the patch; ILIKE in WHERE clause doesn't work.
It was surprising because LIKE in WHERE clause and ILIKE in SELECT list
works expectedly.
 - SELECT * FROM pg_class WHERE relname LIKE 'pg%'
 - SELECT relname ILIKE 'pg%' FROM pg_class;


postgres=# SELECT name, setting FROM pg_settings
 WHERE name IN ('lc_ctype', 'lc_collate', 'server_encoding');
  name   | setting
-+-
 lc_collate  | C
 lc_ctype| C
 server_encoding | UTF8
(3 rows)

postgres=# SELECT * FROM pg_class WHERE relname ILIKE 'pg%';
ERROR:  no collation was derived


-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-04 Thread Peter Eisentraut
On tor, 2010-11-18 at 21:37 +0200, Heikki Linnakangas wrote:
 On 15.11.2010 21:42, Peter Eisentraut wrote:
  On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
  I am checking a patch. I found a problem with initdb
 
  Ah, late night brain farts, it appears.  Here is a corrected version.
 
 Some random comments:
 
 In syntax.sgml:
 
  +The literalCOLLATE/literal clause overrides the collation of
  +an expression.  It is appended to the expression at applies to:
 
 That last sentence doesn't parse.

Fixed in the patch I'm about to send out.

 Would it be possible to eliminate the ExecEvalCollateClause function 
 somehow? It just calls through the argument. How about directly 
 returning the argument ExprState in ExecInitExpr?

This currently mirrors the handling of RelabelType.  Perhaps both could
be improved, but that is an independent issue.

 get_collation_name() returns the plain name without schema, so it's not 
 good enough for use in ruleutils.c. pg_dump is also ignoring collation's 
 schema.

Fixed schema handling in ruleutils and pg_dump in new patch.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-12-04 Thread Peter Eisentraut
On ons, 2010-11-24 at 22:22 +0200, Peter Eisentraut wrote:
 On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote:
  * Did you see any performance regression by collation?
  I found a bug in lc_collate_is_c(); result = 0 should be
  checked before any other checks. SearchSysCache1() here
  would be a performance regression.
 
 That code turned out to be buggy anyway, because it was using the
 result cache variable independent of the collation parameter.

Since I don't have a short-term solution for this, I have ripped out the
caching of C-ness for nondefault locales.

 I did some profiling with this now.  The problem is that this function
 lc_collate_is_c() would need to cache the C-ness property for any
 number of collations.  Depending on what call pattern you expect or want
 to optimize for, you might end up caching most of the pg_collation
 catalog, which is actually the mandate of SearchSysCache, but the
 profile shows that SearchSysCache takes a large chunk of the additional
 run time.
 
 If I remove that branch altogether, that is, don't treat the C locale
 specially at all in the nondefault collation case, then using non-C
 locales as nondefault collation is almost as fast as using non-C locales
 as default location.  However, using the C locale as a nondefault
 collation would then be quite slow (still faster that non-C locales).
 
 The solution would perhaps be a custom, lightweight caching system, but
 I haven't thought of one yet.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-24 Thread Peter Eisentraut
On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote:
 * Did you see any performance regression by collation?
 I found a bug in lc_collate_is_c(); result = 0 should be
 checked before any other checks. SearchSysCache1() here
 would be a performance regression.

That code turned out to be buggy anyway, because it was using the
result cache variable independent of the collation parameter.

I did some profiling with this now.  The problem is that this function
lc_collate_is_c() would need to cache the C-ness property for any
number of collations.  Depending on what call pattern you expect or want
to optimize for, you might end up caching most of the pg_collation
catalog, which is actually the mandate of SearchSysCache, but the
profile shows that SearchSysCache takes a large chunk of the additional
run time.

If I remove that branch altogether, that is, don't treat the C locale
specially at all in the nondefault collation case, then using non-C
locales as nondefault collation is almost as fast as using non-C locales
as default location.  However, using the C locale as a nondefault
collation would then be quite slow (still faster that non-C locales).

The solution would perhaps be a custom, lightweight caching system, but
I haven't thought of one yet.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-11-24 Thread Peter Eisentraut
On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
 It seems you've falsified the header comment in
 pathkeys_useful_for_merging(), although I guess it's already false
 because it doesn't seem to have been updated for the NULLS ASC/DESC
 stuff, and the interior comment in right_merge_direction() also needs
 adjusting.  But this might be more than a documentation problem,
 because the choice of merge direction really *is* arbitrary in the
 case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
 actually true for collation.  If collation affects the definition of
 equality then it certainly isn't true.

I did check that again and didn't arrive at the conclusion that the
comments would need updating either with respect to this patch or some
previous change.  Could you check again and possibly provide a
suggestion?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-11-24 Thread Peter Eisentraut
On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote:
 * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.

This was fixed in the CF2010-11 patch.

 * psql \d needs a separator between collate and not null modifiers.

And this as well.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-11-24 Thread Robert Haas
On Wed, Nov 24, 2010 at 3:37 PM, Peter Eisentraut pete...@gmx.net wrote:
 On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
 It seems you've falsified the header comment in
 pathkeys_useful_for_merging(), although I guess it's already false
 because it doesn't seem to have been updated for the NULLS ASC/DESC
 stuff, and the interior comment in right_merge_direction() also needs
 adjusting.  But this might be more than a documentation problem,
 because the choice of merge direction really *is* arbitrary in the
 case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
 actually true for collation.  If collation affects the definition of
 equality then it certainly isn't true.

 I did check that again and didn't arrive at the conclusion that the
 comments would need updating either with respect to this patch or some
 previous change.  Could you check again and possibly provide a
 suggestion?

I think that you are right and that my previous comment was erroneous.
 Sorry for the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-22 Thread Peter Eisentraut
On tor, 2010-11-18 at 21:37 +0200, Heikki Linnakangas wrote:
 Have you done any performance testing? Functions like text_cmp can be
 a hotspot in sorting, so any extra overhead there might be show up in
 tests.

Without having optimized it very much yet, the performance for a 1GB
ORDER BY is

* without COLLATE clause: about the same as without the patch

* with COLLATE clause: about 30%-50% slower

I can imagine that there is some optimization potential in the latter
case.  But in any case, it's not awfully slow.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-22 Thread Peter Eisentraut
On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote:
 * COLLATE information must be explicitly passed by caller in the patch,
 but we might forgot the handover when we write new codes. Is it possible
 to pass it automatically, say using a global variable? If we could do so,
 existing extensions might work with collation without rewritten.

I don't see how that is supposed to work.  I understand the concern, but
the system is fairly robust against this becoming a problem.

 * Did you check the regression test on Windows? We probably cannot use
 en_US.utf8 on Windows. Also, some output of the test includes non-ASCII
 characters. How will we test COLLATE feature on non-UTF8 databases?

I attempted to discuss this here:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00464.php

For a lack of a solution, the approach now is to run the regression test
file manually, and we provide separate test files for as many platforms
and encodings or whatever we want to support.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-21 Thread Itagaki Takahiro
On Tue, Nov 16, 2010 at 04:42, Peter Eisentraut pete...@gmx.net wrote:
 On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
 I am checking a patch. I found a problem with initdb
 Ah, late night brain farts, it appears.  Here is a corrected version.

This version cannot be applied cleanly any more. Please update it.
(I think you don't have to include changes for catversion.h)
./src/backend/optimizer/util/plancat.c.rej
./src/backend/optimizer/plan/createplan.c.rej
./src/backend/optimizer/path/indxpath.c.rej
./src/include/catalog/catversion.h.rej

I didn't compile nor run the patched server, but I found a couple of
issues in the design and source code:

* COLLATE information must be explicitly passed by caller in the patch,
but we might forgot the handover when we write new codes. Is it possible
to pass it automatically, say using a global variable? If we could do so,
existing extensions might work with collation without rewritten.

* Did you check the regression test on Windows? We probably cannot use
en_US.utf8 on Windows. Also, some output of the test includes non-ASCII
characters. How will we test COLLATE feature on non-UTF8 databases?

[src/test/regress/sql/collate.sql]
+CREATE TABLE collate_test1 (
+a int,
+b text COLLATE en_US.utf8 not null
+);

* Did you see any performance regression by collation?
I found a bug in lc_collate_is_c(); result = 0 should be
checked before any other checks. SearchSysCache1() here
would be a performance regression.

[src/backend/utils/adt/pg_locale.c]
-lc_collate_is_c(void)
+lc_collate_is_c(Oid collation)
 {
...
+   tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collation));
...
HERE = if (result = 0)
return (bool) result;

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-18 Thread Heikki Linnakangas

On 15.11.2010 21:42, Peter Eisentraut wrote:

On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:

I am checking a patch. I found a problem with initdb


Ah, late night brain farts, it appears.  Here is a corrected version.


Some random comments:

In syntax.sgml:


+The literalCOLLATE/literal clause overrides the collation of
+an expression.  It is appended to the expression at applies to:


That last sentence doesn't parse.


Would it be possible to eliminate the ExecEvalCollateClause function 
somehow? It just calls through the argument. How about directly 
returning the argument ExprState in ExecInitExpr?


get_collation_name() returns the plain name without schema, so it's not 
good enough for use in ruleutils.c. pg_dump is also ignoring collation's 
schema.


Have you done any performance testing? Functions like text_cmp can be a 
hotspot in sorting, so any extra overhead there might be show up in tests.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Peter Eisentraut
On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote:
 a) default encoding for collate isn't same as default encoding of database
 
 it's minimally not friendly - mostly used encoding is UTF8, but in
 most cases users should to write locale.utf8.

I don't understand what you are trying to say.  Please provide more
detail.

 b) there is bug - default collate (database collate is ignored)
 
 
 postgres=# show lc_collate;
  lc_collate
 
  cs_CZ.UTF8
 (1 row)
 
 Time: 0.518 ms
 postgres=# select * from jmena order by v;
  v
 ───
  Chromečka
  Crha
  Drobný
  Čečetka
 (4 rows)
 
 postgres=# select * from jmena order by v collate cs_CZ.utf8;
  v
 ───
  Crha
  Čečetka
  Drobný
  Chromečka
 (4 rows)
 
 both result should be same.

I tried to reproduce this here but got the expected results.  Could you
try to isolate a complete test script?

 isn't there problem in case sensitive collate name? When I use a
 lc_collate value, I got a error message
 
 postgres=# select * from jmena order by v collate cs_CZ.UTF8;
 ERROR:  collation cs_CZ.UTF8 for current database encoding UTF8
 does not exist
 LINE 1: select * from jmena order by v collate cs_CZ.UTF8;
 
 problem is when table is created without explicit collate.

Well, I agree it's not totally nice, but we have to do something, and I
think it's logical to use the locale names as collation names by
default, and collation names are SQL identifiers.  Do you have any ideas
for improving this?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Pavel Stehule
Hello

2010/11/16 Peter Eisentraut pete...@gmx.net:
 On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote:
 a) default encoding for collate isn't same as default encoding of database

 it's minimally not friendly - mostly used encoding is UTF8, but in
 most cases users should to write locale.utf8.

 I don't understand what you are trying to say.  Please provide more
 detail.

go down.


 b) there is bug - default collate (database collate is ignored)


 postgres=# show lc_collate;
  lc_collate
 
  cs_CZ.UTF8
 (1 row)

 Time: 0.518 ms
 postgres=# select * from jmena order by v;
      v
 ───
  Chromečka
  Crha
  Drobný
  Čečetka
 (4 rows)

 postgres=# select * from jmena order by v collate cs_CZ.utf8;
      v
 ───
  Crha
  Čečetka
  Drobný
  Chromečka
 (4 rows)

 both result should be same.

 I tried to reproduce this here but got the expected results.  Could you
 try to isolate a complete test script?


I can't to reproduce now too. On different system and comp. Maybe I
did some wrong. Sorry.


 isn't there problem in case sensitive collate name? When I use a
 lc_collate value, I got a error message

 postgres=# select * from jmena order by v collate cs_CZ.UTF8;
 ERROR:  collation cs_CZ.UTF8 for current database encoding UTF8
 does not exist
 LINE 1: select * from jmena order by v collate cs_CZ.UTF8;

 problem is when table is created without explicit collate.

 Well, I agree it's not totally nice, but we have to do something, and I
 think it's logical to use the locale names as collation names by
 default, and collation names are SQL identifiers.  Do you have any ideas
 for improving this?

yes - my first question is: Why we need to specify encoding, when only
one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
use a UTF8 - btw there is wrong message:

yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592;
ERROR:  collation cs_CZ.iso88592 for current database encoding
UTF8 does not exist
LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592;
   ^

I don't know why, but preferred encoding for czech is iso88592 now -
but I can't to use it - so I can't to use a names czech, cs_CZ. I
always have to use a full name cs_CZ.utf8. It's wrong. More - from
this moment, my application depends on firstly used encoding - I can't
to change encoding without refactoring of SQL statements - because
encoding is hardly there (in collation clause).

So I don't understand, why you fill a table pg_collation with thousand
collated that are not possible to use? If I use a utf8, then there
should be just utf8 based collates. And if you need to work with wide
collates, then I am for a preferring utf8 - minimally for central
europe region. if somebody would to use a collates here, then he will
use a combination cs, de, en - so it must to use a latin2 and latin1
or utf8. I think so encoding should not be a part of collation when it
is possible.

Regards

Pavel









-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Peter Eisentraut
On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
 yes - my first question is: Why we need to specify encoding, when only
 one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
 use a UTF8 - btw there is wrong message:
 
 yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592;
 ERROR:  collation cs_CZ.iso88592 for current database encoding
 UTF8 does not exist
 LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592;
^

Sorry, is there some mistake in that message?

 I don't know why, but preferred encoding for czech is iso88592 now -
 but I can't to use it - so I can't to use a names czech, cs_CZ. I
 always have to use a full name cs_CZ.utf8. It's wrong. More - from
 this moment, my application depends on firstly used encoding - I can't
 to change encoding without refactoring of SQL statements - because
 encoding is hardly there (in collation clause).

I can only look at the locales that the operating system provides.  We
could conceivably make some simplifications like stripping off the
.utf8, but then how far do we go and where do we stop?  Locale names
on Windows look different too.  But in general, how do you suppose we
should map an operating system locale name to an acceptable SQL
identifier?  You might hope, for example, that we could look through the
list of operating system locale names and map, say,

cs_CZ   - czech
cs_CZ.iso88592  - czech
cs_CZ.utf8  - czech
czech   - czech

but we have no way to actually know that these are semantically similar,
so this illustrated mapping is AI complete.  We need to take the locale
names as is, and that may or may not carry encoding information.

 So I don't understand, why you fill a table pg_collation with thousand
 collated that are not possible to use? If I use a utf8, then there
 should be just utf8 based collates. And if you need to work with wide
 collates, then I am for a preferring utf8 - minimally for central
 europe region. if somebody would to use a collates here, then he will
 use a combination cs, de, en - so it must to use a latin2 and latin1
 or utf8. I think so encoding should not be a part of collation when it
 is possible.

Different databases can have different encodings, but the pg_collation
catalog is copied from the template database in any case.  We can't do
any changes in system catalogs as we create new databases, so the
useless collations have to be there.  There are only a few hundred,
actually, so it's not really a lot of wasted space.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Pavel Stehule
2010/11/16 Peter Eisentraut pete...@gmx.net:
 On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
 yes - my first question is: Why we need to specify encoding, when only
 one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
 use a UTF8 - btw there is wrong message:

 yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592;
 ERROR:  collation cs_CZ.iso88592 for current database encoding
 UTF8 does not exist
 LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592;
                                            ^

 Sorry, is there some mistake in that message?


it is unclean - I expect some like cannot to use collation
cs_CZ.iso88502, because your database use a utf8 encoding.

 I don't know why, but preferred encoding for czech is iso88592 now -
 but I can't to use it - so I can't to use a names czech, cs_CZ. I
 always have to use a full name cs_CZ.utf8. It's wrong. More - from
 this moment, my application depends on firstly used encoding - I can't
 to change encoding without refactoring of SQL statements - because
 encoding is hardly there (in collation clause).

 I can only look at the locales that the operating system provides.  We
 could conceivably make some simplifications like stripping off the
 .utf8, but then how far do we go and where do we stop?  Locale names
 on Windows look different too.  But in general, how do you suppose we
 should map an operating system locale name to an acceptable SQL
 identifier?  You might hope, for example, that we could look through the
 list of operating system locale names and map, say,

 cs_CZ           - czech
 cs_CZ.iso88592  - czech
 cs_CZ.utf8      - czech
 czech           - czech

 but we have no way to actually know that these are semantically similar,
 so this illustrated mapping is AI complete.  We need to take the locale
 names as is, and that may or may not carry encoding information.

 So I don't understand, why you fill a table pg_collation with thousand
 collated that are not possible to use? If I use a utf8, then there
 should be just utf8 based collates. And if you need to work with wide
 collates, then I am for a preferring utf8 - minimally for central
 europe region. if somebody would to use a collates here, then he will
 use a combination cs, de, en - so it must to use a latin2 and latin1
 or utf8. I think so encoding should not be a part of collation when it
 is possible.

 Different databases can have different encodings, but the pg_collation
 catalog is copied from the template database in any case.  We can't do
 any changes in system catalogs as we create new databases, so the
 useless collations have to be there.  There are only a few hundred,
 actually, so it's not really a lot of wasted space.


I have not a problem with size. Just I think, current behave isn't
practical. When database encoding is utf8, then I except, so cs_CZ
or czech will be for utf8. I understand, so template0 must have a
all locales, and I understand why current behave is, but it is very
user unfriendly. Actually, only old application in CR uses latin2,
almost all uses a utf, but now latin2 is preferred. This is bad and
should be solved.

Regards

Pavel




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread marcin mank
 I can only look at the locales that the operating system provides.  We
 could conceivably make some simplifications like stripping off the
 .utf8, but then how far do we go and where do we stop?  Locale names
 on Windows look different too.  But in general, how do you suppose we
 should map an operating system locale name to an acceptable SQL
 identifier?  You might hope, for example, that we could look through the

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:
select * from tab order by foo collate Polish, Poland
select * from tab order by foo collate pl_PL.UTF-8

(that`s how it works now, correct?)

Greetings
Marcin Mańk

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Pavel Stehule
2010/11/16 marcin mank marcin.m...@gmail.com:
 I can only look at the locales that the operating system provides.  We
 could conceivably make some simplifications like stripping off the
 .utf8, but then how far do we go and where do we stop?  Locale names
 on Windows look different too.  But in general, how do you suppose we
 should map an operating system locale name to an acceptable SQL
 identifier?  You might hope, for example, that we could look through the

 It would be nice if we could have some mapping of locale names bult
 in, so one doesn`t have to write alternative sql depending on DB
 server OS:

+1

Pavel

 select * from tab order by foo collate Polish, Poland
 select * from tab order by foo collate pl_PL.UTF-8

 (that`s how it works now, correct?)

 Greetings
 Marcin Mańk


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Peter Eisentraut
On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote:
 2010/11/16 Peter Eisentraut pete...@gmx.net:
  On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
  yes - my first question is: Why we need to specify encoding, when only
  one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
  use a UTF8 - btw there is wrong message:
 
  yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592;
  ERROR:  collation cs_CZ.iso88592 for current database encoding
  UTF8 does not exist
  LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592;
 ^
 
  Sorry, is there some mistake in that message?
 
 
 it is unclean - I expect some like cannot to use collation
 cs_CZ.iso88502, because your database use a utf8 encoding.

No, the namespace for collations is per encoding.  (This is per SQL
standard.)  So you *could* have a collation called cs_CZ.iso88502 for
the UTF8 encoding.

 I have not a problem with size. Just I think, current behave isn't
 practical. When database encoding is utf8, then I except, so cs_CZ
 or czech will be for utf8. I understand, so template0 must have a
 all locales, and I understand why current behave is, but it is very
 user unfriendly. Actually, only old application in CR uses latin2,
 almost all uses a utf, but now latin2 is preferred. This is bad and
 should be solved.

Again, we can only look at the locale names that the operating system
gives us.  Mapping that to the names you expect is an AI problem.  If
you have a solution, please share it.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Peter Eisentraut
On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
 It would be nice if we could have some mapping of locale names bult
 in, so one doesn`t have to write alternative sql depending on DB
 server OS:
 select * from tab order by foo collate Polish, Poland
 select * from tab order by foo collate pl_PL.UTF-8

Sure that would be nice, but how do you hope to do that?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Pavel Stehule
2010/11/16 Peter Eisentraut pete...@gmx.net:
 On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote:
 2010/11/16 Peter Eisentraut pete...@gmx.net:
  On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
  yes - my first question is: Why we need to specify encoding, when only
  one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
  use a UTF8 - btw there is wrong message:
 
  yyy=# select * from jmena order by jmeno collate cs_CZ.iso88592;
  ERROR:  collation cs_CZ.iso88592 for current database encoding
  UTF8 does not exist
  LINE 1: select * from jmena order by jmeno collate cs_CZ.iso88592;
                                             ^
 
  Sorry, is there some mistake in that message?
 

 it is unclean - I expect some like cannot to use collation
 cs_CZ.iso88502, because your database use a utf8 encoding.

 No, the namespace for collations is per encoding.  (This is per SQL
 standard.)  So you *could* have a collation called cs_CZ.iso88502 for
 the UTF8 encoding.

 I have not a problem with size. Just I think, current behave isn't
 practical. When database encoding is utf8, then I except, so cs_CZ
 or czech will be for utf8. I understand, so template0 must have a
 all locales, and I understand why current behave is, but it is very
 user unfriendly. Actually, only old application in CR uses latin2,
 almost all uses a utf, but now latin2 is preferred. This is bad and
 should be solved.

 Again, we can only look at the locale names that the operating system
 gives us.  Mapping that to the names you expect is an AI problem.  If
 you have a solution, please share it.


ok, then we should to define this alias manually

some like - CREATE COLLATE czech FOR LOCALE cs_CZ.UTF8

or some similar. Without this, the application or stored procedures
can be non portable between UNIX and WIN.

Peter, now initdb check relation between encoding and locale - and
this check is portable. Can we use this code?

Pavel




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Peter Eisentraut
On tis, 2010-11-16 at 21:40 +0100, Pavel Stehule wrote:
 ok, then we should to define this alias manually
 
 some like - CREATE COLLATE czech FOR LOCALE cs_CZ.UTF8
 
 or some similar. Without this, the application or stored procedures
 can be non portable between UNIX and WIN.

Yes, such a command will be provided.  You can already do it manually.

 Peter, now initdb check relation between encoding and locale - and
 this check is portable. Can we use this code?

Hmm, not really, but something similar, I suppose.  Only that the
mapping list would be much longer and more volatile.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Martijn van Oosterhout
On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:
 On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
  It would be nice if we could have some mapping of locale names bult
  in, so one doesn`t have to write alternative sql depending on DB
  server OS:
  select * from tab order by foo collate Polish, Poland
  select * from tab order by foo collate pl_PL.UTF-8
 
 Sure that would be nice, but how do you hope to do that?

Given that each operating system comes with a different set of
collations, it seems unlikely you could even find two collations on
different OSes that even correspond. There's not a lot of
standardisation here (well, except for the unicode collation
algorithm, but that doesn't help with language variations).

I don't think this is a big deal for now, perhaps after per-column
collation is implemented we can work on the portability issues. Make it
work, then make it better.

/me ducks

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Patriotism is when love of your own people comes first; nationalism,
 when hate for people other than your own comes first. 
   - Charles de Gaulle


signature.asc
Description: Digital signature


Re: [HACKERS] Per-column collation

2010-11-16 Thread Tom Lane
Martijn van Oosterhout klep...@svana.org writes:
 On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:
 On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
 It would be nice if we could have some mapping of locale names bult
 in, so one doesn`t have to write alternative sql depending on DB
 server OS:

 Sure that would be nice, but how do you hope to do that?

 Given that each operating system comes with a different set of
 collations, it seems unlikely you could even find two collations on
 different OSes that even correspond.

Yeah, the *real* portability problem here is that the locale behavior is
likely to be different, not just the name.  I don't think we'd be doing
people many favors by masking behavioral differences between a forced
common name.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-16 Thread Pavel Stehule
2010/11/16 Tom Lane t...@sss.pgh.pa.us:
 Martijn van Oosterhout klep...@svana.org writes:
 On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:
 On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
 It would be nice if we could have some mapping of locale names bult
 in, so one doesn`t have to write alternative sql depending on DB
 server OS:

 Sure that would be nice, but how do you hope to do that?

 Given that each operating system comes with a different set of
 collations, it seems unlikely you could even find two collations on
 different OSes that even correspond.

 Yeah, the *real* portability problem here is that the locale behavior is
 likely to be different, not just the name.  I don't think we'd be doing
 people many favors by masking behavioral differences between a forced
 common name.


no, minimally there is same behave of cs_CZ.utf8 and cs_CZ.iso88592.
But without any alias user should to modify source code, when he
change a encoding.

Pavel

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-15 Thread Pavel Stehule
Hello

I am checking a patch. I found a problem with initdb

[postg...@pavel-stehule postgresql]$ /usr/local/pgsql/bin/initdb -D
/usr/local/pgsql/data/
could not change directory to /home/pavel/src/postgresql
The files belonging to this database system will be owned by user postgres.
This user must also own the server process.

The database cluster will be initialized with locale en_US.utf8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to english.

fixing permissions on existing directory /usr/local/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 24MB
creating configuration files ... ok
creating template1 database in /usr/local/pgsql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ...initdb: locale name has non-ASCII characters,
skipped: bokm�linitdb: locale name has non-ASCII characters, skipped:
fran�aiscould not determine encoding for locale hy_AM.armscii8:
codeset is ARMSCII-8
could not determine encoding for locale ka_GE: codeset is GEORGIAN-PS
could not determine encoding for locale ka_GE.georgianps: codeset is
GEORGIAN-PS
could not determine encoding for locale kk_KZ: codeset is PT154
could not determine encoding for locale kk_KZ.pt154: codeset is PT154
could not determine encoding for locale tg_TJ: codeset is KOI8-T
could not determine encoding for locale tg_TJ.koi8t: codeset is KOI8-T
could not determine encoding for locale thai: codeset is TIS-620
could not determine encoding for locale th_TH: codeset is TIS-620
could not determine encoding for locale th_TH.tis620: codeset is TIS-620
could not determine encoding for locale vi_VN.tcvn: codeset is TCVN5712-1
FATAL:  invalid byte sequence for encoding UTF8: 0xe56c27
child process exited with exit code 1
initdb: removing contents of data directory /usr/local/pgsql/data

tested on fedora 13

[postg...@pavel-stehule local]$ locale -a| wc -l
731

Regards

Pavel Stehule



2010/11/15 Peter Eisentraut pete...@gmx.net:
 Here is the next patch in this epic series. [0]

 I have addressed most of the issues pointed out in previous reviews and
 removed all major outstanding problems that were marked in the code.  So
 it might just almost really work.

 The documentation now also covers everything that's interesting, so
 newcomers can start with that.


 For those who have previously reviewed this, two major changes:

 * The locales to be loaded are now computed by initdb, no longer during
 the build process.

 * The regression test file has been removed from the main test set.  To
 run it, use

 make check MULTIBYTE=UTF8 EXTRA_TESTS=collate


 Stuff that still cannot be expected to work:

 * no CREATE COLLATION yet, maybe later

 * no support for regular expression searches

 * not text search support

 These would not be release blockers, I think.


 [0] 
 http://archives.postgresql.org/message-id/1284583568.4696.20.ca...@vanquo.pezone.net


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation

2010-11-15 Thread Pavel Stehule
Hello

2010/11/15 Peter Eisentraut pete...@gmx.net:
 On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
 I am checking a patch. I found a problem with initdb

 Ah, late night brain farts, it appears.  Here is a corrected version.



yes, it's ok now.

I see still a few issues:

a) default encoding for collate isn't same as default encoding of database

it's minimally not friendly - mostly used encoding is UTF8, but in
most cases users should to write locale.utf8.

b) there is bug - default collate (database collate is ignored)


postgres=# show lc_collate;
 lc_collate

 cs_CZ.UTF8
(1 row)

Time: 0.518 ms
postgres=# select * from jmena order by v;
 v
───
 Chromečka
 Crha
 Drobný
 Čečetka
(4 rows)

postgres=# select * from jmena order by v collate cs_CZ.utf8;
 v
───
 Crha
 Čečetka
 Drobný
 Chromečka
(4 rows)

both result should be same.

isn't there problem in case sensitive collate name? When I use a
lc_collate value, I got a error message

postgres=# select * from jmena order by v collate cs_CZ.UTF8;
ERROR:  collation cs_CZ.UTF8 for current database encoding UTF8
does not exist
LINE 1: select * from jmena order by v collate cs_CZ.UTF8;

problem is when table is created without explicit collate.

Regards

Pavel Stehule

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-21 Thread Peter Eisentraut
On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
 and maybe not that bad, but I wonder if there is some preparatory
 refactoring that could be done to trim it down a bit.  I notice, for
 example, that a lot of places that looked at asc/desc, nulls
 first/last now look at asc/desc, nulls first/last, collation.  In
 particular, all the pathkey stuff is like this.  And similarly places
 that used to care about type, typmod now have to care about type,
 tymod, collation.  There might be ways to restructure some of this
 code so that these things can be changed without having to touch quite
 so many places.

Yeah, I think that's what I'll try to do next.

We already have TypeName as a structure that contains type and typmod
(and collation, in my patch).  We could make that a primnode instead of
a parsenode, and use it in more places, or we could make a new leaner
structure that only contains the numeric info.

We could then, for example, change things like this:

typedef struct Var
{
Exprxpr;
...
Oid vartype;
int32   vartypmod;
...
}

into this

typedef struct Var
{
Exprxpr;
...
TypeName/TypeFoo vartype;
...
}

This would save boatloads of duplicate code.

 It looks like you've define collations as objects that exist within
 particular namespaces, but there's no CREATE COLLATION statement, so I
 don't see what purpose this serves.  I suppose we could leave that to
 be added later, but is there actually a use case for having collations
 in individual schemas, or should we treat them more like we do casts -
 i.e. as database-global objects?

The SQL standard defines it that way, and there should be a CREATE
COLLATION statement later.  Application-specific collation sequences
might not be unreasonable in the future.

 Why does the executor ever need to see collate clauses?

Hmm, maybe not.  I think it did in an earlier working draft.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-21 Thread Robert Haas
On Thu, Oct 21, 2010 at 2:44 PM, Peter Eisentraut pete...@gmx.net wrote:
 On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
 and maybe not that bad, but I wonder if there is some preparatory
 refactoring that could be done to trim it down a bit.  I notice, for
 example, that a lot of places that looked at asc/desc, nulls
 first/last now look at asc/desc, nulls first/last, collation.  In
 particular, all the pathkey stuff is like this.  And similarly places
 that used to care about type, typmod now have to care about type,
 tymod, collation.  There might be ways to restructure some of this
 code so that these things can be changed without having to touch quite
 so many places.

 Yeah, I think that's what I'll try to do next.

 We already have TypeName as a structure that contains type and typmod
 (and collation, in my patch).  We could make that a primnode instead of
 a parsenode, and use it in more places, or we could make a new leaner
 structure that only contains the numeric info.

 We could then, for example, change things like this:

 typedef struct Var
 {
    Expr        xpr;
    ...
    Oid         vartype;
    int32       vartypmod;
    ...
 }

 into this

 typedef struct Var
 {
    Expr        xpr;
    ...
    TypeName/TypeFoo vartype;
    ...
 }

 This would save boatloads of duplicate code.

I think that the idea of having a node that represents a type in all
its glory is a very good one.  I'm somewhat inclined not to reuse
TypeName, because I think we'll end up wanting to use this in places
where names and location are not available.  In fact, judging by
some of the logic in LookupTypeNames(), we have some cases like that
already, which might be worth trying to clean up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-21 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 We already have TypeName as a structure that contains type and typmod
 (and collation, in my patch).  We could make that a primnode instead of
 a parsenode, and use it in more places, or we could make a new leaner
 structure that only contains the numeric info.

TypeName per se is completely inappropriate for use beyond the first
stage of parsing, because it requires catalog lookups to make any sense
of.  I think the post-parsing representation should still start with a
type OID.  I can agree with replacing typmod with a struct, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-21 Thread Robert Haas
On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 We already have TypeName as a structure that contains type and typmod
 (and collation, in my patch).  We could make that a primnode instead of
 a parsenode, and use it in more places, or we could make a new leaner
 structure that only contains the numeric info.

 TypeName per se is completely inappropriate for use beyond the first
 stage of parsing, because it requires catalog lookups to make any sense
 of.  I think the post-parsing representation should still start with a
 type OID.  I can agree with replacing typmod with a struct, though.

I think we should have both the type OID and the typmod in the struct.
 Carrying the type OID separately from the typmod has caused us enough
heartache already.  No?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 TypeName per se is completely inappropriate for use beyond the first
 stage of parsing, because it requires catalog lookups to make any sense
 of.  I think the post-parsing representation should still start with a
 type OID.  I can agree with replacing typmod with a struct, though.

 I think we should have both the type OID and the typmod in the struct.
  Carrying the type OID separately from the typmod has caused us enough
 heartache already.  No?

I think that that would probably involve a whole lot more notational
busywork than just replacing typmod with something more complicated.
However, we're talking about breaking vast amounts of code in either
case, so maybe making it even vaster isn't a real consideration.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-21 Thread Josh Berkus

 I think that that would probably involve a whole lot more notational
 busywork than just replacing typmod with something more complicated.
 However, we're talking about breaking vast amounts of code in either
 case, so maybe making it even vaster isn't a real consideration.

Gods, yes.  Please let's not extend typemod any further without an overhaul.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-14 Thread Robert Haas
On Fri, Sep 24, 2010 at 1:57 AM, Peter Eisentraut pete...@gmx.net wrote:
 On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote:
 On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut pete...@gmx.net wrote:
  We could support it also on MSVC.
  http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- 
  _strcoll_l
  http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- 
  _towupper_l
 
  Great.

 If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
 in pg_dump? Since collation names depend on platforms, a backup dumped
 at UNIX cannot be reloaded to Windows. We might need to normalize
 locale names to generate a portable dump.

 It is not necessary that the SQL collation names are the same as the OS
 locale names.  That is just a convenient way to initialize it.  If you
 need to transport dumps, you can create your own SQL collation entry at
 the target that locally fits what you are trying to do (or perhaps
 rename the collation at the source).

 I don't think there is a universally applicable way to normalize
 locale names, because we want to support user-defined OS locales.

What's the status of this patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-14 Thread Peter Eisentraut
On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote:
 What's the status of this patch?

I would appreciate some more review of the basic architecture.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-10-14 Thread Robert Haas
On Thu, Oct 14, 2010 at 12:53 PM, Peter Eisentraut pete...@gmx.net wrote:
 On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote:
 What's the status of this patch?

 I would appreciate some more review of the basic architecture.

reads patch

Wow, what a patch.  On the whole, I think this looks pretty good.  Of
course, the obvious thing to be dismayed about is how many parts of
the system this touches.  To some extent, that's probably inevitable
and maybe not that bad, but I wonder if there is some preparatory
refactoring that could be done to trim it down a bit.  I notice, for
example, that a lot of places that looked at asc/desc, nulls
first/last now look at asc/desc, nulls first/last, collation.  In
particular, all the pathkey stuff is like this.  And similarly places
that used to care about type, typmod now have to care about type,
tymod, collation.  There might be ways to restructure some of this
code so that these things can be changed without having to touch quite
so many places.  If we're extending these lists from two items to
three, do we need to worry about what happens when they grow to four
or five or six?  I particularly think this is in issue for the type
information.  We are still finding bugs where typemod isn't carried
through properly; this kind of thing is only going to make it much
worse.  We need to encapsulate it in some future-proof way.

It seems you've falsified the header comment in
pathkeys_useful_for_merging(), although I guess it's already false
because it doesn't seem to have been updated for the NULLS ASC/DESC
stuff, and the interior comment in right_merge_direction() also needs
adjusting.  But this might be more than a documentation problem,
because the choice of merge direction really *is* arbitrary in the
case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
actually true for collation.  If collation affects the definition of
equality then it certainly isn't true.

It looks like you've define collations as objects that exist within
particular namespaces, but there's no CREATE COLLATION statement, so I
don't see what purpose this serves.  I suppose we could leave that to
be added later, but is there actually a use case for having collations
in individual schemas, or should we treat them more like we do casts -
i.e. as database-global objects?

Why does the executor ever need to see collate clauses?

In the department of minor nits, the use of the word respectively in
the CREATE INDEX documentation doesn't make sense to me.  The message
about has a collation conflict is not self-explanatory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-26 Thread Pavel Stehule
Hello Peter

Is there any reason why you prohibit a different encodings per one
database? Actually people expect from collate per column a possibility
to store a two or more different encodings per one database. Without
this possibility - only UTF8 is possible for practical work - and for
other encodings only pairs (national locale + C). Yes - it is from my
perspective (as Czech programmer) - very typical situation and request
is mix latin2 and latin1. I can live with limit, but it is very hard
limit and should be documented.

Regards

Pavel

2010/9/15 Peter Eisentraut pete...@gmx.net:
 Following up on my previous patch [0], here is a fairly complete
 implementation of this feature.  The general description and
 implementation outline of the previous message still apply.  This patch
 contains documentation and regression tests, which can serve as further
 explanations.

 As this patch touches pretty much everything in the system, there are
 probably countless bugs and bogosities, some of which I have marked with
 FIXME, TODO, etc.  But all the functionality is basically there, so it's
 time someone else gives this a serious examination.

 Note: As previously, regression tests only work with make check
 MULTIBYTE=UTF8 and the feature overall only works on Linux/glibc.

 [0]
 http://archives.postgresql.org/message-id/1279045531.32647.14.ca...@vanquo.pezone.net


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-26 Thread Greg Stark
On Sun, Sep 26, 2010 at 1:15 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Is there any reason why you prohibit a different encodings per one
 database? Actually people expect from collate per column a possibility
 to store a two or more different encodings per one database.

These are two completely separate problems that only look related. The
main difference is that while collation is a property of the
comparison or sort you're performing encoding is actually a property
of the string itself. It doesn't make sense to specify a different
encoding than what the string actually contains.

You could actually do what you want now by using bytea columns and
convert_to/convert_from and it wouldn't be much easier if the support
were built into text since you would still have to keep track of the
encoding it's in and the encoding you want. We could have a
encoded_text data type which includes both the encoding and the string
and which any comparison function automatically handles conversion
based on the encoding of the collation requested -- but I wouldn't
want that to be the default text datatype. It would impose a lot of
overhead on the basic text operations and magnify the effects of
choosing the wrong collation.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-26 Thread Andrew Dunstan



On 09/26/2010 09:37 AM, Greg Stark wrote:

We could have a
encoded_text data type which includes both the encoding and the string
and which any comparison function automatically handles conversion
based on the encoding of the collation requested -- but I wouldn't
want that to be the default text datatype. It would impose a lot of
overhead on the basic text operations and magnify the effects of
choosing the wrong collation.


Yeah, but it would be a nice gadget to have.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-24 Thread Peter Eisentraut
On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote:
  select to_char(current_date,'tmday' collate cs_CZ.utf8);
 
 I am thinking, collates can be used for this purpose too. I see some
 impacts - this syntax changes a stable function to immutable and it
 cannot be simple to solve.

I don't understand how you come to that conclusion.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-24 Thread Pavel Stehule
2010/9/24 Peter Eisentraut pete...@gmx.net:
 On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote:
  select to_char(current_date,'tmday' collate cs_CZ.utf8);

 I am thinking, collates can be used for this purpose too. I see some
 impacts - this syntax changes a stable function to immutable and it
 cannot be simple to solve.

 I don't understand how you come to that conclusion.

sorry, I was wrong - it has sense for date output function. to_char is
immutable everywhere

Pavel





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Pavel Stehule
Hello

I am playing with your patch now. I found a few issues:

1. It's doesn't work with SQL 92 rules for sortby list. I can
understand so explicit COLLATE using doesn't work, but the implicit
using doesn't work too:

CREATE TABLE foo(a text, b text COLLATE cs_CZ.UTF8)

SELECT * FROM foo ORDER BY 1 -- produce wrong order

2. Why default encoding for collate is static? There are latin2 for
czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
encoding explicitly. But the more used and preferred encoding is UTF8
now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.

3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8;
 to_char
──
 thursday -- bad result
(1 row)

4. is somewhere ToDo for collation implementation?

5.

postgres=# create table xy(a text, b text collate cs_CZ);
ERROR:  collation cs_CZ for current database encoding UTF8 does not exist

can be there some more friendly message or hint ? like you cannot to
use a different encoding. This collate is in pg_collates table.

--
patch was applied cleanly and works in very well. Thank you.

Regards

Pavel Stehule


2010/9/15 Peter Eisentraut pete...@gmx.net:
 Following up on my previous patch [0], here is a fairly complete
 implementation of this feature.  The general description and
 implementation outline of the previous message still apply.  This patch
 contains documentation and regression tests, which can serve as further
 explanations.

 As this patch touches pretty much everything in the system, there are
 probably countless bugs and bogosities, some of which I have marked with
 FIXME, TODO, etc.  But all the functionality is basically there, so it's
 time someone else gives this a serious examination.

 Note: As previously, regression tests only work with make check
 MULTIBYTE=UTF8 and the feature overall only works on Linux/glibc.

 [0]
 http://archives.postgresql.org/message-id/1279045531.32647.14.ca...@vanquo.pezone.net


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Itagaki Takahiro
On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8;
  to_char
 ──
  thursday -- bad result
 (1 row)

COLLATE means collation rather than locale, no?

 5.
 postgres=# create table xy(a text, b text collate cs_CZ);
 ERROR:  collation cs_CZ for current database encoding UTF8 does not exist
 can be there some more friendly message or hint ?

I hope Postgres automatically detects the omitted encoding
because it knows the database encoding is UTF8.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Pavel Stehule
2010/9/23 Itagaki Takahiro itagaki.takah...@gmail.com:
 On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8;
  to_char
 ──
  thursday -- bad result
 (1 row)

 COLLATE means collation rather than locale, no?

ok.


 5.
 postgres=# create table xy(a text, b text collate cs_CZ);
 ERROR:  collation cs_CZ for current database encoding UTF8 does not exist
 can be there some more friendly message or hint ?

 I hope Postgres automatically detects the omitted encoding
 because it knows the database encoding is UTF8.

I know what this issue means, but it needs some detail or hint I think

Regards

Pavel


 --
 Itagaki Takahiro


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Peter Eisentraut
On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote:
 1. It's doesn't work with SQL 92 rules for sortby list. I can
 understand so explicit COLLATE using doesn't work, but the implicit
 using doesn't work too:
 
 CREATE TABLE foo(a text, b text COLLATE cs_CZ.UTF8)
 
 SELECT * FROM foo ORDER BY 1 -- produce wrong order

I can't reproduce that.  Please provide more details.

 2. Why default encoding for collate is static? There are latin2 for
 czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
 encoding explicitly. But the more used and preferred encoding is UTF8
 now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.

That's tweakable.  One idea I had is to strip the .utf8 suffix from
locale names when populating the pg_collation catalog, or create both
versions.  I agree that the current way is a bit cumbersome.

 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8;
  to_char
 ──
  thursday -- bad result
 (1 row)

As was already pointed out, collation only covers lc_collate and
lc_ctype.  (It could cover other things, for example an application to
the money type was briefly discussed, but that's outside the current
mandate.)

As a point of order, what you wrote above attaches a collation to the
result of the function call.  To get the collation to apply to the
function call itself, you have to put the collate clause on one of the
arguments, e.g.,

select to_char(current_date,'tmday' collate cs_CZ.utf8);

 4. is somewhere ToDo for collation implementation?

At the moment it's mostly in the source code.  I have a list of notes
locally that I can clean up and put in the wiki once we agree on the
general direction.

 5.
 
 postgres=# create table xy(a text, b text collate cs_CZ);
 ERROR:  collation cs_CZ for current database encoding UTF8 does not exist
 
 can be there some more friendly message or hint ? like you cannot to
 use a different encoding. This collate is in pg_collates table.

That can surely be polished.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Peter Eisentraut
On tor, 2010-09-23 at 17:29 +0900, Itagaki Takahiro wrote:
 On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
  5.
  postgres=# create table xy(a text, b text collate cs_CZ);
  ERROR:  collation cs_CZ for current database encoding UTF8 does not 
  exist
  can be there some more friendly message or hint ?
 
 I hope Postgres automatically detects the omitted encoding
 because it knows the database encoding is UTF8.

I would rather not build too many expectations into this yet.  The
collation names are chosen by the user, the locale names are from the
operating system.  There is not necessarily a correspondence.  The best
fix is probably what I described earlier, populate the pg_collation
table with the .utf8 suffix stripped.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Pavel Stehule
2010/9/23 Peter Eisentraut pete...@gmx.net:
 On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote:
 1. It's doesn't work with SQL 92 rules for sortby list. I can
 understand so explicit COLLATE using doesn't work, but the implicit
 using doesn't work too:

 CREATE TABLE foo(a text, b text COLLATE cs_CZ.UTF8)

 SELECT * FROM foo ORDER BY 1 -- produce wrong order

 I can't reproduce that.  Please provide more details.

sorry, it is ok - I was confused


 2. Why default encoding for collate is static? There are latin2 for
 czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
 encoding explicitly. But the more used and preferred encoding is UTF8
 now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.

 That's tweakable.  One idea I had is to strip the .utf8 suffix from
 locale names when populating the pg_collation catalog, or create both
 versions.  I agree that the current way is a bit cumbersome.


yes. now almost all databases are in utf8

 3. postgres=# select to_char(current_date,'tmday') collate cs_CZ.utf8;
  to_char
 ──
  thursday -- bad result
 (1 row)

 As was already pointed out, collation only covers lc_collate and
 lc_ctype.  (It could cover other things, for example an application to
 the money type was briefly discussed, but that's outside the current
 mandate.)

ook

 As a point of order, what you wrote above attaches a collation to the
 result of the function call.  To get the collation to apply to the
 function call itself, you have to put the collate clause on one of the
 arguments, e.g.,

 select to_char(current_date,'tmday' collate cs_CZ.utf8);

I am thinking, collates can be used for this purpose too. I see some
impacts - this syntax changes a stable function to immutable and it
cannot be simple to solve.


 4. is somewhere ToDo for collation implementation?

 At the moment it's mostly in the source code.  I have a list of notes
 locally that I can clean up and put in the wiki once we agree on the
 general direction.

 5.

 postgres=# create table xy(a text, b text collate cs_CZ);
 ERROR:  collation cs_CZ for current database encoding UTF8 does not exist

 can be there some more friendly message or hint ? like you cannot to
 use a different encoding. This collate is in pg_collates table.

 That can surely be polished.



Regards

Pavel Stehule

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Itagaki Takahiro
On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut pete...@gmx.net wrote:
 We could support it also on MSVC.
 http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
 http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l

 Great.

If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
in pg_dump? Since collation names depend on platforms, a backup dumped
at UNIX cannot be reloaded to Windows. We might need to normalize
locale names to generate a portable dump.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-23 Thread Peter Eisentraut
On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote:
 On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut pete...@gmx.net wrote:
  We could support it also on MSVC.
  http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- 
  _strcoll_l
  http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- 
  _towupper_l
 
  Great.
 
 If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
 in pg_dump? Since collation names depend on platforms, a backup dumped
 at UNIX cannot be reloaded to Windows. We might need to normalize
 locale names to generate a portable dump.

It is not necessary that the SQL collation names are the same as the OS
locale names.  That is just a convenient way to initialize it.  If you
need to transport dumps, you can create your own SQL collation entry at
the target that locally fits what you are trying to do (or perhaps
rename the collation at the source).

I don't think there is a universally applicable way to normalize
locale names, because we want to support user-defined OS locales.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-22 Thread Itagaki Takahiro
On Thu, Sep 16, 2010 at 5:46 AM, Peter Eisentraut pete...@gmx.net wrote:
 Following up on my previous patch [0], here is a fairly complete
 implementation of this feature.  The general description and
 implementation outline of the previous message still apply.  This patch
 contains documentation and regression tests, which can serve as further
 explanations.

I tested the patch on database with encoding=UTF8 and locale-C.
I have a couple of questions and comments.

* CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.
  We need to copy collations by default, or add INCLUDING COLLATE option.

* upper() doesn't work if a column has a collation.
  It still works if a column doesn't have a collation.
postgres=# \d tbl
 Table public.tbl
 Column | Type | Modifiers
+--+
 c  | text | collate C
 ja | text | collate ja_JP.utf8

postgres=# SELECT name, setting FROM pg_settings WHERE name IN
('lc_ctype', 'lc_collate');
name| setting
+-
 lc_collate | C
 lc_ctype   | C
(2 rows)

postgres=# SELECT upper(c) FROM tbl;
ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.
postgres=# SELECT upper(ja) FROM tbl;
ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding

* Comparison of strings with different collations is forbidden,
  but assignment is allowed, right?

postgres=# SELECT * FROM tbl WHERE c = ja;
ERROR:  collation mismatch between implicit collations C and ja_JP.utf8
LINE 1: SELECT * FROM tbl WHERE c = ja;
^
HINT:  You can override the collation by applying the COLLATE clause
to one or both expressions.
postgres=# INSERT INTO tbl(c, ja) SELECT ja, c FROM tbl;
INSERT 0 6

* psql \d needs a separator between collate and not null modifiers.
postgres=# ALTER TABLE tbl ALTER COLUMN c SET NOT NULL;
ALTER TABLE
postgres=# \d tbl
 Table public.tbl
 Column | Type | Modifiers
+--+
 c  | text | collate Cnot null= HERE
 ja | text | collate ja_JP.utf8


 the feature overall only works on Linux/glibc.

We could support it also on MSVC.
http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, work in progress

2010-09-22 Thread Peter Eisentraut
On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote:
 * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.
   We need to copy collations by default, or add INCLUDING COLLATE option.

OK, should be easy to fix.

 * upper() doesn't work if a column has a collation.
   It still works if a column doesn't have a collation.

I think what you are observing is the result of mixing C and non-C
locales.  Of course that should also be fixed, but it doesn't have much
to do with what upper() does.  Note that there is a regression test case
for lower(), which works mostly the same way.

 * Comparison of strings with different collations is forbidden,
   but assignment is allowed, right?

Correct.

 * psql \d needs a separator between collate and not null modifiers.

OK.

 We could support it also on MSVC.
 http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
 http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l

Great.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-08-18 Thread Peter Eisentraut
On tis, 2010-08-17 at 01:16 -0500, Jaime Casanova wrote:
  creating collations ...FATAL:  invalid byte sequence for encoding
  UTF8: 0xe56c09
  CONTEXT:  COPY tmp_pg_collation, line 86
  STATEMENT:  COPY tmp_pg_collation FROM
  E'/usr/local/pgsql/9.1/share/locales.txt';
  
 
  Hmm, what is in that file on that line?
 
 
 
 bokmål  ISO-8859-1

Hey, that borders on genius: Use a non-ASCII letter in the name of a
locale whose purpose it is to configure how non-ASCII letters are
interpreted. :-/

Interestingly, I don't see this on a Debian system.  Good thing to know
that this needs separate testing on different Linux variants.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-08-18 Thread Jaime Casanova
On Wed, Aug 18, 2010 at 11:29 AM, Peter Eisentraut pete...@gmx.net wrote:
 On tis, 2010-08-17 at 01:16 -0500, Jaime Casanova wrote:
  creating collations ...FATAL:  invalid byte sequence for encoding
  UTF8: 0xe56c09
  CONTEXT:  COPY tmp_pg_collation, line 86
  STATEMENT:  COPY tmp_pg_collation FROM
  E'/usr/local/pgsql/9.1/share/locales.txt';
  
 
  Hmm, what is in that file on that line?
 
 

 bokmål  ISO-8859-1

 Hey, that borders on genius: Use a non-ASCII letter in the name of a
 locale whose purpose it is to configure how non-ASCII letters are
 interpreted. :-/

 Interestingly, I don't see this on a Debian system.  Good thing to know
 that this needs separate testing on different Linux variants.



Yeah! and when installing centos 5 i don't have a chance to choose
what locales i want, it just installs all of them

-- 
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-08-17 Thread Jaime Casanova
On Mon, Aug 16, 2010 at 10:56 PM, Peter Eisentraut pete...@gmx.net wrote:
 On lör, 2010-08-14 at 02:05 -0500, Jaime Casanova wrote:

 BTW, why the double quotes?

 Because the name contains upper case letters?


why everything seems so obvious once someone else state it? :)

 sorry to state the obvious but this doesn't work on windows, does it?

 Probably not, but hopefully there is some similar API that could be used
 on Windows.


good luck with that! ;)
seriously, maybe this helps
http://msdn.microsoft.com/en-us/library/system.windows.forms.inputlanguage.installedinputlanguages.aspx
but probably you will need to write the code yourself... at least i
don't think there is something like locale -a

 and for some reason it also didn't work on a centos 5 (this error
 ocurred when initdb'ing)
 
 loading system objects' descriptions ... ok
 creating collations ...FATAL:  invalid byte sequence for encoding
 UTF8: 0xe56c09
 CONTEXT:  COPY tmp_pg_collation, line 86
 STATEMENT:  COPY tmp_pg_collation FROM
 E'/usr/local/pgsql/9.1/share/locales.txt';
 

 Hmm, what is in that file on that line?



bokmål  ISO-8859-1

(i'm attaching the locales.txt just in case)

-- 
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL
aa_DJ   ISO-8859-1
aa_DJ.iso88591  ISO-8859-1
aa_DJ.utf8  UTF-8
aa_ER   UTF-8
aa...@saaho UTF-8
aa_ER.utf8  UTF-8
aa_er.u...@saahoUTF-8
aa_ET   UTF-8
aa_ET.utf8  UTF-8
af_ZA   ISO-8859-1
af_ZA.iso88591  ISO-8859-1
af_ZA.utf8  UTF-8
am_ET   UTF-8
am_ET.utf8  UTF-8
an_ES   ISO-8859-15
an_ES.iso885915 ISO-8859-15
an_ES.utf8  UTF-8
ar_AE   ISO-8859-6
ar_AE.iso88596  ISO-8859-6
ar_AE.utf8  UTF-8
ar_BH   ISO-8859-6
ar_BH.iso88596  ISO-8859-6
ar_BH.utf8  UTF-8
ar_DZ   ISO-8859-6
ar_DZ.iso88596  ISO-8859-6
ar_DZ.utf8  UTF-8
ar_EG   ISO-8859-6
ar_EG.iso88596  ISO-8859-6
ar_EG.utf8  UTF-8
ar_IN   UTF-8
ar_IN.utf8  UTF-8
ar_IQ   ISO-8859-6
ar_IQ.iso88596  ISO-8859-6
ar_IQ.utf8  UTF-8
ar_JO   ISO-8859-6
ar_JO.iso88596  ISO-8859-6
ar_JO.utf8  UTF-8
ar_KW   ISO-8859-6
ar_KW.iso88596  ISO-8859-6
ar_KW.utf8  UTF-8
ar_LB   ISO-8859-6
ar_LB.iso88596  ISO-8859-6
ar_LB.utf8  UTF-8
ar_LY   ISO-8859-6
ar_LY.iso88596  ISO-8859-6
ar_LY.utf8  UTF-8
ar_MA   ISO-8859-6
ar_MA.iso88596  ISO-8859-6
ar_MA.utf8  UTF-8
ar_OM   ISO-8859-6
ar_OM.iso88596  ISO-8859-6
ar_OM.utf8  UTF-8
ar_QA   ISO-8859-6
ar_QA.iso88596  ISO-8859-6
ar_QA.utf8  UTF-8
ar_SA   ISO-8859-6
ar_SA.iso88596  ISO-8859-6
ar_SA.utf8  UTF-8
ar_SD   ISO-8859-6
ar_SD.iso88596  ISO-8859-6
ar_SD.utf8  UTF-8
ar_SY   ISO-8859-6
ar_SY.iso88596  ISO-8859-6
ar_SY.utf8  UTF-8
ar_TN   ISO-8859-6
ar_TN.iso88596  ISO-8859-6
ar_TN.utf8  UTF-8
ar_YE   ISO-8859-6
ar_YE.iso88596  ISO-8859-6
ar_YE.utf8  UTF-8
as_IN.utf8  UTF-8
az_AZ.utf8  UTF-8
be_BY   CP1251
be_BY.cp1251CP1251
be...@latin UTF-8
be_BY.utf8  UTF-8
be_by.u...@latinUTF-8
bg_BG   CP1251
bg_BG.cp1251CP1251
bg_BG.utf8  UTF-8
bn_BD   UTF-8
bn_BD.utf8  UTF-8
bn_IN   UTF-8
bn_IN.utf8  UTF-8
bokmal  ISO-8859-1
bokmål  ISO-8859-1
br_FR   ISO-8859-1
br...@euro  ISO-8859-15
br_FR.iso88591  ISO-8859-1
br_fr.iso885...@euroISO-8859-15
br_FR.utf8  UTF-8
bs_BA   ISO-8859-2
bs_BA.iso88592  ISO-8859-2
bs_BA.utf8  UTF-8
byn_ER  UTF-8
byn_ER.utf8 UTF-8
C   ANSI_X3.4-1968
ca_AD   ISO-8859-15
ca_AD.iso885915 ISO-8859-15
ca_AD.utf8  UTF-8
ca_ES   ISO-8859-1
ca...@euro  ISO-8859-15
ca_ES.iso88591  ISO-8859-1
ca_es.iso885...@euroISO-8859-15
ca_ES.utf8  UTF-8
ca_FR   ISO-8859-15
ca_FR.iso885915 ISO-8859-15
ca_FR.utf8  UTF-8
ca_IT   ISO-8859-15
ca_IT.iso885915 ISO-8859-15
ca_IT.utf8  UTF-8
catalan ISO-8859-1
croatianISO-8859-2
csb_PL  UTF-8
csb_PL.utf8 UTF-8
cs_CZ   ISO-8859-2
cs_CZ.iso88592  ISO-8859-2
cs_CZ.utf8  UTF-8
cy_GB   ISO-8859-14
cy_GB.iso885914 ISO-8859-14
cy_GB.utf8  UTF-8
czech   ISO-8859-2
da_DK   ISO-8859-1
da_DK.iso88591  ISO-8859-1
da_DK.iso885915 ISO-8859-15
da_DK.utf8  UTF-8
danish  ISO-8859-1
dansk   ISO-8859-1
de_AT   ISO-8859-1
de...@euro  ISO-8859-15
de_AT.iso88591  ISO-8859-1
de_at.iso885...@euroISO-8859-15
de_AT.utf8  UTF-8
de_BE   ISO-8859-1
de...@euro  ISO-8859-15
de_BE.iso88591  ISO-8859-1
de_be.iso885...@euroISO-8859-15
de_BE.utf8  UTF-8
de_CH   ISO-8859-1
de_CH.iso88591  ISO-8859-1
de_CH.utf8  UTF-8
de_DE   ISO-8859-1
de...@euro  ISO-8859-15
de_DE.iso88591  ISO-8859-1
de_de.iso885...@euroISO-8859-15
de_DE.utf8  UTF-8
de_LU   ISO-8859-1
de...@euro  ISO-8859-15
de_LU.iso88591  ISO-8859-1
de_lu.iso885...@euroISO-8859-15
de_LU.utf8  UTF-8
deutsch ISO-8859-1
dutch   ISO-8859-1
dz_BT   UTF-8
dz_BT.utf8  UTF-8
eesti   ISO-8859-1
el_CY   ISO-8859-7
el_CY.iso88597  ISO-8859-7
el_CY.utf8  UTF-8
el_GR   ISO-8859-7
el_GR.iso88597  ISO-8859-7
el_GR.utf8  UTF-8
en_AU   ISO-8859-1
en_AU.iso88591  ISO-8859-1
en_AU.utf8  

Re: [HACKERS] Per-column collation, proof of concept

2010-08-16 Thread Peter Eisentraut
On lör, 2010-08-14 at 02:05 -0500, Jaime Casanova wrote:
 btw, the patch no longer apply cleanly but most are just hunks the
 worst it's in src/backend/catalog/namespace.c because
 FindConversionByName() is now called get_conversion_oid()... so maybe
 this function should be named get_collation_oid(), i guess

OK, that will need to be adjusted.

 well at least pg_collation should be a shared catalog, no?
 and i think we shouldn't be thinking in this without think first how
 to integrate this with at least per-database configuration

Good point.  But one might also want to create private collations, so
a collation in a schema would also be useful.  Tricky.

 also, it doesn't recognize C collate although it is in the locales.txt
 
 test3=# create database test4 with template=template0 encoding 'utf-8'
 lc_collate='C';
 CREATE DATABASE
 test3=# create table t3 (col1 text collate C );
 ERROR:  collation C does not exist
 

I've fixed this in the meantime.  Your version of the patch doesn't
support the C locale yet.

 BTW, why the double quotes?

Because the name contains upper case letters?

 sorry to state the obvious but this doesn't work on windows, does it?

Probably not, but hopefully there is some similar API that could be used
on Windows.

 and for some reason it also didn't work on a centos 5 (this error
 ocurred when initdb'ing)
 
 loading system objects' descriptions ... ok
 creating collations ...FATAL:  invalid byte sequence for encoding
 UTF8: 0xe56c09
 CONTEXT:  COPY tmp_pg_collation, line 86
 STATEMENT:  COPY tmp_pg_collation FROM
 E'/usr/local/pgsql/9.1/share/locales.txt';
 

Hmm, what is in that file on that line?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-08-14 Thread Jaime Casanova
Hi,

sorry for the delay...
btw, the patch no longer apply cleanly but most are just hunks the
worst it's in src/backend/catalog/namespace.c because
FindConversionByName() is now called get_conversion_oid()... so maybe
this function should be named get_collation_oid(), i guess

On Tue, Aug 3, 2010 at 11:32 AM, Peter Eisentraut pete...@gmx.net wrote:
 On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote:
 nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
 clause similar as the lc_collate?
 i mean, is lc_collate what we will use as a default?

 Yes, if you do not specify anything per column, the database default is
 used.

 How to integrate the per-database or per-cluster configuration with the
 new system is something to figure out in the future.


well at least pg_collation should be a shared catalog, no?
and i think we shouldn't be thinking in this without think first how
to integrate this with at least per-database configuration

 if yes, then probably we need to use pg_collation there too because
 lc_collate and the new collate clause use different collation names.
 
 postgres=# create database test with lc_collate 'en_US.UTF-8';
 CREATE DATABASE
 test=# create table t1 (col1 text collate en_US.UTF-8);
 ERROR:  collation en_US.UTF-8 does not exist
 test=# create table t1 (col1 text collate en_US.utf8);
 CREATE TABLE
 

 This is something that libc does for you.  The locale as listed by
 locale -a is called en_US.utf8, but apparently libc takes
 en_US.UTF-8 as well.


ok, but at least this is confusing

also, it doesn't recognize C collate although it is in the locales.txt

test3=# create database test4 with template=template0 encoding 'utf-8'
lc_collate='C';
CREATE DATABASE
test3=# create table t3 (col1 text collate C );
ERROR:  collation C does not exist


BTW, why the double quotes?

 also i got errors from regression tests when MULTIBYTE=UTF8
 (attached). it seems i was trying to create locales that weren't
 defined on locales.txt (from were was fed that file?). i added a line
 to that file (for es_EC.utf8) then i create a table with a column
 using that collate and execute select * from t2 where col1  'n'; 
 and i got this error: ERROR:  could not create locale es_EC.utf8
 (of course, that last part was me messing the things up, but it show
 we shouldn't be using a file locales.txt, i think)

 It might be that you don't have those locales installed in your system.
 locales.txt is created by using locale -a.  Check what that gives you.


sorry to state the obvious but this doesn't work on windows, does it?
and for some reason it also didn't work on a centos 5 (this error
ocurred when initdb'ing)

loading system objects' descriptions ... ok
creating collations ...FATAL:  invalid byte sequence for encoding
UTF8: 0xe56c09
CONTEXT:  COPY tmp_pg_collation, line 86
STATEMENT:  COPY tmp_pg_collation FROM
E'/usr/local/pgsql/9.1/share/locales.txt';


-- 
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-08-03 Thread Peter Eisentraut
On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote:
 nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
 clause similar as the lc_collate?
 i mean, is lc_collate what we will use as a default?

Yes, if you do not specify anything per column, the database default is
used.

How to integrate the per-database or per-cluster configuration with the
new system is something to figure out in the future.

 if yes, then probably we need to use pg_collation there too because
 lc_collate and the new collate clause use different collation names.
 
 postgres=# create database test with lc_collate 'en_US.UTF-8';
 CREATE DATABASE
 test=# create table t1 (col1 text collate en_US.UTF-8);
 ERROR:  collation en_US.UTF-8 does not exist
 test=# create table t1 (col1 text collate en_US.utf8);
 CREATE TABLE
 

This is something that libc does for you.  The locale as listed by
locale -a is called en_US.utf8, but apparently libc takes
en_US.UTF-8 as well.

 also i got errors from regression tests when MULTIBYTE=UTF8
 (attached). it seems i was trying to create locales that weren't
 defined on locales.txt (from were was fed that file?). i added a line
 to that file (for es_EC.utf8) then i create a table with a column
 using that collate and execute select * from t2 where col1  'n'; 
 and i got this error: ERROR:  could not create locale es_EC.utf8
 (of course, that last part was me messing the things up, but it show
 we shouldn't be using a file locales.txt, i think)

It might be that you don't have those locales installed in your system.
locales.txt is created by using locale -a.  Check what that gives you.

 i can attach a collate to a domain but i can't see where are we
 storing that info (actually it says it's not collatable):

Domain support is not done yet.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-08-02 Thread Jaime Casanova
On Tue, Jul 13, 2010 at 1:25 PM, Peter Eisentraut pete...@gmx.net wrote:
 Here is a proof of concept for per-column collation support.


Hi,

i was looking at this.

nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
clause similar as the lc_collate?
i mean, is lc_collate what we will use as a default?

if yes, then probably we need to use pg_collation there too because
lc_collate and the new collate clause use different collation names.

postgres=# create database test with lc_collate 'en_US.UTF-8';
CREATE DATABASE
test=# create table t1 (col1 text collate en_US.UTF-8);
ERROR:  collation en_US.UTF-8 does not exist
test=# create table t1 (col1 text collate en_US.utf8);
CREATE TABLE


also i got errors from regression tests when MULTIBYTE=UTF8
(attached). it seems i was trying to create locales that weren't
defined on locales.txt (from were was fed that file?). i added a line
to that file (for es_EC.utf8) then i create a table with a column
using that collate and execute select * from t2 where col1  'n'; 
and i got this error: ERROR:  could not create locale es_EC.utf8
(of course, that last part was me messing the things up, but it show
we shouldn't be using a file locales.txt, i think)

i can attach a collate to a domain but i can't see where are we
storing that info (actually it says it's not collatable):

-- 
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-07-15 Thread Peter Eisentraut
On tor, 2010-07-15 at 05:57 +0200, Pavel Stehule wrote:
 :( maybe we have to enhance a locales - or do some work in this way.
 In Czech's IS is relative often operation some like
 
 name = 'Stěhule' COLLATION cs_CZ_cs_ai -- compare case insensitive
 accent insensitive
 
 PostgreSQL is last db, that doesn't integreated support for it

Well, the comparison function varstr_cmp() contains this comment:

/*
 * In some locales strcoll() can claim that nonidentical strings are
 * equal.  Believing that would be bad news for a number of reasons,
 * so we follow Perl's lead and sort equal strings according to
 * strcmp().
 */

This might not be strictly necessary, seeing that citext obviously
doesn't work that way, but resolving this is really an orthogonal issue.
If you fix that and you have a locale that does what you want, my patch
will help you get your example working.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-07-15 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 Well, the comparison function varstr_cmp() contains this comment:

 /*
  * In some locales strcoll() can claim that nonidentical strings are
  * equal.  Believing that would be bad news for a number of reasons,
  * so we follow Perl's lead and sort equal strings according to
  * strcmp().
  */

 This might not be strictly necessary, seeing that citext obviously
 doesn't work that way, but resolving this is really an orthogonal issue.

The problem with not doing that is it breaks hashing --- hash joins and
hash aggregation being the real pain points.

citext works around this in a rather klugy fashion by decreeing that two
strings are equal iff their str_tolower() conversions are bitwise equal.
So it can hash the str_tolower() representation.  But that's kinda slow
and it fails in the general case anyhow, I think.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-07-15 Thread Greg Stark
On Thu, Jul 15, 2010 at 4:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The problem with not doing that is it breaks hashing --- hash joins and
 hash aggregation being the real pain points.

 citext works around this in a rather klugy fashion by decreeing that two
 strings are equal iff their str_tolower() conversions are bitwise equal.
 So it can hash the str_tolower() representation.  But that's kinda slow
 and it fails in the general case anyhow, I think.

I think the general equivalent would be to call strxfrm and hash the
result of that.



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-07-14 Thread Kevin Grittner
Peter Eisentraut pete...@gmx.net wrote:
 
 Here is a proof of concept for per-column collation support.
 
Did you want a WIP review of that patch?  (CF closing to new
submissions soon)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-07-14 Thread Pavel Stehule
Hello

I have only one question - If I understand well you can use collate
just for sort. What is your plan for range search operation? Sort is
interesting and I am sure important for multilangual applications, for
me - more important is case sensitive, case insensitive, accent
sensitive, insensitive filtering - do you have a plan for it?

Regards

Pavel Stehule

2010/7/13 Peter Eisentraut pete...@gmx.net:
 Here is a proof of concept for per-column collation support.

 Here is how it works: When creating a table, an optional COLLATE clause
 can specify a collation name, which is stored (by OID) in pg_attribute.
 This becomes part of the type information and is propagated through the
 expression parse analysis, like typmod.  When an operator or function
 call is parsed (transformed), the collations of the arguments are
 unified, using some rules (like type analysis, but different in detail).
 The collations of the function/operator arguments come either from Var
 nodes which in turn got them from pg_attribute, or from other
 function and operator calls, or you can override them with explicit
 COLLATE clauses (not yet implemented, but will work a bit like
 RelabelType).  At the end, each function or operator call gets one
 collation to use.


what about DISTINCT clause, maybe GROUP BY clause ?

regards

Pavel

 The function call itself can then look up the collation using the
 fcinfo-flinfo-fn_expr field.  (Works for operator calls, but doesn't
 work for sort operations, needs more thought.)

 A collation is in this implementation defined as an lc_collate string
 and an lc_ctype string.  The implementation of functions interested in
 that information, such as comparison operators, or upper and lower
 functions, will take the collation OID that is passed in, look up the
 locale string, and use the xlocale.h interface (newlocale(),
 strcoll_l()) to compute the result.

 (Note that the xlocale stuff is only 10 or so lines in this patch.  It
 should be feasible to allow other appropriate locale libraries to be
 used.)

 Loose ends:

 - Support function calls (currently only operator calls) (easy)

 - Implementation of sort clauses

 - Indexing support/integration

 - Domain support (should be straightforward)

 - Make all expression node types deal with collation information
  appropriately

 - Explicit COLLATE clause on expressions

 - Caching and not leaking memory of locale lookups

 - I have typcollatable to mark which types can accept collation
  information, but perhaps there should also be proicareaboutcollation
  to skip collation resolution when none of the functions in the
  expression tree care.

 You can start by reading the collate.sql regression test file to see
 what it can do.  Btw., regression tests only work with make check
 MULTIBYTE=UTF8.  And it (probably) only works with glibc for now.



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-07-14 Thread Peter Eisentraut
On ons, 2010-07-14 at 19:35 +0200, Pavel Stehule wrote:
 I have only one question - If I understand well you can use collate
 just for sort. What is your plan for range search operation?

My patch does range searches.  Sorting uses the same operators, so both
will be supported.  (Sorting is not yet implemented, as I had written.)

 Sort is
 interesting and I am sure important for multilangual applications, for
 me - more important is case sensitive, case insensitive, accent
 sensitive, insensitive filtering - do you have a plan for it?

You may be able to do some of these by using appropriate locale
definitions.  I'd need some examples to be able to tell for sure.

 what about DISTINCT clause, maybe GROUP BY clause ?

DISTINCT and GROUP BY work with equality, which is not affected by
locales (at least under the current rules).



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Per-column collation, proof of concept

2010-07-14 Thread Pavel Stehule
2010/7/14 Peter Eisentraut pete...@gmx.net:
 On ons, 2010-07-14 at 19:35 +0200, Pavel Stehule wrote:
 I have only one question - If I understand well you can use collate
 just for sort. What is your plan for range search operation?

 My patch does range searches.  Sorting uses the same operators, so both
 will be supported.  (Sorting is not yet implemented, as I had written.)

 Sort is
 interesting and I am sure important for multilangual applications, for
 me - more important is case sensitive, case insensitive, accent
 sensitive, insensitive filtering - do you have a plan for it?

 You may be able to do some of these by using appropriate locale
 definitions.  I'd need some examples to be able to tell for sure.

 what about DISTINCT clause, maybe GROUP BY clause ?

 DISTINCT and GROUP BY work with equality, which is not affected by
 locales (at least under the current rules).


:( maybe we have to enhance a locales - or do some work in this way.
In Czech's IS is relative often operation some like

name = 'Stěhule' COLLATION cs_CZ_cs_ai -- compare case insensitive
accent insensitive

PostgreSQL is last db, that doesn't integreated support for it

Regards

Pavel




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers