Re: Add standard collation UNICODE

2023-05-12 Thread Peter Eisentraut

On 08.05.23 17:48, Peter Eisentraut wrote:

On 27.04.23 13:44, Daniel Verite wrote:

This collation has an empty pg_collation.collversion column, instead
of being set to the same value as "und-x-icu" to track its version.


The original patch implements this as an INSERT in which it would be 
easy to

fix I guess, but in current HEAD it comes as an entry in
include/catalog/pg_collation.dat:

{ oid => '963',
   descr => 'sorts using the Unicode Collation Algorithm with default
settings',
   collname => 'unicode', collprovider => 'i', collencoding => '-1',
   colliculocale => 'und' },

Should it be converted back into an INSERT or better left
in this file and collversion being updated afterwards?


How about we do it with an UPDATE command.  We already do this for 
pg_database in a similar way.  See attached patch.


This has been committed.




Re: Add standard collation UNICODE

2023-05-08 Thread Peter Eisentraut

On 27.04.23 13:44, Daniel Verite wrote:

This collation has an empty pg_collation.collversion column, instead
of being set to the same value as "und-x-icu" to track its version.



The original patch implements this as an INSERT in which it would be easy to
fix I guess, but in current HEAD it comes as an entry in
include/catalog/pg_collation.dat:

{ oid => '963',
   descr => 'sorts using the Unicode Collation Algorithm with default
settings',
   collname => 'unicode', collprovider => 'i', collencoding => '-1',
   colliculocale => 'und' },

Should it be converted back into an INSERT or better left
in this file and collversion being updated afterwards?


How about we do it with an UPDATE command.  We already do this for 
pg_database in a similar way.  See attached patch.
From 91f2aff04f9bf137e4ac6e7df624dde770503bfd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 8 May 2023 17:45:46 +0200
Subject: [PATCH] initdb: Set collversion for standard collation UNICODE

Discussion: 
https://www.postgresql.org/message-id/49417853-7bdd-4b23-a4e9-04c7aff33...@manitou-mail.org
---
 src/bin/initdb/initdb.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..632f6c9c72 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1695,6 +1695,13 @@ setup_description(FILE *cmdfd)
 static void
 setup_collation(FILE *cmdfd)
 {
+   /*
+* Set the collation version for collations defined in pg_collation.dat,
+* except the ones where we know that the collation behavior will never
+* change.
+*/
+   PG_CMD_PUTS("UPDATE pg_collation SET collversion = 
pg_collation_actual_version(oid) WHERE collname = 'unicode';\n\n");
+
/* Import all collations we can find in the operating system */
PG_CMD_PUTS("SELECT pg_import_system_collations('pg_catalog');\n\n");
 }
-- 
2.40.0



Re: Add standard collation UNICODE

2023-04-27 Thread Daniel Verite
Peter Eisentraut wrote:

>  COLLATE UNICODE
> 
> instead of
> 
>  COLLATE "und-x-icu"
> 
> or whatever it is, is pretty useful.
> 
> So, attached is a small patch to add this.

This collation has an empty pg_collation.collversion column, instead
of being set to the same value as "und-x-icu" to track its version.


postgres=# select * from pg_collation  where collname='unicode' \gx
-[ RECORD 1 ]---+
oid | 963
collname| unicode
collnamespace   | 11
collowner   | 10
collprovider| i
collisdeterministic | t
collencoding| -1
collcollate | 
collctype   | 
colliculocale   | und
collicurules| 
collversion | 

The original patch implements this as an INSERT in which it would be easy to
fix I guess, but in current HEAD it comes as an entry in
include/catalog/pg_collation.dat:

{ oid => '963',
  descr => 'sorts using the Unicode Collation Algorithm with default
settings',
  collname => 'unicode', collprovider => 'i', collencoding => '-1',
  colliculocale => 'und' },

Should it be converted back into an INSERT or better left
in this file and collversion being updated afterwards?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Add standard collation UNICODE

2023-03-28 Thread Jeff Davis
On Tue, 2023-03-28 at 08:46 -0400, Joe Conway wrote:
> > As long as we still have to initialize the libc locale fields to
> > some
> > language, I think it would be less confusing to keep the ICU locale
> > on
> > the same language.
> 
> I definitely agree with that.

Sounds good -- no changes then.

Regards,
Jeff Davis

> 




Re: Add standard collation UNICODE

2023-03-28 Thread Joe Conway

On 3/28/23 06:07, Peter Eisentraut wrote:

On 23.03.23 21:16, Jeff Davis wrote:

Another thought: for ICU, do we want the default collation to be
UNICODE (root collation)? What we have now gets the default from the
environment, which is consistent with the libc provider.

But now that we have the UNICODE collation, it makes me wonder if we
should just default to that. The server's environment doesn't
necessarily say much about the locale of the data stored in it or the
locale of the applications accessing it.


As long as we still have to initialize the libc locale fields to some
language, I think it would be less confusing to keep the ICU locale on
the same language.


I definitely agree with that.


If we ever manage to get rid of that, then I would also support making
the ICU locale the root collation by default.


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Add standard collation UNICODE

2023-03-28 Thread Peter Eisentraut

On 23.03.23 21:16, Jeff Davis wrote:

Another thought: for ICU, do we want the default collation to be
UNICODE (root collation)? What we have now gets the default from the
environment, which is consistent with the libc provider.

But now that we have the UNICODE collation, it makes me wonder if we
should just default to that. The server's environment doesn't
necessarily say much about the locale of the data stored in it or the
locale of the applications accessing it.


As long as we still have to initialize the libc locale fields to some 
language, I think it would be less confusing to keep the ICU locale on 
the same language.


If we ever manage to get rid of that, then I would also support making 
the ICU locale the root collation by default.







Re: Add standard collation UNICODE

2023-03-28 Thread Laurenz Albe
On Thu, 2023-03-23 at 13:16 -0700, Jeff Davis wrote:
> Another thought: for ICU, do we want the default collation to be
> UNICODE (root collation)? What we have now gets the default from the
> environment, which is consistent with the libc provider.
> 
> But now that we have the UNICODE collation, it makes me wonder if we
> should just default to that. The server's environment doesn't
> necessarily say much about the locale of the data stored in it or the
> locale of the applications accessing it.
> 
> I don't have a strong opinion here, but I thought I'd raise the issue.
> 
> By my count, >50% of locales are actually just the root locale. I'm not
> sure if that should matter or not -- we don't want to weigh some
> locales over others -- but I found it interesting.

I second that.  Most people don't pay attention to that when creating a
cluster, so having a locale-agnostic collation is often better than
inheriting whatever default happened to be set in your shell.
For example, the Debian/Ubuntu binary packages create a cluster when
you install the server package, and most people just go on using that.

Yours,
Laurenz Albe




Re: Add standard collation UNICODE

2023-03-23 Thread Jeff Davis
On Thu, 2023-03-09 at 11:23 -0800, Jeff Davis wrote:
> Looks good to me.

Another thought: for ICU, do we want the default collation to be
UNICODE (root collation)? What we have now gets the default from the
environment, which is consistent with the libc provider.

But now that we have the UNICODE collation, it makes me wonder if we
should just default to that. The server's environment doesn't
necessarily say much about the locale of the data stored in it or the
locale of the applications accessing it.

I don't have a strong opinion here, but I thought I'd raise the issue.

By my count, >50% of locales are actually just the root locale. I'm not
sure if that should matter or not -- we don't want to weigh some
locales over others -- but I found it interesting.

Regards,
Jeff Davis





Re: Add standard collation UNICODE

2023-03-10 Thread Peter Eisentraut

On 09.03.23 20:23, Jeff Davis wrote:

On Thu, 2023-03-09 at 11:21 +0100, Peter Eisentraut wrote:

How about this patch version?


Looks good to me.


Committed, after adding a test.





Re: Add standard collation UNICODE

2023-03-09 Thread Jeff Davis
On Thu, 2023-03-09 at 11:21 +0100, Peter Eisentraut wrote:
> How about this patch version?

Looks good to me.

Regards,
Jeff Davis






Re: Add standard collation UNICODE

2023-03-09 Thread Peter Eisentraut

On 08.03.23 19:25, Jeff Davis wrote:

Why is "unicode" only provided for the UTF-8 encoding? For "ucs_basic"
that makes some sense, because the implementation only works in UTF-8.
But here we are using ICU, and the "und" locale should work for any
ICU-supported encoding. I suggest that we use collencoding=-1 for
"unicode", and the docs can just add a note next to "ucs_basic" that it
only works for UTF-8, because that's the weird case.


make sense


For the docs, I suggest that you clarify that "ucs_basic" has the same
behavior as the C locale does *in the UTF-8 encoding*. Not all users
might pick up on the subtlety that the C locale has different behaviors
in different encodings.


Ok, word-smithed a bit more.

How about this patch version?

From a8e33d010f60cceb9442123bd0531451875df313 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 9 Mar 2023 11:14:28 +0100
Subject: [PATCH v2] Add standard collation UNICODE

Discussion: 
https://www.postgresql.org/message-id/flat/1293e382-2093-a2bf-a397-c04e8f83d...@enterprisedb.com
---
 doc/src/sgml/charset.sgml | 31 ---
 src/bin/initdb/initdb.c   | 10 +++---
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 3032392b80..12fabb7372 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -659,9 +659,34 @@ Standard Collations

 

-Additionally, the SQL standard collation name ucs_basic
-is available for encoding UTF8.  It is equivalent
-to C and sorts by Unicode code point.
+Additionally, two SQL standard collation names are available:
+
+
+ 
+  unicode
+  
+   
+This collation sorts using the Unicode Collation Algorithm with the
+Default Unicode Collation Element Table.  It is available in all
+encodings.  ICU support is required to use this collation.  (This
+collation has the same behavior as the ICU root locale; see .)
+   
+  
+ 
+
+ 
+  ucs_basic
+  
+   
+This collation sorts by Unicode code point.  It is only available for
+encoding UTF8.  (This collation has the same
+behavior as the libc locale specification C in
+UTF8 encoding.)
+   
+  
+ 
+

   
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 5e3c6a27c4..d303cc5609 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1486,10 +1486,14 @@ static void
 setup_collation(FILE *cmdfd)
 {
/*
-* Add an SQL-standard name.  We don't want to pin this, so it doesn't 
go
-* in pg_collation.h.  But add it before reading system collations, so
-* that it wins if libc defines a locale named ucs_basic.
+* Add SQL-standard names.  We don't want to pin these, so they don't go
+* in pg_collation.dat.  But add them before reading system collations, 
so
+* that they win if libc defines a locale with the same name.
 */
+   PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
+ "VALUES 
(pg_nextoid('pg_catalog.pg_collation', 'oid', 
'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, 
%u, '%c', true, -1, 'und');\n\n",
+ BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU);
+
PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, collcollate, 
collctype)"
  "VALUES 
(pg_nextoid('pg_catalog.pg_collation', 'oid', 
'pg_catalog.pg_collation_oid_index'), 'ucs_basic', 'pg_catalog'::regnamespace, 
%u, '%c', true, %d, 'C', 'C');\n\n",
  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_LIBC, 
PG_UTF8);

base-commit: 36ea345f8fa616fd9b40576310e54145aa70c1a1
-- 
2.39.2



Re: Add standard collation UNICODE

2023-03-08 Thread Jeff Davis
On Wed, 2023-03-08 at 07:21 +0100, Peter Eisentraut wrote:
> On 04.03.23 19:29, Jeff Davis wrote:
> > It looks like the way you've handled this is by inserting the
> > collation
> > with collprovider=icu even if built without ICU support. I think
> > that's
> > a new case, so we need to make sure it throws reasonable user-
> > facing
> > errors.
> 
> It would look like this:
> 
> => select * from t1 order by b collate unicode;
> ERROR:  0A000: ICU is not supported in this build

Right, the error looks good. I'm just pointing out that before this
patch, having provider='i' in a build without ICU was a configuration
mistake; whereas afterward every database will have a collation with
provider='i' whether it has ICU support or not. I think that's fine,
I'm just double-checking.

Why is "unicode" only provided for the UTF-8 encoding? For "ucs_basic"
that makes some sense, because the implementation only works in UTF-8.
But here we are using ICU, and the "und" locale should work for any
ICU-supported encoding. I suggest that we use collencoding=-1 for
"unicode", and the docs can just add a note next to "ucs_basic" that it
only works for UTF-8, because that's the weird case.

For the docs, I suggest that you clarify that "ucs_basic" has the same
behavior as the C locale does *in the UTF-8 encoding*. Not all users
might pick up on the subtlety that the C locale has different behaviors
in different encodings.

Other than that, it looks good.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Add standard collation UNICODE

2023-03-07 Thread Peter Eisentraut

On 04.03.23 19:29, Jeff Davis wrote:

I do like your approach though because, if someone is using a standard
collation, I think "not built with ICU" (feature not supported) is a
better error than "collation doesn't exist". It also effectively
reserves the name "unicode".


By the way, speaking of reserving names, I don't remember the reason for 
this bit in initdb.c:


/*
 * Add SQL-standard names.  We don't want to pin these, so they don't go
 * in pg_collation.h.  But add them before reading system collations, so
 * that they win if libc defines a locale with the same name.
 */

Why don't we want them pinned?

If we add them instead as entries into pg_collation.dat, it seems to 
work for me.


Another question: What is our current thinking on using BCP 47 names? 
The documentation says for example


"""
The first example selects the ICU locale using a “language tag” per BCP 
47. The second example uses the traditional ICU-specific locale syntax. 
The first style is preferred going forward, but it is not supported by 
older ICU versions.

"""

My patch uses 'und' [BCP 47 style], which appears to be in conflict with 
that statement.


But we have had some discussions on how correct that statement is, but I 
don't remember the outcome.






Re: Add standard collation UNICODE

2023-03-07 Thread Peter Eisentraut

On 04.03.23 19:29, Jeff Davis wrote:

It looks like the way you've handled this is by inserting the collation
with collprovider=icu even if built without ICU support. I think that's
a new case, so we need to make sure it throws reasonable user-facing
errors.


It would look like this:

=> select * from t1 order by b collate unicode;
ERROR:  0A000: ICU is not supported in this build


I do like your approach though because, if someone is using a standard
collation, I think "not built with ICU" (feature not supported) is a
better error than "collation doesn't exist". It also effectively
reserves the name "unicode".


right





Re: Add standard collation UNICODE

2023-03-04 Thread Tom Lane
Jeff Davis  writes:
> On Sun, 2023-03-05 at 08:27 +1300, Thomas Munro wrote:
>> It's created for UTF-8 only, and UTF-8 sorts the same way as the
>> encoded code points, when interpreted as a sequence of unsigned char
>> by memcmp(), strcmp() etc.  Seems right?

> Right, makes sense.

> Though in principle, shouldn't someone using another encoding also be
> able to use ucs_basic? I'm not sure if that's a practical problem or
> not; I'm just curious. Does ICU provide a locale for sorting by code
> point?

ISTM we could trivially allow it in LATIN1 encoding as well;
strcmp would still have the effect of sorting by unicode code points.

Given the complete lack of field demand for making it work in
other encodings, I'm unexcited about spending more effort than that.

regards, tom lane




Re: Add standard collation UNICODE

2023-03-04 Thread Jeff Davis
On Sun, 2023-03-05 at 08:27 +1300, Thomas Munro wrote:
> It's created for UTF-8 only, and UTF-8 sorts the same way as the
> encoded code points, when interpreted as a sequence of unsigned char
> by memcmp(), strcmp() etc.  Seems right?

Right, makes sense.

Though in principle, shouldn't someone using another encoding also be
able to use ucs_basic? I'm not sure if that's a practical problem or
not; I'm just curious. Does ICU provide a locale for sorting by code
point?

Regards,
Jeff Davis





Re: Add standard collation UNICODE

2023-03-04 Thread Thomas Munro
On Sun, Mar 5, 2023 at 7:30 AM Jeff Davis  wrote:
> Sorting by codepoint should be encoding-independent (i.e. decode to
> codepoint first); but the C collation is just strcmp, which is
> encoding-dependent. So is UCS_BASIC wrong today?

It's created for UTF-8 only, and UTF-8 sorts the same way as the
encoded code points, when interpreted as a sequence of unsigned char
by memcmp(), strcmp() etc.  Seems right?




Re: Add standard collation UNICODE

2023-03-04 Thread Jeff Davis
On Wed, 2023-03-01 at 11:09 +0100, Peter Eisentraut wrote:

> When collation support was added to PostgreSQL, we added UCS_BASIC, 
> since that could easily be mapped to the C locale.

Sorting by codepoint should be encoding-independent (i.e. decode to
codepoint first); but the C collation is just strcmp, which is
encoding-dependent. So is UCS_BASIC wrong today?

(Aside: I wonder whether we should differentiate between the libc
provider, which uses strcoll(), and the provider of non-localized
comparisons that just use strcmp(). That would be a better reflection
of what the code actually does.)

> With ICU support, we can provide the UNICODE collation, since it's
> just 
> the root locale.

+1

>   I suppose one hesitation was that ICU was not a 
> standard feature, so this would create variations in the default
> catalog 
> contents, or something like that.

It looks like the way you've handled this is by inserting the collation
with collprovider=icu even if built without ICU support. I think that's
a new case, so we need to make sure it throws reasonable user-facing
errors.

I do like your approach though because, if someone is using a standard
collation, I think "not built with ICU" (feature not supported) is a
better error than "collation doesn't exist". It also effectively
reserves the name "unicode".


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Add standard collation UNICODE

2023-03-01 Thread Vik Fearing

On 3/1/23 11:09, Peter Eisentraut wrote:
The SQL standard defines several standard collations.  Most of them are 
only of legacy interest (IMO), but two are currently relevant: UNICODE 
and UCS_BASIC.  UNICODE sorts by the default Unicode collation algorithm 
specifications and UCS_BASIC sorts by codepoint.


When collation support was added to PostgreSQL, we added UCS_BASIC, 
since that could easily be mapped to the C locale.  But there was no 
straightforward way to provide the UNICODE collation.  (Recall that 
collation support came several releases before ICU support.)


With ICU support, we can provide the UNICODE collation, since it's just 
the root locale.  I suppose one hesitation was that ICU was not a 
standard feature, so this would create variations in the default catalog 
contents, or something like that.  But I think now that we are drifting 
to make ICU more prominent, we can just add that anyway.  I think being 
able to say


     COLLATE UNICODE

instead of

     COLLATE "und-x-icu"

or whatever it is, is pretty useful.

So, attached is a small patch to add this.


I don't feel competent to review the patch (simple as it is), but +1 on 
the principle.

--
Vik Fearing





Add standard collation UNICODE

2023-03-01 Thread Peter Eisentraut
The SQL standard defines several standard collations.  Most of them are 
only of legacy interest (IMO), but two are currently relevant: UNICODE 
and UCS_BASIC.  UNICODE sorts by the default Unicode collation algorithm 
specifications and UCS_BASIC sorts by codepoint.


When collation support was added to PostgreSQL, we added UCS_BASIC, 
since that could easily be mapped to the C locale.  But there was no 
straightforward way to provide the UNICODE collation.  (Recall that 
collation support came several releases before ICU support.)


With ICU support, we can provide the UNICODE collation, since it's just 
the root locale.  I suppose one hesitation was that ICU was not a 
standard feature, so this would create variations in the default catalog 
contents, or something like that.  But I think now that we are drifting 
to make ICU more prominent, we can just add that anyway.  I think being 
able to say


COLLATE UNICODE

instead of

COLLATE "und-x-icu"

or whatever it is, is pretty useful.

So, attached is a small patch to add this.From 98fce30e4997253e5b010b6ac72d66255bc77bf6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 1 Mar 2023 10:38:19 +0100
Subject: [PATCH] Add standard collation UNICODE

---
 doc/src/sgml/charset.sgml | 30 +++---
 src/bin/initdb/initdb.c   | 10 +++---
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 3032392b80..13ec238a81 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -659,9 +659,33 @@ Standard Collations

 

-Additionally, the SQL standard collation name ucs_basic
-is available for encoding UTF8.  It is equivalent
-to C and sorts by Unicode code point.
+Additionally, two SQL standard collation names are available for encoding
+UTF8:
+
+
+ 
+  unicode
+  
+   
+This collation sorts using the Unicode Collation Algorithm with the
+Default Unicode Collation Element Table.  (This is the same behavior
+as the ICU root locale; see .)  This
+collation is only available when ICU support is configured.
+   
+  
+ 
+
+ 
+  ucs_basic
+  
+   
+This collation sorts by Unicode code point.  (This is the same
+behavior as the libc locale specification C.)
+   
+  
+ 
+

   
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..525bec4b44 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1482,10 +1482,14 @@ static void
 setup_collation(FILE *cmdfd)
 {
/*
-* Add an SQL-standard name.  We don't want to pin this, so it doesn't 
go
-* in pg_collation.h.  But add it before reading system collations, so
-* that it wins if libc defines a locale named ucs_basic.
+* Add SQL-standard names.  We don't want to pin these, so they don't go
+* in pg_collation.h.  But add them before reading system collations, so
+* that they win if libc defines a locale with the same name.
 */
+   PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
+ "VALUES 
(pg_nextoid('pg_catalog.pg_collation', 'oid', 
'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, 
%u, '%c', true, %d, 'und');\n\n",
+ BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU, 
PG_UTF8);
+
PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, collcollate, 
collctype)"
  "VALUES 
(pg_nextoid('pg_catalog.pg_collation', 'oid', 
'pg_catalog.pg_collation_oid_index'), 'ucs_basic', 'pg_catalog'::regnamespace, 
%u, '%c', true, %d, 'C', 'C');\n\n",
  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_LIBC, 
PG_UTF8);
-- 
2.39.2