subject:"Built\-in CTYPE provider"

Re: Built-in CTYPE provider

2024-04-04 Thread Jeff Davis

On Fri, 2024-04-05 at 11:22 +1300, Thomas Munro wrote:
> Hi,
> 
> +command_ok(
> +   [
> +   'initdb', '--no-sync',
> +   '--locale-provider=builtin', '-E UTF-8',
> +   '--builtin-locale=C.UTF-8', "$tempdir/data8"
> +   ],
> +   'locale provider builtin with -E UTF-8 --builtin-
> locale=C.UTF-8');

...

>   LC_COLLATE:  en_US
>   LC_CTYPE:    en_US
>   LC_MESSAGES: C
>   LC_MONETARY: en_US
>   LC_NUMERIC:  en_US
>   LC_TIME: en_US
> initdb: error: encoding mismatch
> initdb: detail: The encoding you selected (UTF8) and the encoding
> that
> the selected locale uses (LATIN1) do not match.

Thank you for the report.

I fixed it in e2a2357671 by forcing the environment locale to C which
is compatible with any encoding. The test still forces the encoding to
UTF-8 and the collation to the builtin C.UTF-8.

In passing, I noticed some unrelated regression test failures when I
set LANG=tr_TR: tsearch, tsdict, json, and jsonb. There's an additional
failure in the updatable_views test when LANG=tr_TR.utf8. I haven't
looked into the details yet.

Regards,
Jeff Davis

From a7bf58f27bcd60bf79c097c7f462fa183264e9cd Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 4 Apr 2024 15:55:03 -0700
Subject: [PATCH v1] Fix test failure in 001_initdb.pl.

Reported-by: Thomas Munro
Discussion: https://postgr.es/m/CA+hUKGK-ZqV1njkG_=xcCqXh2fcMkz85FTMnhS2opm4ZerH=x...@mail.gmail.com
---
 src/bin/initdb/t/001_initdb.pl | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index c63d3206d9..b31dad2464 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -200,6 +200,7 @@ command_ok(
 	[
 		'initdb', '--no-sync',
 		'--locale-provider=builtin', '-E UTF-8',
+		'--lc-collate=C', '--lc-ctype=C',
 		'--builtin-locale=C.UTF-8', "$tempdir/data8"
 	],
 	'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8');
@@ -208,6 +209,7 @@ command_fails(
 	[
 		'initdb', '--no-sync',
 		'--locale-provider=builtin', '-E SQL_ASCII',
+		'--lc-collate=C', '--lc-ctype=C',
 		'--builtin-locale=C.UTF-8', "$tempdir/data9"
 	],
 	'locale provider builtin with --builtin-locale=C.UTF-8 fails for SQL_ASCII'
-- 
2.34.1

Re: Built-in CTYPE provider

2024-04-04 Thread Thomas Munro

Hi,

+command_ok(
+   [
+   'initdb', '--no-sync',
+   '--locale-provider=builtin', '-E UTF-8',
+   '--builtin-locale=C.UTF-8', "$tempdir/data8"
+   ],
+   'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8');

This Sun animal recently turned on --enable-tap-tests, and that ↑ failed[1]:

# Running: initdb --no-sync --locale-provider=builtin -E UTF-8
--builtin-locale=C.UTF-8
/home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/initdb/tmp_check/tmp_test_XvK1/data8
The files belonging to this database system will be owned by user "marcel".
This user must also own the server process.

The database cluster will be initialized with this locale configuration:
  locale provider:   builtin
  default collation: C.UTF-8
  LC_COLLATE:  en_US
  LC_CTYPE:en_US
  LC_MESSAGES: C
  LC_MONETARY: en_US
  LC_NUMERIC:  en_US
  LC_TIME: en_US
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that
the selected locale uses (LATIN1) do not match. This would lead to
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding
explicitly, or choose a matching combination.
[14:04:12.462](0.036s) not ok 28 - locale provider builtin with -E
UTF-8 --builtin-locale=C.UTF-8

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-04-04%2011%3A42%3A40

Re: Built-in CTYPE provider

2024-04-04 Thread Peter Eisentraut


On 01.04.24 21:52, Jeff Davis wrote:

On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote:

The patch set v27 is ok with me, modulo (a) discussion about initcap
semantics, and (b) what collation to assign to ucs_basic, which can
be
revisited later.


Attached v28.

The remaining patches are for full case mapping and PG_UNICODE_FAST.

I am fine waiting until July to get these remaining patches committed.
That would give us time to sort out details like:

* Get consensus that it's OK to change UCS_BASIC.
* Figure out if we need a pg-specific locale and whether
PG_UNICODE_FAST is the right name.
* Make sure that full case mapping interacts with regexes in a sane way
(probably it needs to just fall back to simple case mapping, but
perhaps that's worth a discussion).
* Implement case folding.
* Implement a more unicode-friendly TITLECASE() function, which could
offer a number of options that don't fit well with INITCAP().
* Figure out if UPPER()/LOWER() should also have some of those options.

Thoughts?


Yeah, I think it's good to give some more time to work out these things. 
 The features committed for PG17 so far are solid, so it's a good point 
to pause.

Re: Built-in CTYPE provider

2024-04-03 Thread Jeff Davis

On Tue, 2024-03-26 at 08:14 +0100, Peter Eisentraut wrote:
> 
> Full vs. simple case mapping is more of a legacy compatibility
> question, 
> in my mind.  There is some expectation/precedent that C.UTF-8 uses 
> simple case mapping, but beyond that, I don't see a reason why
> someone 
> would want to explicitly opt for simple case mapping, other than if
> they 
> need length preservation or something, but if they need that, then
> they 
> are going to be in a world of pain in Unicode anyway.

I mostly agree, though there are some other purposes for the simple
mapping:

* a substitute for case folding: lower() with simple case mapping will
work better for that purpose than lower() with full case mapping (after
we have casefold(), this won't be a problem)

* simple case mapping is conceptually simpler, and that's a benefit by
itself in some situations -- maybe the 1:1 assumption exists other
places in their application

> > There's also another reason to consider it an argument rather than
> > a
> > collation property, which is that it might be dependent on some
> > other
> > field in a row. I could imagine someone wanting to do:
> > 
> >     SELECT
> >   UPPER(some_field,
> >     full => true,
> >     dotless_i => CASE other_field WHEN ...)
> >     FROM ...
> 
> Can you index this usefully?  It would only work if the user query 
> matches exactly this pattern?

In that example, UPPER is used in the target list -- the WHERE clause
might be indexable. The UPPER is just used for display purposes, and
may depend on some locale settings stored in another table associated
with a particular user.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-27 Thread Jeff Davis

On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote:
> The patch set v27 is ok with me, modulo (a) discussion about initcap 
> semantics, and (b) what collation to assign to ucs_basic, which can
> be 
> revisited later.

I held off on the refactoring patch for lc_{ctype|collate}_is_c().
There's an explicit "NB: pg_newlocale_from_collation is only supposed
to be called on non-C-equivalent locales" comment in DefineCollation().

What I'd like to do is make it possible to create valid pg_locale_t
objects out of C locales, which can be used anywhere a real locale can
be used. Callers can still check lc_{collate|ctype}_is_c() for various
reasons; but if they did call pg_newlocale_from_collation on a C locale
it would at least work for the pg_locale.h APIs. That would be a
slightly simpler and safer API, and make it easier to do the collation
version check consistently.

That's not very complicated, but it's a bit invasive and probably out
of scope for v17. It might be part of another change I had intended for
a while, which is to make NULL an invalid pg_locale_t, and use a
different representation to mean "use the server environment". That
would clean up a lot of checks for NULL.

For now, we'd still like to add the version number to the builtin
collations, so that leaves us with two options:

(a) Perform the version check in lc_{collate|ctype}_is_c(), which
duplicates some code and creates some inconsistency in how the version
is checked for different providers.

(b) Don't worry about it and just commit the version change in v27-
0001. The version check is already performed correctly on the database
without changes, even if the locale is "C". And there are already three
built-in "C" collations: "C", "POSIX", and UCS_BASIC; so it's not clear
why someone would create even more of them. And even if they did, there
would be no reason to give them a warning because we haven't
incremented the version, so there's no chance of a mismatch.

I'm inclined toward (b). Thoughts?

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-27 Thread Jeff Davis

On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote:
>  provider | isalpha | isdigit 
> --+-+-
>  ICU  | f   | t
>  glibc    | t   | f
>  builtin  | f   | f

The "ICU" above is really the behvior of the Postgres ICU provider as
we implemented it, it's not something forced on us by ICU.

For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and
pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are
defined by ICU to be equivalent to java.lang.Character.isLetter() and
java.lang.Character.isDigit().

ICU documents[3] how regex character classes should be implemented
using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite
being under the heading "...for C/POSIX character classes...", [3] says
it's based on the "Standard" variant of [4], rather than "POSIX
Compatible".

(Aside: the Postgres ICU provider doesn't match what [3] suggests for
the "alpha" class. For the character U+FF11 it doesn't matter, but I
suspect there are differences for other characters. This should be
fixed.)

The differences between PG_C_UTF8 and what ICU suggests are just
because the former uses the "POSIX Compatible" definitions and the
latter uses "Standard".

I implemented both the "Standard" and "POSIX Compatible" compatibility
properties in ad49994538, so it would be easy to change what PG_C_UTF8
uses.

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283
[2] 
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6
[3] 
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details
[4] 
http://www.unicode.org/reports/tr18/#Compatibility_Properties

> Are we fine with pg_c_utf8 differing from both ICU's point of view
> (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is
> not
> digit, but it's alpha)?

Yes, some differences are to be expected.

But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as
we can point to something other than "glibc version 2.35 does it this
way".

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-27 Thread Daniel Verite

Jeff Davis wrote:

> The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
> collation vs '123Abc' in PG_UNICODE_FAST.
> 
> The reason for the latter behavior is that the Unicode Default Case
> Conversion algorithm for toTitlecase() advances to the next Cased
> character before mapping to titlecase, and digits are not Cased. ICU
> has a configurable adjustment, and defaults in a way that produces
> '123abc'.

Even aside from ICU, there's a different behavior between glibc
and pg_c_utf8 glibc for codepoints in the decimal digit category 
outside of the US-ASCII range '0'..'9',

select initcap(concat(chr(0xff11), 'a') collate "C.utf8");   -- glibc 2.35
 initcap 
-
 １a

select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8");
 initcap 
-
 １A

Both collations consider that chr(0xff11) is not a digit
(isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8
says it's neither digit nor alpha.

AFAIU this is why in the above initcap() call, pg_c_utf8 considers
that 'a' is the first alphanumeric, whereas C.utf8 considers that '１'
is the first alphanumeric, leading to different capitalizations.

Comparing the 3 providers:

WITH v(provider,type,result) AS (values
 ('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"),
 ('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"),
 ('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"),
 ('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"),
 ('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"),
 ('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8")
 )
select * from v
\crosstabview

 provider | isalpha | isdigit 
--+-+-
 ICU  | f   | t
 glibc| t   | f
 builtin  | f   | f

Are we fine with pg_c_utf8 differing from both ICU's point of view
(U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not
digit, but it's alpha)?

Aside from initcap(), this is going to be significant for regular
expressions.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

2024-03-26 Thread Peter Eisentraut


On 25.03.24 18:52, Jeff Davis wrote:

OK, I'll propose a "title" or "titlecase" function for 18, along with
"casefold" (which I was already planning to propose).


(Yay, casefold will be useful.)


What do you think about UPPER/LOWER and full case mapping? Should there
be extra arguments for full vs simple case mapping, or should it come
from the collation?

It makes sense that the "dotted vs dotless i" behavior comes from the
collation because that depends on locale. But full-vs-simple case
mapping is not really a locale question. For instance:

select lower('0Σ' collate "en-US-x-icu") AS lower_sigma,
   lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma,
   upper('ß' collate "en-US-x-icu") AS upper_eszett;
 lower_sigma | lower_final_sigma | upper_eszett
-+---+--
 0σ  | ας| SS

produces the same results for any ICU collation.


I think of a collation describing what language a text is in.  So it 
makes sense that "dotless i" depends on the locale/collation.


Full vs. simple case mapping is more of a legacy compatibility question, 
in my mind.  There is some expectation/precedent that C.UTF-8 uses 
simple case mapping, but beyond that, I don't see a reason why someone 
would want to explicitly opt for simple case mapping, other than if they 
need length preservation or something, but if they need that, then they 
are going to be in a world of pain in Unicode anyway.



There's also another reason to consider it an argument rather than a
collation property, which is that it might be dependent on some other
field in a row. I could imagine someone wanting to do:

SELECT
  UPPER(some_field,
full => true,
dotless_i => CASE other_field WHEN ...)
FROM ...


Can you index this usefully?  It would only work if the user query 
matches exactly this pattern?



That makes sense for a function in the target list, because different
customers might be from different locales and therefore want different
treatment of the dotted-vs-dotless-i.


There is also the concept of a session collation, which we haven't 
implemented, but it would address this kind of use.  But there again the 
problem is indexing.  But maybe indexing isn't as important for case 
conversion as it is for sorting.

Re: Built-in CTYPE provider

2024-03-26 Thread Peter Eisentraut


On 21.03.24 01:13, Jeff Davis wrote:

The v26 patch was not quite complete, so I didn't commit it yet.
Attached v27-0001 and 0002.

0002 is necessary because otherwise lc_collate_is_c() short-circuits
the version check in pg_newlocale_from_collation(). With 0002, the code
is simpler and all paths go through pg_newlocale_from_collation(), and
the version check happens even when lc_collate_is_c().

But perhaps there was a reason the code was the way it was, so
submitting for review in case I missed something.


0005 and 0006 don't contain any test cases.  So I guess they are
really
only usable via 0007.  Is that understanding correct?

0005 is not a functional change, it's just a refactoring to use a
callback, which is preparation for 0007.


Are there any test cases that illustrate the word boundary changes in
patch 0005?  It might be useful to test those against Oracle as well.

The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
collation vs '123Abc' in PG_UNICODE_FAST.

The reason for the latter behavior is that the Unicode Default Case
Conversion algorithm for toTitlecase() advances to the next Cased
character before mapping to titlecase, and digits are not Cased. ICU
has a configurable adjustment, and defaults in a way that produces
'123abc'.

New rebased series attached.


The patch set v27 is ok with me, modulo (a) discussion about initcap 
semantics, and (b) what collation to assign to ucs_basic, which can be 
revisited later.

Re: Built-in CTYPE provider

2024-03-25 Thread Jeff Davis

On Mon, 2024-03-25 at 08:29 +0100, Peter Eisentraut wrote:
> Right.  I thought when you said there is an ICU configuration for it,
> that it might be like collation options that you specify in the
> locale 
> string.  But it appears it is only an internal API setting.  So that,
> in 
> my mind, reinforces the opinion that we should leave initcap() as is
> and 
> make a new function that exposes the new functionality.  (This does
> not 
> have to be part of this patch set.)

OK, I'll propose a "title" or "titlecase" function for 18, along with
"casefold" (which I was already planning to propose).

What do you think about UPPER/LOWER and full case mapping? Should there
be extra arguments for full vs simple case mapping, or should it come
from the collation?

It makes sense that the "dotted vs dotless i" behavior comes from the
collation because that depends on locale. But full-vs-simple case
mapping is not really a locale question. For instance:

   select lower('0Σ' collate "en-US-x-icu") AS lower_sigma,
  lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma,
  upper('ß' collate "en-US-x-icu") AS upper_eszett;
lower_sigma | lower_final_sigma | upper_eszett 
   -+---+--
0σ  | ας| SS

produces the same results for any ICU collation.

There's also another reason to consider it an argument rather than a
collation property, which is that it might be dependent on some other
field in a row. I could imagine someone wanting to do:

   SELECT
 UPPER(some_field,
   full => true,
   dotless_i => CASE other_field WHEN ...)
   FROM ...

That makes sense for a function in the target list, because different
customers might be from different locales and therefore want different
treatment of the dotted-vs-dotless-i.

Thoughts? Should we use the collation by default but then allow
parameters to override? Or should we just consider this a new set of
functions?

(All of this is v18 material, of course.)

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-25 Thread Peter Eisentraut


On 22.03.24 18:26, Jeff Davis wrote:

On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote:

I think this might be too big of a compatibility break.  So far,
initcap('123abc') has always returned '123abc'.  If the new collation
returns '123Abc' now, then that's quite a change.  These are not some
obscure Unicode special case characters, after all.


It's a new collation, so I'm not sure it's a compatibility break. But
you are right that it is against documentation and expectations for
INITCAP().


What is the ICU configuration incantation for this?  Maybe we could
have
the builtin provider understand some of that, too.


https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e


Or we should create a function separate from initcap.


If we create a new function, that also gives us the opportunity to
accept optional arguments to control the behavior rather than relying
on collation for every decision.


Right.  I thought when you said there is an ICU configuration for it, 
that it might be like collation options that you specify in the locale 
string.  But it appears it is only an internal API setting.  So that, in 
my mind, reinforces the opinion that we should leave initcap() as is and 
make a new function that exposes the new functionality.  (This does not 
have to be part of this patch set.)

Re: Built-in CTYPE provider

2024-03-25 Thread Laurenz Albe

There is no technical content in this mail, but I'd like to
show appreciation for your work on this.  I hope this will
eventually remove one of the great embarrassments when using
PostgreSQL: the dependency on operation system collations.

Yours,
Laurenz Albe

Re: Built-in CTYPE provider

2024-03-24 Thread Jeff Davis

On Sun, 2024-03-24 at 14:00 +0300, Alexander Lakhin wrote:
> Please look at a Valgrind-detected error caused by the following
> query
> (starting from f69319f2f):
> SELECT lower('Π' COLLATE pg_c_utf8);

Thank you for the report!

Fixed in 503c0ad976.

Valgrind did not detect the problem in my setup, so I added a unit test
in case_test.c where it's easier to see the valgrind problem.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-24 Thread Alexander Lakhin


Hello Jeff,

21.03.2024 03:13, Jeff Davis wrote:

On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote:

* v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch

Looks ok.

Committed.


Please look at a Valgrind-detected error caused by the following query
(starting from f69319f2f):
SELECT lower('Π' COLLATE pg_c_utf8);

==00:00:00:03.487 1429669== Invalid read of size 1
==00:00:00:03.487 1429669==    at 0x7C64A5: convert_case (unicode_case.c:107)
==00:00:00:03.487 1429669==    by 0x7C: unicode_strlower (unicode_case.c:70)
==00:00:00:03.487 1429669==    by 0x66B218: str_tolower (formatting.c:1698)
==00:00:00:03.488 1429669==    by 0x6D6C55: lower (oracle_compat.c:55)

Best regards,
Alexander

Re: Built-in CTYPE provider

2024-03-22 Thread Jeff Davis

On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote:
> I think this might be too big of a compatibility break.  So far, 
> initcap('123abc') has always returned '123abc'.  If the new collation
> returns '123Abc' now, then that's quite a change.  These are not some
> obscure Unicode special case characters, after all.

It's a new collation, so I'm not sure it's a compatibility break. But
you are right that it is against documentation and expectations for
INITCAP().

> What is the ICU configuration incantation for this?  Maybe we could
> have 
> the builtin provider understand some of that, too.

https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e

> Or we should create a function separate from initcap.

If we create a new function, that also gives us the opportunity to
accept optional arguments to control the behavior rather than relying
on collation for every decision.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-22 Thread Peter Eisentraut


On 21.03.24 01:13, Jeff Davis wrote:

Are there any test cases that illustrate the word boundary changes in
patch 0005?  It might be useful to test those against Oracle as well.

The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
collation vs '123Abc' in PG_UNICODE_FAST.

The reason for the latter behavior is that the Unicode Default Case
Conversion algorithm for toTitlecase() advances to the next Cased
character before mapping to titlecase, and digits are not Cased. ICU
has a configurable adjustment, and defaults in a way that produces
'123abc'.


I think this might be too big of a compatibility break.  So far, 
initcap('123abc') has always returned '123abc'.  If the new collation 
returns '123Abc' now, then that's quite a change.  These are not some 
obscure Unicode special case characters, after all.


What is the ICU configuration incantation for this?  Maybe we could have 
the builtin provider understand some of that, too.


Or we should create a function separate from initcap.

Re: Built-in CTYPE provider

2024-03-19 Thread Peter Eisentraut


* v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch

This was committed.

* v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch

Looks ok.

* v25-0003-Inline-basic-UTF-8-functions.patch

ok

* v25-0004-Use-version-for-builtin-collations.patch

Not sure about the version format "1.0", which implies some sort of 
major/minor or component-based system.  I would just use "1".


* v25-0005-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch
* v25-0006-Support-Unicode-full-case-mapping-and-conversion.patch
* v25-0007-Support-PG_UNICODE_FAST-locale-in-the-builtin-co.patch

0005 and 0006 don't contain any test cases.  So I guess they are really 
only usable via 0007.  Is that understanding correct?


Btw., tested initcap() on Oracle:

select initcap('ǆudo') from dual;

(which uses the precomposed U+01F3) and the result is

Ǆudo

(with the precomposed uppercase character).  So that matches the 
behavior proposed in your 0002 patch.


Are there any test cases that illustrate the word boundary changes in 
patch 0005?  It might be useful to test those against Oracle as well.

Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Mar 19, 2024 at 11:55 AM Tom Lane  wrote:
 This is causing all CI jobs to fail the "compiler warnings" check.

>>> I did run CI before checkin, and it passed:

> Maybe I misunderstood this exchange but ...

> Currently Windows warnings don't make any CI tasks fail ie turn red,
> which is why Jeff's run is all green in his personal github repo.
> ...
> But I did teach cfbot to do some extra digging through the logs,

Ah.  What I should have said was "it's causing cfbot to complain
about every patch".

Seems like the divergence in the pass criterion is not such a
great idea.

regards, tom lane

Re: Built-in CTYPE provider

2024-03-18 Thread Thomas Munro

On Tue, Mar 19, 2024 at 11:55 AM Tom Lane  wrote:
> Jeff Davis  writes:
> > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
> >> This is causing all CI jobs to fail the "compiler warnings" check.
>
> > I did run CI before checkin, and it passed:
> > https://cirrus-ci.com/build/5382423490330624
>
> Weird, why did it not report with the same level of urgency?
> But anyway, thanks for fixing.

Maybe I misunderstood this exchange but ...

Currently Windows warnings don't make any CI tasks fail ie turn red,
which is why Jeff's run is all green in his personal github repo.
Unlike gcc and clang, and MinGW cross-build warnings which cause the
special "CompilerWarnings" CI task to fail (red).  That task is
running on a Linux system so it can't use MSVC.  The idea of keeping
it separate from the "main" Linux, FreeBSD, macOS tasks (which use
gcc, clang, clang respectively) was that it's nicer to try to run the
actual tests even if there is a pesky warning, so having it in a
separate task gets you that info without blocking other progress, and
it also tries with and without assertions (a category of warning
hazard, eg unused variables when assertions are off).

But I did teach cfbot to do some extra digging through the logs,
looking for various interesting patterns[1], including non-error
warnings, and if it finds anything interesting it shows a little
clickable ⚠ symbol on the front page.

If there is something like -Werror on MSVC we could turn that on for
the main Windows test, but that might also be a bit annoying.  Perhaps
there is another way: we could have it compile and test everything,
allowing warnings, but also then grep the build log afterwards in a
new step that fails if any warnings were there?  Then Jeff would have
got a failure in his personal CI run.  Or something like that.

[1] https://github.com/macdice/cfbot/blob/master/cfbot_work_queue.py

Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane

Jeff Davis  writes:
> On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
>> This is causing all CI jobs to fail the "compiler warnings" check.

> I did run CI before checkin, and it passed:
> https://cirrus-ci.com/build/5382423490330624

Weird, why did it not report with the same level of urgency?
But anyway, thanks for fixing.

regards, tom lane

Re: Built-in CTYPE provider

2024-03-18 Thread Jeff Davis

On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
> This is causing all CI jobs to fail the "compiler warnings" check.

I did run CI before checkin, and it passed:

https://cirrus-ci.com/build/5382423490330624

If I open up the windows build, I see the warning:

https://cirrus-ci.com/task/5199979044667392

but I didn't happen to check this time.

> Probably the best fix is the traditional
> 
> return ;    /* keep compiler quiet */
> 
> but I'm not sure what the best default result is in this function.

In inverted the check so that I didn't have to choose a default.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane

Jeff Davis  writes:
> It may be moot soon, but I committed a fix now.

Thanks, but it looks like 846311051 introduced a fresh issue.
MSVC is complaining about

[21:37:15.349] c:\cirrus\src\backend\utils\adt\pg_locale.c(2515) : warning 
C4715: 'builtin_locale_encoding': not all control paths return a value

This is causing all CI jobs to fail the "compiler warnings" check.

Probably the best fix is the traditional

return ;/* keep compiler quiet */

but I'm not sure what the best default result is in this function.

regards, tom lane

Re: Built-in CTYPE provider

2024-03-18 Thread Jeff Davis

On Sun, 2024-03-17 at 17:46 -0400, Tom Lane wrote:
> Jeff Davis  writes:
> > New series attached.
> 
> Coverity thinks there's something wrong with builtin_validate_locale,
> and as far as I can tell it's right: the last ereport is unreachable,
> because required_encoding is never changed from its initial -1 value.
> It looks like there's a chunk of logic missing there, or else that
> the code could be simplified further.

Thank you, it was a bit of over-generalization in anticipation of
future patches.

It may be moot soon, but I committed a fix now.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-17 Thread Tom Lane

Jeff Davis  writes:
> New series attached.

Coverity thinks there's something wrong with builtin_validate_locale,
and as far as I can tell it's right: the last ereport is unreachable,
because required_encoding is never changed from its initial -1 value.
It looks like there's a chunk of logic missing there, or else that
the code could be simplified further.

/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/pg_locale.c: 2519 
in builtin_validate_locale()
>>> CID 1594398:  Control flow issues  (DEADCODE)
>>> Execution cannot reach the expression "encoding != required_encoding" 
>>> inside this statement: "if (required_encoding >= 0 ...".
2519if (required_encoding >= 0 && encoding != required_encoding)
2520ereport(ERROR,
2521(errcode(ERRCODE_WRONG_OBJECT_TYPE),
2522 errmsg("encoding \"%s\" does not match 
locale \"%s\"",
2523
pg_encoding_to_char(encoding), locale)));

regards, tom lane

Re: Built-in CTYPE provider

2024-03-14 Thread Jeff Davis

On Thu, 2024-03-14 at 15:38 +0100, Peter Eisentraut wrote:
> On 14.03.24 09:08, Jeff Davis wrote:
> > 0001 (the C.UTF-8 locale) is also close...
> 
> If have tested this against the libc locale C.utf8 that was available
> on 
> the OS, and the behavior is consistent.

That was the goal, in spirit.

But to clarify: it's not guaranteed that the built-in C.UTF-8 is always
the same as the libc UTF-8, because different implementations do
different things. For instance, I saw significant differences on MacOS.

> I wonder if we should version the builtin locales too.  We might make
> a 
> mistake and want to change something sometime?

I'm fine with that, see v25-0004 in the reply to your other mail.

The version only tracks sort order, and all of the builtin locales sort
based on memcmp(). But it's possible there are bugs in the
optimizations around memcmp() (e.g. abbreviated keys, or some future
optimization).

> Tiny comments:
> 
> * src/bin/scripts/t/020_createdb.pl
> 
> The two added tests should have different names that tells them apart
> (like the new initdb tests).
> 
> * src/include/catalog/pg_collation.dat

Done in v25-0002 (in reply to your other mail).

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-03-14 Thread Peter Eisentraut


On 14.03.24 09:08, Jeff Davis wrote:

0001 (the C.UTF-8 locale) is also close. Considering that most of the
infrastructure is already in place, that's not a large patch. You many
have some comments about the way I'm canonicalizing and validating in
initdb -- that could be cleaner, but it feels like I should refactor
the surrounding code separately first.


If have tested this against the libc locale C.utf8 that was available on 
the OS, and the behavior is consistent.


I wonder if we should version the builtin locales too.  We might make a 
mistake and want to change something sometime?


Tiny comments:

* src/bin/scripts/t/020_createdb.pl

The two added tests should have different names that tells them apart
(like the new initdb tests).

* src/include/catalog/pg_collation.dat

Maybe use 'and' instead of '&' in the description.


0002 (inlining utf8 functions) is also ready.


Seems ok.


For 0003 and beyond, I'd like some validation that it's what you had in
mind.


I'll look into those later.

Re: Built-in CTYPE provider

2024-03-14 Thread Peter Eisentraut


On 14.03.24 09:08, Jeff Davis wrote:

On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote:

New series attached. I plan to commit 0001 very soon.


Committed the basic builtin provider, supporting only the "C" locale.


As you were committing this, I had another review of 
v23-0001-Introduce-collation-provider-builtin.patch in progress.  Some 
of the things I found you have already addressed in what you committed. 
Please check the remaining comments.



* doc/src/sgml/charset.sgml

I don't understand the purpose of this sentence:

"When using this locale, the behavior may depend on the database encoding."


* doc/src/sgml/ref/create_database.sgml

The new parameter builtin_locale is not documented.


* src/backend/commands/collationcmds.c

I think DefineCollation() should set collencoding = -1 for the
COLLPROVIDER_BUILTIN case.  -1 stands for any encoding.  Or at least
explain why not?


* src/backend/utils/adt/pg_locale.c

This part is a bit confusing:

+   cache_entry->collate_is_c = true;
+   cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0);

Is collate always C but ctype only sometimes?  Does this anticipate
future patches in this series?  Maybe in this patch it should always
be true?


* src/bin/initdb/initdb.c

+   printf(_("  --builtin-locale=LOCALE   set builtin locale name 
for new databases\n"));


Put in a line break so that the right "column" lines up.

This output should line up better:

The database cluster will be initialized with this locale configuration:
  default collation provider:  icu
  default collation locale:en
  LC_COLLATE:  C
  LC_CTYPE:C
  ...

Also, why are there two spaces after "provider:  "?

Also we call these locale provider on input, why are they collation
providers on output?  What is a "collation locale"?


* src/bin/pg_upgrade/t/002_pg_upgrade.pl

+if ($oldnode->pg_version >= '17devel')

This is weird.  >= is a numeric comparison, so providing a string with
non-digits is misleading at best.


* src/test/icu/t/010_database.pl

-# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE
-# are specified

Why remove this test?

+my ($ret, $stdout, $stderr) = $node1->psql('postgres',
+   q{CREATE DATABASE dbicu LOCALE_PROVIDER builtin LOCALE 'C' TEMPLATE 
dbicu}

+);

Change the name of the new database to be different from the name of
the template database.

Re: Built-in CTYPE provider

2024-03-12 Thread Peter Eisentraut


On 08.03.24 02:00, Jeff Davis wrote:

And here's v22 (I didn't post v21).

I committed Unicode property tables and functions, and the simple case
mapping. I separated out the full case mapping changes (based on
SpecialCasing.txt) into patch 0006.



0002: Basic builtin collation provider that only supports "C".


Overall, this patch looks sound.

In the documentation, let's make the list of locale providers an actual 
list instead of a sequence of s.


We had some discussion on initdb option --builtin-locale and whether it 
should be something more general.  I'm ok with leaving it like this for 
now and maybe consider as an "open item" for PG17.


In

errmsg("parameter \"locale\" must be specified")

make "locale" a placeholder.  (See commit 36a14afc076).

It seems the builtin provider accepts both "C" and "POSIX" as locale
names, but the documentation says it must be "C".  Maybe we don't need
to accept "POSIX"?  (Seeing that there are no plans for "POSIX.UTF-8",
maybe we just ignore the "POSIX" spelling altogether?)

Speaking of which, the code in postinit.c is inconsistent in that 
respect with builtin_validate_locale().  Shouldn't postinit.c use

builtin_validate_locale(), to keep it consistent?

Or, there could be a general function that accepts a locale provider and 
a locale string and validates everything together?


In initdb.c, this message

printf(_("The database cluster will be initialized with no locale.\n"));

sounds a bit confusing.  I think it's ok to show "C" as a locale.  I'm
not sure we need to change the logic here.

Also in initdb.c, this message

pg_fatal("locale must be specified unless provider is libc");

should be flipped around, like

locale must be specified if provider is %s

In pg_dump.c, dumpDatabase(), there are some new warning messages that
are not specifically about the builtin provider.  Are those existing
deficiencies?  It's not clear to me.

What are the changes in the pg_upgrade test about?  Maybe explain the
scenario it is trying to test briefly?



0004: Inline some UTF-8 functions to improve performance


Makes sense that inlining can be effective here.  But why aren't you 
just inlining the existing function pg_utf_mblen()?  Now we have two 
functions that do the same thing.  And the comment at pg_utf_mblen() is 
removed completely, so it's not clear anymore why it exists.




0005: Add a unicode_strtitle() function and move the implementation for
the builtin provider out of formatting.c.


In the recent discussion you had expression some uncertainty about the 
detailed semantics of this.  INITCAP() was copied from Oracle, so we 
could check there for reference, too.  Or we go with full Unicode 
semantics.  I'm not clear on all the differences and tradeoffs, if there 
are any.  In any case, it would be good if there were documentation or a 
comment that somehow wrote down the resolution of this.

Re: Built-in CTYPE provider

2024-02-12 Thread Peter Eisentraut


On 13.02.24 03:01, Jeff Davis wrote:

1. The SQL spec mentions the capitalization of "ß" as "SS"
specifically. Should UCS_BASIC use the unconditional mappings in
SpecialCasing.txt? I already have some code to do that (not posted
yet).


It is my understanding that "correct" Unicode case conversion needs to 
use at least some parts of SpecialCasing.txt.  The header of the file says


"For compatibility, the UnicodeData.txt file only contains simple case 
mappings for characters where they are one-to-one and independent of 
context and language. The data in this file, combined with the simple 
case mappings in UnicodeData.txt, defines the full case mappings [...]"


I read this as, just using UnicodeData.txt by itself is incomplete.

I think we need to use the "Unconditional" mappings and the "Conditional 
Language-Insensitive" mappings (which is just Greek sigma).  Obviously, 
skip the "Language-Sensitive" mappings.

Re: Built-in CTYPE provider

2024-02-12 Thread Jeff Davis

On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote:
> Various comments are updated to include the term "character class". 
> I 
> don't recognize that as an official Unicode term.  There are
> categories 
> and properties.  Let's check this.

It's based on
https://www.unicode.org/reports/tr18/#Compatibility_Properties

so I suppose the right name is "properties".

> Is it potentially confusing that only some pg_u_prop_* have a posix
> variant?  Would it be better for a consistent interface to have a
> "posix" argument for each and just ignore it if not used?  Not sure.

I thought about it but didn't see a clear advantage one way or another.

> About this initdb --builtin-locale option and analogous options 
> elsewhere:  Maybe we should flip this around and provide a --libc-
> locale 
> option, and have all the other providers just use the --locale
> option. 
> This would be more consistent with the fact that it's libc that is 
> special in this context.

Would --libc-locale affect all the environment variables or just
LC_CTYPE/LC_COLLATE? How do we avoid breakage?

I like the general direction here but we might need to phase in the
option or come up with a new name. Suggestions welcome.

> Do we even need the "C" locale?  We have established that "C.UTF-8"
> is 
> useful, but if that is easily available, who would need "C"?

I don't think we should encourage its use generally but I also don't
think it will disappear any time soon. Some people will want it on
simplicity grounds. I hope fewer people will use "C" when we have a
better builtin option.

> Some changes in this patch appear to be just straight renamings, like
> in
> src/backend/utils/init/postinit.c and 
> src/bin/pg_upgrade/t/002_pg_upgrade.pl.  Maybe those should be put
> into 
> the previous patch instead.
> 
> On the collation naming: My expectation would have been that the 
> "C.UTF-8" locale would be exposed as the UCS_BASIC collation.

I'd like that. We have to sort out a couple things first, though:

1. The SQL spec mentions the capitalization of "ß" as "SS"
specifically. Should UCS_BASIC use the unconditional mappings in
SpecialCasing.txt? I already have some code to do that (not posted
yet).

2. Should UCS_BASIC use the "POSIX" or "Standard" variant of regex
properties? (The main difference seems to be whether symbols get
treated as punctuation or not.)

3. What do we do about potential breakage for existing users of
UCS_BASIC who might be expecting C-like behavior?

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-02-07 Thread Peter Eisentraut


Review of the v16 patch set:

(Btw., I suppose you started this patch series with 0002 because some 
0001 was committed earlier.  But I have found this rather confusing.  I 
think it's ok to renumber from 0001 for each new version.)


* v16-0002-Add-Unicode-property-tables.patch

Various comments are updated to include the term "character class".  I 
don't recognize that as an official Unicode term.  There are categories 
and properties.  Let's check this.


Some files need heavy pgindent and perltidy.  You were surely going to 
do this eventually, but maybe you want to do this sooner to check 
whether you like the results.


- src/common/unicode/Makefile

This patch series adds some new post-update-unicode tests.  Should we 
have a separate target for each or just one common "unicode test" 
target?  Not sure.


- .../generate-unicode_category_table.pl

The trailing commas handling ($firsttime etc.) is not necessary with 
C99.  The code can be simplified.


For this kind of code:

+print $OT <<"HEADER";

let's use a common marker like EOS instead of a different one for each 
block.  That just introduces unnecessary variations.


- src/common/unicode_category.c

The mask stuff at the top could use more explanation.  It's impossible
to figure out exactly what, say, PG_U_PC_MASK does.

Line breaks in the different pg_u_prop_* functions are gratuitously 
different.


Is it potentially confusing that only some pg_u_prop_* have a posix
variant?  Would it be better for a consistent interface to have a
"posix" argument for each and just ignore it if not used?  Not sure.

Let's use size_t instead of Size for new code.


* v16-0003-Add-unicode-case-mapping-tables-and-functions.patch

Several of the above points apply here analogously.


* v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch

This is mostly a straightforward renaming patch, but there are some 
changes in initdb and pg_dump that pre-assume the changes in the next 
patch, like which locale columns apply for which providers.  I think it 
would be better for the historical record to make this a straight 
renaming patch and move those semantic changes to the next patch (or a 
separate intermediate patch, if you prefer).


- src/bin/psql/describe.c
- src/test/regress/expected/psql.out

This would be a good opportunity to improve the output columns for 
collations.  The updated view is now:


+ Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules | 
Deterministic?

++--+--+-+---++---+

This is historically grown but suboptimal.  Why is Locale after Collate 
and Ctype, and why does it show both?  I think we could have just the 
Locale column, and if the libc provider is used with different 
collate/ctype (very rare!), we somehow write that into the single locale 
column.


(A change like this would be a separate patch.)


* v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch

About this initdb --builtin-locale option and analogous options 
elsewhere:  Maybe we should flip this around and provide a --libc-locale 
option, and have all the other providers just use the --locale option. 
This would be more consistent with the fact that it's libc that is 
special in this context.


Do we even need the "C" locale?  We have established that "C.UTF-8" is 
useful, but if that is easily available, who would need "C"?


Some changes in this patch appear to be just straight renamings, like in
src/backend/utils/init/postinit.c and 
src/bin/pg_upgrade/t/002_pg_upgrade.pl.  Maybe those should be put into 
the previous patch instead.


On the collation naming: My expectation would have been that the 
"C.UTF-8" locale would be exposed as the UCS_BASIC collation.  And the 
"C" locale as some other name (or not at all, see above).  You have this 
the other way around.

Re: Built-in CTYPE provider

2024-01-22 Thread Jeff Davis

On Mon, 2024-01-22 at 19:49 +0100, Peter Eisentraut wrote:

> > 
> I don't get this argument.  Of course, people care about sorting and 
> sort order.  Whether you consider this part of Unicode or adjacent to
> it, people still want it.

You said that my proposal sends a message that we somehow don't care
about Unicode, and I strongly disagree. The built-in provider I'm
proposing does implement Unicode semantics.

Surely a database that offers UCS_BASIC (a SQL spec feature) isn't
sending a message that it doesn't care about Unicode, and neither is my
proposal.

> > 
> > * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
> > matching, and customization with rules. It's the solution for
> > everything from "slightly more advanced" to "very advanced".
> 
> I am astonished by this.  In your world, do users not want their text
> data sorted?  Do they not care what the sort order is? 

I obviously care about Unicode and collation. I've put a lot of effort
recently into contributions in this area, and I wouldn't have done that
if I thought users didn't care. You've made much greater contributions
and I thank you for that.

The logical conclusion of your line of argument would be that libc's
"C.UTF-8" locale and UCS_BASIC simply should not exist. But they do
exist, and for good reason.

One of those good reasons is that only *human* users care about the
human-friendliness of sort order. If Postgres is just feeding the
results to another system -- or an application layer that re-sorts the
data anyway -- then stability, performance, and interoperability matter
more than human-friendliness. (Though Unicode character semantics are
still useful even when the data is not going directly to a human.)

>  You consider UCA 
> sort order an "advanced" feature?

I said "slightly more advanced" compared with "basic". "Advanced" can
be taken in either a positive way ("more useful") or a negative way
("complex"). I'm sorry for the misunderstanding, but my point was this:

* The builtin provider is for people who are fine with code point order
and no tailoring, but want Unicode character semantics, collation
stability, and performance.

* ICU is the right solution for anyone who wants human-friendly
collation or tailoring, and is willing to put up with some collation
stability risk and lower collation performance.

Both have their place and the user is free to mix and match as needed,
thanks to the COLLATE clause for columns and queries.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-22 Thread Peter Eisentraut


On 18.01.24 23:03, Jeff Davis wrote:

On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:

I think that would be a terrible direction to take, because it would
regress the default sort order from "correct" to "useless".


I don't agree that the current default is "correct". There are a lot of
ways it can be wrong:

   * the environment variables at initdb time don't reflect what the
users of the database actually want
   * there are so many different users using so many different
applications connected to the database that no one "correct" sort order
exists
   * libc has some implementation quirks
   * the version of Unicode that libc is based on is not what you expect
   * the version of libc is not what you expect


These are arguments why the current defaults are not universally 
perfect, but I'd argue that they are still most often the right thing as 
the default.



   Aside from
the overall message this sends about how PostgreSQL cares about
locales
and Unicode and such.


Unicode is primarily about the semantics of characters and their
relationships. The patches I propose here do a great job of that.

Collation (relationships between *strings*) is a part of Unicode, but
not the whole thing or even the main thing.


I don't get this argument.  Of course, people care about sorting and 
sort order.  Whether you consider this part of Unicode or adjacent to 
it, people still want it.



Maybe you don't intend for this to be the default provider?


I am not proposing that this provider be the initdb-time default.


ok


   But then
who would really use it? I mean, sure, some people would, but how
would
you even explain, in practice, the particular niche of users or use
cases?


It's for users who want to respect Unicode support text from
international sources in their database; but are not experts on the
subject and don't know precisely what they want or understand the
consequences. If and when such users do notice a problem with the sort
order, they'd handle it at that time (perhaps with a COLLATE clause, or
sorting in the application).



Vision:



* ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
matching, and customization with rules. It's the solution for
everything from "slightly more advanced" to "very advanced".


I am astonished by this.  In your world, do users not want their text 
data sorted?  Do they not care what the sort order is?  You consider UCA 
sort order an "advanced" feature?

Re: Built-in CTYPE provider

2024-01-18 Thread Jeff Davis

On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:
> I think that would be a terrible direction to take, because it would 
> regress the default sort order from "correct" to "useless".

I don't agree that the current default is "correct". There are a lot of
ways it can be wrong:

  * the environment variables at initdb time don't reflect what the
users of the database actually want
  * there are so many different users using so many different
applications connected to the database that no one "correct" sort order
exists
  * libc has some implementation quirks
  * the version of Unicode that libc is based on is not what you expect
  * the version of libc is not what you expect

>   Aside from 
> the overall message this sends about how PostgreSQL cares about
> locales 
> and Unicode and such.

Unicode is primarily about the semantics of characters and their
relationships. The patches I propose here do a great job of that.

Collation (relationships between *strings*) is a part of Unicode, but
not the whole thing or even the main thing.

> Maybe you don't intend for this to be the default provider?

I am not proposing that this provider be the initdb-time default.

>   But then
> who would really use it? I mean, sure, some people would, but how
> would 
> you even explain, in practice, the particular niche of users or use
> cases?

It's for users who want to respect Unicode support text from
international sources in their database; but are not experts on the
subject and don't know precisely what they want or understand the
consequences. If and when such users do notice a problem with the sort
order, they'd handle it at that time (perhaps with a COLLATE clause, or
sorting in the application).

> Maybe if this new provider would be called "minimal", it might
> describe 
> the purpose better.

"Builtin" communicates that it's available everywhere (not a
dependency), that specific behaviors can be documented and tested, and
that behavior doesn't change within a major version. I want to
communicate all of those things.

> I could see a use for this builtin provider if it also included the 
> default UCA collation (what COLLATE UNICODE does now).

I won't rule that out, but I'm not proposing that right now and my
proposal is already offering useful functionality.

> There would still be a risk with that approach, since it would 
> permanently marginalize ICU functionality

Yeah, ICU already does a good job offering the root collation. I don't
think the builtin provider needs to do so.

> I would be curious what your overall vision is here?

Vision:

* The builtin provider will offer Unicode character semantics, basic
collation, platform-independence, and high performance. It can be used
on its own or in combination with ICU via the COLLATE clause.

* ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
matching, and customization with rules. It's the solution for
everything from "slightly more advanced" to "very advanced".

* libc would be for databases serving applications on the same machine
where a matching sort order is helpful, risks to indexes are
acceptable, and performance is not important.

>   Is switching the 
> default to ICU still your goal?  Or do you want the builtin provider
> to 
> be the default?

It's hard to answer this question while initdb chooses the database
default collation based on the environment. Neither ICU nor the builtin
provider can reasonably handle whatever those environment variables
might be set to.

Stepping back from the focus on what initdb does, we should be
providing the right encouragement in documentation and packaging to
guide users toward the right provider based their needs and the vision
outlined above.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-18 Thread Daniel Verite

Peter Eisentraut wrote:

> > If the Postgres default was bytewise sorting+locale-agnostic
> > ctype functions directly derived from Unicode data files,
> > as opposed to libc/$LANG at initdb time, the main
> > annoyance would be that "ORDER BY textcol" would no
> > longer be the human-favored sort.
> 
> I think that would be a terrible direction to take, because it would 
> regress the default sort order from "correct" to "useless".  Aside from 
> the overall message this sends about how PostgreSQL cares about
> locales and Unicode and such.

Well, offering a viable solution to avoid as much as possible
the dreaded:

"WARNING: collation "xyz" has version mismatch
... HINT: Rebuild all objects affected by this collation..."

that doesn't sound like a bad message to send. 

Currently, to have in codepoint order the indexes that don't need a
linguistic order, you're supposed to use collate "C", which then means
that upper(), lower() etc.. don't work beyond ASCII.
Here our Unicode support is not good enough, and the proposal
addresses that.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

2024-01-18 Thread Peter Eisentraut

On 12.01.24 03:02, Jeff Davis wrote:

New version attached. Changes:

  * Named collation object PG_C_UTF8, which seems like a good idea to
prevent name conflicts with existing collations. The locale name is
still C.UTF-8, which still makes sense to me because it matches the
behavior of the libc locale of the same name so closely.

I am catching up on this thread.  The discussions have been very 
complicated, so maybe I didn't get it all.

The patches look pretty sound, but I'm questioning how useful this 
feature is and where you plan to take it.

Earlier in the thread, the aim was summarized as

> If the Postgres default was bytewise sorting+locale-agnostic
> ctype functions directly derived from Unicode data files,
> as opposed to libc/$LANG at initdb time, the main
> annoyance would be that "ORDER BY textcol" would no
> longer be the human-favored sort.

I think that would be a terrible direction to take, because it would 
regress the default sort order from "correct" to "useless".  Aside from 
the overall message this sends about how PostgreSQL cares about locales 
and Unicode and such.

Maybe you don't intend for this to be the default provider?  But then 
who would really use it?  I mean, sure, some people would, but how would 
you even explain, in practice, the particular niche of users or use cases?

Maybe if this new provider would be called "minimal", it might describe 
the purpose better.

I could see a use for this builtin provider if it also included the 
default UCA collation (what COLLATE UNICODE does now).  Then it would 
provide a "common" default behavior out of the box, and if you want more 
fine-tuning, you can go to ICU.  There would still be some questions 
about making sure the builtin behavior and the ICU behavior are 
consistent (different Unicode versions, stock UCA vs CLDR, etc.).  But 
for practical purposes, it might work.

There would still be a risk with that approach, since it would 
permanently marginalize ICU functionality, in the sense that only some 
locales would need ICU, and so we might not pay the same amount of 
attention to the ICU functionality.

I would be curious what your overall vision is here?  Is switching the 
default to ICU still your goal?  Or do you want the builtin provider to 
be the default?  Or something else?

Re: Built-in CTYPE provider

2024-01-15 Thread Jeff Davis

On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote:
> Concerning the target category_test, it produces failures with
> versions of ICU with Unicode < 15. The first one I see with Ubuntu
> 22.04 (ICU 70.1) is:

...

> I find these results interesting because they tell us what contents
> can break regexp-based check constraints on upgrades.

Thank you for collecting and consolidating this information.

> But about category_test as a pass-or-fail kind of test, it can only
> be
> used when the Unicode version in ICU is the same as in Postgres.

The test has a few potential purposes:

1. To see if there is some error in parsing the Unicode files and
building the arrays in the .h file. For instance, let's say the perl
parser I wrote works fine on the Unicode 15.1 data file, but does
something wrong on the 16.0 data file: the test would fail and we'd
investigate. This is the most important reason for the test.

2. To notice any quirks between how we interpret Unicode vs how ICU
does.

3. To help see "interesting" differences between different Unicode
versions.

For #1 and #2, the best way to test is by using a version of ICU that
uses the same Unicode version as Postgres. The one running update-
unicode can try to recompile with the right one for the purposes of the
test. NB: There might be no version of ICU where the Unicode version
exactly matches what we'd like to update to. In that case, we'd need to
use the closest version and do some manual validation that the
generated tables are sane.

For #3, that is also interesting information to know about, but it's
not directly actionable. As you point out, Unicode does not guarantee
that these properties are static forever, so regexes can change
behavior when we update Unicode for the next PG version. That is a much
lower risk than a collation change, but as you point out, is a risk for
regexes inside of a CHECK constraint. If a user needs zero risk of
semantic changes for regexes, the only option is "C". Perhaps there
should be a separate test target for this mode so that it doesn't exit
early?

(Note: case mapping has much stronger guarantees than the character
classification.)

I will update the README to document how someone running update-unicode
should interpret the test results.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-15 Thread Daniel Verite

Jeff Davis wrote:

> New version attached.

[v16]

Concerning the target category_test, it produces failures with
versions of ICU with Unicode < 15. The first one I see with Ubuntu
22.04 (ICU 70.1) is:

category_test: Postgres Unicode version:15.1
category_test: ICU Unicode version: 14.0
category_test: FAILURE for codepoint 0x000c04
category_test: Postgres property   
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
1/0/0/0/0/0
category_test: ICU  property   
alphabetic/lowercase/uppercase/white_space/hex_digit/join_control:
0/0/0/0/0/0

U+0C04 is a codepoint added in Unicode 11.
https://en.wikipedia.org/wiki/Telugu_(Unicode_block)

In Unicode.txt:
0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;N;

In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt
$ grep 0C04 PropList.txt 
0C04  ; Other_Alphabetic # Mn   TELUGU SIGN COMBINING ANUSVARA
ABOVE

But in Unicode 14 it was not there.
As a result its binary property UCHAR_ALPHABETIC has changed from
false to true in ICU 72 vs previous versions.

As I understand, the stability policy says that such things happen.
From https://www.unicode.org/policies/stability_policy.html

   Once a character is encoded, its properties may still be changed,
   but not in such a way as to change the fundamental identity of the
   character.

   The Consortium will endeavor to keep the values of the other
   properties as stable as possible, but some circumstances may arise
   that require changing them. Particularly in the situation where
   the Unicode Standard first encodes less well-documented characters
   and scripts, the exact character properties and behavior initially
   may not be well known.

   As more experience is gathered in implementing the characters,
   adjustments in the properties may become necessary. Examples of
   such properties include, but are not limited to, the following:

- General_Category
- Case mappings
- Bidirectional properties
[...]

I've commented the exit(1) in category_test to collect all errors, and
built it with versions of ICU from 74 down to 60 (that is Unicode 10.0).
Results are attached. As expected, the older the ICU version, the more
differences are found against Unicode 15.1.

I find these results interesting because they tell us what contents
can break regexp-based check constraints on upgrades.

But about category_test as a pass-or-fail kind of test, it can only be
used when the Unicode version in ICU is the same as in Postgres.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite


results-category-tests-multiple-icu-versions.tar.bz2
Description: Binary data

Re: Built-in CTYPE provider

2024-01-14 Thread Michael Paquier

On Fri, Jan 12, 2024 at 01:13:04PM -0500, Robert Haas wrote:
> On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite  wrote:
>> ISTM that in general the behavior of old psql vs new server does
>> not weight much against choosing optimal catalog changes.
> 
> +1.

+1.  There is a good amount of effort put in maintaining downward
compatibility in psql.  Upward compatibility would require more
manipulations of the stable branches to make older versions of psql
compatible with newer server versions.  Brr.
--
Michael


signature.asc
Description: PGP signature

Re: Built-in CTYPE provider

2024-01-12 Thread Jeff Davis

On Fri, 2024-01-12 at 19:00 +0100, Daniel Verite wrote:
> Another one is that version 12 broke \d in older psql by
> removing pg_class.relhasoids.
> 
> ISTM that in general the behavior of old psql vs new server does
> not weight much against choosing optimal catalog changes.
> 
> There's also that warning at start informing users about it:
> WARNING: psql major version X, server major version Y.
>  Some psql features might not work.

Good point, I'll leave it as-is then. If someone complains I can rework
it.

Also, the output of \l changes from version to version, so if there are
automated tools processing the output then they'd have to change
anyway.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite  wrote:
> ISTM that in general the behavior of old psql vs new server does
> not weight much against choosing optimal catalog changes.

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-01-12 Thread Daniel Verite

Jeff Davis wrote:

> > Jeremy also raised a problem with old versions of psql connecting to
> > a
> > new server: the \l and \dO won't work. Not sure exactly what to do
> > there, but I could work around it by adding a new field rather than
> > renaming (though that's not ideal).
> 
> I did a bit of research for a precedent, and the closest I could find
> is that \dp was broken between 14 and 15 by commit 07eee5a0dc.

Another one is that version 12 broke \d in older psql by
removing pg_class.relhasoids.

ISTM that in general the behavior of old psql vs new server does
not weight much against choosing optimal catalog changes.

There's also that warning at start informing users about it:
WARNING: psql major version X, server major version Y.
 Some psql features might not work.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

2024-01-12 Thread Jeff Davis

On Thu, 2024-01-11 at 18:02 -0800, Jeff Davis wrote:
> Jeremy also raised a problem with old versions of psql connecting to
> a
> new server: the \l and \dO won't work. Not sure exactly what to do
> there, but I could work around it by adding a new field rather than
> renaming (though that's not ideal).

I did a bit of research for a precedent, and the closest I could find
is that \dp was broken between 14 and 15 by commit 07eee5a0dc.

That is some precedent, but it's more narrow. I think that might
justify breaking \dO in older clients, but \l is used frequently.

It seems safer to just introduce new columns "datbuiltinlocale" and
"collbuiltinlocale". They'll be nullable anyway.

If we want to clean this up we can do so as a separate commit. There
are other parts of the catalog representation related to collation that
we might want to clean up as well.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-11 Thread Jeff Davis

On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
> I think we missed something in psql, pretty sure I applied all the
> patches but I see this error:
> 
> =# \l
> ERROR:  42703: column d.datlocale does not exist
> LINE 8:   d.datlocale as "Locale",
>   ^
> HINT:  Perhaps you meant to reference the column "d.daticulocale".
> LOCATION:  errorMissingColumn, parse_relation.c:3720

I think you're connecting to a patched server with an older version of
psql, so it doesn't know the catalog column was renamed.

pg_dump and pg_upgrade don't have that problem because they throw an
error when connecting to a newer server.

But for psql, that's perfectly reasonable to connect to a newer server.
Have we dropped or renamed catalog columns used by psql backslash
commands before, and if so, how do we handle that?

I can just not rename that column, but that's a bit frustrating because
it means I'd need to add a new column to pg_database, which seems
redundant.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-11 Thread Jeff Davis

On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote:
> $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
>  
>   The database cluster will be initialized with this locale
> configuration:
>     default collation provider:  builtin
>     default collation locale:    C.UTF-8
>     LC_COLLATE:  C.UTF-8
>     LC_CTYPE:    C.UTF-8
>     LC_MESSAGES: C.UTF-8
>     LC_MONETARY: C.UTF-8
>     LC_NUMERIC:  C.UTF-8
>     LC_TIME: C.UTF-8
>   The default database encoding has accordingly been set to "UTF8".
>   The default text search configuration will be set to "english".
> 
> This is from an environment where LANG=fr_FR.UTF-8
> 
> I would expect all LC_* variables to be fr_FR.UTF-8, and the default
> text search configuration to be "french".

You can get the behavior you want by doing:

  initdb --builtin-locale=C.UTF-8 --locale-provider=builtin \
-D/tmp/pgdata

where "--builtin-locale" is analogous to "--icu-locale".

It looks like I forgot to document the new initdb option, which seems
to be the source of the confusion. Sorry, I'll fix that in the next
patch set. (See examples in the initdb tests.)

I think this answers some of your follow-up questions as well.

> A related comment is about naming the builtin locale C.UTF-8, the
> same
> name as in libc. On one hand this is semantically sound, but on the
> other hand, it's likely to confuse people. What about using
> completely
> different names, like "pg_unicode" or something else prefixed by
> "pg_"
> both for the locale name and the collation name (currently
> C.UTF-8/c_utf8)?

I'm flexible on naming, but here are my thoughts:

* A "pg_" prefix makes sense.

* If we named it something like "pg_unicode" someone might expect it to
sort using the root collation.

* The locale name "C.UTF-8" is nice because it implies things about
both the collation and the character behavior. It's also nice because
on at least some platforms, the behavior is almost identical to the
libc locale of the same name.

* UCS_BASIC might be a good name, because it also seems to carry the
right meanings, but that name is already taken.

* We also might to support variations, such as full case mapping (which
uppercases "ß" to "SS", as the SQL standard requires), or perhaps the
"standard" flavor of regexes (which don't count all symbols as
punctuation). Leaving some room to name those variations would be a
good idea.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-10 Thread Daniel Verite

Jeff Davis wrote:

> Attached a more complete version that fixes a few bugs

[v15 patch]

When selecting the builtin provider with initdb, I'm getting the
following setup:

$ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
 
  The database cluster will be initialized with this locale configuration:
default collation provider:  builtin
default collation locale:C.UTF-8
LC_COLLATE:  C.UTF-8
LC_CTYPE:C.UTF-8
LC_MESSAGES: C.UTF-8
LC_MONETARY: C.UTF-8
LC_NUMERIC:  C.UTF-8
LC_TIME: C.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "english".

This is from an environment where LANG=fr_FR.UTF-8

I would expect all LC_* variables to be fr_FR.UTF-8, and the default
text search configuration to be "french".  It is what happens
when selecting ICU as the provider in the same environment:

$ bin/initdb --icu-locale=en --locale-provider=icu -D/tmp/pgdata

  Using language tag "en" for ICU locale "en".
  The database cluster will be initialized with this locale configuration:
default collation provider:  icu
default collation locale:en
LC_COLLATE:  fr_FR.UTF-8
LC_CTYPE:fr_FR.UTF-8
LC_MESSAGES: fr_FR.UTF-8
LC_MONETARY: fr_FR.UTF-8
LC_NUMERIC:  fr_FR.UTF-8
LC_TIME: fr_FR.UTF-8
  The default database encoding has accordingly been set to "UTF8".
  The default text search configuration will be set to "french".

The collation setup does not influence the rest of the localization.
The problem AFAIU is that --locale has two distinct
meanings in the v15 patch:
--locale-provider=X --locale=Y means use "X" as the provider
with "Y" as datlocale, and it means use "Y" as the locale for all
localized libc functionalities.

I wonder what would happen if invoking
 bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
on a system where C.UTF-8 does not exist as a libc locale.
Would it fail? (I don't have an OS like this to test ATM, will try later).

A related comment is about naming the builtin locale C.UTF-8, the same
name as in libc. On one hand this is semantically sound, but on the
other hand, it's likely to confuse people. What about using completely
different names, like "pg_unicode" or something else prefixed by "pg_"
both for the locale name and the collation name (currently
C.UTF-8/c_utf8)?



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

2024-01-09 Thread Jeremy Schneider

On 1/9/24 2:31 PM, Jeff Davis wrote:
> On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
>> I think we missed something in psql, pretty sure I applied all the
>> patches but I see this error:
>>
>> =# \l
>> ERROR:  42703: column d.datlocale does not exist
>> LINE 8:   d.datlocale as "Locale",
>>
> 
> Thank you. I'll fix this in the next patch set.

Very minor nitpick/request. Not directly with this patch set but with
the category_test which is related and also recently committed.

I'm testing out "make update-unicode" scripts, and due to my system ICU
version being different from the PostgreSQL unicode version I get the
expected warnings from category_test:

Postgres Unicode Version:   15.1
ICU Unicode Version:14.0
Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch.
category_test: All tests successful!

I know it's minor, but I saw the 5116 skipped codepoints and saw "all
tests succeeded" but I thought the output might be a little nicer if we
showed the count of tests that succeeded. For example:

Postgres Unicode Version:   15.1
ICU Unicode Version:14.0
Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch.
category_test: All 1108996 tests successful!

It's a minor tweak to a script that I don't think even runs in the
standard build; any objections?  Patch attached.

-Jeremy


-- 
http://about.me/jeremy_schneider
From 5ff09eb7371abc14cd8537b4e3265e35f030794f Mon Sep 17 00:00:00 2001
From: Jeremy Schneider 
Date: Wed, 10 Jan 2024 07:25:17 +
Subject: [PATCH] Output count of checked codepoints.

---
 src/common/unicode/category_test.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/common/unicode/category_test.c 
b/src/common/unicode/category_test.c
index 6cd7cd1a5f..62ff345066 100644
--- a/src/common/unicode/category_test.c
+++ b/src/common/unicode/category_test.c
@@ -53,6 +53,7 @@ main(int argc, char **argv)
int icu_unicode_version = 
parse_unicode_version(U_UNICODE_VERSION);
int pg_skipped_codepoints = 0;
int icu_skipped_codepoints = 0;
+   int checked_codepoints = 0;
 
printf("Postgres Unicode Version:\t%s\n", PG_UNICODE_VERSION);
printf("ICU Unicode Version:\t\t%s\n", U_UNICODE_VERSION);
@@ -90,6 +91,10 @@ main(int argc, char **argv)
exit(1);
}
}
+   else
+   {
+   checked_codepoints++;
+   }
}
 
if (pg_skipped_codepoints > 0)
@@ -99,7 +104,8 @@ main(int argc, char **argv)
printf("Skipped %d codepoints unassigned in ICU due to Unicode 
version mismatch.\n",
   icu_skipped_codepoints);
 
-   printf("category_test: All tests successful!\n");
+   printf("category_test: All %d tests successful!\n",
+  checked_codepoints);
exit(0);
 #else
printf("ICU support required for test; skipping.\n");
-- 
2.34.1

Re: Built-in CTYPE provider

2024-01-09 Thread Jeff Davis

On Mon, 2024-01-08 at 17:17 -0800, Jeremy Schneider wrote:

> I agree with merging the threads, even though it makes for a larger
> patch set. It would be great to get a unified "builtin" provider in
> place for the next major.

I believe that's possible and that this proposal is quite close (hoping
to get something in this 'fest). The tables I'm introducing have
exhaustive test coverage, so there's not a lot of risk there. And the
builtin provider itself is an optional feature, so it won't be
disruptive.

> 
> In the first list it seems that some callers might be influenced by a
> COLLATE clause or table definition while others always take the
> database
> default? It still seems a bit odd to me if different providers can be
> used for different parts of a single SQL.

Right, that can happen today, and my proposal doesn't change that.
Basically those are cases where the caller was never properly onboarded
to our collation system, like the ts_locale.c routines.

> Is there any reason we couldn't commit the minor cleanup (patch 0001)
> now? It's less than 200 lines and pretty straightforward.

Sure, I'll commit that fairly soon then.

> I wonder if, after a year of running the builtin provider in
> production,
> whether we might consider adding to the builtin provider a few
> locales
> with simple but more reasonable ordering for european and asian
> languages?

I won't rule that out completely, but there's a lot we would need to do
to get there. Even assuming we implement that perfectly, we'd need to
make sure it's a reasonable scope for Postgres as a project and that we
have more than one person willing to maintain it. Similar things have
been rejected before for similar reasons.

What I'm proposing for v17 is much simpler: basically some lookup
tables, which is just an extension of what we're already doing for
normalization.

> https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main
> 
> Make sure to click the link to show the counterexamples and
> discussion,
> that's the best part.

Yes, it can be hard to reason about this stuff but I believe Unicode
provides a lot of good data and guidance to work from. If you think my
proposal relies on one of those assumptions let me know.

To the extent that I do rely on any of those assumptions, it's mostly
to match libc's "C.UTF-8" behavior.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-09 Thread Jeff Davis

On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
> I think we missed something in psql, pretty sure I applied all the
> patches but I see this error:
> 
> =# \l
> ERROR:  42703: column d.datlocale does not exist
> LINE 8:   d.datlocale as "Locale",
> 

Thank you. I'll fix this in the next patch set.

> This is interesting. Jeff your original email didn't explicitly show
> any
> other initcap() results, but on Ubuntu 22.04 (glibc 2.35) I see
> different results:
> 
> =# SELECT initcap('axxE áxxÉ ǄxxǄ ǅxxx ǆxxx');
>  initcap
> --
>  Axxe Áxxé Ǆxxǆ Ǆxxx Ǆxxx
> 
> =# SELECT initcap('axxE áxxÉ ǄxxǄ ǅxxx ǆxxx' COLLATE C_UTF8);
>  initcap
> --
>  Axxe Áxxé ǅxxǆ ǅxxx ǅxxx

The reason for this difference is because libc doesn't support
titlecase. In the next patch set, I'll not use titlecase (only
uppercase/lowercase even for initcap()), and then it will match libc
100%.

> The COLLATE sql syntax feels awkward to me. In this example, we're
> just
> using it to attach locale info to the string, and there's not
> actually
> any collation involved here. Not sure if COLLATE comes from the
> standard, and even if it does I'm not sure whether the standard had
> upper/lowercase in mind.

The standard doesn't use the COLLATE clause for case mapping, but also
doesn't offer any other mechanism to, e.g., get case mapping according
to the "tr_TR" locale.

I think what Postgres does here, re-purposing the COLLATE clause to
allow tailoring of case mapping, is imperfect but reasonable. My
proposal doesn't change that.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2024-01-09 Thread Jeremy Schneider

On 12/28/23 6:57 PM, Jeff Davis wrote:
> On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote:
> Attached a more complete version that fixes a few bugs, stabilizes the
> tests, and improves the documentation. I optimized the performance, too
> -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
> collation and case mapping (numbers below).
> 
> It's really nice to finally be able to have platform-independent tests
> that work on any UTF-8 database.

I think we missed something in psql, pretty sure I applied all the
patches but I see this error:

=# \l
ERROR:  42703: column d.datlocale does not exist
LINE 8:   d.datlocale as "Locale",
  ^
HINT:  Perhaps you meant to reference the column "d.daticulocale".
LOCATION:  errorMissingColumn, parse_relation.c:3720

=

This is interesting. Jeff your original email didn't explicitly show any
other initcap() results, but on Ubuntu 22.04 (glibc 2.35) I see
different results:

=# SELECT initcap('axxE áxxÉ ǄxxǄ ǅxxx ǆxxx');
 initcap
--
 Axxe Áxxé Ǆxxǆ Ǆxxx Ǆxxx

=# SELECT initcap('axxE áxxÉ ǄxxǄ ǅxxx ǆxxx' COLLATE C_UTF8);
 initcap
--
 Axxe Áxxé ǅxxǆ ǅxxx ǅxxx

The COLLATE sql syntax feels awkward to me. In this example, we're just
using it to attach locale info to the string, and there's not actually
any collation involved here. Not sure if COLLATE comes from the
standard, and even if it does I'm not sure whether the standard had
upper/lowercase in mind.

That said, I think the thing that mainly matters will be the CREATE
DATABASE syntax and the database default.

I want to try a few things with table-level defaults that differ from
database-level defaults, especially table-level ICU defaults because I
think a number of PostgreSQL users set that up in the years before we
supported DB-level ICU. Some people will probably keep using their
old/existing schema-creation scripts even after they begin provisioning
new systems with new database-level defaults.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

2024-01-08 Thread Jeremy Schneider

On 12/28/23 6:57 PM, Jeff Davis wrote:
> 
> Attached a more complete version that fixes a few bugs, stabilizes the
> tests, and improves the documentation. I optimized the performance, too
> -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
> collation and case mapping (numbers below).
> 
> It's really nice to finally be able to have platform-independent tests
> that work on any UTF-8 database.

Thanks for all your work on this, Jeff

I didn't know about the Unicode stability policy. Since it's formal
policy, I agree this provides some assumptions we can safely build on.

I'm working my way through these patches but it's taking a little time
for me. I hadn't tracked with the "builtin" thread last summer so I'm
coming up to speed on that now too. I'm hopeful that something along
these lines gets into pg17. The pg17 cycle is going to start heating up
pretty soon.

I agree with merging the threads, even though it makes for a larger
patch set. It would be great to get a unified "builtin" provider in
place for the next major.

I also still want to parse my way through your email reply about the two
groups of callers, and what this means for user experience.

https://www.postgresql.org/message-id/7774b3a64f51b3375060c29871cf2b02b3e85dab.camel%40j-davis.com

> Let's separate it into groups.
> (1) Callers that use a collation OID or pg_locale_t:
> (2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are
> set to, or use ad-hoc ASCII-only semantics:

In the first list it seems that some callers might be influenced by a
COLLATE clause or table definition while others always take the database
default? It still seems a bit odd to me if different providers can be
used for different parts of a single SQL. But it might not be so bad - I
haven't fully thought through it yet and I'm still kicking the tires on
my test build over here.

Is there any reason we couldn't commit the minor cleanup (patch 0001)
now? It's less than 200 lines and pretty straightforward.

I wonder if, after a year of running the builtin provider in production,
whether we might consider adding to the builtin provider a few locales
with simple but more reasonable ordering for european and asian
languages? Maybe just grabbing a current version of DUCET with no
intention of ever updating it? I don't know how bad sorting is with
plain DUCET for things like french or spanish or german, but surely it's
not as unusable as code point order? Anyone who needs truly accurate or
updated or customized linguistic sorting can always go to ICU, and take
responsibility for their ICU upgrades, but something basic and static
might meet the needs of 99% of postgres users indefinitely.

By the way - my coworker Josh (who I don't think posts much on the
hackers list here, but shares an unhealthy inability to look away from
database unicode disasters) passed along this link today which I think
is a fantastic list of surprises about programming and strings
(generally unicode).

https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main

Make sure to click the link to show the counterexamples and discussion,
that's the best part.

-Jeremy

PS. I was joking around today that the the second best part is that it's
proof that people named Jeremy are always brilliant within their field.
 Josh said its just a subset of "always trust people whose names start
with J" which seems fair. Unfortunately I can't yet think of a way to
shoehorn the rest of the amazing PG hackers on this thread into the joke.

-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

2023-12-22 Thread Daniel Verite

Robert Haas wrote:

> For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8,
> a change to C.utf8 would be a much bigger problem, I would
> think. Their alphabet isn't in code point order, and so things would
> be alphabetized wrongly.

> That might be OK if they don't care about ordering for any purpose
> other than equality lookups, but otherwise it's going to force them
> to change the default, where today they don't have to do that.

Sure, in whatever collation setup we expose, we need to keep
it possible and even easy to sort properly with linguistic rules.

But some reasons to use $LANG as the default locale/collation
are no longer as good as they used to be, I think.

Starting with v10/ICU we have many pre-created ICU locales with
fixed names, and starting with v16, we can simply write "ORDER BY
textfield COLLATE unicode" which is good enough in most cases. So
the configuration "bytewise sort by default" / "linguistic sort on-demand"
has become more realistic.

By contrast in the pre-v10 days with only libc collations, an
application could have no idea which collations were going to be
available on the server, and how they were named precisely, as this
varies across OSes and across installs even with the same OS.
On Windows, I think that before v16 initdb did not create any libc
collation beyond C/POSIX and the default language/region of the OS.

In that libc context, if a db wants the C locale by default for
performance and truly immutable indexes, but the client app needs to
occasionally do in-db linguistic sorts, the app needs to figure out
which collation name will work for that. This is hard if you don't
target a specific installation that guarantees that such or such
collation is going to be installed.
Whereas if the linguistic locale is the default, the app never needs
to know its name or anything about it. So it's done that way,
linguistic by default. But that leaves databases with many
indexes sorted linguistically instead of bytewise for fields
that semantically never need any linguistic sort.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

2023-12-21 Thread Jeff Davis

On Wed, 2023-12-20 at 15:47 -0800, Jeremy Schneider wrote:

> One other thing that comes to mind: how does the parser do case
> folding
> for relation names? Is that using OS-provided libc as of today? Or
> did
> we code it to use ICU if that's the DB default? I'm guessing libc,
> and
> global catalogs probably need to be handled in a consistent manner,
> even
> across different encodings.

The code is in downcase_identifier():

  /*  
   * SQL99 specifies Unicode-aware case normalization, which we don't 
   * yet have the infrastructure for...
   */
  if (ch >= 'A' && ch <= 'Z')
ch += 'a' - 'A';
  else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
ch = tolower(ch);
  result[i] = (char) ch;

My proposal would add the infrastructure that the comment above says is
missing.

It seems like we should be using the database collation at this point
because you don't want inconsistency between the catalogs and the
parser here. Then again, the SQL spec doesn't seem to support tailoring
of case conversions, so maybe we are avoiding it for that reason? Or
maybe we're avoiding catalog access? Or perhaps the work for ICU just
wasn't done here yet?

> (Kindof related... did you ever see the demo where I create a user
> named
> '' and then I try to connect to a database with non-unicode
> encoding?
>   ...at least it seems to be able to walk the index without
> decoding
> strings to find other users - but the way these global catalogs work
> scares me a little bit)

I didn't see that specific demo, but in general we seem to change
between pg_wchar and unicode code points too freely, so I'm not
surprised that something went wrong.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-21 Thread Jeff Davis

On Wed, 2023-12-20 at 16:29 -0800, Jeremy Schneider wrote:
> found some more. here's my running list of everything user-facing I
> see
> in core PG code so far that might involve case:
> 
> * upper/lower/initcap
> * regexp_*() and *_REGEXP()
> * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~*
> * citext + replace(), split_part(), strpos() and translate()
> * full text search - everything is case folded
> * unaccent? not clear to me whether CTYPE includes accent folding

No, ctype has nothing to do with accents as far as I can tell. I don't
know if I'm using the right terminology, but I think "case" is a
variant of a character whereas "accent" is a modifier/mark, and the
mark is a separate concept from the character itself.

> * ltree
> * pg_trgm
> * core PG parser, case folding of relation names

Let's separate it into groups.

(1) Callers that use a collation OID or pg_locale_t:

  * collation & hashing
  * upper/lower/initcap
  * regex, LIKE, formatting
  * pg_trgm (which uses regexes)
  * maybe postgres_fdw, but might just be a passthrough
  * catalog cache (always uses DEFAULT_COLLATION_OID)
  * citext (always uses DEFAULT_COLLATION_OID, but probably shouldn't)

(2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are
set to, or use ad-hoc ASCII-only semantics:

  * core SQL parser downcase_identifier()
  * callers of pg_strcasecmp() (DDL, etc.)
  * GUC name case folding
  * full text search ("mylocale = 0 /* TODO */")
  * a ton of stuff uses isspace(), isdigit(), etc.
  * various callers of tolower()/toupper()
  * some selfuncs.c stuff
  * ...

Might have missed some places.

The user impact of a new builtin provider would affect (1), but only
for those actually using the provider. So there's no compatibility risk
there, but it's good to understand what it will affect.

We can, on a case-by-case basis, also consider using the new APIs I'm
proposing for instances of (2). There would be some compatibility risk
there for existing callers, and we'd have to consider whether it's
worth it or not. Ideally, new callers would either use the new APIs or
use the pg_ascii_* APIs.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-20 Thread Robert Haas

On Wed, Dec 20, 2023 at 5:57 PM Jeff Davis  wrote:
> Those locales all have the same ctype behavior.

Sigh. I keep getting confused about how that works...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2023-12-20 Thread Jeremy Schneider

On 12/20/23 4:04 PM, Jeremy Schneider wrote:
> On 12/20/23 3:47 PM, Jeremy Schneider wrote:
>> On 12/5/23 3:46 PM, Jeff Davis wrote:
>>> CTYPE, which handles character classification and upper/lowercasing
>>> behavior, may be simpler than it first appears. We may be able to get
>>> a net decrease in complexity by just building in most (or perhaps all)
>>> of the functionality.
>>
>> I'll be honest, even though this is primarily about CTYPE and not
>> collation, I still need to keep re-reading the initial email slowly to
>> let it sink in and better understand it... at least for me, it's complex
>> to reason through. 
>>
>> I'm trying to make sure I understand clearly what the user impact/change
>> is that we're talking about: after a little bit of brainstorming and
>> looking through the PG docs, I'm actually not seeing much more than
>> these two things you've mentioned here: the set of regexp_* functions PG
>> provides, and these three generic functions. That alone doesn't seem
>> highly concerning.
> 
> I missed citext, which extends impact to replace(), split_part(),
> strpos() and translate().  There are also the five *_REGEX() functions
> from the SQL standard which I assume are just calling the PG functions.

found some more. here's my running list of everything user-facing I see
in core PG code so far that might involve case:

* upper/lower/initcap
* regexp_*() and *_REGEXP()
* ILIKE, operators ~* !~* ~~ !~~ ~~* !~~*
* citext + replace(), split_part(), strpos() and translate()
* full text search - everything is case folded
* unaccent? not clear to me whether CTYPE includes accent folding
* ltree
* pg_trgm
* core PG parser, case folding of relation names


-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

2023-12-20 Thread Jeremy Schneider

On 12/20/23 3:47 PM, Jeremy Schneider wrote:
> On 12/5/23 3:46 PM, Jeff Davis wrote:
>> CTYPE, which handles character classification and upper/lowercasing
>> behavior, may be simpler than it first appears. We may be able to get
>> a net decrease in complexity by just building in most (or perhaps all)
>> of the functionality.
> 
> I'll be honest, even though this is primarily about CTYPE and not
> collation, I still need to keep re-reading the initial email slowly to
> let it sink in and better understand it... at least for me, it's complex
> to reason through. 
> 
> I'm trying to make sure I understand clearly what the user impact/change
> is that we're talking about: after a little bit of brainstorming and
> looking through the PG docs, I'm actually not seeing much more than
> these two things you've mentioned here: the set of regexp_* functions PG
> provides, and these three generic functions. That alone doesn't seem
> highly concerning.

I missed citext, which extends impact to replace(), split_part(),
strpos() and translate().  There are also the five *_REGEX() functions
from the SQL standard which I assume are just calling the PG functions.

I just saw the krb_caseins_users GUC, which reminds me that PLs also
have their own case functions. And of course extensions. I'm not saying
any of this is in scope for the change here, but I'm just trying to wrap
my brain around all the places we've got CTYPE processing happening, to
better understand the big picture. It might help tease out unexpected
small glitches from changing one thing but not another one.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

2023-12-20 Thread Jeremy Schneider

On 12/5/23 3:46 PM, Jeff Davis wrote:
> CTYPE, which handles character classification and upper/lowercasing
> behavior, may be simpler than it first appears. We may be able to get
> a net decrease in complexity by just building in most (or perhaps all)
> of the functionality.
> 
> === Character Classification ===
> 
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1], specifying both a "Standard" variant and a
> "POSIX Compatible" variant. The main difference with the POSIX variant
> is that symbols count as punctuation.
> 
> === LOWER()/INITCAP()/UPPER() ===
> 
> The LOWER() and UPPER() functions are defined in the SQL spec with
> surprising detail, relying on specific Unicode General Category
> assignments. How to map characters seems to be left (implicitly) up to
> Unicode. If the input string is normalized, the output string must be
> normalized, too. Weirdly, there's no room in the SQL spec to localize
> LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
> specifies one example, which is that "ß" becomes "SS" when folded to
> upper case. INITCAP() is not in the SQL spec.

I'll be honest, even though this is primarily about CTYPE and not
collation, I still need to keep re-reading the initial email slowly to
let it sink in and better understand it... at least for me, it's complex
to reason through. 

I'm trying to make sure I understand clearly what the user impact/change
is that we're talking about: after a little bit of brainstorming and
looking through the PG docs, I'm actually not seeing much more than
these two things you've mentioned here: the set of regexp_* functions PG
provides, and these three generic functions. That alone doesn't seem
highly concerning.

I haven't checked the source code for the regexp_* functions yet, but
are these just passing through to an external library? Are we actually
able to easily change the CTYPE provider for them? If nobody
knows/replies then I'll find some time to look.

One other thing that comes to mind: how does the parser do case folding
for relation names? Is that using OS-provided libc as of today? Or did
we code it to use ICU if that's the DB default? I'm guessing libc, and
global catalogs probably need to be handled in a consistent manner, even
across different encodings.

(Kindof related... did you ever see the demo where I create a user named
'' and then I try to connect to a database with non-unicode encoding?
  ...at least it seems to be able to walk the index without decoding
strings to find other users - but the way these global catalogs work
scares me a little bit)

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

2023-12-20 Thread Jeff Davis

On Wed, 2023-12-20 at 14:24 -0500, Robert Haas wrote:
> This makes sense to me, too, but it feels like it might work out
> better for speakers of English than for speakers of other languages.

There's very little in the way of locale-specific tailoring for ctype
behaviors in ICU or glibc -- only for the 'az', 'el', 'lt', and 'tr'
locales. While English speakers like us may benefit from being aligned
with the default ctype behaviors, those behaviors are not at all
specific to 'en' locales in ICU or glibc.

Collation varies a lot more between locales. I wouldn't call memcmp
ideal for English ('Zebra' comes before 'apple', which seems wrong to
me). If memcmp sorting does favor any particular group, I would say it
favors programmers more than English speakers. But that could just be
my perspective and I certainly understand the point that memcmp
ordering is more tolerable for some languages than others.

> Right now, I tend to get databases that default to en_US.utf8, and if
> the default changed to C.utf8, then the case-comparison behavior
> might
> be different

en_US.UTF-8 and C.UTF-8 have the same ctype behavior.

>  For
> someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a
> change to C.utf8 would be a much bigger problem, I would think.

Those locales all have the same ctype behavior.

It turns out that that en_US.UTF-8 and fr_FR.UTF-8 also have the same
collation order -- no tailoring beyond root collation according to CLDR
files for 'en' and 'fr' (though note that 'fr_CA' does have tailoring).
That doesn't mean the experience of switching to memcmp order is
exactly the same for a French speaker and an English speaker, but I
think it's interesting.

> That might be OK if they don't care about
> ordering for any purpose other than equality lookups, but otherwise
> it's going to force them to change the default, where today they
> don't
> have to do that.

To be clear, I haven't proposed changing the initdb default. This
thread is about adding a builtin provider with builtin ctype, which I
believe a lot of users would like.

It also might be the best chance we have to get to a reasonable default
behavior at some point in the future. It would be always available,
fast, stable, better semantics than "C" for many locales, and we can
document it. In any case, we don't need to decide that now. If the
builtin provider is useful, we should do it.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-20 Thread Robert Haas

On Wed, Dec 20, 2023 at 2:13 PM Jeff Davis  wrote:
> On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
> > If the Postgres default was bytewise sorting+locale-agnostic
> > ctype functions directly derived from Unicode data files,
> > as opposed to libc/$LANG at initdb time, the main
> > annoyance would be that "ORDER BY textcol" would no
> > longer be the human-favored sort.
> > For the presentation layer, we would have to write for instance
> >  ORDER BY textcol COLLATE "unicode" for the root collation
> > or a specific region-country if needed.
> > But all the rest seems better, especially cross-OS compatibity,
> > truly immutable and faster indexes for fields that
> > don't require linguistic ordering, alignment between Unicode
> > updates and Postgres updates.
>
> Thank you, that summarizes exactly the compromise that I'm trying to
> reach.

This makes sense to me, too, but it feels like it might work out
better for speakers of English than for speakers of other languages.
Right now, I tend to get databases that default to en_US.utf8, and if
the default changed to C.utf8, then the case-comparison behavior might
be different but the letters would still sort in the right order. For
someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a
change to C.utf8 would be a much bigger problem, I would think. Their
alphabet isn't in code point order, and so things would be
alphabetized wrongly. That might be OK if they don't care about
ordering for any purpose other than equality lookups, but otherwise
it's going to force them to change the default, where today they don't
have to do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2023-12-20 Thread Jeff Davis

On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
> If the Postgres default was bytewise sorting+locale-agnostic
> ctype functions directly derived from Unicode data files,
> as opposed to libc/$LANG at initdb time, the main
> annoyance would be that "ORDER BY textcol" would no
> longer be the human-favored sort.
> For the presentation layer, we would have to write for instance
>  ORDER BY textcol COLLATE "unicode" for the root collation
> or a specific region-country if needed.
> But all the rest seems better, especially cross-OS compatibity,
> truly immutable and faster indexes for fields that
> don't require linguistic ordering, alignment between Unicode
> updates and Postgres updates.

Thank you, that summarizes exactly the compromise that I'm trying to
reach.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-20 Thread Daniel Verite

Jeff Davis wrote:

> But there are a lot of users for whom neither of those things are true,
> and it makes zero sense to order all of the text indexes in the
> database according to any one particular locale. I think these users
> would prioritize stability and performance for the database collation,
> and then use COLLATE clauses with ICU collations where necessary.

+1

> I am also still concerned that we have the wrong defaults. Almost
> nobody thinks libc is a great provider, but that's the default, and
> there were problems trying to change that default to ICU in 16. If we
> had a builtin provider, that might be a better basis for a default
> (safe, fast, always available, and documentable). Then, at least if
> someone picks a different locale at initdb time, they would be doing so
> intentionally, rather than implicitly accepting index corruption risks
> based on an environment variable.

Yes. The introduction of the bytewise-sorting, locale-agnostic
C.UTF-8 in glibc is also a step in the direction of providing better
defaults for apps like Postgres, that need both long-term stability
in sorts and Unicode coverage for ctype-dependent functions.

But C.UTF-8 is not available everywhere, and there's still the
problem that Unicode updates through libc are not aligned
with Postgres releases.

ICU has the advantage of cross-OS compatibility,
but it does not provide any collation with bytewise sorting
like C or C.UTF-8, and we don't allow a combination like
"C" for sorting and ICU for ctype operations. When opting
for a locale provider, it has to be for both sorting
and ctype, so an installation that needs cross-OS
compatibility, good Unicode support and long-term stability
of indexes cannot get that with ICU as we expose it
today.

If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
 ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

2023-12-19 Thread Jeff Davis

On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote:
> FWIW, the idea that we're going to develop a built-in provider seems
> to be solid, for the reasons Jeff mentions: it can be stable, and
> under our control. But it seems like we might need built-in providers
> for everything rather than just CTYPE to get those advantages, and I
> fear we'll get sucked into needing a lot of tailoring rather than
> just
> being able to get by with one "vanilla" implementation.

For the database default collation, I suspect a lot of users would jump
at the chance to have "vanilla" semantics. Tailoring is more important
for individual collation objects than for the database-level collation.

There are reasons you might select a tailored database collation, like
if the set of users accessing it are mostly from a single locale, or if
the application connected to the database is expecting it in a certain
form.

But there are a lot of users for whom neither of those things are true,
and it makes zero sense to order all of the text indexes in the
database according to any one particular locale. I think these users
would prioritize stability and performance for the database collation,
and then use COLLATE clauses with ICU collations where necessary.

The question for me is how good the "vanilla" semantics need to be to
be useful as a database-level collation. Most of the performance and
stability problems come from collation, so it makes sense to me to
provide a fast and stable memcmp collation paired with richer ctype
semantics (as proposed here). Users who want something more probably
want the Unicode "root" collation, which can be provided by ICU today.

I am also still concerned that we have the wrong defaults. Almost
nobody thinks libc is a great provider, but that's the default, and
there were problems trying to change that default to ICU in 16. If we
had a builtin provider, that might be a better basis for a default
(safe, fast, always available, and documentable). Then, at least if
someone picks a different locale at initdb time, they would be doing so
intentionally, rather than implicitly accepting index corruption risks
based on an environment variable.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-19 Thread Robert Haas

On Mon, Dec 18, 2023 at 2:46 PM Jeff Davis  wrote:
> The whole concept of "providers" is that they aren't consistent with
> each other. ICU, libc, and the builtin provider will all be based on
> different versions of Unicode. That's by design.
>
> The built-in provider will be a bit better in the sense that it's
> consistent with the normalization functions, and the other providers
> aren't.

FWIW, the idea that we're going to develop a built-in provider seems
to be solid, for the reasons Jeff mentions: it can be stable, and
under our control. But it seems like we might need built-in providers
for everything rather than just CTYPE to get those advantages, and I
fear we'll get sucked into needing a lot of tailoring rather than just
being able to get by with one "vanilla" implementation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2023-12-18 Thread Jeff Davis

On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote:
> Looking closer, patches 3 and 4 look like an incremental extension of
> this earlier idea;

Yes, it's essentially the same thing extended to a few more files. I
don't know if "incremental" is the right word though; this is a
substantial extension of the idea.

>  the perl scripts download data from unicode.org and
> we've specifically defined Unicode version 15.1 and the scripts turn
> the
> data tables inside-out into C data structures optimized for lookup.
> That
> C code is then checked in to the PostgreSQL source code files
> unicode_category.h and unicode_case_table.h - right?

Yes. The standard build process shouldn't be downloading files, so the
static tables are checked in. Also, seeing the diffs of the static
tables improves the visibility of changes in case there's some mistake
or big surprise.

> Am I reading correctly that these two patches add C functions
> pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
> don't yet reference these functions anywhere? So this is just getting
> some plumbing in place?

Correct. Perhaps I should combine these into the builtin provider
thread, but these are independently testable and reviewable.

> > 
> My prediction is that updating this built-in provider eventually
> won't
> be any different from ICU or glibc.

The built-in provider will have several advantages because it's tied to
a PG major version:

  * A physical replica can't have different semantics than the primary.
  * Easier to document and test.
  * Changes are more transparent and can be documented in the release
notes, so that administrators can understand the risks and blast radius
at pg_upgrade time.

> Later on down the road, from a user perspective, I think we should be
> careful about confusion where providers are used inconsistently. It's
> not great if one function follow built-in Unicode 15.1 rules but
> another
> function uses Unicode 13 rules because it happened to call an ICU
> function or a glibc function. We could easily end up with multiple
> providers processing different parts of a single SQL statement, which
> could lead to strange results in some cases.

The whole concept of "providers" is that they aren't consistent with
each other. ICU, libc, and the builtin provider will all be based on
different versions of Unicode. That's by design.

The built-in provider will be a bit better in the sense that it's
consistent with the normalization functions, and the other providers
aren't.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-15 Thread Jeremy Schneider

On 12/13/23 5:28 AM, Jeff Davis wrote:
> On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
>> My biggest concern is around maintenance. Every year Unicode is
>> assigning new characters to existing code points, and those existing
>> code points can of course already be stored in old databases before
>> libs
>> are updated.
> 
> Is the concern only about unassigned code points?
> 
> I already committed a function "unicode_assigned()" to test whether a
> string contains only assigned code points, which can be used in a
> CHECK() constraint. I also posted[5] an idea about a per-database
> option that could reject the storage of any unassigned code point,
> which would make it easier for users highly concerned about
> compatibility.

I didn't know about this.  Did a few smoke tests against today's head on
git and it's nice to see the function working as expected.  :)

test=# select unicode_version();
 unicode_version
-
 15.1

test=# select chr(3212),unicode_assigned(chr(3212));
 chr | unicode_assigned
-+--
 ಌ   | t

-- unassigned code point inside assigned block
test=# select chr(3213),unicode_assigned(chr(3213));
 chr | unicode_assigned
-+--
 ಍   | f

test=# select chr(3214),unicode_assigned(chr(3214));
 chr | unicode_assigned
-+--
 ಎ   | t

-- unassigned block
test=# select chr(67024),unicode_assigned(chr(67024));
 chr | unicode_assigned
-+--
 א   | f

test=# select chr(67072),unicode_assigned(chr(67072));
 chr | unicode_assigned
-+--
 ؀   | t

Looking closer, patches 3 and 4 look like an incremental extension of
this earlier idea; the perl scripts download data from unicode.org and
we've specifically defined Unicode version 15.1 and the scripts turn the
data tables inside-out into C data structures optimized for lookup. That
C code is then checked in to the PostgreSQL source code files
unicode_category.h and unicode_case_table.h - right?

Am I reading correctly that these two patches add C functions
pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
don't yet reference these functions anywhere? So this is just getting
some plumbing in place?

>> And we may end up with
>> something like the timezone database where we need to periodically
>> add a
>> more current ruleset - albeit alongside as a new version in this
>> case.
> 
> There's a build target "update-unicode" which is run to pull in new
> Unicode data files and parse them into static C arrays (we already do
> this for the Unicode normalization tables). So I agree that the tables
> should be updated but I don't understand why that's a problem.

I don't want to get stuck on this. I agree with the general approach of
beginning to add a provider for locale functions inside the database. We
have awhile before Unicode 16 comes out. Plenty of time for bikeshedding

My prediction is that updating this built-in provider eventually won't
be any different from ICU or glibc. It depends a bit on how we
specifically built on this plumbing - but when Unicode 16 comes out, i
I'll try to come up with a simple repro on a default DB config where
changing the Unicode version causes corruption (it was pretty easy to
demonstrate for ICU collation, if you knew where to look)... but I don't
think that discussion should derail this commit, because for now we're
just starting the process of getting Unicode 15.1 into the PostgreSQL
code base. We can cross the "update" bridge when we come to it.

Later on down the road, from a user perspective, I think we should be
careful about confusion where providers are used inconsistently. It's
not great if one function follow built-in Unicode 15.1 rules but another
function uses Unicode 13 rules because it happened to call an ICU
function or a glibc function. We could easily end up with multiple
providers processing different parts of a single SQL statement, which
could lead to strange results in some cases.

Ideally a user just specifies a default provider their database, and the
rules for that version of Unicode are used as consistently as possible -
unless a user explicitly overrides their choice in a table/column
definition, query, etc. But it might take a little time and work to get
to this point.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

2023-12-14 Thread Jeff Davis

On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
> In particular "el" (modern greek) has case mapping rules that
> ICU seems to implement, but "el" is missing from the list
> ("lt", "tr", and "az") you identified.

I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the
ctype semantics match C.UTF-8 for all code points. glibc is not doing
this additional tailoring for "el".

Therefore I believe the builtin CTYPE would be very useful for case
mapping (both "simple" and "full") even without this additional
tailoring.

You are correct that ICU will still have some features that won't be
supported by the builtin provider. Better word boundary semantics in
INITCAP() are another advantage.

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-13 Thread Jeff Davis

On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
> But there are CLDR mappings on top of that.

I see, thank you.

Would it still be called "full" case mapping to only use the mappings
in SpecialCasing.txt? And would that be useful?

Regards,
Jeff Davis

Re: Built-in CTYPE provider

2023-12-13 Thread Daniel Verite

Jeff Davis wrote:

> While "full" case mapping sounds more complex, there are actually
> very few cases to consider and they are covered in another (small)
> data file. That data file covers ~100 code points that convert to
> multiple code points when the case changes (e.g. "ß" -> "SS"), 7
> code points that have context-sensitive mappings, and three locales
> which have special conversions ("lt", "tr", and "az") for a few code
> points.

But there are CLDR mappings on top of that.

According to the Unicode FAQ

   https://unicode.org/faq/casemap_charprop.html#5

   Q: Does the default case mapping work for every language? What
   about the default case folding?

   [...]

   To make case mapping language sensitive, the Unicode Standard
   specificially allows implementations to tailor the mappings for
   each language, but does not provide the necessary data. The file
   SpecialCasing.txt is included in the Standard as a guide to a few
   of the more important individual character mappings needed for
   specific languages, notably the Greek script and the Turkic
   languages. However, for most language-specific mappings and
   tailoring, users should refer to CLDR and other resources.

In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.

The CLDR case mappings seem to be found in
https://github.com/unicode-org/cldr/tree/main/common/transforms
in *-Lower.xml and *-Upper.xml

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

2023-12-13 Thread Jeff Davis

On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
> My biggest concern is around maintenance. Every year Unicode is
> assigning new characters to existing code points, and those existing
> code points can of course already be stored in old databases before
> libs
> are updated.

Is the concern only about unassigned code points?

I already committed a function "unicode_assigned()" to test whether a
string contains only assigned code points, which can be used in a
CHECK() constraint. I also posted[5] an idea about a per-database
option that could reject the storage of any unassigned code point,
which would make it easier for users highly concerned about
compatibility.

> And we may end up with
> something like the timezone database where we need to periodically
> add a
> more current ruleset - albeit alongside as a new version in this
> case.

There's a build target "update-unicode" which is run to pull in new
Unicode data files and parse them into static C arrays (we already do
this for the Unicode normalization tables). So I agree that the tables
should be updated but I don't understand why that's a problem.

> If I'm reading the Unicode 15 update correctly, PostgreSQL regex
> expressions with [[:digit:]] will not correctly identify Kaktovik or
> Nag
> Mundari or Kawi digits without that update to character type specs.

Yeah, if we are behind in the Unicode version, then results won't be
the most up-to-date. But ICU or libc could also be behind in the
Unicode version.

> But lets remember that people like to build indexes on character
> classification functions like upper/lower, for case insensitive
> searching.

UPPER()/LOWER() are based on case mapping, not character
classification.

I intend to introduce a SQL-level CASEFOLD() function that would obey
Unicode casefolding rules, which have very strong compatibility
guarantees[6] (essentially, if you are only using assigned code points,
you are fine).

>  It's another case where the index will be corrupted if
> someone happened to store Latin Glottal vowels in their database and
> then we update libs to the latest character type rules.

I don't agree with this characterization at all.

  (a) It's not "another case". Corruption of an index on LOWER() can
happen today. My proposal makes the situation better, not worse.
  (b) These aren't libraries, I am proposing built-in Unicode tables
that only get updated in a new major PG version.
  (c) It likely only affects a small number of indexes and it's easier
for an administrator to guess which ones might be affected, making it
easier to just rebuild those indexes.
  (d) It's not a problem if you stick to assigned code points.

> So even with something as basic as character type, if we're going to
> do
> it right, we still need to either version it or definitively decide
> that
> we're not going to every support newly added Unicode characters like
> Latin Glottals.

If, by "version it", you mean "update the data tables in new Postgres
versions", then I agree. If you mean that one PG version would need to
support many versions of Unicode, I don't agree.

Regards,
Jeff Davis

[5]
https://postgr.es/m/c5e9dac884332824e0797937518da0b8766c1238.ca...@j-davis.com

[6] https://www.unicode.org/policies/stability_policy.html#Case_Folding

Re: Built-in CTYPE provider

2023-12-12 Thread Jeremy Schneider

On 12/5/23 3:46 PM, Jeff Davis wrote:
> === Character Classification ===
> 
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1], specifying both a "Standard" variant and a
> "POSIX Compatible" variant. The main difference with the POSIX variant
> is that symbols count as punctuation.
> 
> === LOWER()/INITCAP()/UPPER() ===
> 
> The LOWER() and UPPER() functions are defined in the SQL spec with
> surprising detail, relying on specific Unicode General Category
> assignments. How to map characters seems to be left (implicitly) up to
> Unicode. If the input string is normalized, the output string must be
> normalized, too. Weirdly, there's no room in the SQL spec to localize
> LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
> specifies one example, which is that "ß" becomes "SS" when folded to
> upper case. INITCAP() is not in the SQL spec.
> 
> === Questions ===
> 
> * Is a built-in ctype provider a reasonable direction for Postgres as
>   a project?
> * Does it feel like it would be simpler or more complex than what
>   we're doing now?
> * Do we want to just try to improve our ICU support instead?
> * Do we want the built-in provider to be one thing, or have a few
>   options (e.g. "standard" or "posix" character classification;
>   "simple" or "full" case mapping)?

Generally, I am in favor of this - I think we need to move in the
direction of having an in-database option around unicode for PG users,
given how easy it is for administrators to mis-manage dependencies.
Especially when OS admins can be different from DB admins, and when
nobody really understands risks of changing libs with in-place moves to
new operating systems - except for like 4 of us on the mailing lists.

My biggest concern is around maintenance. Every year Unicode is
assigning new characters to existing code points, and those existing
code points can of course already be stored in old databases before libs
are updated. When users start to notice that regex [[:digit:]] or
upper/lower functions aren't working correctly with characters in their
DB, they'll probably come asking for fixes. And we may end up with
something like the timezone database where we need to periodically add a
more current ruleset - albeit alongside as a new version in this case.

Here are direct links to charts of newly assigned characters from the
last few Unicode updates:

2022: https://www.unicode.org/charts/PDF/Unicode-15.0/
2021: https://www.unicode.org/charts/PDF/Unicode-14.0/
2020: https://www.unicode.org/charts/PDF/Unicode-13.0/
2019: https://www.unicode.org/charts/PDF/Unicode-12.0/

If I'm reading the Unicode 15 update correctly, PostgreSQL regex
expressions with [[:digit:]] will not correctly identify Kaktovik or Nag
Mundari or Kawi digits without that update to character type specs.

If I'm reading the Unicode 12 update correctly, then upper/lower
functions aren't going to work correctly on Latin Glottal A and I and U
characters without that update to character type specs.

Overall I see a lot fewer Unicode updates involving upper/lower than I
do with digits - especially since new scripts often involve their own
numbering characters which makes new digits more common.

But lets remember that people like to build indexes on character
classification functions like upper/lower, for case insensitive
searching. It's another case where the index will be corrupted if
someone happened to store Latin Glottal vowels in their database and
then we update libs to the latest character type rules.

So even with something as basic as character type, if we're going to do
it right, we still need to either version it or definitively decide that
we're not going to every support newly added Unicode characters like
Latin Glottals.

-Jeremy

-- 
http://about.me/jeremy_schneider

71 matches

Mail list logo