Re: Built-in CTYPE provider
On Fri, 2024-04-05 at 11:22 +1300, Thomas Munro wrote: > Hi, > > +command_ok( > + [ > + 'initdb', '--no-sync', > + '--locale-provider=builtin', '-E UTF-8', > + '--builtin-locale=C.UTF-8', "$tempdir/data8" > + ], > + 'locale provider builtin with -E UTF-8 --builtin- > locale=C.UTF-8'); ... > LC_COLLATE: en_US > LC_CTYPE: en_US > LC_MESSAGES: C > LC_MONETARY: en_US > LC_NUMERIC: en_US > LC_TIME: en_US > initdb: error: encoding mismatch > initdb: detail: The encoding you selected (UTF8) and the encoding > that > the selected locale uses (LATIN1) do not match. Thank you for the report. I fixed it in e2a2357671 by forcing the environment locale to C which is compatible with any encoding. The test still forces the encoding to UTF-8 and the collation to the builtin C.UTF-8. In passing, I noticed some unrelated regression test failures when I set LANG=tr_TR: tsearch, tsdict, json, and jsonb. There's an additional failure in the updatable_views test when LANG=tr_TR.utf8. I haven't looked into the details yet. Regards, Jeff Davis From a7bf58f27bcd60bf79c097c7f462fa183264e9cd Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 4 Apr 2024 15:55:03 -0700 Subject: [PATCH v1] Fix test failure in 001_initdb.pl. Reported-by: Thomas Munro Discussion: https://postgr.es/m/CA+hUKGK-ZqV1njkG_=xcCqXh2fcMkz85FTMnhS2opm4ZerH=x...@mail.gmail.com --- src/bin/initdb/t/001_initdb.pl | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl index c63d3206d9..b31dad2464 100644 --- a/src/bin/initdb/t/001_initdb.pl +++ b/src/bin/initdb/t/001_initdb.pl @@ -200,6 +200,7 @@ command_ok( [ 'initdb', '--no-sync', '--locale-provider=builtin', '-E UTF-8', + '--lc-collate=C', '--lc-ctype=C', '--builtin-locale=C.UTF-8', "$tempdir/data8" ], 'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8'); @@ -208,6 +209,7 @@ command_fails( [ 'initdb', '--no-sync', '--locale-provider=builtin', '-E SQL_ASCII', + '--lc-collate=C', '--lc-ctype=C', '--builtin-locale=C.UTF-8', "$tempdir/data9" ], 'locale provider builtin with --builtin-locale=C.UTF-8 fails for SQL_ASCII' -- 2.34.1
Re: Built-in CTYPE provider
Hi, +command_ok( + [ + 'initdb', '--no-sync', + '--locale-provider=builtin', '-E UTF-8', + '--builtin-locale=C.UTF-8', "$tempdir/data8" + ], + 'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8'); This Sun animal recently turned on --enable-tap-tests, and that ↑ failed[1]: # Running: initdb --no-sync --locale-provider=builtin -E UTF-8 --builtin-locale=C.UTF-8 /home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/initdb/tmp_check/tmp_test_XvK1/data8 The files belonging to this database system will be owned by user "marcel". This user must also own the server process. The database cluster will be initialized with this locale configuration: locale provider: builtin default collation: C.UTF-8 LC_COLLATE: en_US LC_CTYPE:en_US LC_MESSAGES: C LC_MONETARY: en_US LC_NUMERIC: en_US LC_TIME: en_US initdb: error: encoding mismatch initdb: detail: The encoding you selected (UTF8) and the encoding that the selected locale uses (LATIN1) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. [14:04:12.462](0.036s) not ok 28 - locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8 [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-04-04%2011%3A42%3A40
Re: Built-in CTYPE provider
On 01.04.24 21:52, Jeff Davis wrote: On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote: The patch set v27 is ok with me, modulo (a) discussion about initcap semantics, and (b) what collation to assign to ucs_basic, which can be revisited later. Attached v28. The remaining patches are for full case mapping and PG_UNICODE_FAST. I am fine waiting until July to get these remaining patches committed. That would give us time to sort out details like: * Get consensus that it's OK to change UCS_BASIC. * Figure out if we need a pg-specific locale and whether PG_UNICODE_FAST is the right name. * Make sure that full case mapping interacts with regexes in a sane way (probably it needs to just fall back to simple case mapping, but perhaps that's worth a discussion). * Implement case folding. * Implement a more unicode-friendly TITLECASE() function, which could offer a number of options that don't fit well with INITCAP(). * Figure out if UPPER()/LOWER() should also have some of those options. Thoughts? Yeah, I think it's good to give some more time to work out these things. The features committed for PG17 so far are solid, so it's a good point to pause.
Re: Built-in CTYPE provider
On Tue, 2024-03-26 at 08:14 +0100, Peter Eisentraut wrote: > > Full vs. simple case mapping is more of a legacy compatibility > question, > in my mind. There is some expectation/precedent that C.UTF-8 uses > simple case mapping, but beyond that, I don't see a reason why > someone > would want to explicitly opt for simple case mapping, other than if > they > need length preservation or something, but if they need that, then > they > are going to be in a world of pain in Unicode anyway. I mostly agree, though there are some other purposes for the simple mapping: * a substitute for case folding: lower() with simple case mapping will work better for that purpose than lower() with full case mapping (after we have casefold(), this won't be a problem) * simple case mapping is conceptually simpler, and that's a benefit by itself in some situations -- maybe the 1:1 assumption exists other places in their application > > There's also another reason to consider it an argument rather than > > a > > collation property, which is that it might be dependent on some > > other > > field in a row. I could imagine someone wanting to do: > > > > SELECT > > UPPER(some_field, > > full => true, > > dotless_i => CASE other_field WHEN ...) > > FROM ... > > Can you index this usefully? It would only work if the user query > matches exactly this pattern? In that example, UPPER is used in the target list -- the WHERE clause might be indexable. The UPPER is just used for display purposes, and may depend on some locale settings stored in another table associated with a particular user. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote: > The patch set v27 is ok with me, modulo (a) discussion about initcap > semantics, and (b) what collation to assign to ucs_basic, which can > be > revisited later. I held off on the refactoring patch for lc_{ctype|collate}_is_c(). There's an explicit "NB: pg_newlocale_from_collation is only supposed to be called on non-C-equivalent locales" comment in DefineCollation(). What I'd like to do is make it possible to create valid pg_locale_t objects out of C locales, which can be used anywhere a real locale can be used. Callers can still check lc_{collate|ctype}_is_c() for various reasons; but if they did call pg_newlocale_from_collation on a C locale it would at least work for the pg_locale.h APIs. That would be a slightly simpler and safer API, and make it easier to do the collation version check consistently. That's not very complicated, but it's a bit invasive and probably out of scope for v17. It might be part of another change I had intended for a while, which is to make NULL an invalid pg_locale_t, and use a different representation to mean "use the server environment". That would clean up a lot of checks for NULL. For now, we'd still like to add the version number to the builtin collations, so that leaves us with two options: (a) Perform the version check in lc_{collate|ctype}_is_c(), which duplicates some code and creates some inconsistency in how the version is checked for different providers. (b) Don't worry about it and just commit the version change in v27- 0001. The version check is already performed correctly on the database without changes, even if the locale is "C". And there are already three built-in "C" collations: "C", "POSIX", and UCS_BASIC; so it's not clear why someone would create even more of them. And even if they did, there would be no reason to give them a warning because we haven't incremented the version, so there's no chance of a mismatch. I'm inclined toward (b). Thoughts? Regards, Jeff Davis
Re: Built-in CTYPE provider
On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote: > provider | isalpha | isdigit > --+-+- > ICU | f | t > glibc | t | f > builtin | f | f The "ICU" above is really the behvior of the Postgres ICU provider as we implemented it, it's not something forced on us by ICU. For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are defined by ICU to be equivalent to java.lang.Character.isLetter() and java.lang.Character.isDigit(). ICU documents[3] how regex character classes should be implemented using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite being under the heading "...for C/POSIX character classes...", [3] says it's based on the "Standard" variant of [4], rather than "POSIX Compatible". (Aside: the Postgres ICU provider doesn't match what [3] suggests for the "alpha" class. For the character U+FF11 it doesn't matter, but I suspect there are differences for other characters. This should be fixed.) The differences between PG_C_UTF8 and what ICU suggests are just because the former uses the "POSIX Compatible" definitions and the latter uses "Standard". I implemented both the "Standard" and "POSIX Compatible" compatibility properties in ad49994538, so it would be easy to change what PG_C_UTF8 uses. [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283 [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6 [3] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details [4] http://www.unicode.org/reports/tr18/#Compatibility_Properties > Are we fine with pg_c_utf8 differing from both ICU's point of view > (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is > not > digit, but it's alpha)? Yes, some differences are to be expected. But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as we can point to something other than "glibc version 2.35 does it this way". Regards, Jeff Davis
Re: Built-in CTYPE provider
Jeff Davis wrote: > The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 > collation vs '123Abc' in PG_UNICODE_FAST. > > The reason for the latter behavior is that the Unicode Default Case > Conversion algorithm for toTitlecase() advances to the next Cased > character before mapping to titlecase, and digits are not Cased. ICU > has a configurable adjustment, and defaults in a way that produces > '123abc'. Even aside from ICU, there's a different behavior between glibc and pg_c_utf8 glibc for codepoints in the decimal digit category outside of the US-ASCII range '0'..'9', select initcap(concat(chr(0xff11), 'a') collate "C.utf8"); -- glibc 2.35 initcap - 1a select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8"); initcap - 1A Both collations consider that chr(0xff11) is not a digit (isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8 says it's neither digit nor alpha. AFAIU this is why in the above initcap() call, pg_c_utf8 considers that 'a' is the first alphanumeric, whereas C.utf8 considers that '1' is the first alphanumeric, leading to different capitalizations. Comparing the 3 providers: WITH v(provider,type,result) AS (values ('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"), ('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"), ('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"), ('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"), ('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"), ('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8") ) select * from v \crosstabview provider | isalpha | isdigit --+-+- ICU | f | t glibc| t | f builtin | f | f Are we fine with pg_c_utf8 differing from both ICU's point of view (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not digit, but it's alpha)? Aside from initcap(), this is going to be significant for regular expressions. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Built-in CTYPE provider
On 25.03.24 18:52, Jeff Davis wrote: OK, I'll propose a "title" or "titlecase" function for 18, along with "casefold" (which I was already planning to propose). (Yay, casefold will be useful.) What do you think about UPPER/LOWER and full case mapping? Should there be extra arguments for full vs simple case mapping, or should it come from the collation? It makes sense that the "dotted vs dotless i" behavior comes from the collation because that depends on locale. But full-vs-simple case mapping is not really a locale question. For instance: select lower('0Σ' collate "en-US-x-icu") AS lower_sigma, lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma, upper('ß' collate "en-US-x-icu") AS upper_eszett; lower_sigma | lower_final_sigma | upper_eszett -+---+-- 0σ | ας| SS produces the same results for any ICU collation. I think of a collation describing what language a text is in. So it makes sense that "dotless i" depends on the locale/collation. Full vs. simple case mapping is more of a legacy compatibility question, in my mind. There is some expectation/precedent that C.UTF-8 uses simple case mapping, but beyond that, I don't see a reason why someone would want to explicitly opt for simple case mapping, other than if they need length preservation or something, but if they need that, then they are going to be in a world of pain in Unicode anyway. There's also another reason to consider it an argument rather than a collation property, which is that it might be dependent on some other field in a row. I could imagine someone wanting to do: SELECT UPPER(some_field, full => true, dotless_i => CASE other_field WHEN ...) FROM ... Can you index this usefully? It would only work if the user query matches exactly this pattern? That makes sense for a function in the target list, because different customers might be from different locales and therefore want different treatment of the dotted-vs-dotless-i. There is also the concept of a session collation, which we haven't implemented, but it would address this kind of use. But there again the problem is indexing. But maybe indexing isn't as important for case conversion as it is for sorting.
Re: Built-in CTYPE provider
On 21.03.24 01:13, Jeff Davis wrote: The v26 patch was not quite complete, so I didn't commit it yet. Attached v27-0001 and 0002. 0002 is necessary because otherwise lc_collate_is_c() short-circuits the version check in pg_newlocale_from_collation(). With 0002, the code is simpler and all paths go through pg_newlocale_from_collation(), and the version check happens even when lc_collate_is_c(). But perhaps there was a reason the code was the way it was, so submitting for review in case I missed something. 0005 and 0006 don't contain any test cases. So I guess they are really only usable via 0007. Is that understanding correct? 0005 is not a functional change, it's just a refactoring to use a callback, which is preparation for 0007. Are there any test cases that illustrate the word boundary changes in patch 0005? It might be useful to test those against Oracle as well. The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 collation vs '123Abc' in PG_UNICODE_FAST. The reason for the latter behavior is that the Unicode Default Case Conversion algorithm for toTitlecase() advances to the next Cased character before mapping to titlecase, and digits are not Cased. ICU has a configurable adjustment, and defaults in a way that produces '123abc'. New rebased series attached. The patch set v27 is ok with me, modulo (a) discussion about initcap semantics, and (b) what collation to assign to ucs_basic, which can be revisited later.
Re: Built-in CTYPE provider
On Mon, 2024-03-25 at 08:29 +0100, Peter Eisentraut wrote: > Right. I thought when you said there is an ICU configuration for it, > that it might be like collation options that you specify in the > locale > string. But it appears it is only an internal API setting. So that, > in > my mind, reinforces the opinion that we should leave initcap() as is > and > make a new function that exposes the new functionality. (This does > not > have to be part of this patch set.) OK, I'll propose a "title" or "titlecase" function for 18, along with "casefold" (which I was already planning to propose). What do you think about UPPER/LOWER and full case mapping? Should there be extra arguments for full vs simple case mapping, or should it come from the collation? It makes sense that the "dotted vs dotless i" behavior comes from the collation because that depends on locale. But full-vs-simple case mapping is not really a locale question. For instance: select lower('0Σ' collate "en-US-x-icu") AS lower_sigma, lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma, upper('ß' collate "en-US-x-icu") AS upper_eszett; lower_sigma | lower_final_sigma | upper_eszett -+---+-- 0σ | ας| SS produces the same results for any ICU collation. There's also another reason to consider it an argument rather than a collation property, which is that it might be dependent on some other field in a row. I could imagine someone wanting to do: SELECT UPPER(some_field, full => true, dotless_i => CASE other_field WHEN ...) FROM ... That makes sense for a function in the target list, because different customers might be from different locales and therefore want different treatment of the dotted-vs-dotless-i. Thoughts? Should we use the collation by default but then allow parameters to override? Or should we just consider this a new set of functions? (All of this is v18 material, of course.) Regards, Jeff Davis
Re: Built-in CTYPE provider
On 22.03.24 18:26, Jeff Davis wrote: On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote: I think this might be too big of a compatibility break. So far, initcap('123abc') has always returned '123abc'. If the new collation returns '123Abc' now, then that's quite a change. These are not some obscure Unicode special case characters, after all. It's a new collation, so I'm not sure it's a compatibility break. But you are right that it is against documentation and expectations for INITCAP(). What is the ICU configuration incantation for this? Maybe we could have the builtin provider understand some of that, too. https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e Or we should create a function separate from initcap. If we create a new function, that also gives us the opportunity to accept optional arguments to control the behavior rather than relying on collation for every decision. Right. I thought when you said there is an ICU configuration for it, that it might be like collation options that you specify in the locale string. But it appears it is only an internal API setting. So that, in my mind, reinforces the opinion that we should leave initcap() as is and make a new function that exposes the new functionality. (This does not have to be part of this patch set.)
Re: Built-in CTYPE provider
There is no technical content in this mail, but I'd like to show appreciation for your work on this. I hope this will eventually remove one of the great embarrassments when using PostgreSQL: the dependency on operation system collations. Yours, Laurenz Albe
Re: Built-in CTYPE provider
On Sun, 2024-03-24 at 14:00 +0300, Alexander Lakhin wrote: > Please look at a Valgrind-detected error caused by the following > query > (starting from f69319f2f): > SELECT lower('Π' COLLATE pg_c_utf8); Thank you for the report! Fixed in 503c0ad976. Valgrind did not detect the problem in my setup, so I added a unit test in case_test.c where it's easier to see the valgrind problem. Regards, Jeff Davis
Re: Built-in CTYPE provider
Hello Jeff, 21.03.2024 03:13, Jeff Davis wrote: On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote: * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch Looks ok. Committed. Please look at a Valgrind-detected error caused by the following query (starting from f69319f2f): SELECT lower('Π' COLLATE pg_c_utf8); ==00:00:00:03.487 1429669== Invalid read of size 1 ==00:00:00:03.487 1429669== at 0x7C64A5: convert_case (unicode_case.c:107) ==00:00:00:03.487 1429669== by 0x7C: unicode_strlower (unicode_case.c:70) ==00:00:00:03.487 1429669== by 0x66B218: str_tolower (formatting.c:1698) ==00:00:00:03.488 1429669== by 0x6D6C55: lower (oracle_compat.c:55) Best regards, Alexander
Re: Built-in CTYPE provider
On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote: > I think this might be too big of a compatibility break. So far, > initcap('123abc') has always returned '123abc'. If the new collation > returns '123Abc' now, then that's quite a change. These are not some > obscure Unicode special case characters, after all. It's a new collation, so I'm not sure it's a compatibility break. But you are right that it is against documentation and expectations for INITCAP(). > What is the ICU configuration incantation for this? Maybe we could > have > the builtin provider understand some of that, too. https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#a4975f537b9960f0330b233061ef0608d https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/stringoptions_8h.html#afc65fa226cac9b8eeef0e877b8a7744e > Or we should create a function separate from initcap. If we create a new function, that also gives us the opportunity to accept optional arguments to control the behavior rather than relying on collation for every decision. Regards, Jeff Davis
Re: Built-in CTYPE provider
On 21.03.24 01:13, Jeff Davis wrote: Are there any test cases that illustrate the word boundary changes in patch 0005? It might be useful to test those against Oracle as well. The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 collation vs '123Abc' in PG_UNICODE_FAST. The reason for the latter behavior is that the Unicode Default Case Conversion algorithm for toTitlecase() advances to the next Cased character before mapping to titlecase, and digits are not Cased. ICU has a configurable adjustment, and defaults in a way that produces '123abc'. I think this might be too big of a compatibility break. So far, initcap('123abc') has always returned '123abc'. If the new collation returns '123Abc' now, then that's quite a change. These are not some obscure Unicode special case characters, after all. What is the ICU configuration incantation for this? Maybe we could have the builtin provider understand some of that, too. Or we should create a function separate from initcap.
Re: Built-in CTYPE provider
* v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch This was committed. * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch Looks ok. * v25-0003-Inline-basic-UTF-8-functions.patch ok * v25-0004-Use-version-for-builtin-collations.patch Not sure about the version format "1.0", which implies some sort of major/minor or component-based system. I would just use "1". * v25-0005-Add-unicode_strtitle-for-Unicode-Default-Case-Co.patch * v25-0006-Support-Unicode-full-case-mapping-and-conversion.patch * v25-0007-Support-PG_UNICODE_FAST-locale-in-the-builtin-co.patch 0005 and 0006 don't contain any test cases. So I guess they are really only usable via 0007. Is that understanding correct? Btw., tested initcap() on Oracle: select initcap('džudo') from dual; (which uses the precomposed U+01F3) and the result is DŽudo (with the precomposed uppercase character). So that matches the behavior proposed in your 0002 patch. Are there any test cases that illustrate the word boundary changes in patch 0005? It might be useful to test those against Oracle as well.
Re: Built-in CTYPE provider
Thomas Munro writes: > On Tue, Mar 19, 2024 at 11:55 AM Tom Lane wrote: This is causing all CI jobs to fail the "compiler warnings" check. >>> I did run CI before checkin, and it passed: > Maybe I misunderstood this exchange but ... > Currently Windows warnings don't make any CI tasks fail ie turn red, > which is why Jeff's run is all green in his personal github repo. > ... > But I did teach cfbot to do some extra digging through the logs, Ah. What I should have said was "it's causing cfbot to complain about every patch". Seems like the divergence in the pass criterion is not such a great idea. regards, tom lane
Re: Built-in CTYPE provider
On Tue, Mar 19, 2024 at 11:55 AM Tom Lane wrote: > Jeff Davis writes: > > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: > >> This is causing all CI jobs to fail the "compiler warnings" check. > > > I did run CI before checkin, and it passed: > > https://cirrus-ci.com/build/5382423490330624 > > Weird, why did it not report with the same level of urgency? > But anyway, thanks for fixing. Maybe I misunderstood this exchange but ... Currently Windows warnings don't make any CI tasks fail ie turn red, which is why Jeff's run is all green in his personal github repo. Unlike gcc and clang, and MinGW cross-build warnings which cause the special "CompilerWarnings" CI task to fail (red). That task is running on a Linux system so it can't use MSVC. The idea of keeping it separate from the "main" Linux, FreeBSD, macOS tasks (which use gcc, clang, clang respectively) was that it's nicer to try to run the actual tests even if there is a pesky warning, so having it in a separate task gets you that info without blocking other progress, and it also tries with and without assertions (a category of warning hazard, eg unused variables when assertions are off). But I did teach cfbot to do some extra digging through the logs, looking for various interesting patterns[1], including non-error warnings, and if it finds anything interesting it shows a little clickable ⚠ symbol on the front page. If there is something like -Werror on MSVC we could turn that on for the main Windows test, but that might also be a bit annoying. Perhaps there is another way: we could have it compile and test everything, allowing warnings, but also then grep the build log afterwards in a new step that fails if any warnings were there? Then Jeff would have got a failure in his personal CI run. Or something like that. [1] https://github.com/macdice/cfbot/blob/master/cfbot_work_queue.py
Re: Built-in CTYPE provider
Jeff Davis writes: > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: >> This is causing all CI jobs to fail the "compiler warnings" check. > I did run CI before checkin, and it passed: > https://cirrus-ci.com/build/5382423490330624 Weird, why did it not report with the same level of urgency? But anyway, thanks for fixing. regards, tom lane
Re: Built-in CTYPE provider
On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: > This is causing all CI jobs to fail the "compiler warnings" check. I did run CI before checkin, and it passed: https://cirrus-ci.com/build/5382423490330624 If I open up the windows build, I see the warning: https://cirrus-ci.com/task/5199979044667392 but I didn't happen to check this time. > Probably the best fix is the traditional > > return ; /* keep compiler quiet */ > > but I'm not sure what the best default result is in this function. In inverted the check so that I didn't have to choose a default. Regards, Jeff Davis
Re: Built-in CTYPE provider
Jeff Davis writes: > It may be moot soon, but I committed a fix now. Thanks, but it looks like 846311051 introduced a fresh issue. MSVC is complaining about [21:37:15.349] c:\cirrus\src\backend\utils\adt\pg_locale.c(2515) : warning C4715: 'builtin_locale_encoding': not all control paths return a value This is causing all CI jobs to fail the "compiler warnings" check. Probably the best fix is the traditional return ;/* keep compiler quiet */ but I'm not sure what the best default result is in this function. regards, tom lane
Re: Built-in CTYPE provider
On Sun, 2024-03-17 at 17:46 -0400, Tom Lane wrote: > Jeff Davis writes: > > New series attached. > > Coverity thinks there's something wrong with builtin_validate_locale, > and as far as I can tell it's right: the last ereport is unreachable, > because required_encoding is never changed from its initial -1 value. > It looks like there's a chunk of logic missing there, or else that > the code could be simplified further. Thank you, it was a bit of over-generalization in anticipation of future patches. It may be moot soon, but I committed a fix now. Regards, Jeff Davis
Re: Built-in CTYPE provider
Jeff Davis writes: > New series attached. Coverity thinks there's something wrong with builtin_validate_locale, and as far as I can tell it's right: the last ereport is unreachable, because required_encoding is never changed from its initial -1 value. It looks like there's a chunk of logic missing there, or else that the code could be simplified further. /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/pg_locale.c: 2519 in builtin_validate_locale() >>> CID 1594398: Control flow issues (DEADCODE) >>> Execution cannot reach the expression "encoding != required_encoding" >>> inside this statement: "if (required_encoding >= 0 ...". 2519if (required_encoding >= 0 && encoding != required_encoding) 2520ereport(ERROR, 2521(errcode(ERRCODE_WRONG_OBJECT_TYPE), 2522 errmsg("encoding \"%s\" does not match locale \"%s\"", 2523 pg_encoding_to_char(encoding), locale))); regards, tom lane
Re: Built-in CTYPE provider
On Thu, 2024-03-14 at 15:38 +0100, Peter Eisentraut wrote: > On 14.03.24 09:08, Jeff Davis wrote: > > 0001 (the C.UTF-8 locale) is also close... > > If have tested this against the libc locale C.utf8 that was available > on > the OS, and the behavior is consistent. That was the goal, in spirit. But to clarify: it's not guaranteed that the built-in C.UTF-8 is always the same as the libc UTF-8, because different implementations do different things. For instance, I saw significant differences on MacOS. > I wonder if we should version the builtin locales too. We might make > a > mistake and want to change something sometime? I'm fine with that, see v25-0004 in the reply to your other mail. The version only tracks sort order, and all of the builtin locales sort based on memcmp(). But it's possible there are bugs in the optimizations around memcmp() (e.g. abbreviated keys, or some future optimization). > Tiny comments: > > * src/bin/scripts/t/020_createdb.pl > > The two added tests should have different names that tells them apart > (like the new initdb tests). > > * src/include/catalog/pg_collation.dat Done in v25-0002 (in reply to your other mail). Regards, Jeff Davis
Re: Built-in CTYPE provider
On 14.03.24 09:08, Jeff Davis wrote: 0001 (the C.UTF-8 locale) is also close. Considering that most of the infrastructure is already in place, that's not a large patch. You many have some comments about the way I'm canonicalizing and validating in initdb -- that could be cleaner, but it feels like I should refactor the surrounding code separately first. If have tested this against the libc locale C.utf8 that was available on the OS, and the behavior is consistent. I wonder if we should version the builtin locales too. We might make a mistake and want to change something sometime? Tiny comments: * src/bin/scripts/t/020_createdb.pl The two added tests should have different names that tells them apart (like the new initdb tests). * src/include/catalog/pg_collation.dat Maybe use 'and' instead of '&' in the description. 0002 (inlining utf8 functions) is also ready. Seems ok. For 0003 and beyond, I'd like some validation that it's what you had in mind. I'll look into those later.
Re: Built-in CTYPE provider
On 14.03.24 09:08, Jeff Davis wrote: On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote: New series attached. I plan to commit 0001 very soon. Committed the basic builtin provider, supporting only the "C" locale. As you were committing this, I had another review of v23-0001-Introduce-collation-provider-builtin.patch in progress. Some of the things I found you have already addressed in what you committed. Please check the remaining comments. * doc/src/sgml/charset.sgml I don't understand the purpose of this sentence: "When using this locale, the behavior may depend on the database encoding." * doc/src/sgml/ref/create_database.sgml The new parameter builtin_locale is not documented. * src/backend/commands/collationcmds.c I think DefineCollation() should set collencoding = -1 for the COLLPROVIDER_BUILTIN case. -1 stands for any encoding. Or at least explain why not? * src/backend/utils/adt/pg_locale.c This part is a bit confusing: + cache_entry->collate_is_c = true; + cache_entry->ctype_is_c = (strcmp(colllocale, "C") == 0); Is collate always C but ctype only sometimes? Does this anticipate future patches in this series? Maybe in this patch it should always be true? * src/bin/initdb/initdb.c + printf(_(" --builtin-locale=LOCALE set builtin locale name for new databases\n")); Put in a line break so that the right "column" lines up. This output should line up better: The database cluster will be initialized with this locale configuration: default collation provider: icu default collation locale:en LC_COLLATE: C LC_CTYPE:C ... Also, why are there two spaces after "provider: "? Also we call these locale provider on input, why are they collation providers on output? What is a "collation locale"? * src/bin/pg_upgrade/t/002_pg_upgrade.pl +if ($oldnode->pg_version >= '17devel') This is weird. >= is a numeric comparison, so providing a string with non-digits is misleading at best. * src/test/icu/t/010_database.pl -# Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE -# are specified Why remove this test? +my ($ret, $stdout, $stderr) = $node1->psql('postgres', + q{CREATE DATABASE dbicu LOCALE_PROVIDER builtin LOCALE 'C' TEMPLATE dbicu} +); Change the name of the new database to be different from the name of the template database.
Re: Built-in CTYPE provider
On 08.03.24 02:00, Jeff Davis wrote: And here's v22 (I didn't post v21). I committed Unicode property tables and functions, and the simple case mapping. I separated out the full case mapping changes (based on SpecialCasing.txt) into patch 0006. 0002: Basic builtin collation provider that only supports "C". Overall, this patch looks sound. In the documentation, let's make the list of locale providers an actual list instead of a sequence of s. We had some discussion on initdb option --builtin-locale and whether it should be something more general. I'm ok with leaving it like this for now and maybe consider as an "open item" for PG17. In errmsg("parameter \"locale\" must be specified") make "locale" a placeholder. (See commit 36a14afc076). It seems the builtin provider accepts both "C" and "POSIX" as locale names, but the documentation says it must be "C". Maybe we don't need to accept "POSIX"? (Seeing that there are no plans for "POSIX.UTF-8", maybe we just ignore the "POSIX" spelling altogether?) Speaking of which, the code in postinit.c is inconsistent in that respect with builtin_validate_locale(). Shouldn't postinit.c use builtin_validate_locale(), to keep it consistent? Or, there could be a general function that accepts a locale provider and a locale string and validates everything together? In initdb.c, this message printf(_("The database cluster will be initialized with no locale.\n")); sounds a bit confusing. I think it's ok to show "C" as a locale. I'm not sure we need to change the logic here. Also in initdb.c, this message pg_fatal("locale must be specified unless provider is libc"); should be flipped around, like locale must be specified if provider is %s In pg_dump.c, dumpDatabase(), there are some new warning messages that are not specifically about the builtin provider. Are those existing deficiencies? It's not clear to me. What are the changes in the pg_upgrade test about? Maybe explain the scenario it is trying to test briefly? 0004: Inline some UTF-8 functions to improve performance Makes sense that inlining can be effective here. But why aren't you just inlining the existing function pg_utf_mblen()? Now we have two functions that do the same thing. And the comment at pg_utf_mblen() is removed completely, so it's not clear anymore why it exists. 0005: Add a unicode_strtitle() function and move the implementation for the builtin provider out of formatting.c. In the recent discussion you had expression some uncertainty about the detailed semantics of this. INITCAP() was copied from Oracle, so we could check there for reference, too. Or we go with full Unicode semantics. I'm not clear on all the differences and tradeoffs, if there are any. In any case, it would be good if there were documentation or a comment that somehow wrote down the resolution of this.
Re: Built-in CTYPE provider
On 13.02.24 03:01, Jeff Davis wrote: 1. The SQL spec mentions the capitalization of "ß" as "SS" specifically. Should UCS_BASIC use the unconditional mappings in SpecialCasing.txt? I already have some code to do that (not posted yet). It is my understanding that "correct" Unicode case conversion needs to use at least some parts of SpecialCasing.txt. The header of the file says "For compatibility, the UnicodeData.txt file only contains simple case mappings for characters where they are one-to-one and independent of context and language. The data in this file, combined with the simple case mappings in UnicodeData.txt, defines the full case mappings [...]" I read this as, just using UnicodeData.txt by itself is incomplete. I think we need to use the "Unconditional" mappings and the "Conditional Language-Insensitive" mappings (which is just Greek sigma). Obviously, skip the "Language-Sensitive" mappings.
Re: Built-in CTYPE provider
On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote: > Various comments are updated to include the term "character class". > I > don't recognize that as an official Unicode term. There are > categories > and properties. Let's check this. It's based on https://www.unicode.org/reports/tr18/#Compatibility_Properties so I suppose the right name is "properties". > Is it potentially confusing that only some pg_u_prop_* have a posix > variant? Would it be better for a consistent interface to have a > "posix" argument for each and just ignore it if not used? Not sure. I thought about it but didn't see a clear advantage one way or another. > About this initdb --builtin-locale option and analogous options > elsewhere: Maybe we should flip this around and provide a --libc- > locale > option, and have all the other providers just use the --locale > option. > This would be more consistent with the fact that it's libc that is > special in this context. Would --libc-locale affect all the environment variables or just LC_CTYPE/LC_COLLATE? How do we avoid breakage? I like the general direction here but we might need to phase in the option or come up with a new name. Suggestions welcome. > Do we even need the "C" locale? We have established that "C.UTF-8" > is > useful, but if that is easily available, who would need "C"? I don't think we should encourage its use generally but I also don't think it will disappear any time soon. Some people will want it on simplicity grounds. I hope fewer people will use "C" when we have a better builtin option. > Some changes in this patch appear to be just straight renamings, like > in > src/backend/utils/init/postinit.c and > src/bin/pg_upgrade/t/002_pg_upgrade.pl. Maybe those should be put > into > the previous patch instead. > > On the collation naming: My expectation would have been that the > "C.UTF-8" locale would be exposed as the UCS_BASIC collation. I'd like that. We have to sort out a couple things first, though: 1. The SQL spec mentions the capitalization of "ß" as "SS" specifically. Should UCS_BASIC use the unconditional mappings in SpecialCasing.txt? I already have some code to do that (not posted yet). 2. Should UCS_BASIC use the "POSIX" or "Standard" variant of regex properties? (The main difference seems to be whether symbols get treated as punctuation or not.) 3. What do we do about potential breakage for existing users of UCS_BASIC who might be expecting C-like behavior? Regards, Jeff Davis
Re: Built-in CTYPE provider
Review of the v16 patch set: (Btw., I suppose you started this patch series with 0002 because some 0001 was committed earlier. But I have found this rather confusing. I think it's ok to renumber from 0001 for each new version.) * v16-0002-Add-Unicode-property-tables.patch Various comments are updated to include the term "character class". I don't recognize that as an official Unicode term. There are categories and properties. Let's check this. Some files need heavy pgindent and perltidy. You were surely going to do this eventually, but maybe you want to do this sooner to check whether you like the results. - src/common/unicode/Makefile This patch series adds some new post-update-unicode tests. Should we have a separate target for each or just one common "unicode test" target? Not sure. - .../generate-unicode_category_table.pl The trailing commas handling ($firsttime etc.) is not necessary with C99. The code can be simplified. For this kind of code: +print $OT <<"HEADER"; let's use a common marker like EOS instead of a different one for each block. That just introduces unnecessary variations. - src/common/unicode_category.c The mask stuff at the top could use more explanation. It's impossible to figure out exactly what, say, PG_U_PC_MASK does. Line breaks in the different pg_u_prop_* functions are gratuitously different. Is it potentially confusing that only some pg_u_prop_* have a posix variant? Would it be better for a consistent interface to have a "posix" argument for each and just ignore it if not used? Not sure. Let's use size_t instead of Size for new code. * v16-0003-Add-unicode-case-mapping-tables-and-functions.patch Several of the above points apply here analogously. * v16-0004-Catalog-changes-preparing-for-builtin-collation-.patch This is mostly a straightforward renaming patch, but there are some changes in initdb and pg_dump that pre-assume the changes in the next patch, like which locale columns apply for which providers. I think it would be better for the historical record to make this a straight renaming patch and move those semantic changes to the next patch (or a separate intermediate patch, if you prefer). - src/bin/psql/describe.c - src/test/regress/expected/psql.out This would be a good opportunity to improve the output columns for collations. The updated view is now: + Schema | Name | Provider | Collate | Ctype | Locale | ICU Rules | Deterministic? ++--+--+-+---++---+ This is historically grown but suboptimal. Why is Locale after Collate and Ctype, and why does it show both? I think we could have just the Locale column, and if the libc provider is used with different collate/ctype (very rare!), we somehow write that into the single locale column. (A change like this would be a separate patch.) * v16-0005-Introduce-collation-provider-builtin-for-C-and-C.patch About this initdb --builtin-locale option and analogous options elsewhere: Maybe we should flip this around and provide a --libc-locale option, and have all the other providers just use the --locale option. This would be more consistent with the fact that it's libc that is special in this context. Do we even need the "C" locale? We have established that "C.UTF-8" is useful, but if that is easily available, who would need "C"? Some changes in this patch appear to be just straight renamings, like in src/backend/utils/init/postinit.c and src/bin/pg_upgrade/t/002_pg_upgrade.pl. Maybe those should be put into the previous patch instead. On the collation naming: My expectation would have been that the "C.UTF-8" locale would be exposed as the UCS_BASIC collation. And the "C" locale as some other name (or not at all, see above). You have this the other way around.
Re: Built-in CTYPE provider
On Mon, 2024-01-22 at 19:49 +0100, Peter Eisentraut wrote: > > > I don't get this argument. Of course, people care about sorting and > sort order. Whether you consider this part of Unicode or adjacent to > it, people still want it. You said that my proposal sends a message that we somehow don't care about Unicode, and I strongly disagree. The built-in provider I'm proposing does implement Unicode semantics. Surely a database that offers UCS_BASIC (a SQL spec feature) isn't sending a message that it doesn't care about Unicode, and neither is my proposal. > > > > * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive > > matching, and customization with rules. It's the solution for > > everything from "slightly more advanced" to "very advanced". > > I am astonished by this. In your world, do users not want their text > data sorted? Do they not care what the sort order is? I obviously care about Unicode and collation. I've put a lot of effort recently into contributions in this area, and I wouldn't have done that if I thought users didn't care. You've made much greater contributions and I thank you for that. The logical conclusion of your line of argument would be that libc's "C.UTF-8" locale and UCS_BASIC simply should not exist. But they do exist, and for good reason. One of those good reasons is that only *human* users care about the human-friendliness of sort order. If Postgres is just feeding the results to another system -- or an application layer that re-sorts the data anyway -- then stability, performance, and interoperability matter more than human-friendliness. (Though Unicode character semantics are still useful even when the data is not going directly to a human.) > You consider UCA > sort order an "advanced" feature? I said "slightly more advanced" compared with "basic". "Advanced" can be taken in either a positive way ("more useful") or a negative way ("complex"). I'm sorry for the misunderstanding, but my point was this: * The builtin provider is for people who are fine with code point order and no tailoring, but want Unicode character semantics, collation stability, and performance. * ICU is the right solution for anyone who wants human-friendly collation or tailoring, and is willing to put up with some collation stability risk and lower collation performance. Both have their place and the user is free to mix and match as needed, thanks to the COLLATE clause for columns and queries. Regards, Jeff Davis
Re: Built-in CTYPE provider
On 18.01.24 23:03, Jeff Davis wrote: On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote: I think that would be a terrible direction to take, because it would regress the default sort order from "correct" to "useless". I don't agree that the current default is "correct". There are a lot of ways it can be wrong: * the environment variables at initdb time don't reflect what the users of the database actually want * there are so many different users using so many different applications connected to the database that no one "correct" sort order exists * libc has some implementation quirks * the version of Unicode that libc is based on is not what you expect * the version of libc is not what you expect These are arguments why the current defaults are not universally perfect, but I'd argue that they are still most often the right thing as the default. Aside from the overall message this sends about how PostgreSQL cares about locales and Unicode and such. Unicode is primarily about the semantics of characters and their relationships. The patches I propose here do a great job of that. Collation (relationships between *strings*) is a part of Unicode, but not the whole thing or even the main thing. I don't get this argument. Of course, people care about sorting and sort order. Whether you consider this part of Unicode or adjacent to it, people still want it. Maybe you don't intend for this to be the default provider? I am not proposing that this provider be the initdb-time default. ok But then who would really use it? I mean, sure, some people would, but how would you even explain, in practice, the particular niche of users or use cases? It's for users who want to respect Unicode support text from international sources in their database; but are not experts on the subject and don't know precisely what they want or understand the consequences. If and when such users do notice a problem with the sort order, they'd handle it at that time (perhaps with a COLLATE clause, or sorting in the application). Vision: * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive matching, and customization with rules. It's the solution for everything from "slightly more advanced" to "very advanced". I am astonished by this. In your world, do users not want their text data sorted? Do they not care what the sort order is? You consider UCA sort order an "advanced" feature?
Re: Built-in CTYPE provider
On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote: > I think that would be a terrible direction to take, because it would > regress the default sort order from "correct" to "useless". I don't agree that the current default is "correct". There are a lot of ways it can be wrong: * the environment variables at initdb time don't reflect what the users of the database actually want * there are so many different users using so many different applications connected to the database that no one "correct" sort order exists * libc has some implementation quirks * the version of Unicode that libc is based on is not what you expect * the version of libc is not what you expect > Aside from > the overall message this sends about how PostgreSQL cares about > locales > and Unicode and such. Unicode is primarily about the semantics of characters and their relationships. The patches I propose here do a great job of that. Collation (relationships between *strings*) is a part of Unicode, but not the whole thing or even the main thing. > Maybe you don't intend for this to be the default provider? I am not proposing that this provider be the initdb-time default. > But then > who would really use it? I mean, sure, some people would, but how > would > you even explain, in practice, the particular niche of users or use > cases? It's for users who want to respect Unicode support text from international sources in their database; but are not experts on the subject and don't know precisely what they want or understand the consequences. If and when such users do notice a problem with the sort order, they'd handle it at that time (perhaps with a COLLATE clause, or sorting in the application). > Maybe if this new provider would be called "minimal", it might > describe > the purpose better. "Builtin" communicates that it's available everywhere (not a dependency), that specific behaviors can be documented and tested, and that behavior doesn't change within a major version. I want to communicate all of those things. > I could see a use for this builtin provider if it also included the > default UCA collation (what COLLATE UNICODE does now). I won't rule that out, but I'm not proposing that right now and my proposal is already offering useful functionality. > There would still be a risk with that approach, since it would > permanently marginalize ICU functionality Yeah, ICU already does a good job offering the root collation. I don't think the builtin provider needs to do so. > I would be curious what your overall vision is here? Vision: * The builtin provider will offer Unicode character semantics, basic collation, platform-independence, and high performance. It can be used on its own or in combination with ICU via the COLLATE clause. * ICU offers COLLATE UNICODE, locale tailoring, case-insensitive matching, and customization with rules. It's the solution for everything from "slightly more advanced" to "very advanced". * libc would be for databases serving applications on the same machine where a matching sort order is helpful, risks to indexes are acceptable, and performance is not important. > Is switching the > default to ICU still your goal? Or do you want the builtin provider > to > be the default? It's hard to answer this question while initdb chooses the database default collation based on the environment. Neither ICU nor the builtin provider can reasonably handle whatever those environment variables might be set to. Stepping back from the focus on what initdb does, we should be providing the right encouragement in documentation and packaging to guide users toward the right provider based their needs and the vision outlined above. Regards, Jeff Davis
Re: Built-in CTYPE provider
Peter Eisentraut wrote: > > If the Postgres default was bytewise sorting+locale-agnostic > > ctype functions directly derived from Unicode data files, > > as opposed to libc/$LANG at initdb time, the main > > annoyance would be that "ORDER BY textcol" would no > > longer be the human-favored sort. > > I think that would be a terrible direction to take, because it would > regress the default sort order from "correct" to "useless". Aside from > the overall message this sends about how PostgreSQL cares about > locales and Unicode and such. Well, offering a viable solution to avoid as much as possible the dreaded: "WARNING: collation "xyz" has version mismatch ... HINT: Rebuild all objects affected by this collation..." that doesn't sound like a bad message to send. Currently, to have in codepoint order the indexes that don't need a linguistic order, you're supposed to use collate "C", which then means that upper(), lower() etc.. don't work beyond ASCII. Here our Unicode support is not good enough, and the proposal addresses that. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Built-in CTYPE provider
On 12.01.24 03:02, Jeff Davis wrote: New version attached. Changes: * Named collation object PG_C_UTF8, which seems like a good idea to prevent name conflicts with existing collations. The locale name is still C.UTF-8, which still makes sense to me because it matches the behavior of the libc locale of the same name so closely. I am catching up on this thread. The discussions have been very complicated, so maybe I didn't get it all. The patches look pretty sound, but I'm questioning how useful this feature is and where you plan to take it. Earlier in the thread, the aim was summarized as > If the Postgres default was bytewise sorting+locale-agnostic > ctype functions directly derived from Unicode data files, > as opposed to libc/$LANG at initdb time, the main > annoyance would be that "ORDER BY textcol" would no > longer be the human-favored sort. I think that would be a terrible direction to take, because it would regress the default sort order from "correct" to "useless". Aside from the overall message this sends about how PostgreSQL cares about locales and Unicode and such. Maybe you don't intend for this to be the default provider? But then who would really use it? I mean, sure, some people would, but how would you even explain, in practice, the particular niche of users or use cases? Maybe if this new provider would be called "minimal", it might describe the purpose better. I could see a use for this builtin provider if it also included the default UCA collation (what COLLATE UNICODE does now). Then it would provide a "common" default behavior out of the box, and if you want more fine-tuning, you can go to ICU. There would still be some questions about making sure the builtin behavior and the ICU behavior are consistent (different Unicode versions, stock UCA vs CLDR, etc.). But for practical purposes, it might work. There would still be a risk with that approach, since it would permanently marginalize ICU functionality, in the sense that only some locales would need ICU, and so we might not pay the same amount of attention to the ICU functionality. I would be curious what your overall vision is here? Is switching the default to ICU still your goal? Or do you want the builtin provider to be the default? Or something else?
Re: Built-in CTYPE provider
On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote: > Concerning the target category_test, it produces failures with > versions of ICU with Unicode < 15. The first one I see with Ubuntu > 22.04 (ICU 70.1) is: ... > I find these results interesting because they tell us what contents > can break regexp-based check constraints on upgrades. Thank you for collecting and consolidating this information. > But about category_test as a pass-or-fail kind of test, it can only > be > used when the Unicode version in ICU is the same as in Postgres. The test has a few potential purposes: 1. To see if there is some error in parsing the Unicode files and building the arrays in the .h file. For instance, let's say the perl parser I wrote works fine on the Unicode 15.1 data file, but does something wrong on the 16.0 data file: the test would fail and we'd investigate. This is the most important reason for the test. 2. To notice any quirks between how we interpret Unicode vs how ICU does. 3. To help see "interesting" differences between different Unicode versions. For #1 and #2, the best way to test is by using a version of ICU that uses the same Unicode version as Postgres. The one running update- unicode can try to recompile with the right one for the purposes of the test. NB: There might be no version of ICU where the Unicode version exactly matches what we'd like to update to. In that case, we'd need to use the closest version and do some manual validation that the generated tables are sane. For #3, that is also interesting information to know about, but it's not directly actionable. As you point out, Unicode does not guarantee that these properties are static forever, so regexes can change behavior when we update Unicode for the next PG version. That is a much lower risk than a collation change, but as you point out, is a risk for regexes inside of a CHECK constraint. If a user needs zero risk of semantic changes for regexes, the only option is "C". Perhaps there should be a separate test target for this mode so that it doesn't exit early? (Note: case mapping has much stronger guarantees than the character classification.) I will update the README to document how someone running update-unicode should interpret the test results. Regards, Jeff Davis
Re: Built-in CTYPE provider
Jeff Davis wrote: > New version attached. [v16] Concerning the target category_test, it produces failures with versions of ICU with Unicode < 15. The first one I see with Ubuntu 22.04 (ICU 70.1) is: category_test: Postgres Unicode version:15.1 category_test: ICU Unicode version: 14.0 category_test: FAILURE for codepoint 0x000c04 category_test: Postgres property alphabetic/lowercase/uppercase/white_space/hex_digit/join_control: 1/0/0/0/0/0 category_test: ICU property alphabetic/lowercase/uppercase/white_space/hex_digit/join_control: 0/0/0/0/0/0 U+0C04 is a codepoint added in Unicode 11. https://en.wikipedia.org/wiki/Telugu_(Unicode_block) In Unicode.txt: 0C04;TELUGU SIGN COMBINING ANUSVARA ABOVE;Mn;0;NSM;N; In Unicode 15, it has been assigned Other_Alphabetic in PropList.txt $ grep 0C04 PropList.txt 0C04 ; Other_Alphabetic # Mn TELUGU SIGN COMBINING ANUSVARA ABOVE But in Unicode 14 it was not there. As a result its binary property UCHAR_ALPHABETIC has changed from false to true in ICU 72 vs previous versions. As I understand, the stability policy says that such things happen. From https://www.unicode.org/policies/stability_policy.html Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character. The Consortium will endeavor to keep the values of the other properties as stable as possible, but some circumstances may arise that require changing them. Particularly in the situation where the Unicode Standard first encodes less well-documented characters and scripts, the exact character properties and behavior initially may not be well known. As more experience is gathered in implementing the characters, adjustments in the properties may become necessary. Examples of such properties include, but are not limited to, the following: - General_Category - Case mappings - Bidirectional properties [...] I've commented the exit(1) in category_test to collect all errors, and built it with versions of ICU from 74 down to 60 (that is Unicode 10.0). Results are attached. As expected, the older the ICU version, the more differences are found against Unicode 15.1. I find these results interesting because they tell us what contents can break regexp-based check constraints on upgrades. But about category_test as a pass-or-fail kind of test, it can only be used when the Unicode version in ICU is the same as in Postgres. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite results-category-tests-multiple-icu-versions.tar.bz2 Description: Binary data
Re: Built-in CTYPE provider
On Fri, Jan 12, 2024 at 01:13:04PM -0500, Robert Haas wrote: > On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite wrote: >> ISTM that in general the behavior of old psql vs new server does >> not weight much against choosing optimal catalog changes. > > +1. +1. There is a good amount of effort put in maintaining downward compatibility in psql. Upward compatibility would require more manipulations of the stable branches to make older versions of psql compatible with newer server versions. Brr. -- Michael signature.asc Description: PGP signature
Re: Built-in CTYPE provider
On Fri, 2024-01-12 at 19:00 +0100, Daniel Verite wrote: > Another one is that version 12 broke \d in older psql by > removing pg_class.relhasoids. > > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. > > There's also that warning at start informing users about it: > WARNING: psql major version X, server major version Y. > Some psql features might not work. Good point, I'll leave it as-is then. If someone complains I can rework it. Also, the output of \l changes from version to version, so if there are automated tools processing the output then they'd have to change anyway. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite wrote: > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. +1. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Built-in CTYPE provider
Jeff Davis wrote: > > Jeremy also raised a problem with old versions of psql connecting to > > a > > new server: the \l and \dO won't work. Not sure exactly what to do > > there, but I could work around it by adding a new field rather than > > renaming (though that's not ideal). > > I did a bit of research for a precedent, and the closest I could find > is that \dp was broken between 14 and 15 by commit 07eee5a0dc. Another one is that version 12 broke \d in older psql by removing pg_class.relhasoids. ISTM that in general the behavior of old psql vs new server does not weight much against choosing optimal catalog changes. There's also that warning at start informing users about it: WARNING: psql major version X, server major version Y. Some psql features might not work. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Built-in CTYPE provider
On Thu, 2024-01-11 at 18:02 -0800, Jeff Davis wrote: > Jeremy also raised a problem with old versions of psql connecting to > a > new server: the \l and \dO won't work. Not sure exactly what to do > there, but I could work around it by adding a new field rather than > renaming (though that's not ideal). I did a bit of research for a precedent, and the closest I could find is that \dp was broken between 14 and 15 by commit 07eee5a0dc. That is some precedent, but it's more narrow. I think that might justify breaking \dO in older clients, but \l is used frequently. It seems safer to just introduce new columns "datbuiltinlocale" and "collbuiltinlocale". They'll be nullable anyway. If we want to clean this up we can do so as a separate commit. There are other parts of the catalog representation related to collation that we might want to clean up as well. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: > I think we missed something in psql, pretty sure I applied all the > patches but I see this error: > > =# \l > ERROR: 42703: column d.datlocale does not exist > LINE 8: d.datlocale as "Locale", > ^ > HINT: Perhaps you meant to reference the column "d.daticulocale". > LOCATION: errorMissingColumn, parse_relation.c:3720 I think you're connecting to a patched server with an older version of psql, so it doesn't know the catalog column was renamed. pg_dump and pg_upgrade don't have that problem because they throw an error when connecting to a newer server. But for psql, that's perfectly reasonable to connect to a newer server. Have we dropped or renamed catalog columns used by psql backslash commands before, and if so, how do we handle that? I can just not rename that column, but that's a bit frustrating because it means I'd need to add a new column to pg_database, which seems redundant. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote: > $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata > > The database cluster will be initialized with this locale > configuration: > default collation provider: builtin > default collation locale: C.UTF-8 > LC_COLLATE: C.UTF-8 > LC_CTYPE: C.UTF-8 > LC_MESSAGES: C.UTF-8 > LC_MONETARY: C.UTF-8 > LC_NUMERIC: C.UTF-8 > LC_TIME: C.UTF-8 > The default database encoding has accordingly been set to "UTF8". > The default text search configuration will be set to "english". > > This is from an environment where LANG=fr_FR.UTF-8 > > I would expect all LC_* variables to be fr_FR.UTF-8, and the default > text search configuration to be "french". You can get the behavior you want by doing: initdb --builtin-locale=C.UTF-8 --locale-provider=builtin \ -D/tmp/pgdata where "--builtin-locale" is analogous to "--icu-locale". It looks like I forgot to document the new initdb option, which seems to be the source of the confusion. Sorry, I'll fix that in the next patch set. (See examples in the initdb tests.) I think this answers some of your follow-up questions as well. > A related comment is about naming the builtin locale C.UTF-8, the > same > name as in libc. On one hand this is semantically sound, but on the > other hand, it's likely to confuse people. What about using > completely > different names, like "pg_unicode" or something else prefixed by > "pg_" > both for the locale name and the collation name (currently > C.UTF-8/c_utf8)? I'm flexible on naming, but here are my thoughts: * A "pg_" prefix makes sense. * If we named it something like "pg_unicode" someone might expect it to sort using the root collation. * The locale name "C.UTF-8" is nice because it implies things about both the collation and the character behavior. It's also nice because on at least some platforms, the behavior is almost identical to the libc locale of the same name. * UCS_BASIC might be a good name, because it also seems to carry the right meanings, but that name is already taken. * We also might to support variations, such as full case mapping (which uppercases "ß" to "SS", as the SQL standard requires), or perhaps the "standard" flavor of regexes (which don't count all symbols as punctuation). Leaving some room to name those variations would be a good idea. Regards, Jeff Davis
Re: Built-in CTYPE provider
Jeff Davis wrote: > Attached a more complete version that fixes a few bugs [v15 patch] When selecting the builtin provider with initdb, I'm getting the following setup: $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata The database cluster will be initialized with this locale configuration: default collation provider: builtin default collation locale:C.UTF-8 LC_COLLATE: C.UTF-8 LC_CTYPE:C.UTF-8 LC_MESSAGES: C.UTF-8 LC_MONETARY: C.UTF-8 LC_NUMERIC: C.UTF-8 LC_TIME: C.UTF-8 The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". This is from an environment where LANG=fr_FR.UTF-8 I would expect all LC_* variables to be fr_FR.UTF-8, and the default text search configuration to be "french". It is what happens when selecting ICU as the provider in the same environment: $ bin/initdb --icu-locale=en --locale-provider=icu -D/tmp/pgdata Using language tag "en" for ICU locale "en". The database cluster will be initialized with this locale configuration: default collation provider: icu default collation locale:en LC_COLLATE: fr_FR.UTF-8 LC_CTYPE:fr_FR.UTF-8 LC_MESSAGES: fr_FR.UTF-8 LC_MONETARY: fr_FR.UTF-8 LC_NUMERIC: fr_FR.UTF-8 LC_TIME: fr_FR.UTF-8 The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "french". The collation setup does not influence the rest of the localization. The problem AFAIU is that --locale has two distinct meanings in the v15 patch: --locale-provider=X --locale=Y means use "X" as the provider with "Y" as datlocale, and it means use "Y" as the locale for all localized libc functionalities. I wonder what would happen if invoking bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata on a system where C.UTF-8 does not exist as a libc locale. Would it fail? (I don't have an OS like this to test ATM, will try later). A related comment is about naming the builtin locale C.UTF-8, the same name as in libc. On one hand this is semantically sound, but on the other hand, it's likely to confuse people. What about using completely different names, like "pg_unicode" or something else prefixed by "pg_" both for the locale name and the collation name (currently C.UTF-8/c_utf8)? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Built-in CTYPE provider
On 1/9/24 2:31 PM, Jeff Davis wrote: > On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: >> I think we missed something in psql, pretty sure I applied all the >> patches but I see this error: >> >> =# \l >> ERROR: 42703: column d.datlocale does not exist >> LINE 8: d.datlocale as "Locale", >> > > Thank you. I'll fix this in the next patch set. Very minor nitpick/request. Not directly with this patch set but with the category_test which is related and also recently committed. I'm testing out "make update-unicode" scripts, and due to my system ICU version being different from the PostgreSQL unicode version I get the expected warnings from category_test: Postgres Unicode Version: 15.1 ICU Unicode Version:14.0 Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch. category_test: All tests successful! I know it's minor, but I saw the 5116 skipped codepoints and saw "all tests succeeded" but I thought the output might be a little nicer if we showed the count of tests that succeeded. For example: Postgres Unicode Version: 15.1 ICU Unicode Version:14.0 Skipped 5116 codepoints unassigned in ICU due to Unicode version mismatch. category_test: All 1108996 tests successful! It's a minor tweak to a script that I don't think even runs in the standard build; any objections? Patch attached. -Jeremy -- http://about.me/jeremy_schneider From 5ff09eb7371abc14cd8537b4e3265e35f030794f Mon Sep 17 00:00:00 2001 From: Jeremy Schneider Date: Wed, 10 Jan 2024 07:25:17 + Subject: [PATCH] Output count of checked codepoints. --- src/common/unicode/category_test.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/src/common/unicode/category_test.c b/src/common/unicode/category_test.c index 6cd7cd1a5f..62ff345066 100644 --- a/src/common/unicode/category_test.c +++ b/src/common/unicode/category_test.c @@ -53,6 +53,7 @@ main(int argc, char **argv) int icu_unicode_version = parse_unicode_version(U_UNICODE_VERSION); int pg_skipped_codepoints = 0; int icu_skipped_codepoints = 0; + int checked_codepoints = 0; printf("Postgres Unicode Version:\t%s\n", PG_UNICODE_VERSION); printf("ICU Unicode Version:\t\t%s\n", U_UNICODE_VERSION); @@ -90,6 +91,10 @@ main(int argc, char **argv) exit(1); } } + else + { + checked_codepoints++; + } } if (pg_skipped_codepoints > 0) @@ -99,7 +104,8 @@ main(int argc, char **argv) printf("Skipped %d codepoints unassigned in ICU due to Unicode version mismatch.\n", icu_skipped_codepoints); - printf("category_test: All tests successful!\n"); + printf("category_test: All %d tests successful!\n", + checked_codepoints); exit(0); #else printf("ICU support required for test; skipping.\n"); -- 2.34.1
Re: Built-in CTYPE provider
On Mon, 2024-01-08 at 17:17 -0800, Jeremy Schneider wrote: > I agree with merging the threads, even though it makes for a larger > patch set. It would be great to get a unified "builtin" provider in > place for the next major. I believe that's possible and that this proposal is quite close (hoping to get something in this 'fest). The tables I'm introducing have exhaustive test coverage, so there's not a lot of risk there. And the builtin provider itself is an optional feature, so it won't be disruptive. > > In the first list it seems that some callers might be influenced by a > COLLATE clause or table definition while others always take the > database > default? It still seems a bit odd to me if different providers can be > used for different parts of a single SQL. Right, that can happen today, and my proposal doesn't change that. Basically those are cases where the caller was never properly onboarded to our collation system, like the ts_locale.c routines. > Is there any reason we couldn't commit the minor cleanup (patch 0001) > now? It's less than 200 lines and pretty straightforward. Sure, I'll commit that fairly soon then. > I wonder if, after a year of running the builtin provider in > production, > whether we might consider adding to the builtin provider a few > locales > with simple but more reasonable ordering for european and asian > languages? I won't rule that out completely, but there's a lot we would need to do to get there. Even assuming we implement that perfectly, we'd need to make sure it's a reasonable scope for Postgres as a project and that we have more than one person willing to maintain it. Similar things have been rejected before for similar reasons. What I'm proposing for v17 is much simpler: basically some lookup tables, which is just an extension of what we're already doing for normalization. > https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main > > Make sure to click the link to show the counterexamples and > discussion, > that's the best part. Yes, it can be hard to reason about this stuff but I believe Unicode provides a lot of good data and guidance to work from. If you think my proposal relies on one of those assumptions let me know. To the extent that I do rely on any of those assumptions, it's mostly to match libc's "C.UTF-8" behavior. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: > I think we missed something in psql, pretty sure I applied all the > patches but I see this error: > > =# \l > ERROR: 42703: column d.datlocale does not exist > LINE 8: d.datlocale as "Locale", > Thank you. I'll fix this in the next patch set. > This is interesting. Jeff your original email didn't explicitly show > any > other initcap() results, but on Ubuntu 22.04 (glibc 2.35) I see > different results: > > =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx'); > initcap > -- > Axxe Áxxé DŽxxdž DŽxxx DŽxxx > > =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8); > initcap > -- > Axxe Áxxé Džxxdž Džxxx Džxxx The reason for this difference is because libc doesn't support titlecase. In the next patch set, I'll not use titlecase (only uppercase/lowercase even for initcap()), and then it will match libc 100%. > The COLLATE sql syntax feels awkward to me. In this example, we're > just > using it to attach locale info to the string, and there's not > actually > any collation involved here. Not sure if COLLATE comes from the > standard, and even if it does I'm not sure whether the standard had > upper/lowercase in mind. The standard doesn't use the COLLATE clause for case mapping, but also doesn't offer any other mechanism to, e.g., get case mapping according to the "tr_TR" locale. I think what Postgres does here, re-purposing the COLLATE clause to allow tailoring of case mapping, is imperfect but reasonable. My proposal doesn't change that. Regards, Jeff Davis
Re: Built-in CTYPE provider
On 12/28/23 6:57 PM, Jeff Davis wrote: > On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote: > Attached a more complete version that fixes a few bugs, stabilizes the > tests, and improves the documentation. I optimized the performance, too > -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both > collation and case mapping (numbers below). > > It's really nice to finally be able to have platform-independent tests > that work on any UTF-8 database. I think we missed something in psql, pretty sure I applied all the patches but I see this error: =# \l ERROR: 42703: column d.datlocale does not exist LINE 8: d.datlocale as "Locale", ^ HINT: Perhaps you meant to reference the column "d.daticulocale". LOCATION: errorMissingColumn, parse_relation.c:3720 = This is interesting. Jeff your original email didn't explicitly show any other initcap() results, but on Ubuntu 22.04 (glibc 2.35) I see different results: =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx'); initcap -- Axxe Áxxé DŽxxdž DŽxxx DŽxxx =# SELECT initcap('axxE áxxÉ DŽxxDŽ Džxxx džxxx' COLLATE C_UTF8); initcap -- Axxe Áxxé Džxxdž Džxxx Džxxx The COLLATE sql syntax feels awkward to me. In this example, we're just using it to attach locale info to the string, and there's not actually any collation involved here. Not sure if COLLATE comes from the standard, and even if it does I'm not sure whether the standard had upper/lowercase in mind. That said, I think the thing that mainly matters will be the CREATE DATABASE syntax and the database default. I want to try a few things with table-level defaults that differ from database-level defaults, especially table-level ICU defaults because I think a number of PostgreSQL users set that up in the years before we supported DB-level ICU. Some people will probably keep using their old/existing schema-creation scripts even after they begin provisioning new systems with new database-level defaults. -Jeremy -- http://about.me/jeremy_schneider
Re: Built-in CTYPE provider
On 12/28/23 6:57 PM, Jeff Davis wrote: > > Attached a more complete version that fixes a few bugs, stabilizes the > tests, and improves the documentation. I optimized the performance, too > -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both > collation and case mapping (numbers below). > > It's really nice to finally be able to have platform-independent tests > that work on any UTF-8 database. Thanks for all your work on this, Jeff I didn't know about the Unicode stability policy. Since it's formal policy, I agree this provides some assumptions we can safely build on. I'm working my way through these patches but it's taking a little time for me. I hadn't tracked with the "builtin" thread last summer so I'm coming up to speed on that now too. I'm hopeful that something along these lines gets into pg17. The pg17 cycle is going to start heating up pretty soon. I agree with merging the threads, even though it makes for a larger patch set. It would be great to get a unified "builtin" provider in place for the next major. I also still want to parse my way through your email reply about the two groups of callers, and what this means for user experience. https://www.postgresql.org/message-id/7774b3a64f51b3375060c29871cf2b02b3e85dab.camel%40j-davis.com > Let's separate it into groups. > (1) Callers that use a collation OID or pg_locale_t: > (2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are > set to, or use ad-hoc ASCII-only semantics: In the first list it seems that some callers might be influenced by a COLLATE clause or table definition while others always take the database default? It still seems a bit odd to me if different providers can be used for different parts of a single SQL. But it might not be so bad - I haven't fully thought through it yet and I'm still kicking the tires on my test build over here. Is there any reason we couldn't commit the minor cleanup (patch 0001) now? It's less than 200 lines and pretty straightforward. I wonder if, after a year of running the builtin provider in production, whether we might consider adding to the builtin provider a few locales with simple but more reasonable ordering for european and asian languages? Maybe just grabbing a current version of DUCET with no intention of ever updating it? I don't know how bad sorting is with plain DUCET for things like french or spanish or german, but surely it's not as unusable as code point order? Anyone who needs truly accurate or updated or customized linguistic sorting can always go to ICU, and take responsibility for their ICU upgrades, but something basic and static might meet the needs of 99% of postgres users indefinitely. By the way - my coworker Josh (who I don't think posts much on the hackers list here, but shares an unhealthy inability to look away from database unicode disasters) passed along this link today which I think is a fantastic list of surprises about programming and strings (generally unicode). https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main Make sure to click the link to show the counterexamples and discussion, that's the best part. -Jeremy PS. I was joking around today that the the second best part is that it's proof that people named Jeremy are always brilliant within their field. Josh said its just a subset of "always trust people whose names start with J" which seems fair. Unfortunately I can't yet think of a way to shoehorn the rest of the amazing PG hackers on this thread into the joke. -- http://about.me/jeremy_schneider
Re: Built-in CTYPE provider
Robert Haas wrote: > For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, > a change to C.utf8 would be a much bigger problem, I would > think. Their alphabet isn't in code point order, and so things would > be alphabetized wrongly. > That might be OK if they don't care about ordering for any purpose > other than equality lookups, but otherwise it's going to force them > to change the default, where today they don't have to do that. Sure, in whatever collation setup we expose, we need to keep it possible and even easy to sort properly with linguistic rules. But some reasons to use $LANG as the default locale/collation are no longer as good as they used to be, I think. Starting with v10/ICU we have many pre-created ICU locales with fixed names, and starting with v16, we can simply write "ORDER BY textfield COLLATE unicode" which is good enough in most cases. So the configuration "bytewise sort by default" / "linguistic sort on-demand" has become more realistic. By contrast in the pre-v10 days with only libc collations, an application could have no idea which collations were going to be available on the server, and how they were named precisely, as this varies across OSes and across installs even with the same OS. On Windows, I think that before v16 initdb did not create any libc collation beyond C/POSIX and the default language/region of the OS. In that libc context, if a db wants the C locale by default for performance and truly immutable indexes, but the client app needs to occasionally do in-db linguistic sorts, the app needs to figure out which collation name will work for that. This is hard if you don't target a specific installation that guarantees that such or such collation is going to be installed. Whereas if the linguistic locale is the default, the app never needs to know its name or anything about it. So it's done that way, linguistic by default. But that leaves databases with many indexes sorted linguistically instead of bytewise for fields that semantically never need any linguistic sort. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Built-in CTYPE provider
On Wed, 2023-12-20 at 15:47 -0800, Jeremy Schneider wrote: > One other thing that comes to mind: how does the parser do case > folding > for relation names? Is that using OS-provided libc as of today? Or > did > we code it to use ICU if that's the DB default? I'm guessing libc, > and > global catalogs probably need to be handled in a consistent manner, > even > across different encodings. The code is in downcase_identifier(): /* * SQL99 specifies Unicode-aware case normalization, which we don't * yet have the infrastructure for... */ if (ch >= 'A' && ch <= 'Z') ch += 'a' - 'A'; else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch)) ch = tolower(ch); result[i] = (char) ch; My proposal would add the infrastructure that the comment above says is missing. It seems like we should be using the database collation at this point because you don't want inconsistency between the catalogs and the parser here. Then again, the SQL spec doesn't seem to support tailoring of case conversions, so maybe we are avoiding it for that reason? Or maybe we're avoiding catalog access? Or perhaps the work for ICU just wasn't done here yet? > (Kindof related... did you ever see the demo where I create a user > named > '' and then I try to connect to a database with non-unicode > encoding? > ...at least it seems to be able to walk the index without > decoding > strings to find other users - but the way these global catalogs work > scares me a little bit) I didn't see that specific demo, but in general we seem to change between pg_wchar and unicode code points too freely, so I'm not surprised that something went wrong. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Wed, 2023-12-20 at 16:29 -0800, Jeremy Schneider wrote: > found some more. here's my running list of everything user-facing I > see > in core PG code so far that might involve case: > > * upper/lower/initcap > * regexp_*() and *_REGEXP() > * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~* > * citext + replace(), split_part(), strpos() and translate() > * full text search - everything is case folded > * unaccent? not clear to me whether CTYPE includes accent folding No, ctype has nothing to do with accents as far as I can tell. I don't know if I'm using the right terminology, but I think "case" is a variant of a character whereas "accent" is a modifier/mark, and the mark is a separate concept from the character itself. > * ltree > * pg_trgm > * core PG parser, case folding of relation names Let's separate it into groups. (1) Callers that use a collation OID or pg_locale_t: * collation & hashing * upper/lower/initcap * regex, LIKE, formatting * pg_trgm (which uses regexes) * maybe postgres_fdw, but might just be a passthrough * catalog cache (always uses DEFAULT_COLLATION_OID) * citext (always uses DEFAULT_COLLATION_OID, but probably shouldn't) (2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are set to, or use ad-hoc ASCII-only semantics: * core SQL parser downcase_identifier() * callers of pg_strcasecmp() (DDL, etc.) * GUC name case folding * full text search ("mylocale = 0 /* TODO */") * a ton of stuff uses isspace(), isdigit(), etc. * various callers of tolower()/toupper() * some selfuncs.c stuff * ... Might have missed some places. The user impact of a new builtin provider would affect (1), but only for those actually using the provider. So there's no compatibility risk there, but it's good to understand what it will affect. We can, on a case-by-case basis, also consider using the new APIs I'm proposing for instances of (2). There would be some compatibility risk there for existing callers, and we'd have to consider whether it's worth it or not. Ideally, new callers would either use the new APIs or use the pg_ascii_* APIs. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Wed, Dec 20, 2023 at 5:57 PM Jeff Davis wrote: > Those locales all have the same ctype behavior. Sigh. I keep getting confused about how that works... -- Robert Haas EDB: http://www.enterprisedb.com
Re: Built-in CTYPE provider
On 12/20/23 4:04 PM, Jeremy Schneider wrote: > On 12/20/23 3:47 PM, Jeremy Schneider wrote: >> On 12/5/23 3:46 PM, Jeff Davis wrote: >>> CTYPE, which handles character classification and upper/lowercasing >>> behavior, may be simpler than it first appears. We may be able to get >>> a net decrease in complexity by just building in most (or perhaps all) >>> of the functionality. >> >> I'll be honest, even though this is primarily about CTYPE and not >> collation, I still need to keep re-reading the initial email slowly to >> let it sink in and better understand it... at least for me, it's complex >> to reason through. >> >> I'm trying to make sure I understand clearly what the user impact/change >> is that we're talking about: after a little bit of brainstorming and >> looking through the PG docs, I'm actually not seeing much more than >> these two things you've mentioned here: the set of regexp_* functions PG >> provides, and these three generic functions. That alone doesn't seem >> highly concerning. > > I missed citext, which extends impact to replace(), split_part(), > strpos() and translate(). There are also the five *_REGEX() functions > from the SQL standard which I assume are just calling the PG functions. found some more. here's my running list of everything user-facing I see in core PG code so far that might involve case: * upper/lower/initcap * regexp_*() and *_REGEXP() * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~* * citext + replace(), split_part(), strpos() and translate() * full text search - everything is case folded * unaccent? not clear to me whether CTYPE includes accent folding * ltree * pg_trgm * core PG parser, case folding of relation names -- http://about.me/jeremy_schneider
Re: Built-in CTYPE provider
On 12/20/23 3:47 PM, Jeremy Schneider wrote: > On 12/5/23 3:46 PM, Jeff Davis wrote: >> CTYPE, which handles character classification and upper/lowercasing >> behavior, may be simpler than it first appears. We may be able to get >> a net decrease in complexity by just building in most (or perhaps all) >> of the functionality. > > I'll be honest, even though this is primarily about CTYPE and not > collation, I still need to keep re-reading the initial email slowly to > let it sink in and better understand it... at least for me, it's complex > to reason through. > > I'm trying to make sure I understand clearly what the user impact/change > is that we're talking about: after a little bit of brainstorming and > looking through the PG docs, I'm actually not seeing much more than > these two things you've mentioned here: the set of regexp_* functions PG > provides, and these three generic functions. That alone doesn't seem > highly concerning. I missed citext, which extends impact to replace(), split_part(), strpos() and translate(). There are also the five *_REGEX() functions from the SQL standard which I assume are just calling the PG functions. I just saw the krb_caseins_users GUC, which reminds me that PLs also have their own case functions. And of course extensions. I'm not saying any of this is in scope for the change here, but I'm just trying to wrap my brain around all the places we've got CTYPE processing happening, to better understand the big picture. It might help tease out unexpected small glitches from changing one thing but not another one. -Jeremy -- http://about.me/jeremy_schneider
Re: Built-in CTYPE provider
On 12/5/23 3:46 PM, Jeff Davis wrote: > CTYPE, which handles character classification and upper/lowercasing > behavior, may be simpler than it first appears. We may be able to get > a net decrease in complexity by just building in most (or perhaps all) > of the functionality. > > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1], specifying both a "Standard" variant and a > "POSIX Compatible" variant. The main difference with the POSIX variant > is that symbols count as punctuation. > > === LOWER()/INITCAP()/UPPER() === > > The LOWER() and UPPER() functions are defined in the SQL spec with > surprising detail, relying on specific Unicode General Category > assignments. How to map characters seems to be left (implicitly) up to > Unicode. If the input string is normalized, the output string must be > normalized, too. Weirdly, there's no room in the SQL spec to localize > LOWER()/UPPER() at all to handle issues like [1]. Also, the standard > specifies one example, which is that "ß" becomes "SS" when folded to > upper case. INITCAP() is not in the SQL spec. I'll be honest, even though this is primarily about CTYPE and not collation, I still need to keep re-reading the initial email slowly to let it sink in and better understand it... at least for me, it's complex to reason through. I'm trying to make sure I understand clearly what the user impact/change is that we're talking about: after a little bit of brainstorming and looking through the PG docs, I'm actually not seeing much more than these two things you've mentioned here: the set of regexp_* functions PG provides, and these three generic functions. That alone doesn't seem highly concerning. I haven't checked the source code for the regexp_* functions yet, but are these just passing through to an external library? Are we actually able to easily change the CTYPE provider for them? If nobody knows/replies then I'll find some time to look. One other thing that comes to mind: how does the parser do case folding for relation names? Is that using OS-provided libc as of today? Or did we code it to use ICU if that's the DB default? I'm guessing libc, and global catalogs probably need to be handled in a consistent manner, even across different encodings. (Kindof related... did you ever see the demo where I create a user named '' and then I try to connect to a database with non-unicode encoding? ...at least it seems to be able to walk the index without decoding strings to find other users - but the way these global catalogs work scares me a little bit) -Jeremy -- http://about.me/jeremy_schneider
Re: Built-in CTYPE provider
On Wed, 2023-12-20 at 14:24 -0500, Robert Haas wrote: > This makes sense to me, too, but it feels like it might work out > better for speakers of English than for speakers of other languages. There's very little in the way of locale-specific tailoring for ctype behaviors in ICU or glibc -- only for the 'az', 'el', 'lt', and 'tr' locales. While English speakers like us may benefit from being aligned with the default ctype behaviors, those behaviors are not at all specific to 'en' locales in ICU or glibc. Collation varies a lot more between locales. I wouldn't call memcmp ideal for English ('Zebra' comes before 'apple', which seems wrong to me). If memcmp sorting does favor any particular group, I would say it favors programmers more than English speakers. But that could just be my perspective and I certainly understand the point that memcmp ordering is more tolerable for some languages than others. > Right now, I tend to get databases that default to en_US.utf8, and if > the default changed to C.utf8, then the case-comparison behavior > might > be different en_US.UTF-8 and C.UTF-8 have the same ctype behavior. > For > someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a > change to C.utf8 would be a much bigger problem, I would think. Those locales all have the same ctype behavior. It turns out that that en_US.UTF-8 and fr_FR.UTF-8 also have the same collation order -- no tailoring beyond root collation according to CLDR files for 'en' and 'fr' (though note that 'fr_CA' does have tailoring). That doesn't mean the experience of switching to memcmp order is exactly the same for a French speaker and an English speaker, but I think it's interesting. > That might be OK if they don't care about > ordering for any purpose other than equality lookups, but otherwise > it's going to force them to change the default, where today they > don't > have to do that. To be clear, I haven't proposed changing the initdb default. This thread is about adding a builtin provider with builtin ctype, which I believe a lot of users would like. It also might be the best chance we have to get to a reasonable default behavior at some point in the future. It would be always available, fast, stable, better semantics than "C" for many locales, and we can document it. In any case, we don't need to decide that now. If the builtin provider is useful, we should do it. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Wed, Dec 20, 2023 at 2:13 PM Jeff Davis wrote: > On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote: > > If the Postgres default was bytewise sorting+locale-agnostic > > ctype functions directly derived from Unicode data files, > > as opposed to libc/$LANG at initdb time, the main > > annoyance would be that "ORDER BY textcol" would no > > longer be the human-favored sort. > > For the presentation layer, we would have to write for instance > > ORDER BY textcol COLLATE "unicode" for the root collation > > or a specific region-country if needed. > > But all the rest seems better, especially cross-OS compatibity, > > truly immutable and faster indexes for fields that > > don't require linguistic ordering, alignment between Unicode > > updates and Postgres updates. > > Thank you, that summarizes exactly the compromise that I'm trying to > reach. This makes sense to me, too, but it feels like it might work out better for speakers of English than for speakers of other languages. Right now, I tend to get databases that default to en_US.utf8, and if the default changed to C.utf8, then the case-comparison behavior might be different but the letters would still sort in the right order. For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a change to C.utf8 would be a much bigger problem, I would think. Their alphabet isn't in code point order, and so things would be alphabetized wrongly. That might be OK if they don't care about ordering for any purpose other than equality lookups, but otherwise it's going to force them to change the default, where today they don't have to do that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Built-in CTYPE provider
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote: > If the Postgres default was bytewise sorting+locale-agnostic > ctype functions directly derived from Unicode data files, > as opposed to libc/$LANG at initdb time, the main > annoyance would be that "ORDER BY textcol" would no > longer be the human-favored sort. > For the presentation layer, we would have to write for instance > ORDER BY textcol COLLATE "unicode" for the root collation > or a specific region-country if needed. > But all the rest seems better, especially cross-OS compatibity, > truly immutable and faster indexes for fields that > don't require linguistic ordering, alignment between Unicode > updates and Postgres updates. Thank you, that summarizes exactly the compromise that I'm trying to reach. Regards, Jeff Davis
Re: Built-in CTYPE provider
Jeff Davis wrote: > But there are a lot of users for whom neither of those things are true, > and it makes zero sense to order all of the text indexes in the > database according to any one particular locale. I think these users > would prioritize stability and performance for the database collation, > and then use COLLATE clauses with ICU collations where necessary. +1 > I am also still concerned that we have the wrong defaults. Almost > nobody thinks libc is a great provider, but that's the default, and > there were problems trying to change that default to ICU in 16. If we > had a builtin provider, that might be a better basis for a default > (safe, fast, always available, and documentable). Then, at least if > someone picks a different locale at initdb time, they would be doing so > intentionally, rather than implicitly accepting index corruption risks > based on an environment variable. Yes. The introduction of the bytewise-sorting, locale-agnostic C.UTF-8 in glibc is also a step in the direction of providing better defaults for apps like Postgres, that need both long-term stability in sorts and Unicode coverage for ctype-dependent functions. But C.UTF-8 is not available everywhere, and there's still the problem that Unicode updates through libc are not aligned with Postgres releases. ICU has the advantage of cross-OS compatibility, but it does not provide any collation with bytewise sorting like C or C.UTF-8, and we don't allow a combination like "C" for sorting and ICU for ctype operations. When opting for a locale provider, it has to be for both sorting and ctype, so an installation that needs cross-OS compatibility, good Unicode support and long-term stability of indexes cannot get that with ICU as we expose it today. If the Postgres default was bytewise sorting+locale-agnostic ctype functions directly derived from Unicode data files, as opposed to libc/$LANG at initdb time, the main annoyance would be that "ORDER BY textcol" would no longer be the human-favored sort. For the presentation layer, we would have to write for instance ORDER BY textcol COLLATE "unicode" for the root collation or a specific region-country if needed. But all the rest seems better, especially cross-OS compatibity, truly immutable and faster indexes for fields that don't require linguistic ordering, alignment between Unicode updates and Postgres updates. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Built-in CTYPE provider
On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote: > FWIW, the idea that we're going to develop a built-in provider seems > to be solid, for the reasons Jeff mentions: it can be stable, and > under our control. But it seems like we might need built-in providers > for everything rather than just CTYPE to get those advantages, and I > fear we'll get sucked into needing a lot of tailoring rather than > just > being able to get by with one "vanilla" implementation. For the database default collation, I suspect a lot of users would jump at the chance to have "vanilla" semantics. Tailoring is more important for individual collation objects than for the database-level collation. There are reasons you might select a tailored database collation, like if the set of users accessing it are mostly from a single locale, or if the application connected to the database is expecting it in a certain form. But there are a lot of users for whom neither of those things are true, and it makes zero sense to order all of the text indexes in the database according to any one particular locale. I think these users would prioritize stability and performance for the database collation, and then use COLLATE clauses with ICU collations where necessary. The question for me is how good the "vanilla" semantics need to be to be useful as a database-level collation. Most of the performance and stability problems come from collation, so it makes sense to me to provide a fast and stable memcmp collation paired with richer ctype semantics (as proposed here). Users who want something more probably want the Unicode "root" collation, which can be provided by ICU today. I am also still concerned that we have the wrong defaults. Almost nobody thinks libc is a great provider, but that's the default, and there were problems trying to change that default to ICU in 16. If we had a builtin provider, that might be a better basis for a default (safe, fast, always available, and documentable). Then, at least if someone picks a different locale at initdb time, they would be doing so intentionally, rather than implicitly accepting index corruption risks based on an environment variable. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Mon, Dec 18, 2023 at 2:46 PM Jeff Davis wrote: > The whole concept of "providers" is that they aren't consistent with > each other. ICU, libc, and the builtin provider will all be based on > different versions of Unicode. That's by design. > > The built-in provider will be a bit better in the sense that it's > consistent with the normalization functions, and the other providers > aren't. FWIW, the idea that we're going to develop a built-in provider seems to be solid, for the reasons Jeff mentions: it can be stable, and under our control. But it seems like we might need built-in providers for everything rather than just CTYPE to get those advantages, and I fear we'll get sucked into needing a lot of tailoring rather than just being able to get by with one "vanilla" implementation. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Built-in CTYPE provider
On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote: > Looking closer, patches 3 and 4 look like an incremental extension of > this earlier idea; Yes, it's essentially the same thing extended to a few more files. I don't know if "incremental" is the right word though; this is a substantial extension of the idea. > the perl scripts download data from unicode.org and > we've specifically defined Unicode version 15.1 and the scripts turn > the > data tables inside-out into C data structures optimized for lookup. > That > C code is then checked in to the PostgreSQL source code files > unicode_category.h and unicode_case_table.h - right? Yes. The standard build process shouldn't be downloading files, so the static tables are checked in. Also, seeing the diffs of the static tables improves the visibility of changes in case there's some mistake or big surprise. > Am I reading correctly that these two patches add C functions > pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we > don't yet reference these functions anywhere? So this is just getting > some plumbing in place? Correct. Perhaps I should combine these into the builtin provider thread, but these are independently testable and reviewable. > > > My prediction is that updating this built-in provider eventually > won't > be any different from ICU or glibc. The built-in provider will have several advantages because it's tied to a PG major version: * A physical replica can't have different semantics than the primary. * Easier to document and test. * Changes are more transparent and can be documented in the release notes, so that administrators can understand the risks and blast radius at pg_upgrade time. > Later on down the road, from a user perspective, I think we should be > careful about confusion where providers are used inconsistently. It's > not great if one function follow built-in Unicode 15.1 rules but > another > function uses Unicode 13 rules because it happened to call an ICU > function or a glibc function. We could easily end up with multiple > providers processing different parts of a single SQL statement, which > could lead to strange results in some cases. The whole concept of "providers" is that they aren't consistent with each other. ICU, libc, and the builtin provider will all be based on different versions of Unicode. That's by design. The built-in provider will be a bit better in the sense that it's consistent with the normalization functions, and the other providers aren't. Regards, Jeff Davis
Re: Built-in CTYPE provider
On 12/13/23 5:28 AM, Jeff Davis wrote: > On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: >> My biggest concern is around maintenance. Every year Unicode is >> assigning new characters to existing code points, and those existing >> code points can of course already be stored in old databases before >> libs >> are updated. > > Is the concern only about unassigned code points? > > I already committed a function "unicode_assigned()" to test whether a > string contains only assigned code points, which can be used in a > CHECK() constraint. I also posted[5] an idea about a per-database > option that could reject the storage of any unassigned code point, > which would make it easier for users highly concerned about > compatibility. I didn't know about this. Did a few smoke tests against today's head on git and it's nice to see the function working as expected. :) test=# select unicode_version(); unicode_version - 15.1 test=# select chr(3212),unicode_assigned(chr(3212)); chr | unicode_assigned -+-- ಌ | t -- unassigned code point inside assigned block test=# select chr(3213),unicode_assigned(chr(3213)); chr | unicode_assigned -+-- | f test=# select chr(3214),unicode_assigned(chr(3214)); chr | unicode_assigned -+-- ಎ | t -- unassigned block test=# select chr(67024),unicode_assigned(chr(67024)); chr | unicode_assigned -+-- א | f test=# select chr(67072),unicode_assigned(chr(67072)); chr | unicode_assigned -+-- | t Looking closer, patches 3 and 4 look like an incremental extension of this earlier idea; the perl scripts download data from unicode.org and we've specifically defined Unicode version 15.1 and the scripts turn the data tables inside-out into C data structures optimized for lookup. That C code is then checked in to the PostgreSQL source code files unicode_category.h and unicode_case_table.h - right? Am I reading correctly that these two patches add C functions pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we don't yet reference these functions anywhere? So this is just getting some plumbing in place? >> And we may end up with >> something like the timezone database where we need to periodically >> add a >> more current ruleset - albeit alongside as a new version in this >> case. > > There's a build target "update-unicode" which is run to pull in new > Unicode data files and parse them into static C arrays (we already do > this for the Unicode normalization tables). So I agree that the tables > should be updated but I don't understand why that's a problem. I don't want to get stuck on this. I agree with the general approach of beginning to add a provider for locale functions inside the database. We have awhile before Unicode 16 comes out. Plenty of time for bikeshedding My prediction is that updating this built-in provider eventually won't be any different from ICU or glibc. It depends a bit on how we specifically built on this plumbing - but when Unicode 16 comes out, i I'll try to come up with a simple repro on a default DB config where changing the Unicode version causes corruption (it was pretty easy to demonstrate for ICU collation, if you knew where to look)... but I don't think that discussion should derail this commit, because for now we're just starting the process of getting Unicode 15.1 into the PostgreSQL code base. We can cross the "update" bridge when we come to it. Later on down the road, from a user perspective, I think we should be careful about confusion where providers are used inconsistently. It's not great if one function follow built-in Unicode 15.1 rules but another function uses Unicode 13 rules because it happened to call an ICU function or a glibc function. We could easily end up with multiple providers processing different parts of a single SQL statement, which could lead to strange results in some cases. Ideally a user just specifies a default provider their database, and the rules for that version of Unicode are used as consistently as possible - unless a user explicitly overrides their choice in a table/column definition, query, etc. But it might take a little time and work to get to this point. -Jeremy -- http://about.me/jeremy_schneider
Re: Built-in CTYPE provider
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > In particular "el" (modern greek) has case mapping rules that > ICU seems to implement, but "el" is missing from the list > ("lt", "tr", and "az") you identified. I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the ctype semantics match C.UTF-8 for all code points. glibc is not doing this additional tailoring for "el". Therefore I believe the builtin CTYPE would be very useful for case mapping (both "simple" and "full") even without this additional tailoring. You are correct that ICU will still have some features that won't be supported by the builtin provider. Better word boundary semantics in INITCAP() are another advantage. Regards, Jeff Davis
Re: Built-in CTYPE provider
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > But there are CLDR mappings on top of that. I see, thank you. Would it still be called "full" case mapping to only use the mappings in SpecialCasing.txt? And would that be useful? Regards, Jeff Davis
Re: Built-in CTYPE provider
Jeff Davis wrote: > While "full" case mapping sounds more complex, there are actually > very few cases to consider and they are covered in another (small) > data file. That data file covers ~100 code points that convert to > multiple code points when the case changes (e.g. "ß" -> "SS"), 7 > code points that have context-sensitive mappings, and three locales > which have special conversions ("lt", "tr", and "az") for a few code > points. But there are CLDR mappings on top of that. According to the Unicode FAQ https://unicode.org/faq/casemap_charprop.html#5 Q: Does the default case mapping work for every language? What about the default case folding? [...] To make case mapping language sensitive, the Unicode Standard specificially allows implementations to tailor the mappings for each language, but does not provide the necessary data. The file SpecialCasing.txt is included in the Standard as a guide to a few of the more important individual character mappings needed for specific languages, notably the Greek script and the Turkic languages. However, for most language-specific mappings and tailoring, users should refer to CLDR and other resources. In particular "el" (modern greek) has case mapping rules that ICU seems to implement, but "el" is missing from the list ("lt", "tr", and "az") you identified. The CLDR case mappings seem to be found in https://github.com/unicode-org/cldr/tree/main/common/transforms in *-Lower.xml and *-Upper.xml Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Built-in CTYPE provider
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: > My biggest concern is around maintenance. Every year Unicode is > assigning new characters to existing code points, and those existing > code points can of course already be stored in old databases before > libs > are updated. Is the concern only about unassigned code points? I already committed a function "unicode_assigned()" to test whether a string contains only assigned code points, which can be used in a CHECK() constraint. I also posted[5] an idea about a per-database option that could reject the storage of any unassigned code point, which would make it easier for users highly concerned about compatibility. > And we may end up with > something like the timezone database where we need to periodically > add a > more current ruleset - albeit alongside as a new version in this > case. There's a build target "update-unicode" which is run to pull in new Unicode data files and parse them into static C arrays (we already do this for the Unicode normalization tables). So I agree that the tables should be updated but I don't understand why that's a problem. > If I'm reading the Unicode 15 update correctly, PostgreSQL regex > expressions with [[:digit:]] will not correctly identify Kaktovik or > Nag > Mundari or Kawi digits without that update to character type specs. Yeah, if we are behind in the Unicode version, then results won't be the most up-to-date. But ICU or libc could also be behind in the Unicode version. > But lets remember that people like to build indexes on character > classification functions like upper/lower, for case insensitive > searching. UPPER()/LOWER() are based on case mapping, not character classification. I intend to introduce a SQL-level CASEFOLD() function that would obey Unicode casefolding rules, which have very strong compatibility guarantees[6] (essentially, if you are only using assigned code points, you are fine). > It's another case where the index will be corrupted if > someone happened to store Latin Glottal vowels in their database and > then we update libs to the latest character type rules. I don't agree with this characterization at all. (a) It's not "another case". Corruption of an index on LOWER() can happen today. My proposal makes the situation better, not worse. (b) These aren't libraries, I am proposing built-in Unicode tables that only get updated in a new major PG version. (c) It likely only affects a small number of indexes and it's easier for an administrator to guess which ones might be affected, making it easier to just rebuild those indexes. (d) It's not a problem if you stick to assigned code points. > So even with something as basic as character type, if we're going to > do > it right, we still need to either version it or definitively decide > that > we're not going to every support newly added Unicode characters like > Latin Glottals. If, by "version it", you mean "update the data tables in new Postgres versions", then I agree. If you mean that one PG version would need to support many versions of Unicode, I don't agree. Regards, Jeff Davis [5] https://postgr.es/m/c5e9dac884332824e0797937518da0b8766c1238.ca...@j-davis.com [6] https://www.unicode.org/policies/stability_policy.html#Case_Folding
Re: Built-in CTYPE provider
On 12/5/23 3:46 PM, Jeff Davis wrote: > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1], specifying both a "Standard" variant and a > "POSIX Compatible" variant. The main difference with the POSIX variant > is that symbols count as punctuation. > > === LOWER()/INITCAP()/UPPER() === > > The LOWER() and UPPER() functions are defined in the SQL spec with > surprising detail, relying on specific Unicode General Category > assignments. How to map characters seems to be left (implicitly) up to > Unicode. If the input string is normalized, the output string must be > normalized, too. Weirdly, there's no room in the SQL spec to localize > LOWER()/UPPER() at all to handle issues like [1]. Also, the standard > specifies one example, which is that "ß" becomes "SS" when folded to > upper case. INITCAP() is not in the SQL spec. > > === Questions === > > * Is a built-in ctype provider a reasonable direction for Postgres as > a project? > * Does it feel like it would be simpler or more complex than what > we're doing now? > * Do we want to just try to improve our ICU support instead? > * Do we want the built-in provider to be one thing, or have a few > options (e.g. "standard" or "posix" character classification; > "simple" or "full" case mapping)? Generally, I am in favor of this - I think we need to move in the direction of having an in-database option around unicode for PG users, given how easy it is for administrators to mis-manage dependencies. Especially when OS admins can be different from DB admins, and when nobody really understands risks of changing libs with in-place moves to new operating systems - except for like 4 of us on the mailing lists. My biggest concern is around maintenance. Every year Unicode is assigning new characters to existing code points, and those existing code points can of course already be stored in old databases before libs are updated. When users start to notice that regex [[:digit:]] or upper/lower functions aren't working correctly with characters in their DB, they'll probably come asking for fixes. And we may end up with something like the timezone database where we need to periodically add a more current ruleset - albeit alongside as a new version in this case. Here are direct links to charts of newly assigned characters from the last few Unicode updates: 2022: https://www.unicode.org/charts/PDF/Unicode-15.0/ 2021: https://www.unicode.org/charts/PDF/Unicode-14.0/ 2020: https://www.unicode.org/charts/PDF/Unicode-13.0/ 2019: https://www.unicode.org/charts/PDF/Unicode-12.0/ If I'm reading the Unicode 15 update correctly, PostgreSQL regex expressions with [[:digit:]] will not correctly identify Kaktovik or Nag Mundari or Kawi digits without that update to character type specs. If I'm reading the Unicode 12 update correctly, then upper/lower functions aren't going to work correctly on Latin Glottal A and I and U characters without that update to character type specs. Overall I see a lot fewer Unicode updates involving upper/lower than I do with digits - especially since new scripts often involve their own numbering characters which makes new digits more common. But lets remember that people like to build indexes on character classification functions like upper/lower, for case insensitive searching. It's another case where the index will be corrupted if someone happened to store Latin Glottal vowels in their database and then we update libs to the latest character type rules. So even with something as basic as character type, if we're going to do it right, we still need to either version it or definitively decide that we're not going to every support newly added Unicode characters like Latin Glottals. -Jeremy -- http://about.me/jeremy_schneider