patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider

Anton Voloshin Fri, 21 Oct 2022 10:23:51 -0700

Hello, hackers.

In current master, as well as in REL_15_STABLE, installcheck incontrib/citext fails in most locales, if we use ICU as a locale provider:

$ rm -fr data; initdb --locale-provider icu --icu-locale en-US -D data&& pg_ctl -D data -l logfile start && make -C contrib/citextinstallcheck; pg_ctl -D data stop; cat contrib/citext/regression.diffs

...
test citext                       ... ok          457 ms
test citext_utf8                  ... FAILED       21 ms
...

diff -u/home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out/home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out---/home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out2022-07-14 17:45:31.747259743 +0300+++/home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out2022-10-21 19:43:21.146044062 +0300

@@ -54,7 +54,7 @@
 SELECT 'i'::citext = 'İ'::citext AS t;
  t
 ---
- t
+ f
 (1 row)

The reason is that in ICU lowercasing Unicode symbol "İ" (U+0130
"LATIN CAPITAL LETTER I WITH DOT ABOVE") can give two valid results:
- "i", i.e. "U+0069 LATIN SMALL LETTER I" in "tr" and "az" locales.
- "i̇", i.e. "U+0069 LATIN SMALL LETTER I" followed by "U+0307 COMBINING
  DOT ABOVE" in all other locales I've tried (including "en-US", "de",
  "ru", etc).

And the way this test is currently written only accepts plain latin "i",which might be true in glibc, but is not so in ICU. Verified on ICU70.1, but I've seen this on few other ICU versions as well, so I thinkthis is probably an ICU's feature, not a bug(?).


Since we probably want installcheck in contrib/citext to pass on
databases with various locales, including reasonable ICU-based ones,
I suggest to fix this test by accepting either of outputs as valid.

I can see two ways of doing that:
1. change SQL in the test to use "IN" instead of "=";
2. add an alternative output.

I think in this case "IN" is better, because that allows a singlecomment to address both possible outputs and to avoid unnecessaryduplication.

I've attached a patch authored mostly by my colleague, Roman Zharkov, asone possible fix.


Only versions 15+ are affected.

Any comments?

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

From 5cfbb59a11d099fa9b8703502fd8aac936a02c4d Mon Sep 17 00:00:00 2001
From: Roman Zharkov <r.zhar...@postgrespro.ru>
Date: Fri, 21 Oct 2022 19:56:19 +0300
Subject: [PATCH] Fix citext_utf8 test's "Turkish I" with ICU collation
 provider
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With the ICU collation provider the Turkish unicode symbol "İ" (U+0130
"LATIN CAPITAL LETTER I WITH DOT ABOVE") has two lowercase variants:
- "i", i.e. "U+0069 LATIN SMALL LETTER I", in "tr" and "az" locales.
- "i̇", i.e. "U+0069 LATIN SMALL LETTER I" followed by "U+0307 COMBINING
  DOT ABOVE" in all other locales I've tried (including "en-US", "de",
  "ru", etc).

So, add both variants to the test.
---
 contrib/citext/expected/citext_utf8.out | 6 ++++--
 contrib/citext/sql/citext_utf8.sql      | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 666b07ccec4..62f0794028f 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -48,10 +48,12 @@ SELECT 'Ä'::citext <> 'Ä'::citext AS t;
  t
 (1 row)
 
--- Test the Turkish dotted I. The lowercase is a single byte while the
+-- Test the Turkish dotted I. The lowercase might be a single byte while the
 -- uppercase is multibyte. This is why the comparison code can't be optimized
 -- to compare string lengths.
-SELECT 'i'::citext = 'İ'::citext AS t;
+-- Note that lower('İ') is 'i' (U+0069) in tr and az locales,
+-- but 'i̇' (U+0069 U+0307) in C and most (all?) other locales.
+SELECT 'İ'::citext in ('i'::citext, 'i̇'::citext) AS t;
  t 
 ---
  t
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index d068000b423..942daa9ce50 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -24,10 +24,12 @@ SELECT 'À'::citext <> 'B'::citext AS t;
 SELECT 'Ä'::text   <> 'Ä'::text   AS t;
 SELECT 'Ä'::citext <> 'Ä'::citext AS t;
 
--- Test the Turkish dotted I. The lowercase is a single byte while the
+-- Test the Turkish dotted I. The lowercase might be a single byte while the
 -- uppercase is multibyte. This is why the comparison code can't be optimized
 -- to compare string lengths.
-SELECT 'i'::citext = 'İ'::citext AS t;
+-- Note that lower('İ') is 'i' (U+0069) in tr and az locales,
+-- but 'i̇' (U+0069 U+0307) in C and most (all?) other locales.
+SELECT 'İ'::citext in ('i'::citext, 'i̇'::citext) AS t;
 
 -- Regression.
 SELECT 'láska'::citext <> 'laská'::citext AS t;
-- 
2.38.1

patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider

Reply via email to