subject:"encoding affects ICU regex character classification"

Re: encoding affects ICU regex character classification

2023-12-18 Thread Jeff Davis

On Fri, 2023-12-15 at 16:48 -0800, Jeremy Schneider wrote:
> This goes back to my other thread (which sadly got very little
> discussion): PosgreSQL really needs to be safe by /default/

Doesn't a built-in provider help create a safer option?

The built-in provider's version of Unicode will be consistent with
unicode_assigned(), which is a first step toward rejecting code points
that the provider doesn't understand. And by rejecting unassigned code
points, we get all kinds of Unicode compatibility guarantees that avoid
the kinds of change risks that you are worried about.

Regards,
Jeff Davis

Re: encoding affects ICU regex character classification

2023-12-15 Thread Thomas Munro

On Sat, Dec 16, 2023 at 1:48 PM Jeremy Schneider
 wrote:
> On 12/14/23 7:12 AM, Jeff Davis wrote:
> > The concern over unassigned code points is misplaced. The application
> > may be aware of newly-assigned code points, and there's no way they
> > will be mapped correctly in Postgres if the provider is not aware of
> > those code points. The user can either proceed in using unassigned code
> > points and accept the risk of future changes, or wait for the provider
> > to be upgraded.
>
> This does not seem to me like a good way to view the situation.
>
> Earlier this summer, a day or two after writing a document, I was
> completely surprised to open it on my work computer and see "unknown
> character" boxes. When I had previously written the document on my home
> computer and when I had viewed it from my cell phone, everything was
> fine. Apple does a very good job of always keeping iPhones and MacOS
> versions up-to-date with the latest versions of Unicode and latest
> characters. iPhone keyboards make it very easy to access any character.
> Emojis are the canonical example here. My work computer was one major
> version of MacOS behind my home computer.

That "SQUARE ERA NAME REIWA" codepoint we talked about in one of the
multi-version ICU threads was an interesting case study.  It's not an
emoji, it entered real/serious use suddenly, landed in a quickly
wrapped minor release of Unicode, and then arrived in locale
definitions via regular package upgrades on various OSes AFAICT (ie
didn't require a major version upgrade of the OS).

https://en.wikipedia.org/wiki/Reiwa_era#Announcement
https://en.wikipedia.org/wiki/Reiwa_era#Technology
https://unicode.org/versions/Unicode12.1.0/

Re: encoding affects ICU regex character classification

2023-12-15 Thread Jeremy Schneider

On 12/14/23 7:12 AM, Jeff Davis wrote:
> The concern over unassigned code points is misplaced. The application
> may be aware of newly-assigned code points, and there's no way they
> will be mapped correctly in Postgres if the provider is not aware of
> those code points. The user can either proceed in using unassigned code
> points and accept the risk of future changes, or wait for the provider
> to be upgraded.

This does not seem to me like a good way to view the situation.

Earlier this summer, a day or two after writing a document, I was
completely surprised to open it on my work computer and see "unknown
character" boxes. When I had previously written the document on my home
computer and when I had viewed it from my cell phone, everything was
fine. Apple does a very good job of always keeping iPhones and MacOS
versions up-to-date with the latest versions of Unicode and latest
characters. iPhone keyboards make it very easy to access any character.
Emojis are the canonical example here. My work computer was one major
version of MacOS behind my home computer.

And I'm probably one of a few people on this hackers email list who even
understands what the words "unassigned code point" mean. Generally DBAs,
sysadmins, architects and developers who are all part of the tangled web
of building and maintaining systems which use PostgreSQL on their
backend are never going to think about unicode characters proactively.

This goes back to my other thread (which sadly got very little
discussion): PosgreSQL really needs to be safe by /default/ ... having
GUCs is fine though; we can put explanation in the docs about what users
should consider if they change a setting.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: encoding affects ICU regex character classification

2023-12-14 Thread Jeff Davis

On Tue, 2023-12-12 at 14:35 -0800, Jeremy Schneider wrote:
> Is someone able to test out upper & lower functions on U+A7BA ...
> U+A7BF
> across a few libs/versions?

Those code points are unassigned in Unicode 11.0 and assigned in
Unicode 12.0.

In ICU 63-2 (based on Unicode 11.0), they just get mapped to
themselves. In ICU 64-2 (based on Unicode 12.1) they get mapped the
same way the builtin CTYPE maps them (based on Unicode 15.1).

The concern over unassigned code points is misplaced. The application
may be aware of newly-assigned code points, and there's no way they
will be mapped correctly in Postgres if the provider is not aware of
those code points. The user can either proceed in using unassigned code
points and accept the risk of future changes, or wait for the provider
to be upgraded.

If the user doesn't have many expression indexes dependent on ctype
behavior, it doesn't matter much. If they do have such indexes, the
best we can offer is a controlled process, and the builtin provider
allows the most visibility and control.

(Aside: case mapping has very strong compatibility guarantees, but not
perfect. For better compatibility guarantees, we should support case
folding.)

> And I have no idea if or when
> glibc might have picked up the new unicode characters.

That's a strong argument in favor of a builtin provider.

Regards,
Jeff Davis

Re: encoding affects ICU regex character classification

2023-12-12 Thread Jeremy Schneider

On 12/12/23 1:39 PM, Jeff Davis wrote:
> On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:
>> Unless you also
>> implement built-in case mapping, you'd still have to call libc or ICU
>> for that, right?
> 
> We can do built-in case mapping, see:
> 
> https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com
> 
>>   It seems a bit strange to use different systems for
>> classification and mapping.  If you do implement mapping too, you
>> have
>> to decide if you believe it is language-dependent or not, I think?
> 
> A complete solution would need to do the language-dependent case
> mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
> and only a handful of mapping changes, so we can handle that with the
> builtin provider as well.

This thread has me second-guessing the reply I just sent on the other
thread.

Is someone able to test out upper & lower functions on U+A7BA ... U+A7BF
across a few libs/versions?  Theoretically the upper/lower behavior
should change in ICU between Ubuntu 18.04 LTS and Ubuntu 20.04 LTS
(specifically in ICU 64 / Unicode 12).  And I have no idea if or when
glibc might have picked up the new unicode characters.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: encoding affects ICU regex character classification

2023-12-12 Thread Jeff Davis

On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:

> 
> How would you specify what you want?

One proposal would be to have a builtin collation provider:

https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.ca...@j-davis.com

I don't think there are very many ctype options, but they could be
specified as part of the locale, or perhaps even as some provider-
specific options specified at CREATE COLLATION time.

> As with collating, I like the
> idea of keeping support for libc even if it is terrible (some libcs
> more than others) and eventually not the default, because I think
> optional agreement with other software on the same host is a feature.

Of course we should keep the libc support around. I'm not sure how
relevant such a feature is, but I don't think we actually have to
remove it.

> Unless you also
> implement built-in case mapping, you'd still have to call libc or ICU
> for that, right?

We can do built-in case mapping, see:

https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com

>   It seems a bit strange to use different systems for
> classification and mapping.  If you do implement mapping too, you
> have
> to decide if you believe it is language-dependent or not, I think?

A complete solution would need to do the language-dependent case
mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
and only a handful of mapping changes, so we can handle that with the
builtin provider as well.

> Hmm, let's see what we're doing now... for ICU the regex code is
> using
> "simple" case mapping functions like u_toupper(c) that don't take a
> locale, so no Turkish i/İ conversion for you, unlike our SQL
> upper()/lower(), which this is supposed to agree with according to
> the
> comments at the top.  I see why: POSIX can only do one-by-one
> character mappings (which cannot handle Greek's context-sensitive
> Σ->σ/ς or German's multi-character ß->SS)

Regexes are inherently character-by-character, so transformations like
ß->SS are not going to work for case-insensitive regex matching
regardless of the provider.

Σ->σ/ς does make sense, and what we have seems to be just broken:

  select 'ς' ~* 'Σ'; -- false in both libc and ICU
  select 'Σ' ~* 'ς'; -- true in both libc and ICU

Similarly for titlecase variants:

  select 'ǅ' ~* 'ǆ'; -- false in libc and ICU
  select 'ǆ' ~* 'ǅ'; -- true in libc and ICU

If we do the case mapping ourselves, we can make those work. We'd just
have to modify the APIs a bit so that allcases() can actually get all
of the case variants, rather than relying on just towupper/towlower.

Regards,
Jeff Davis

Re: encoding affects ICU regex character classification

2023-12-09 Thread Thomas Munro

On Sat, Dec 2, 2023 at 9:49 AM Jeff Davis  wrote:
> Your definition is too wide in my opinion, because it mixes together
> different sources of variation that are best left separate:
>  a. region/language
>  b. technical requirements
>  c. versioning
>  d. implementation variance
>
> (a) is not a true source of variation (please correct me if I'm wrong)
>
> (b) is perhaps interesting. The "C" locale is one example, and perhaps
> there are others, but I doubt very many others that we want to support.
>
> (c) is not a major concern in my opinion. The impact of Unicode changes
> is usually not dramatic, and it only affects regexes so it's much more
> contained than collation, for example. And if you really care, just use
> the "C" locale.
>
> (d) is mostly a bug

I get you.  I was mainly commenting on what POSIX APIs allow, which is
much wider than what you might observe on , and also
end-user-customisable.  But I agree that Unicode is all-pervasive and
authoritative in practice, to the point that if your libc disagrees
with it, it's probably just wrong.  (I guess site-local locales were
essential for bootstrapping in the early days of computers in a
language/territory but I can't find much discussion of the tools being
used by non-libc-maintainers today.)

> I think we only need 2 main character classification schemes: "C" and
> Unicode (TR #18 Compatibility Properties[1], either the "Standard"
> variant or the "POSIX Compatible" variant or both). The libc and ICU
> ones should be there only for compatibility and discouraged and
> hopefully eventually removed.

How would you specify what you want?  As with collating, I like the
idea of keeping support for libc even if it is terrible (some libcs
more than others) and eventually not the default, because I think
optional agreement with other software on the same host is a feature.

In the regex code we see not only class membership tests eg
iswlower_l(), but also conversions eg towlower_l().  Unless you also
implement built-in case mapping, you'd still have to call libc or ICU
for that, right?  It seems a bit strange to use different systems for
classification and mapping.  If you do implement mapping too, you have
to decide if you believe it is language-dependent or not, I think?

Hmm, let's see what we're doing now... for ICU the regex code is using
"simple" case mapping functions like u_toupper(c) that don't take a
locale, so no Turkish i/İ conversion for you, unlike our SQL
upper()/lower(), which this is supposed to agree with according to the
comments at the top.  I see why: POSIX can only do one-by-one
character mappings (which cannot handle Greek's context-sensitive
Σ->σ/ς or German's multi-character ß->SS), while ICU offers only
language-aware "full" string conversation (which does not guarantee
1:1 mapping for each character in a string) OR non-language-aware
"simple" character conversion (which does not handle Turkish's i->İ).
ICU has no middle ground for language-aware mapping with just the 1:1
cases only, probably because that doesn't really make total sense as a
concept (as I assume Greek speakers would agree).

> > > Not knowing anything about how glibc generates its charmaps,
> > > Unicode
> > > or pre-Unicode, I could take a wild guess that maybe in LATIN9 they
> > > have an old hand-crafted table, but for UTF-8 encoding it's fully
> > > outsourced to Unicode, and that's why you see a difference.
>
> No, the problem is that we're passing a pg_wchar to an ICU function
> that expects a 32-bit code point. Those two things are equivalent in
> the UTF8 encoding, but not in the LATIN9 encoding.

Ah right, I get that now (sorry, I confused myself by forgetting we
were talking about ICU).

Re: encoding affects ICU regex character classification

2023-11-29 Thread Thomas Munro

On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis  wrote:
> Character classification is not localized at all in libc or ICU as far
> as I can tell.

Really?  POSIX isalpha()/isalpha_l() and friends clearly depend on a
locale.  See eg d522b05c for a case where that broke something.
Perhaps you mean glibc wouldn't do that to you because you know that,
as an unstandardised detail, it sucks in (some version of) Unicode's
data which shouldn't vary between locales.  But you are allowed to
make your own locales, including putting whatever classifications you
want into the LC_TYPE file using POSIX-standardised tools like
localedef.  Perhaps that is a bit of a stretch, and no one really does
that in practice, but anyway it's still "localized".

Not knowing anything about how glibc generates its charmaps, Unicode
or pre-Unicode, I could take a wild guess that maybe in LATIN9 they
have an old hand-crafted table, but for UTF-8 encoding it's fully
outsourced to Unicode, and that's why you see a difference.  Another
problem seen in a few parts of our tree is that we sometimes feed
individual UTF-8 bytes to the isXXX() functions which is about as well
defined as trying to pay for a pint with the left half of a $10 bill.

As for ICU, it's "not localized" only if there is only one ICU library
in the universe, but of course different versions of ICU might give
different answers because they correspond to different versions of
Unicode (as do glibc versions, FreeBSD libc versions, etc) and also
might disagree with tables built by PostgreSQL.  Maybe irrelevant for
now, but I think with thus-far-imagined variants of the multi-version
ICU proposal, you have to choose whether to call u_isUAlphabetic() in
the library we're linked against, or via the dlsym() we look up in a
particular dlopen'd library.  So I guess we'd have to access it via
our pg_locale_t, so again it'd be "localized" by some definitions.

Thinking about how to apply that thinking to libc, ... this is going
to sound far fetched and handwavy but here goes:  we could even
imagine a multi-version system based on different base locale paths.
Instead of using the system-provided locales under /usr/share/locale
to look when we call newlocale(..., "en_NZ.UTF-8", ...), POSIX says
we're allowed to specify an absolute path eg newlocale(...,
"/foo/bar/unicode11/en_NZ.UTF-8", ...).  If it is possible to use
$DISTRO's localedef to compile $OLD_DISTRO's locale sources to get
historical behaviour, that might provide a way to get them without
assuming the binary format is stable (it definitely isn't, but the
source format is nailed down by POSIX).  One fly in the ointment is
that glibc failed to implement absolute path support, so you might
need to use versioned locale names instead, or see if the LOCPATH
environment variable can be swizzled around without confusing glibc's
locale cache.  Then wouldn't be fundamentally different than the
hypothesised multi-version ICU case: you could probably come up with
different isalpha_l() results for different locales because you have
different LC_CTYPE versions (for example Unicode 15.0 added new
extended Cyrillic characters 1E030..1E08F, they look alphabetical to
me but what would I know).  That is an extremely hypothetical
pie-in-the-sky thought and I don't know if it'd really work very well,
but it is a concrete way that someone might finish up getting
different answers out of isalpha_l(), to observe that it really is
localised.  And localized.

Re: encoding affects ICU regex character classification

2023-11-29 Thread Tom Lane

Jeff Davis  writes:
> The problem seems to be confusion between pg_wchar and a unicode code
> point in pg_wc_isalpha() and related functions.

Yeah, that's an ancient sore spot: we don't really know what the
representation of wchar is.  We assume it's Unicode code points
for UTF8 locales, but libc isn't required to do that AFAIK.  See
comment block starting about line 20 in regc_pg_locale.c.

I doubt that ICU has much to do with this directly.

We'd have to find an alternate source of knowledge to replace the
 functions if we wanted to fix it fully ... can ICU do that?

regards, tom lane

encoding affects ICU regex character classification

2023-11-29 Thread Jeff Davis

The following query:

SELECT U&'\017D' ~ '[[:alpha:]]' collate "en-US-x-icu";

returns true if the server encoding is UTF8, and false if the server
encoding is LATIN9. That's a bug -- any behavior involving ICU should
be encoding-independent.

The problem seems to be confusion between pg_wchar and a unicode code
point in pg_wc_isalpha() and related functions.

It might be good to introduce some infrastructure here that can convert
a pg_wchar into a Unicode code point, or decode a string of bytes into
a string of 32-bit code points. Right now, that's possible, but it
involves pg_wchar2mb() followed by encoding conversion to UTF8,
followed by decoding the UTF8 to a code point. (Is there an easier path
that I missed?)

One wrinkle is MULE_INTERNAL, which doesn't have any conversion path to
UTF8. That's not important for ICU (because ICU is not allowed for that
encoding), but I'd like it if we could make this infrastructure
independent of ICU, because I have some follow-up proposals to simplify
character classification here and in ts_locale.c.

Thoughts?

Regards,
Jeff Davis

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

Re: encoding affects ICU regex character classification

encoding affects ICU regex character classification

10 matches

Site Navigation

Mail list logo

Footer information