Am 20.08.21 um 11:03 schrieb Helge Oldach: > Hi all, > > I'm confused about the FreeBSD behaviour with respect to locale's > and grep - specifically, it seems case sensitivity is not handled > consistently when grepping character ranges. It looks to me like 11 and > 13 are not behaving consistently however I'm unclear why. > > # uname -a > FreeBSD 11STABLE 11.4-STABLE FreeBSD 11.4-STABLE #1059 r368289M: Thu Dec 3 > 01:48:30 UTC 2020 root@XXX amd64 > # export LANG=en_US.ISO8859-1 > # (echo bla; echo Bla) | grep '[A-Z]' > Bla > # export LANG=C > # (echo bla; echo Bla) | grep '[A-Z]' > Bla > # export LANG=en_US.UTF-8 > # (echo bla; echo Bla) | grep '[A-Z]' > bla > Bla
This is not unexpected, since the default collating sequence for many UTF-8
locales is to have lower case letters precede their upper case versions in
the sequence, i.e.: "aAbBcC..."
https://developer.mimer.com/services/sql-unicode-collation-charts/
Here is a collation chart for English:
https://download.mimer.com/pub/developer/charts/english.htm
But POSIX makes no guarantees for locales other than POSIX or C.
> # uname -a
> FreeBSD 13STABLE 13.0-STABLE FreeBSD 13.0-STABLE #49
> stable/13-n246779-64085efb677-dirty: Mon Aug 16 08:42:53 CEST 2021
> root@XXX amd64
> # export LANG=en_US.ISO8859-1
> # (echo bla; echo Bla) | grep '[A-Z]'
> bla
> Bla
This one is unexpected, the upper case should be a range of its own
and should not include any lower case letters.
> # export LANG=C
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
Correct.
> # export LANG=en_US.UTF-8
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
Here I had expected the result you got with en_US.ISO8859-1 ...
> For comparison, a Linux RHEL box delivers the expected results:
>
> # uname -a
> Linux rhel.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST
> 2019 x86_64 x86_64 x86_64 GNU/Linux
> # export LANG=en_US.ISO8859-1
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
> # export LANG=C
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
> # export LANG=en_US.UTF-8
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
Seems that this version uses a POSIX style collating sequence for UTF-8.
It would be interesting to test with ranges that contain accented
characters or German Umlaut characters.
> There is nothing special in the environment, specifically no LC_xxx nor
> MM_CHARSET in either case.
LANG defines LC_COLLATE, unless overridden.
> Any guidance is appreciated... Thanks!
Definitely a bug in the definition of the collating sequences.
And I have just verified that de_DE.ISO8859-1 wrongly considers "ö"
to be within [a-z], while de_DE.UTF-8 does not (but should).
Seems that the correct collating sequences for ISO8859-1 and UTF-8 are
each assigned to the other one.
Some platforms have switched to use the POSIX style collating sequence
to support traditional style [A-Z] for [[:upper:]], since a lot of shell
script have been written with that assumption for decades.
BTW, character classes work for your examples and more:
# (echo bla; echo Bla) | LANG=en_US.ISO8859-1 grep '[[:upper:]]'
Bla
# (echo bla; echo Bla) | LANG=en_US.UTF-8 grep '[[:upper:]]'
Bla
# (echo "o"; echo "ö") | LANG=de_DE.ISO8859-1 grep '[[:lower:]]'
o
# (echo "o"; echo "ö") | LANG=de_DE.UTF-8 grep '[[:lower:]]'
o
ö
Regards, STefan
OpenPGP_signature
Description: OpenPGP digital signature
