On 19/04/2022 01:52, Harald van Dijk via austin-group-l at The Open
Group wrote:
On 15/04/2022 04:57, Christoph Anton Mitterer wrote:
On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l
at The Open Group wrote:
If there is interest in getting this standardised, I can spend some
more
time on creating some hopefully comprehensive tests for this to
confirm
in what cases shells agree and disagree, and use that as a basis for
proposing wording to cover it.
I'd love to see that and if you'd actually do so,.... I'd kindly ask
Geoff to defer any changes in the ticket #1564 of mine, until it can be
said whether it might be possible to get that standardised.
Very well, I will post tests and test results as soon I can make the
time for it.
Please see the tests and results here. Apologies for the HTML mail but
this is hard to make readable in plain text.
String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh
\303\244
[\303\244]
no match
match
match
match
match
match
match
\303\244
?
no match
match
match
match
match
match
match
\303
[\303]
match
match
match
match
match
match
match
\303
?
match
match
match
match
match
match
match
\303.\303\244
[\303].[\303\244]
no match
no match
no match
match
match
match
match
\303.\303\244
?.?
no match
no match
no match
match
match
match
match
\303\303\244
[\303][\303\244]
no match
no match
no match
match
match
match
match
\303\303\244
??
no match
no match
no match
match
match
match
match
\303\244.\303
[\303\244].[\303]
no match
no match
no match
match
match
match
match
\303\244.\303
?.?
no match
no match
no match
match
match
match
match
\303\244\303
[\303\244][\303]
no match
no match
no match
match
match
match
match
\303\244\303
??
no match
no match
no match
match
match
match
match
\303\244
\303*
match
match
match
match
no match
match
no match
\303\244
\303?
match
match
match
no match
no match
match
no match
\303\244
[\303]*
match
match
match
match
no match
match
no match
\303\244
[\303]?
match
match
match
no match
no match
match
no match
\303\244
*\204
match
match
match
no match
no match
no match
match
\303\244
?\204
match
match
match
no match
no match
no match
no match
\303\244
*[\204]
match
match
match
no match
no match
no match
no match
\303\244
?[\204]
match
match
match
no match
no match
no match
no match
\243]
[\243]]
match
match
match
match
match
match
match
\243]
?
no match
match
match
match
match
match
match
\243
?
match
match
match
match
match
match
match
\243
[\243]
match
match
match
match
no match
no match
error
\243
[\243!]
match
match
match
match
match
match
match
\243]
[\243!]]
match
match
no match
no match
no match
match
no match
\243]
?]
match
match
no match
no match
no match
no match
no match
\243]
*]
match
match
no match
no match
no match
no match
match
The tests involving \303 and \244 are run in an UTF-8 environment. In
UTF-8, \303\244 is the representation of ä, a single valid character.
The tests involving \243 are run in a Big5 environment. In Big5,
\243\135 is the representation of β, a single valid character, even
though \135 on its own is still the single character ].
The results for dash, busybox ash, mksh, posh, pdksh are because they
appear to implement no locale support at all for pattern matching.
The results for the other shells show that they perform locale-sensitive
pattern matching if both the string and the pattern are valid character
strings in the current locale, with one exception.
zsh and glibc fnmatch do not handle the case of the byte representation
of one valid character appearing within the byte representation of
another valid character. This cannot happen with single-byte character
sets, and cannot happen with UTF-8, but can happen with Big-5. I believe
the current wording requires this to be supported, this should continue
to be supported, and would suggest treating this as a bug.
zsh does not implement the rule that a [ that does not start a bracket
expression matches itself, and treats it as an invalid pattern instead.
This is clearly contrary to what POSIX requires and unrelated to the
locale handling.
The other shells, when either the string or the pattern are not valid in
the current locale, are not in agreement on whether parts of the rest of
the string or the pattern are still interpreted according to the current
locale, and if so, which parts.
bash and glibc fnmatch appear to mostly fall back to entirely
single-byte-character-set based pattern matching, as if the whole thing
is treated as if under LC_ALL=C. The test involving the [\243!]] pattern
is an exception in bash, but it's notable that the tests involving \243
are buggy in general in glibc fnmatch. I have not checked whether these
implementations share code and if so, if bash partially fixed glibc's bugs.
gwsh and zsh treat invalid bytes as single characters distinct from all
other characters, and treat the rest of the string and the pattern
according to the current locale.
ksh treats invalid bytes as single characters that may match the start
of other characters, and treats the rest of the string and the pattern
according to the current locale.
bosh does something I do not understand and I would welcome an
explanation by someone who does get it.
The shell tests are all performed with the case statement. The fnmatch
tests are performed with a trivial wrapper program that calls
setlocale() followed by fnmatch(). Running these same tests in other
environments and posting the results would be much appreciated.
Cheers,
Harald van Dijk