On 19/04/2022 01:52, Harald van Dijk via austin-group-l at The Open Group wrote:
On 15/04/2022 04:57, Christoph Anton Mitterer wrote:
On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l
at The Open Group wrote:
If there is interest in getting this standardised, I can spend some
more
time on creating some hopefully comprehensive tests for this to
confirm
in what cases shells agree and disagree, and use that as a basis for
proposing wording to cover it.

I'd love to see that and if you'd actually do so,.... I'd kindly ask
Geoff to defer any changes in the ticket #1564 of mine, until it can be
said whether it might be possible to get that standardised.

Very well, I will post tests and test results as soon I can make the time for it.

Please see the tests and results here. Apologies for the HTML mail but this is hard to make readable in plain text.

String
        
Pattern
        
dash, busybox ash, mksh, posh, pdksh
        
glibc fnmatch
        
bash
        
bosh
        
gwsh
        
ksh
        
zsh
\303\244
        
[\303\244]
        
no match
        
match
        
match
        
match
        
match
        
match
        
match
\303\244
        
?
        
no match
        
match
        
match
        
match
        
match
        
match
        
match
\303
        
[\303]
        
match
        
match
        
match
        
match
        
match
        
match
        
match
\303
        
?
        
match
        
match
        
match
        
match
        
match
        
match
        
match
\303.\303\244
        
[\303].[\303\244]
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303.\303\244
        
?.?
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303\303\244
        
[\303][\303\244]
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303\303\244
        
??
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303\244.\303
        
[\303\244].[\303]
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303\244.\303
        
?.?
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303\244\303
        
[\303\244][\303]
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303\244\303
        
??
        
no match
        
no match
        
no match
        
match
        
match
        
match
        
match
\303\244
        
\303*
        
match
        
match
        
match
        
match
        
no match
        
match
        
no match
\303\244
        
\303?
        
match
        
match
        
match
        
no match
        
no match
        
match
        
no match
\303\244
        
[\303]*
        
match
        
match
        
match
        
match
        
no match
        
match
        
no match
\303\244
        
[\303]?
        
match
        
match
        
match
        
no match
        
no match
        
match
        
no match
\303\244
        
*\204
        
match
        
match
        
match
        
no match
        
no match
        
no match
        
match
\303\244
        
?\204
        
match
        
match
        
match
        
no match
        
no match
        
no match
        
no match
\303\244
        
*[\204]
        
match
        
match
        
match
        
no match
        
no match
        
no match
        
no match
\303\244
        
?[\204]
        
match
        
match
        
match
        
no match
        
no match
        
no match
        
no match
\243]
        
[\243]]
        
match
        
match
        
match
        
match
        
match
        
match
        
match
\243]
        
?
        
no match
        
match
        
match
        
match
        
match
        
match
        
match
\243
        
?
        
match
        
match
        
match
        
match
        
match
        
match
        
match
\243
        
[\243]
        
match
        
match
        
match
        
match
        
no match
        
no match
        
error
\243
        
[\243!]
        
match
        
match
        
match
        
match
        
match
        
match
        
match
\243]
        
[\243!]]
        
match
        
match
        
no match
        
no match
        
no match
        
match
        
no match
\243]
        
?]
        
match
        
match
        
no match
        
no match
        
no match
        
no match
        
no match
\243]
        
*]
        
match
        
match
        
no match
        
no match
        
no match
        
no match
        
match

The tests involving \303 and \244 are run in an UTF-8 environment. In UTF-8, \303\244 is the representation of ä, a single valid character. The tests involving \243 are run in a Big5 environment. In Big5, \243\135 is the representation of β, a single valid character, even though \135 on its own is still the single character ]. The results for dash, busybox ash, mksh, posh, pdksh are because they appear to implement no locale support at all for pattern matching. The results for the other shells show that they perform locale-sensitive pattern matching if both the string and the pattern are valid character strings in the current locale, with one exception. zsh and glibc fnmatch do not handle the case of the byte representation of one valid character appearing within the byte representation of another valid character. This cannot happen with single-byte character sets, and cannot happen with UTF-8, but can happen with Big-5. I believe the current wording requires this to be supported, this should continue to be supported, and would suggest treating this as a bug. zsh does not implement the rule that a [ that does not start a bracket expression matches itself, and treats it as an invalid pattern instead. This is clearly contrary to what POSIX requires and unrelated to the locale handling. The other shells, when either the string or the pattern are not valid in the current locale, are not in agreement on whether parts of the rest of the string or the pattern are still interpreted according to the current locale, and if so, which parts. bash and glibc fnmatch appear to mostly fall back to entirely single-byte-character-set based pattern matching, as if the whole thing is treated as if under LC_ALL=C. The test involving the [\243!]] pattern is an exception in bash, but it's notable that the tests involving \243 are buggy in general in glibc fnmatch. I have not checked whether these implementations share code and if so, if bash partially fixed glibc's bugs. gwsh and zsh treat invalid bytes as single characters distinct from all other characters, and treat the rest of the string and the pattern according to the current locale. ksh treats invalid bytes as single characters that may match the start of other characters, and treats the rest of the string and the pattern according to the current locale. bosh does something I do not understand and I would welcome an explanation by someone who does get it. The shell tests are all performed with the case statement. The fnmatch tests are performed with a trivial wrapper program that calls setlocale() followed by fnmatch(). Running these same tests in other environments and posting the results would be much appreciated.

Cheers,
Harald van Dijk
  • [Issue 8 dra... Austin Group Bug Tracker via austin-group-l at The Open Group
  • [Issue 8 dra... Austin Group Bug Tracker via austin-group-l at The Open Group
  • [Issue 8 dra... Austin Group Bug Tracker via austin-group-l at The Open Group
    • Re: [Is... Robert Elz via austin-group-l at The Open Group
      • Re:... Geoff Clare via austin-group-l at The Open Group
      • Re:... Robert Elz via austin-group-l at The Open Group
        • ... Harald van Dijk via austin-group-l at The Open Group
          • ... Christoph Anton Mitterer via austin-group-l at The Open Group
            • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Chet Ramey via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
          • ... Geoff Clare via austin-group-l at The Open Group
            • ... Harald van Dijk via austin-group-l at The Open Group
        • ... Geoff Clare via austin-group-l at The Open Group
      • Re:... Christoph Anton Mitterer via austin-group-l at The Open Group

Reply via email to