[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
The following issue has a resolution that has been APPLIED. == https://austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: Applied Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text:https://austingroupbugs.net/view.php?id=1564#c5796 Resolution: Accepted As Marked Fixed in Version: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-11-30 16:37 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 2022-04-11 13:55 geoffclare Note Added: 0005796 2022-04-11 22:58 kreNote Added: 0005797 2022-04-12 08:51 geoffclare Note Added: 0005798 2022-04-15 02:12 calestyo Note Added: 0005804 2022-04-15 02:17 calestyo Note Added: 0005805 2022-10-31 16:13 geoffclare Final Accepted Text => https://austingroupbugs.net/view.php?id=1564#c5796 2022-10-31 16:13 geoffclare Status New => Resolved 2022-10-31 16:13 geoffclare Resolution Open => Accepted As Marked 2022-10-31 16:13 geoffclare Tag Attached: tc3-2008 2022-11-30 16:37 geoffclare Status Resolved => Applied ==
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Hey folks. A while ago we had this discussion about pattern matching notation and characters vs. bytes. Back then, Harald von Dijk did some investigation on whether the standard could be changed to allow for bytes (and not just characters) without breaking all kinds of shells. IIRC, when he presented his results, these were that there are some obstacles (in the sense of some shells behaving differently) but he rather considered them bugs than actually desired behaviour by these shells. A while ago I had a short off-list mail exchange with him, and if I understood correctly (please correct me if not ^^), Harald would be still willing to track things further down and get them resolved (i.e. allowing any bytes and not just characters) - which I think would be really good for the standard and the ecosystem[0]. However, I think he wanted to get some kind of blessing/support by people who have a stronger say in the matter (I guess people like shell implementers and representatives from the POSIX WG) before actually putting a lot of effort into that. So questions is: What do people here at the list think, would they find it useful to resolve any open issues and strive to have the standard define pattern matching notation on strings that may contain any bytes (and not just such that form characters in the current locale) and is there any support for this? Do we have any shell implementers/maintainers around here, who could comment on what they think (especially with respect to "their" shells) or do we have some means of contacting such folks? I guess it would be quite worth to actually get to that state,... and while of course this could also be done in years, it may be harder by then, in case new shells come up that handle things in some incompatible way. Thanks, Chris. [0] My personal motivation was to get some (portable) command substitution *with* trailing newlies, which proved to be quite hard to actually do in POSIX. In the current standard, with pattern matching notation working on characters only, using the special properties of the '.' character (as a sentinel) may not be enough... and the games with LC_ALL=C introduce all kinds of further subtle issues, especially when one wants to make the whole thing a function and has to live with shells and their different way of handling `local` variables.
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
The following issue has been RESOLVED. == https://austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: Resolved Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text:https://austingroupbugs.net/view.php?id=1564#c5796 Resolution: Accepted As Marked Fixed in Version: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-10-31 16:13 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 2022-04-11 13:55 geoffclare Note Added: 0005796 2022-04-11 22:58 kreNote Added: 0005797 2022-04-12 08:51 geoffclare Note Added: 0005798 2022-04-15 02:12 calestyo Note Added: 0005804 2022-04-15 02:17 calestyo Note Added: 0005805 2022-10-31 16:13 geoffclare Final Accepted Text => https://austingroupbugs.net/view.php?id=1564#c5796 2022-10-31 16:13 geoffclare Status New => Resolved 2022-10-31 16:13 geoffclare Resolution Open => Accepted As Marked ==
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On 5/18/22 9:46 PM, Christoph Anton Mitterer via austin-group-l at The Open Group wrote: The above, I'm not quite sure what these tell/prove... I assume the ones with '?': that for all except bash/fnmatch '?' matches both, valid characters and a single byte that is no character. And the ones with bracket expression, that these also work when the BE has either a valid character or a byte (that is not a character) and vice-versa? If Chet is reading along, is the above intended in bash, or considered a bug? The bash matcher falls back to C-locale-like behavior only if the pattern and the string both do not contain any valid multibyte characters. So if, for example, the string contains a valid multibyte character, but the pattern does not, the matcher will attempt multibyte (wide character, really) matches. This is why the string \243] (a valid multibyte character in Big5) does not match [\243!]]: nothing in the bracket expression will match that character, and that string will never match a pattern ending in `]'. IMO it would have been interesting to see whether ? would also match multiple bytes that are each for themselves and together no valid character... No, it wouldn't. You can make a case for `?' matching a single byte that is not part of a valid multibyte character (there is no such thing as a single byte that is "no valid character" when you are matching), but you cannot make one for `?' matching more than one byte that does not compose a valid multibyte character. The tests involving \243 are run in a Big5 environment. In Big5, \243\135 is the representation of β, a single valid character, even though \135 on its own is still the single character ]. Seem also a bit strange to me,... all shells match \243 against ? ... i.e. ? matches a single byte that is not a character... but later on it doesn't work again with \243] and ?] Because, as Harald says, \243] is a valid multibyte character in Big5 locales. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On 20/05/2022 01:11, Christoph Anton Mitterer wrote: On Thu, 2022-05-19 at 09:05 +0100, Harald van Dijk wrote: The above, AFAIU, mean that any shell/fnmatch matches a valid multibyte character... but also a byte that is not a character in the locale. Correct, though as I wrote later on, the way they go about it is different. And I think, for any real standardisation of this (which I'd still love to see) quite a few things would need to be reasonably defined, including but most likely not limited to: - Does * match bytes (by which I mean 1-n which don't form valid characters in the current locale). This is another one of those that seemed obvious enough to me that I did not think to check explicitly. As far as I can tell at a quick glance, * matches any number (zero or more) of ?, whatever ? means, except in the case of a particular shell bug that also breaks scripts already required by the standard to work. - The same for ? ... and if that matches bytes at all - only 1 or n? It matches a single character or a single byte that is not part of a character. - In which "direction" is the matching done, which AFAIU would be important, e.g. \303\244\244 *if* '?' were to match also bytes, is '?\244' meant to be matching the character followed by the byte: (\303\244)\244 or could it non-match the byte followed by more bytes: (\303)\244\244 In an UTF-8 locale, \303\244\244 is an invalid character string. As the test results have shown, in some of the implementations, that causes pattern matching to be done as if under the C locale. In those implementations, it does not match ?\244, but it does match ??\244. In other implementations, only the final invalid byte \244 is given special treatment, in which case the whole string does match ?\244. - And I guess these questions would also pop up for the ##, #, %% and % forms of parameter expansions, especially when one has a local like Big-5. In the sense of, can one strip of a character (or byte) that forms part of another character. This does pop up there too but the questions are not new, they are the same questions that already pop up for regular pattern matching. ${var#pat} strips a leading pat of $var if and only if $var matches pat*. ${var%pat} strips a trailing pat of $var if and only if $var matches *pat. That said, in those cases where shells disagree over whether $var matches pat* / *pat, that is those cases where I would propose making the result unspecified, the results may also be inconsistent with the same shell's pattern matching in other contexts. If shells were required not to decompose such valid characters (that contain another valid character, when looked at it from right to left), then it would also need to be defined how the strings needed to be interpreted (most likely of course: as defined by the respective char encoding). This is where the example with β comes in. The current standard, as far as I can tell, *already* requires var=β echo ${var%]} case $var in *]) echo match esac to print "β", and not print "match", regardless of how that β is encoded. There are no invalid bytes here. This can only be done by processing the string left-to-right. So for all these cases it might additionally be required to check how the different shells behave when trying to ##, #, %%, % ... And AFAIU, some actually allow to "decompose" a character. Yes, this is expected and consistent with regular pattern matching. And even if the standard were to say, that it must check whether the matched part is part of a bigger multibyte character (like in the BIG5 case) and then not allow to decompose that would it still be allowed to do so when the pattern contains bytes that are themselves not valid characters) Yes, they should be allowed to do so. As we have seen, bash and GNU fnmatch() simply fall back to single-byte-character-set matching if the string or pattern is not valid in the current locale, and what you describe would be the natural result of that. - Are there any undesired side effects? Like bash, has the nocasematch shell option... which IIRC affects patterns... would we break anything in such fields? How could we? What bash does if a non-standard shell option is set is not covered by POSIX, nor should it be. - I think it already is defined (more or less) which locale is actually used for the matching, i.e. the current one as set by LC_CTYPE and not e.g. the "lexical" locale defined on the start of the shell. Agreed. I tested this now. In that same list of shells, and in glibc fnmatch(), ? only matches a single invalid byte. Tested in an UTF-8 locale with the string \200\200 and the patterns ? and ??. With ?, they do not match. With ??, they do. The next question that would come to my mind: Do these tests really give us a definite answer on the behaviour... or may some things be
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On Thu, 2022-05-19 at 09:05 +0100, Harald van Dijk wrote: > > > > The above, AFAIU, mean that any shell/fnmatch matches a valid > > multibyte > > character... but also a byte that is not a character in the locale. > > Correct, though as I wrote later on, the way they go about it is > different. And I think, for any real standardisation of this (which I'd still love to see) quite a few things would need to be reasonably defined, including but most likely not limited to: - Does * match bytes (by which I mean 1-n which don't form valid characters in the current locale). - The same for ? ... and if that matches bytes at all - only 1 or n? - In which "direction" is the matching done, which AFAIU would be important, e.g. \303\244\244 *if* '?' were to match also bytes, is '?\244' meant to be matching the character followed by the byte: (\303\244)\244 or could it non-match the byte followed by more bytes: (\303)\244\244 - And I guess these questions would also pop up for the ##, #, %% and % forms of parameter expansions, especially when one has a local like Big-5. In the sense of, can one strip of a character (or byte) that forms part of another character. If shells were required not to decompose such valid characters (that contain another valid character, when looked at it from right to left), then it would also need to be defined how the strings needed to be interpreted (most likely of course: as defined by the respective char encoding). So for all these cases it might additionally be required to check how the different shells behave when trying to ##, #, %%, % ... And AFAIU, some actually allow to "decompose" a character. And even if the standard were to say, that it must check whether the matched part is part of a bigger multibyte character (like in the BIG5 case) and then not allow to decompose that would it still be allowed to do so when the pattern contains bytes that are themselves not valid characters) - Are there any undesired side effects? Like bash, has the nocasematch shell option... which IIRC affects patterns... would we break anything in such fields? - I think it already is defined (more or less) which locale is actually used for the matching, i.e. the current one as set by LC_CTYPE and not e.g. the "lexical" locale defined on the start of the shell. > I tested this now. In that same list of shells, and in glibc > fnmatch(), > ? only matches a single invalid byte. Tested in an UTF-8 locale with > the > string \200\200 and the patterns ? and ??. With ?, they do not match. > With ??, they do. The next question that would come to my mind: Do these tests really give us a definite answer on the behaviour... or may some things be dependent on the specific locale? Maybe the above behaviour is *only* with UTF-8? Or can this be ruled out? > > > So unlike before, in the above bash/fnmatch do seem to let '?' > > match a > > single byte that is not a character... and the remaining ones have > > quite mixed feelings > Not quite: all of them always let ? match a single invalid byte, but > here we have a single byte that is invalid on its own, valid as part > of > a character, and appears in the string as part of that character. > When > processing \303\244, most shells don't process this as the single > byte > \303 followed by the single byte \244, they preprocess this so that > by > the time they actually check whether it matches, they just see the > character U+00C4, so that if a pattern looks for \303 on its own, it > will not be found. Hmm... seems a bit strange to me... I mean above you had: string pattern \303.\303\244 ?.? And e.g. bash didn't match.. my assumption was, because the first \303 is not a character. But later on you had: \303\244\303* \303\244\303? which bash *did* match. Sure, the 2 bytes together are already one character, but bash had to match the single \303 plus * or ? ... and if above ? didn't match the single invalid \303 it did match the single \244 here (which ain't a character either). No even if one says now it's the direction,... there was also: \303\244.\303 ?.? with no match in bash... the first ? should be okay, because it's a char,... the 2nd one would be the lone \303 byte. > > > Seem also a bit strange to me,... all shells match \243 against ? > > ... > > i.e. ? matches a single byte that is not a character... but later > > on it > > doesn't work again with \243] and ?] > > Remember that \243] is a single character β. \243] is not supposed to > match when given a pattern ?]. The pattern ?] means "any character, > followed by ]". "β" is a character not followed by ]. This is similar > to > how in UTF-8 environments, ä should not match against the pattern ?? > even though both of the bytes that make up ä individually do match > against the pattern ?. Okay but isn't that then the case where the matching
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On 15/05/2022 16:14, Harald van Dijk via austin-group-l at The Open Group wrote: On 19/04/2022 01:52, Harald van Dijk via austin-group-l at The Open Group wrote: On 15/04/2022 04:57, Christoph Anton Mitterer wrote: On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l at The Open Group wrote: If there is interest in getting this standardised, I can spend some more time on creating some hopefully comprehensive tests for this to confirm in what cases shells agree and disagree, and use that as a basis for proposing wording to cover it. I'd love to see that and if you'd actually do so, I'd kindly ask Geoff to defer any changes in the ticket #1564 of mine, until it can be said whether it might be possible to get that standardised. Very well, I will post tests and test results as soon I can make the time for it. Please see the tests and results here. Apologies for the HTML mail but this is hard to make readable in plain text. String Pattern dash, busybox ash, mksh, posh, pdksh glibc fnmatch bash bosh gwsh ksh zsh \303\244 [\303\244] no match match match match match match match \303\244 ? no match match match match match match match \303 [\303] match match match match match match match \303 ? match match match match match match match \303.\303\244 [\303].[\303\244] no match no match no match match match match match \303.\303\244 ?.? no match no match no match match match match match \303\303\244 [\303][\303\244] no match no match no match match match match match \303\303\244 ?? no match no match no match match match match match \303\244.\303 [\303\244].[\303] no match no match no match match match match match \303\244.\303 ?.? no match no match no match match match match match \303\244\303 [\303\244][\303] no match no match no match match match match match \303\244\303 ?? no match no match no match match match match match \303\244 \303* match match match match no match match no match \303\244 \303? match match match no match no match match no match \303\244 [\303]* match match match match no match match no match \303\244 [\303]? match match match no match no match match no match \303\244 *\204 match match match no match no match no match match \303\244 ?\204 match match match no match no match no match no match \303\244 *[\204] match match match no match no match no match no match \303\244 ?[\204] match match match no match no match no match no match \243] [\243]] match match match match match match match \243] ? no match match match match match match match \243 ? match match match match match match match \243 [\243] match match match match no match no match error \243 [\243!] match match match match match match match \243] [\243!]] match match no match no match no match match no match \243] ?] match match no match no match no match no match no match \243] *] match match no match no match no match no match match The tests
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On 19/05/2022 02:46, Christoph Anton Mitterer wrote: On Sun, 2022-05-15 at 16:14 +0100, Harald van Dijk wrote: Please see the tests and results here. So dash/ash/mksh/posh/pdksh,... and every other shell that doesn't handle locales at all (and thus works in the C locale)... is anyway always right (except for bugs), since any (non-NUL) byte is treated as a character. Correct. For the other shells (and fncmatch): String Pattern dash, busybox ash, mksh, posh, pdksh glibc fnmatch bash bosh gwsh ksh zsh \303\244 [\303\244] no match match match match match match match \303\244 ? no match match match match match match match \303 [\303] match match match match match match match \303 ? match match match match match match match The above, AFAIU, mean that any shell/fnmatch matches a valid multibyte character... but also a byte that is not a character in the locale. Correct, though as I wrote later on, the way they go about it is different. String Pattern dash, busybox ash, mksh, posh, pdksh glibc fnmatch bash bosh gwsh ksh zsh \303.\303\244 [\303].[\303\244] no match no match no match match match match match \303.\303\244 ?.? no match no match no match match match match match \303\303\244 [\303][\303\244] no match no match no match match match match match \303\303\244 ?? no match no match no match match match match match \303\244.\303 [\303\244].[\303] no match no match no match match match match match \303\244.\303 ?.? no match no match no match match match match match \303\244\303 [\303\244][\303] no match no match no match match match match match \303\244\303 ?? no match no match no match match match match match The above, I'm not quite sure what these tell/prove... I assume the ones with '?': that for all except bash/fnmatch '?' matches both, valid characters and a single byte that is no character. Correct. And the ones with bracket expression, that these also work when the BE has either a valid character or a byte (that is not a character) and vice-versa? Correct. If Chet is reading along, is the above intended in bash, or considered a bug? IMO it would have been interesting to see whether ? would also match multiple bytes that are each for themselves and together no valid character... cause for '*' one can kinda assume that it has this "match anything" meaning... one could also say that is more or less reasonable that '?' matches a single invalid byte... but why not several of them? I tested this now. In that same list of shells, and in glibc fnmatch(), ? only matches a single invalid byte. Tested in an UTF-8 locale with the string \200\200 and the patterns ? and ??. With ?, they do not match. With ??, they do. String Pattern dash, busybox ash, mksh, posh, pdksh glibc fnmatch bash bosh gwsh ksh zsh \303\244 \303* match match match match no match match no match \303\244 \303? match match match no match no match match no match \303\244 [\303]* match match match match no match match no match \303\244 [\303]? match match match no match no match match no match \303\244 *\204 match match match no match no match no match match \303\244 ?\204 match match match no match no match no match no match \303\244 *[\204] match match match no match no match no match no match \303\244 ?[\204] match match match no match no match no match no match So unlike before, in the above bash/fnmatch do seem to let '?' match a single byte that is not a character... and the remaining ones have quite mixed feelings Not quite: all of them always let ? match a single invalid byte, but here we have a single byte that is invalid on its own, valid as part of a character, and appears in the string as part of that character. When processing \303\244, most shells don't process this as the single byte \303 followed by the single byte \244, they preprocess this so that by the time they actually check whether it matches, they just see the character U+00C4, so that if a pattern looks for \303 on its own, it will not be found. String Pattern dash, busybox ash, mksh, posh, pdksh glibc fnmatch bash bosh gwsh ksh zsh \243] [\243]] match match match match match match match \243] ? no match match match match match match match \243 ? match match match match match match match \243 [\243] match match match match no match no match error \243 [\243!] match match match match match match match \243] [\243!]] match match no match no match no match match no match \243] ?] match match no match no match no match no match no match \243] *] match match no match no match no match no match match The tests involving \243 are run in a Big5 environment. In Big5, \243\135 is the representation of β, a single valid character, even though \135 on its own is still the single character ]. Seem also a bit strange to me,... all shells match \243 against ? ... i.e. ? matches a single byte that is not a character... but later on it doesn't work again with \243] and ?] Remember that \243] is a single character β. \243] is not
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On Sun, 2022-05-15 at 16:14 +0100, Harald van Dijk wrote: > Please see the tests and results here. So dash/ash/mksh/posh/pdksh,... and every other shell that doesn't handle locales at all (and thus works in the C locale)... is anyway always right (except for bugs), since any (non-NUL) byte is treated as a character. For the other shells (and fncmatch): > String > Pattern > dash, busybox ash, mksh, posh, pdksh > glibc fnmatch > bash > bosh > gwsh > ksh > zsh > \303\244 > [\303\244] > no match > match > match > match > match > match > match > \303\244 > ? > no match > match > match > match > match > match > match > \303 > [\303] > match > match > match > match > match > match > match > \303 > ? > match > match > match > match > match > match > match > The above, AFAIU, mean that any shell/fnmatch matches a valid multibyte character... but also a byte that is not a character in the locale. > String > Pattern > dash, busybox ash, mksh, posh, pdksh > glibc fnmatch > bash > bosh > gwsh > ksh > zsh > > \303.\303\244 > [\303].[\303\244] > no match > no match > no match > match > match > match > match > \303.\303\244 > ?.? > no match > no match > no match > match > match > match > match > \303\303\244 > [\303][\303\244] > no match > no match > no match > match > match > match > match > \303\303\244 > ?? > no match > no match > no match > match > match > match > match > \303\244.\303 > [\303\244].[\303] > no match > no match > no match > match > match > match > match > \303\244.\303 > ?.? > no match > no match > no match > match > match > match > match > \303\244\303 > [\303\244][\303] > no match > no match > no match > match > match > match > match > \303\244\303 > ?? > no match > no match > no match > match > match > match > match > > The above, I'm not quite sure what these tell/prove... I assume the ones with '?': that for all except bash/fnmatch '?' matches both, valid characters and a single byte that is no character. And the ones with bracket expression, that these also work when the BE has either a valid character or a byte (that is not a character) and vice-versa? If Chet is reading along, is the above intended in bash, or considered a bug? IMO it would have been interesting to see whether ? would also match multiple bytes that are each for themselves and together no valid character... cause for '*' one can kinda assume that it has this "match anything" meaning... one could also say that is more or less reasonable that '?' matches a single invalid byte... but why not several of them? > String > Pattern > dash, busybox ash, mksh, posh, pdksh > glibc fnmatch > bash > bosh > gwsh > ksh > zsh > > \303\244 > \303* > match > match > match > match > no match > match > no match > \303\244 > \303? > match > match > match > no match > no match > match > no match > \303\244 > [\303]* > match > match > match > match > no match > match > no match > \303\244 > [\303]? > match > match > match > no match > no match > match > no match > \303\244 > *\204 > match > match > match > no match > no match > no match > match > \303\244 > ?\204 > match > match > match > no match > no match > no match > no match > \303\244 > *[\204] > match > match > match > no match > no match > no match > no match > \303\244 > ?[\204] > match > match > match > no match > no match > no match > no match > > So unlike before, in the above bash/fnmatch do seem to let '?' match a single byte that is not a character... and the remaining ones have quite mixed feelings > String > Pattern > dash, busybox ash, mksh, posh, pdksh > glibc fnmatch > bash > bosh > gwsh > ksh > zsh > > \243] > [\243]] > match > match > match > match > match > match > match > \243] > ? > no match > match > match > match > match > match > match > \243 > ? > match > match > match > match > match > match > match > \243 > [\243] > match > match > match > match > no match > no match > error > \243 > [\243!] > match > match > match > match > match > match > match > \243] > [\243!]] > match > match > no match > no match > no match > match > no match > \243] > ?] > match > match > no match > no match > no match > no match > no match > \243] > *] > match > match > no match > no match > no match > no match > match > The tests involving \243 are run in a Big5 environment. In Big5, > \243\135 is the representation of β, a single valid character, even > though \135 on its own is still the single character ]. Seem also a bit strange to me,... all shells match \243 against ? ... i.e. ? matches a single byte that is not a character... but later on it doesn't work again with \243] and ?] > The other shells, when either the string or the pattern are not valid > in the current locale, are not in agreement on whether parts of the > rest of the string or the pattern are still interpreted according to > the current locale, and if so, which parts. I assume this effectively puts an end to any efforts of standardising this for byte strings, or what is your conclusion?
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On 19/04/2022 01:52, Harald van Dijk via austin-group-l at The Open Group wrote: On 15/04/2022 04:57, Christoph Anton Mitterer wrote: On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l at The Open Group wrote: If there is interest in getting this standardised, I can spend some more time on creating some hopefully comprehensive tests for this to confirm in what cases shells agree and disagree, and use that as a basis for proposing wording to cover it. I'd love to see that and if you'd actually do so, I'd kindly ask Geoff to defer any changes in the ticket #1564 of mine, until it can be said whether it might be possible to get that standardised. Very well, I will post tests and test results as soon I can make the time for it. Please see the tests and results here. Apologies for the HTML mail but this is hard to make readable in plain text. String Pattern dash, busybox ash, mksh, posh, pdksh glibc fnmatch bash bosh gwsh ksh zsh \303\244 [\303\244] no match match match match match match match \303\244 ? no match match match match match match match \303 [\303] match match match match match match match \303 ? match match match match match match match \303.\303\244 [\303].[\303\244] no match no match no match match match match match \303.\303\244 ?.? no match no match no match match match match match \303\303\244 [\303][\303\244] no match no match no match match match match match \303\303\244 ?? no match no match no match match match match match \303\244.\303 [\303\244].[\303] no match no match no match match match match match \303\244.\303 ?.? no match no match no match match match match match \303\244\303 [\303\244][\303] no match no match no match match match match match \303\244\303 ?? no match no match no match match match match match \303\244 \303* match match match match no match match no match \303\244 \303? match match match no match no match match no match \303\244 [\303]* match match match match no match match no match \303\244 [\303]? match match match no match no match match no match \303\244 *\204 match match match no match no match no match match \303\244 ?\204 match match match no match no match no match no match \303\244 *[\204] match match match no match no match no match no match \303\244 ?[\204] match match match no match no match no match no match \243] [\243]] match match match match match match match \243] ? no match match match match match match match \243 ? match match match match match match match \243 [\243] match match match match no match no match error \243 [\243!] match match match match match match match \243] [\243!]] match match no match no match no match match no match \243] ?] match match no match no match no match no match no match \243] *] match match no match no match no match no match match The tests involving \303 and \244 are run in an UTF-8 environment. In UTF-8, \303\244 is the
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
[accidentally replied privately, re-sending to list] On 22/04/2022 10:54, Geoff Clare via austin-group-l at The Open Group wrote: Harald van Dijk wrote, on 15 Apr 2022: For the most part(*), those shells that support locales appear to already be in agreement that single bytes that do not form a valid multi-byte character are interpreted as single characters that can be matched with *, with ?, and with those single bytes themselves. Shells are not in agreement on whether such single bytes can be matched with [...], nor in those shells where they can be, whether multiple bracket expressions can be used to match the individual bytes of a valid multi-byte character. Shells are not the only issue here. Pattern matching or expansion is also performed by find, pax, fnmatch() and glob(), and there is no agreement between those. E.g. GNU find -name does not match such bytes with * or ?, whereas Solaris find does, and the glibc fnmatch() returns an error (not just "no match") if the string to be matched does not consist of valid characters. Good point that those need to be checked too. However, what you describe is only what older versions of GNU fnmatch() did. It was regarded as a bug, and fixed. Cheers, Harald van Dijk
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Harald van Dijk wrote, on 15 Apr 2022: > > For the most part(*), those shells that support locales appear to already be > in agreement that single bytes that do not form a valid multi-byte character > are interpreted as single characters that can be matched with *, with ?, and > with those single bytes themselves. Shells are not in agreement on whether > such single bytes can be matched with [...], nor in those shells where they > can be, whether multiple bracket expressions can be used to match the > individual bytes of a valid multi-byte character. Shells are not the only issue here. Pattern matching or expansion is also performed by find, pax, fnmatch() and glob(), and there is no agreement between those. E.g. GNU find -name does not match such bytes with * or ?, whereas Solaris find does, and the glibc fnmatch() returns an error (not just "no match") if the string to be matched does not consist of valid characters. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Robert Elz wrote, on 15 Apr 2022: > > | That is how things are at present. The suggested changes just make it > | explicit. > > Yes, I know, but that's what I am suggesting that we not do in this one case. > > | Do you have an alternative proposal? > > Only to the extent of "do nothing". I am certainly not suggesting that > we attempt to solve the problem. > > Except perhaps it might be worth adding something to the Rationale (but > about what, ie: where there, I have no idea) along the lines of: > > It is often unclear whether a string is to be interpreted as > characters in some locale, or as an arbitrary byte string. > While it would have been possible to arbitrarily make the various > cases more explicit, or explicitly unspecifried, it was considered > better, in this version of to > make no changes, as it is believed that much additional work is > required to enable a standards-worthy specification possible. > This work is beyond the scope of this standard. It makes no difference to the requirements of the standard whether we state explicitly in normative text that something is unspecified or acknowledge in rationale that it is implicitly unspecified. Personally I prefer to have it explicit in normative text so that readers don't have to dig through rationale to find out about it (or worse, report that the normative text is unclear because they didn't see the rationale). > The problem I see, is that any specification at all of any of this, > allows implementors to just say "that is what posix requires" and do > nothing at all, where we really need some innovation, by someone who > actually understands the issues and how to deal with them in a rational > way - or at least who can come up with some kind of plan, and without > any possibility of being considered a non-conformant implementation > because of it. I don't see why a statement in normative text about unspecified behaviour would have any effect on implementors' attitudes to changing their implementation from one allowed behaviour to another. > | The application can document that it requires pathnames to be in the > | same encoding as the user's locale. > > That's not sufficient.Try encoding a find command to look for pathnames > containing currency symbols. It should be just a simple find -name > '*[ABCD]*' > type operation, with appropriate substitutions for the ABCD chars. If all filenames encountered are encoded in the user's locale (as per the application's documented requirement) and the pattern is encoded in the user's locale, then POSIX requires this use of find by the application to work. > | > Even worse perhaps, ???.doc which should match 7 char > | > names that end in ".doc" (or is that 7 byte names?) (not counting the > \0). > | > | It would match 7-byte names. > > Yes, in the C locale it would. But do you believe that is what the user > would have intended? Are they to be required to work out how many bytes > their local filenames are encoded as, and enter the appropriate number of '?' > chars? The point of adding an explicit statement to normative text is to draw attention to the issue. Thus users should be aware that ???.doc will match 7-byte names, and if that's not what they want then they will need to find a different way to obtain the result they want. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Hey. On Tue, 2022-04-19 at 01:52 +0100, Harald van Dijk wrote: > Even I did not to apply this to pattern matching. The > lexical locale, the locale used for lexing, is only used for lexing, > i.e. for recognising tokens, not to how those tokens are then > interpreted later on. If locale comes into play for that, as it does > in > pattern matching, it is the then-current value of LC_CTYPE that comes > into play, as it does in other shells. So... how is (as per the standard) it intended to work? My understanding was that if during lexing it sees a pattern '*∈' it would store the binary representation (as following from the lexical locale, in which the shell script/input is in principle expected to be) of these characters for the pattern. But when the actual pattern matching is done, it would interpret that binary representation with respect to the current locale (LC_CTYPE). So if by then, then binary representation of the script's '*∈' would mean '*z?' in the current locale, it would use that meaning as the pattern. Does that sound right? '∈' not being a member of the portable character set would make it, AFAIU, in principle valid for being mapped to `z?` in another locale - while changing the mapping of '*' would be possible, but according to POSIX produce undefined results. ("If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, if an application uses any pair of locales where the character encodings differ, or accesses data from an application using a locale which has different encodings from the locales used by the application, the results are unspecified.") > As for future directions, no opinion on that from me. That would IMO only make sense, if e.g. there was only one and not even well maintained shell that behaves different from all others. The "future directions" would indicate to possible new implementers where things may go and what they should do. 10 years later, one could re-visit the topic, and if that one shell that behaved different from all others had died in the meantime, and any possible new ones followed the future directions... one could standardise it. If not, one could simply leave everything as is and no one would get into troubles. Whether such approach actually works out as intended is of course not guaranteed. > I would not think this should be a special case: «${foo%.}» should > strip > a trailing «.» in exactly those cases where the shell considers foo > to > match the pattern «*.». However, I can see value in doing some extra > tests to verify that this matches what shells do. Remember that it might not be enough to check whether such shell strip off correctly when one has the case but also the case where one or more trailing bytes of the first group and the bytes of the valid character form a new valid character. While this wouldn't be possible if '.' is the characters (because of it's special properties)... it can happen with other characters in some special locales. > Very well, I will post tests and test results as soon I can make the > time for it. Thanks. FYI: I think the outcome will also affect the current proposal for #1561: https://www.austingroupbugs.net/view.php?id=1561#c5795 in specific the part: On page 2321 line 74857 section 2.6.2 Parameter Expansion, change: Thanks, Chris.
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On 15/04/2022 04:57, Christoph Anton Mitterer wrote: On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l at The Open Group wrote: Hmm, I would. I like that :-D This would have been the preferred alternative I've asked for to look at, in the ticket. Shells are not in agreement on whether such single bytes can be matched with [...], nor in those shells where they can be, whether multiple bracket expressions can be used to match the individual bytes of a valid multi-byte character. The cases with [...] only come up when scripts themselves use patterns that are not valid character strings You mean in the lexical locale? I do not, but interesting question. I am one of the few, if not only, shell authors that actually implemented "Changing the value of LC_CTYPE after the shell has started shall not affect the lexical processing of shell commands in the current shell execution environment or its subshells" rule. Even I did not to apply this to pattern matching. The lexical locale, the locale used for lexing, is only used for lexing, i.e. for recognising tokens, not to how those tokens are then interpreted later on. If locale comes into play for that, as it does in pattern matching, it is the then-current value of LC_CTYPE that comes into play, as it does in other shells. they are unlikely to affect existing scripts and I imagine there is not much harm in leaving those unspecified. It should however be clearly described that behaviour in this field is undefined, perhaps with some "future directions" that this might change some day. I prefer explicit over implicit as well myself. Perhaps it does not even need to be undefined though, perhaps unspecified with a few limited options is good enough. I am not sure at this time whether that is feasible. As for future directions, no opinion on that from me. The cases with * and ? do come up in existing scripts, but if shells are in agreement as they appear to be, there is no need to coordinate with shell authors on whether they would be willing to change their implementations, it is possible to change POSIX to describe the shells' current behaviour. Well but it's not only * and ? ... it's also a single character matching that character in a byte string that contains bytes or sequences thereof which do not form any valid character ... both before or after that character to be matched. Yes, I did mention those earlier on in my message but forgot to repeat it here. It's where shells also appear to be in agreement, except in the same corner case that also applies to [...] where an invalid byte in a pattern is used to match part of a valid character in the string. And since pattern matching notation isn't just used for matching alone, but e.g. also for string manipulation in parameter expansion (e.g. "${foo%.}" case)... these shells would also need to agree how to handle that, wouldn't they? I would not think this should be a special case: «${foo%.}» should strip a trailing «.» in exactly those cases where the shell considers foo to match the pattern «*.». However, I can see value in doing some extra tests to verify that this matches what shells do. If there is interest in getting this standardised, I can spend some more time on creating some hopefully comprehensive tests for this to confirm in what cases shells agree and disagree, and use that as a basis for proposing wording to cover it. I'd love to see that and if you'd actually do so, I'd kindly ask Geoff to defer any changes in the ticket #1564 of mine, until it can be said whether it might be possible to get that standardised. Very well, I will post tests and test results as soon I can make the time for it. Cheers, Harald van Dijk
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l at The Open Group wrote: > Hmm, I would. I like that :-D This would have been the preferred alternative I've asked for to look at, in the ticket. > Shells > are not in agreement on whether such single bytes can be matched with > [...], nor in those shells where they can be, whether multiple > bracket > expressions can be used to match the individual bytes of a valid > multi-byte character. > > The cases with [...] only come up when scripts themselves use > patterns > that are not valid character strings You mean in the lexical locale? > they are unlikely to affect > existing scripts and I imagine there is not much harm in leaving > those > unspecified. It should however be clearly described that behaviour in this field is undefined, perhaps with some "future directions" that this might change some day. > The cases with * and ? do come up in existing scripts, but > if shells are in agreement as they appear to be, there is no need to > coordinate with shell authors on whether they would be willing to > change > their implementations, it is possible to change POSIX to describe the > shells' current behaviour. Well but it's not only * and ? ... it's also a single character matching that character in a byte string that contains bytes or sequences thereof which do not form any valid character ... both before or after that character to be matched. And since pattern matching notation isn't just used for matching alone, but e.g. also for string manipulation in parameter expansion (e.g. "${foo%.}" case)... these shells would also need to agree how to handle that, wouldn't they? > If there is interest in getting this standardised, I can spend some > more > time on creating some hopefully comprehensive tests for this to > confirm > in what cases shells agree and disagree, and use that as a basis for > proposing wording to cover it. I'd love to see that and if you'd actually do so, I'd kindly ask Geoff to defer any changes in the ticket #1564 of mine, until it can be said whether it might be possible to get that standardised. Thanks, Chris.
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
On Tue, 2022-04-12 at 18:54 +0700, Robert Elz via austin-group-l at The Open Group wrote: > The point was that, at least as I read the proposed text, you're > defining > things like '*' to only work (reliably as specified) when the locale > is > POSIX (aka C). In the user's locale, who knows what happens? TBH, I'm not sure whether I understand the problem. What means "reliably"? That it always matches any file not starting with a "."? One could say it's simply not defined as that. *If* pattern matching notation is considered to work on character strings only, that the logical consequence would be that * doesn't necessarily match $'\xFF'.doc in an UTF-8 locale. *Not necessarily* - since Geoff's proposal defines the behaviour in that case clearly as unspecified, and thus a shell could do what e.g. bash seems to do (or at least that was my understanding) ... and simply carry on any invalid bytes in strings. But Shell XYZ may choose to not match. Geoff's proposal doesn't seem to codify anything, which isn't already (and unfortunately) allowed anyway... it just clarifies the ambiguity by the inconsistent use of defined terms. Which makes it clear to e.g. some random guy like me and the command substitution with trailing newline case - that I *cannot* simply assume, that (because of the special properties of '.' or '/' as characters) it would be enough to simply use one of these two and stripping them off would work for sure in *any* conforming locale... On Fri, 2022-04-15 at 06:03 +0700, Robert Elz via austin-group-l at The Open Group wrote: > | Do you have an alternative proposal? > > Only to the extent of "do nothing". I am certainly not suggesting > that > we attempt to solve the problem. ... without such clarification (as made by the proposal) I could have just rea into the standard what I wanted to have it, i.e. that the matching *has* to work ("as expected") on (byte-)strings and that it would be enough to just strip of a trailing '.' without any LC_ALL=C games. But others may - in the current form of the standard - choose to interpret it as defined-on-character-strings-only. So I think it's better to clarify (even it it's just the "it's unspecified"), than to leave ambiguous. One could however add a future direction, telling that this might be defined some day. > The problem I see, is that any specification at all of any of this, > allows implementors to just say "that is what posix requires" and do > nothing at all, where we really need some innovation, by someone who > actually understands the issues and how to deal with them in a > rational > way - or at least who can come up with some kind of plan, and without > any possibility of being considered a non-conformant implementation > because of it. I'd also prefer something that doesn't result in "undefined behaviour", "implementation defined" or similar. Despite being not an expert, it seems unlikely that different character encodings will ever go away - even if all people would actually use UTF-8, how would you ever manage to get rid of the whole framework for different char encodings? So the only other way would seem to be to specify how pattern matching notation should work on byte strings. Which I'd also prefer to have... but is there any consensus for this (which will then actually be adhered to by shell implementors)? Cheers, Chris.
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://www.austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-04-15 02:17 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == -- (0005805) calestyo (reporter) - 2022-04-15 02:17 https://www.austingroupbugs.net/view.php?id=1564#c5805 -- Re: https://www.austingroupbugs.net/view.php?id=1561#c5796 In principle that clarifies my original point. (Although I'd probably have preferred if all (relevant) implementations just behave the same already... and this could be made non-undefined. Do you think something should be done about fnmatch(), page 879? While it refers to sections 2.13.1 and 2.13.2 (which also fall under your proposed changes in 2.13) it still uses "string" all over and doesn't mention any dependency on the locale's character encoding? Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 2022-04-11 13:55 geoffclare Note Added: 0005796 2022-04-11 22:58 kreNote Added: 0005797 2022-04-12 08:51 geoffclare Note Added: 0005798 2022-04-15 02:12 calestyo Note Added: 0005804 2022-04-15 02:17 calestyo Note Added: 0005805 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://www.austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-04-15 02:12 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == -- (0005804) calestyo (reporter) - 2022-04-15 02:12 https://www.austingroupbugs.net/view.php?id=1564#c5804 -- Re: https://www.austingroupbugs.net/view.php?id=1561#c5797 Well this proposal is not really changing anything, is it? Why do you think it's worse to name that something results in undefined behaviour than not saying anything at all and leave it ambiguous (especially when the wording is already contradictory)? Also... it doesn't seem as if locales would ever go away. And even if the POSIX/C community (and all other affected groups) would decide tomorrow to abolish locales or at least the choice of character encodings and make all UTF-8... there would be still millions of lines of code which assume different character encodings to exist and which thus somehow need to be defined in a proper manner. Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 2022-04-11 13:55 geoffclare Note Added: 0005796 2022-04-11 22:58 kreNote Added: 0005797 2022-04-12 08:51 geoffclare Note Added: 0005798 2022-04-15 02:12 calestyo Note Added: 0005804 ==
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Date:Thu, 14 Apr 2022 09:42:37 +0100 From:"Geoff Clare via austin-group-l at The Open Group" Message-ID: <20220414084237.GA15370@localhost> | That is how things are at present. The suggested changes just make it | explicit. Yes, I know, but that's what I am suggesting that we not do in this one case. | Do you have an alternative proposal? Only to the extent of "do nothing". I am certainly not suggesting that we attempt to solve the problem. Except perhaps it might be worth adding something to the Rationale (but about what, ie: where there, I have no idea) along the lines of: It is often unclear whether a string is to be interpreted as characters in some locale, or as an arbitrary byte string. While it would have been possible to arbitrarily make the various cases more explicit, or explicitly unspecifried, it was considered better, in this version of to make no changes, as it is believed that much additional work is required to enable a standards-worthy specification possible. This work is beyond the scope of this standard. The problem I see, is that any specification at all of any of this, allows implementors to just say "that is what posix requires" and do nothing at all, where we really need some innovation, by someone who actually understands the issues and how to deal with them in a rational way - or at least who can come up with some kind of plan, and without any possibility of being considered a non-conformant implementation because of it. | The application can document that it requires pathnames to be in the | same encoding as the user's locale. That's not sufficient.Try encoding a find command to look for pathnames containing currency symbols. It should be just a simple find -name '*[ABCD]*' type operation, with appropriate substitutions for the ABCD chars. No problem if not all the world's currency symbols are encoded, if we find one that has been forgotten, it can simply be added. Currency symbols are things like the $ sign, British pound, Euro, Yen, Baht, ... (there are a whole bunch of them). If there were a [:currency:] class, it would be easy (and I'd need to come up with a different example). But there isn't. If we cannot do something this simple, and expect it to work reliably, everywhere, then what we have is useless, and needs to be replaced or reworked. That's not a standards' body type task. But we should be doing nothing to interfere with the production of a solution. | The C locale is specified as containing 256 single-byte characters. | Thus in the C locale all pathnames are valid character strings. Sure, understood. | > Even worse perhaps, ???.doc which should match 7 char | > names that end in ".doc" (or is that 7 byte names?) (not counting the \0). | | It would match 7-byte names. Yes, in the C locale it would. But do you believe that is what the user would have intended? Are they to be required to work out how many bytes their local filenames are encoded as, and enter the appropriate number of '?' chars? kre
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Robert Elz wrote, on 12 Apr 2022: > > | 1. The vast majority of apps will never need to do that because they know > | (or can assume) that the pathnames they handle either always use the > | portable filename character set or use the user's locale. > > The latter, perhaps, the former, certainly not in an international context. > The point was that, at least as I read the proposed text, you're defining > things like '*' to only work (reliably as specified) when the locale is > POSIX (aka C). In the user's locale, who knows what happens? That is how things are at present. The suggested changes just make it explicit. Do you have an alternative proposal? > | I.e. the pathnames are not abitrary (a word I was careful to > | include in the proposed changes). > > Sure, the problem is that when dealing with user input (as in, for example, > the command line args) the application cannot assume that the pathnames are > not aribtrary. They're anything that's OK for the user. The application can document that it requires pathnames to be in the same encoding as the user's locale. > | 2. In apps that truly do need to do matching or expansion on arbitrary > | pathnames, a C program can call uselocale() before and after calls to > | fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before > | handling pathnames (and unset it or restore it afterwards). > > But how does that help *.doc (in a defined way, as opposed to "of course > that works in all glob implementations") match a filename that isn't > entirely ascii (by which I mean, using characters only from the portable > character set)? The C locale is specified as containing 256 single-byte characters. Thus in the C locale all pathnames are valid character strings. > Even worse perhaps, ???.doc which should match 7 char > names that end in ".doc" (or is that 7 byte names?) (not counting the \0). It would match 7-byte names. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
Date:Tue, 12 Apr 2022 08:51:51 + From:"Austin Group Bug Tracker via austin-group-l at The Open Group" Message-ID: <1541e949d4c9cd28467acf6033bfd...@austingroupbugs.net> That is, Geoff Clare: | 1. The vast majority of apps will never need to do that because they know | (or can assume) that the pathnames they handle either always use the | portable filename character set or use the user's locale. The latter, perhaps, the former, certainly not in an international context. The point was that, at least as I read the proposed text, you're defining things like '*' to only work (reliably as specified) when the locale is POSIX (aka C). In the user's locale, who knows what happens? | I.e. the pathnames are not abitrary (a word I was careful to | include in the proposed changes). Sure, the problem is that when dealing with user input (as in, for example, the command line args) the application cannot assume that the pathnames are not aribtrary. They're anything that's OK for the user. | 2. In apps that truly do need to do matching or expansion on arbitrary | pathnames, a C program can call uselocale() before and after calls to | fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before | handling pathnames (and unset it or restore it afterwards). But how does that help *.doc (in a defined way, as opposed to "of course that works in all glob implementations") match a filename that isn't entirely ascii (by which I mean, using characters only from the portable character set)?Even worse perhaps, ???.doc which should match 7 char names that end in ".doc" (or is that 7 byte names?) (not counting the \0). Anyone from outside the English speaking world is likely to encounter many of those. kre
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-04-12 08:51 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == -- (0005798) geoffclare (manager) - 2022-04-12 08:51 https://austingroupbugs.net/view.php?id=1564#c5798 -- > How can a conforming application possibly (sanely) ensure the C locale is in use when performing pathname expansion using user input that has been presented in the user's locale 1. The vast majority of apps will never need to do that because they know (or can assume) that the pathnames they handle either always use the portable filename character set or use the user's locale. I.e. the pathnames are not abitrary (a word I was careful to include in the proposed changes). 2. In apps that truly do need to do matching or expansion on arbitrary pathnames, a C program can call uselocale() before and after calls to fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before handling pathnames (and unset it or restore it afterwards). Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 2022-04-11 13:55 geoffclare Note Added: 0005796 2022-04-11 22:58 kreNote Added: 0005797 2022-04-12 08:51 geoffclare Note Added: 0005798 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-04-11 22:58 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == -- (0005797) kre (reporter) - 2022-04-11 22:58 https://austingroupbugs.net/view.php?id=1564#c5797 -- How can a conforming application possibly (sanely) ensure the C locale is in use when performing pathname expansion using user input that has been presented in the user's locale (and if that is not to be allowed, how can the user ever sanely use pathnames containing characters that are not ASCII, and if that is not to be allowed, what good are locales ?) I truly wish we could simply stop attempting to make the standard consistent with regard to the current locale mess, it is all way too broken to be useful. Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 2022-04-11 13:55 geoffclare Note Added: 0005796 2022-04-11 22:58 kreNote Added: 0005797 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-04-11 13:55 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == -- (0005796) geoffclare (manager) - 2022-04-11 13:55 https://austingroupbugs.net/view.php?id=1564#c5796 -- Suggested changes... On page 2351 line 76098 section 2.13 Pattern Matching Notation, change:The pattern matching notation described in this section is used to specify patterns for matching strings in the shell.to:The pattern matching notation described in this section is used to specify patterns for matching character strings in the shell. After page 2351 line 76102 section 2.13 Pattern Matching Notation, add a new paragraph:If an attempt is made to use pattern matching notation to match a string that contains one or more bytes that do not form part of a valid character, the behavior is unspecified. Since pathnames can contain such bytes, portable applications need to ensure that the current locale is the C or POSIX locale when performing pattern matching (or expansion) on arbitrary pathnames. Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 2022-04-11 13:55 geoffclare Note Added: 0005796 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
The following issue has been set as RELATED TO issue 0001561. == https://austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-03-03 03:37 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == Relationships ID Summary -- related to 0001561 clarify what kind of data shell variabl... == Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 2022-04-07 16:30 geoffclare Relationship added related to 0001561 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://www.austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-03-03 03:37 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == -- (0005729) calestyo (reporter) - 2022-03-03 03:37 https://www.austingroupbugs.net/view.php?id=1564#c5729 -- Well I guess the whole thing is also, why your point had been earlier, that '.' as sentinel would be enough, and any implementation that wouldn't carry on invalid encodings (i.e. bytes that do not form characters), would be buggy in that respect already. I guess on the one hand, Geoff is clearly right, when he says that any such behaviour (especially the complicated mappings that you explained above) are not expected to be carried out by an implementation (at least not from the current standard)... and as such '.' would in fact not be enough as the sentinel for command substitution with trailing newlines, but the LC_ALL=C would be required. OTOH... the '*' example above was intended to question whether there are really any implementations which would filter out filenames which contain bytes that do not form characters. I'd guess not. So one more reason, why I think that this should be clearly specified (which I've requested with this issue)... i.e. if there's consensus that it (pattern matching) operates on character strings only - clearly name this, declare operation on non-character strings unspecified (with respect to their results) and remove all current references that indicate that it would operate on bytes. Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 2022-03-03 03:37 calestyo Note Added: 0005729 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://www.austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-02-25 20:54 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == -- (0005719) mirabilos (reporter) - 2022-02-25 20:54 https://www.austingroupbugs.net/view.php?id=1564#c5719 -- 「would a "ls *" in principle be expected to only match filenames who are character strings in the current locale?」 That’d be the logical consequence of treating it as characters in the current encoding. When I added locale “things” to my projects, I extended the definition of character. Instead of just accepting the characters that are valid in the current encoding (which is either 8-bit-transparent 7-bit ASCII or (back then BMP-only, but I’m moving to full 21-bit) UTF-8), so-called “raw octets” are also mapped into the wide character range. Every time a conversion fails, the first octet of it is handled as raw octet, then the conversion restarts on the next one. (This can obviously be optimised for illegal UTF-8 sequences if one is careful about the beginning of the next possibly valid sequence.) In 16-bit wchar_t times (basically “until 2022Q1”), this is mapped into a PUA range reserved by the CSUR for this. (Not quite optimal.) This is U+EF80‥U+EFFF. (What happens when you encounter \xEE\xBE\x80 can only be described as fun.) In the new scheme, I’m mapping them to U-1080‥U-10FF which is outside of the range of things, so not a problem (except now I’m wondering what to set WCHAR_MAX to, but I think 0x10U still, because only these are, strictly speaking, valid?) There’s a complication that has to do with the idiotic Standard C API for mbrtowc(3) in that “return value == 0” is the sole test for “*pwc == L'\0'” and so cannot be used to signal that 0 octets have been eaten, which means I might need to use even higher numbers for 2‑, 3‑ and 4-byte raw octet sequences. (The latter of which has wcwidth() == 4…) But that’s detail. The thing relevant here is that this is (could be, but anything else is either discarding the notion of character here (which would be hard to make congruent with the existence of character classes) or an active and certainly harmful disservice to users (the “only match filenames that are valid” I quoted above)) a middle ground between characters and bytes: bytes that are characters if possible and have character semantics applied, but may not. This is currently unspecified. I’d like to (continue) treat(int) things in a way that means that, for example, ? is either a character or a single byte from an invalid multibyte sequence (of length 1 or more), with a subsequent ? catching a possible second byte, and so on. Raw octets are displayed as � with a wcwidth() of 1 each (or some application-local suitable encoding, where that is possible). Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
A NOTE has been added to this issue. == https://www.austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-02-25 04:57 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work == -- (0005716) calestyo (reporter) - 2022-02-25 04:57 https://www.austingroupbugs.net/view.php?id=1564#c5716 -- 3) A third aspect, that should perhaps be considered by some more knowledgable person than me: One main use of pattern matching is matching filenames (2.13.3 Patterns Used for Filename Expansion). Filenames however are explicitly byte strings So if pattern matching is indeed intended to only have a specified meaning on character strings (in the current locale), then 2.13.3 should somehow explain this. Especially how such patterns should deal with filenames that aren't character strings in the current locale (e.g. error, unspecified, ignored?). For example, would a "ls *" in principle be expected to only match filenames who are character strings in the current locale? Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 ==
[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work
The following issue has been SUBMITTED. == https://www.austingroupbugs.net/view.php?id=1564 == Reported By:calestyo Assigned To: == Project:Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section:2.13 Pattern Matching Notation Page Number:2351 Line Number:76099 Final Accepted Text: == Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-02-23 01:54 UTC == Summary:clariy on what (character/byte) strings pattern matching notation should work Description: On the mailing list, the question arose (from my side) what the current wording in the standard implies as to whether pattern matching works on byte or character strings. - In some earlier discussion it was pointed out that shell variables should be strings (of bytes, other than NUL) => which could one lead to think that pattern matching must work on any such strings - 2.6.2 Parameter Expansion doesn't seem to say, what the #, ##, % and %% special forms of expansion work on: bytes or characters, it just refers to the pattern matching chapter - 2.13. Pattern Matching Notation says: "The pattern matching notation described in this section is used to specify patterns for matching strings in the shell." => strings... would mean bytes (as per 3.375 String) - 2.13.1 Patterns Matching a Single Character however says: "The following patterns matching a single character shall match a single character: ordinary characters,..." I questioned whether one could deduce from that, that patten matching is required to cope with any non-characters in the string it operates upon. This was however rejected on the list, and Geoff Clare pointed out, that since no behaviour is specified (i.e. how the implementation would need to handle such invalidly encoded character) the use of pattern matching on arbitrary byte strings is undefined behaviour. Desired Action: Either: 1) - In line 76099, replace "strings" with "character strings" and perhaps mention that the results when this is done on strings that contain any byte sequence that is not a character in the current locale, the results are undefined. Perhaps also clarify this in fnmatch() (page 879), this doesn't seem to mention locales at all, but when the above assumption is true, and pattern matching operates on characters only, wouldn't it then need to be subject of the current LC_CTYPE? 2) Alternatively, some expert could check whether there are any shell/fnmatch() implementations which do not simply carry on any bytes that do not form characters. Probably there are (yash?). But if there weren't POSIX might even chose to standardise that behaviour, which would probably be better than leaving it unspecified?! == Issue History Date ModifiedUsername FieldChange == 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 ==