[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-11-30 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


The following issue has a resolution that has been APPLIED. 
== 
https://austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: Applied
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text:https://austingroupbugs.net/view.php?id=1564#c5796 
Resolution: Accepted As Marked
Fixed in Version:   
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-11-30 16:37 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
2022-04-11 13:55 geoffclare Note Added: 0005796  
2022-04-11 22:58 kreNote Added: 0005797  
2022-04-12 08:51 geoffclare Note Added: 0005798  
2022-04-15 02:12 calestyo   Note Added: 0005804  
2022-04-15 02:17 calestyo   Note Added: 0005805  
2022-10-31 16:13 geoffclare Final Accepted Text   =>
https://austingroupbugs.net/view.php?id=1564#c5796
2022-10-31 16:13 geoffclare Status   New => Resolved 
2022-10-31 16:13 geoffclare Resolution   Open => Accepted As
Marked
2022-10-31 16:13 geoffclare Tag Attached: tc3-2008   
2022-11-30 16:37 geoffclare Status   Resolved => Applied 
==




Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-10-31 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
Hey folks.


A while ago we had this discussion about pattern matching notation and
characters vs. bytes.

Back then, Harald von Dijk did some investigation on whether the
standard could be changed to allow for bytes (and not just characters)
without breaking all kinds of shells.

IIRC, when he presented his results, these were that there are some
obstacles (in the sense of some shells behaving differently) but he
rather considered them bugs than actually desired behaviour by these
shells.


A while ago I had a short off-list mail exchange with him, and if I
understood correctly (please correct me if not ^^), Harald would be
still willing to track things further down and get them resolved (i.e.
allowing any bytes and not just characters) - which I think would be
really good for the standard and the ecosystem[0].

However, I think he wanted to get some kind of blessing/support by
people who have a stronger say in the matter (I guess people like shell
implementers and representatives from the POSIX WG) before actually
putting a lot of effort into that.


So questions is:
What do people here at the list think, would they find it useful to
resolve any open issues and strive to have the standard define pattern
matching notation on strings that may contain any bytes (and not just
such that form characters in the current locale) and is there any
support for this?

Do we have any shell implementers/maintainers around here, who could
comment on what they think (especially with respect to "their" shells)
or do we have some means of contacting such folks?



I guess it would be quite worth to actually get to that state,... and
while of course this could also be done in years, it may be harder by
then, in case new shells come up that handle things in some
incompatible way.



Thanks,
Chris.



[0] My personal motivation was to get some (portable) command
substitution *with* trailing newlies, which proved to be quite hard to
actually do in POSIX.
In the current standard, with pattern matching notation working on
characters only, using the special properties of the '.' character (as
a sentinel) may not be enough... and the games with LC_ALL=C introduce
all kinds of further subtle issues, especially when one wants to make
the whole thing a function and has to live with shells and their
different way of handling `local` variables.



[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-10-31 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


The following issue has been RESOLVED. 
== 
https://austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: Resolved
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text:https://austingroupbugs.net/view.php?id=1564#c5796 
Resolution: Accepted As Marked
Fixed in Version:   
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-10-31 16:13 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
2022-04-11 13:55 geoffclare Note Added: 0005796  
2022-04-11 22:58 kreNote Added: 0005797  
2022-04-12 08:51 geoffclare Note Added: 0005798  
2022-04-15 02:12 calestyo   Note Added: 0005804  
2022-04-15 02:17 calestyo   Note Added: 0005805  
2022-10-31 16:13 geoffclare Final Accepted Text   =>
https://austingroupbugs.net/view.php?id=1564#c5796
2022-10-31 16:13 geoffclare Status   New => Resolved 
2022-10-31 16:13 geoffclare Resolution   Open => Accepted As
Marked
==




Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-05-23 Thread Chet Ramey via austin-group-l at The Open Group
On 5/18/22 9:46 PM, Christoph Anton Mitterer via austin-group-l at The Open 
Group wrote:




The above, I'm not quite sure what these tell/prove...

I assume the ones with '?': that for all except bash/fnmatch   '?'
matches both, valid characters and a single byte that is no character.

And the ones with bracket expression, that these also work when the BE
has either a valid character or a byte (that is not a character) and
vice-versa?

If Chet is reading along, is the above intended in bash, or considered
a bug?


The bash matcher falls back to C-locale-like behavior only if the pattern
and the string both do not contain any valid multibyte characters. So if,
for example, the string contains a valid multibyte character, but the
pattern does not, the matcher will attempt multibyte (wide character,
really) matches.

This is why the string \243] (a valid multibyte character in Big5) does not
match [\243!]]: nothing in the bracket expression will match that
character, and that string will never match a pattern ending in `]'.



IMO it would have been interesting to see whether ? would also match
multiple bytes that are each for themselves and together no valid
character...


No, it wouldn't. You can make a case for `?' matching a single byte that is
not part of a valid multibyte character (there is no such thing as a single
byte that is "no valid character" when you are matching), but you cannot
make one for `?' matching more than one byte that does not compose a valid
multibyte character.



The tests involving \243 are run in a Big5 environment. In Big5,
\243\135 is the representation of β, a single valid character, even
though \135 on its own is still the single character ].


Seem also a bit strange to me,... all shells match \243 against ? ...
i.e. ? matches a single byte that is not a character... but later on it
doesn't work again with \243] and ?]


Because, as Harald says, \243] is a valid multibyte character in Big5
locales.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-05-19 Thread Harald van Dijk via austin-group-l at The Open Group

On 20/05/2022 01:11, Christoph Anton Mitterer wrote:

On Thu, 2022-05-19 at 09:05 +0100, Harald van Dijk wrote:


The above, AFAIU, mean that any shell/fnmatch matches a valid
multibyte
character... but also a byte that is not a character in the locale.


Correct, though as I wrote later on, the way they go about it is
different.


And I think, for any real standardisation of this (which I'd still love
to see) quite a few things would need to be reasonably defined,
including but most likely not limited to:
- Does * match bytes (by which I mean 1-n which don't form valid
   characters in the current locale).


This is another one of those that seemed obvious enough to me that I did 
not think to check explicitly. As far as I can tell at a quick glance, * 
matches any number (zero or more) of ?, whatever ? means, except in the 
case of a particular shell bug that also breaks scripts already required 
by the standard to work.



- The same for ? ... and if that matches bytes at all - only 1 or n?


It matches a single character or a single byte that is not part of a 
character.



- In which "direction" is the matching done, which AFAIU would be
   important, e.g.
   \303\244\244
   *if* '?' were to match also bytes, is '?\244' meant to be matching
   the character followed by the byte:
 (\303\244)\244
   or could it non-match the byte followed by more bytes:
 (\303)\244\244


In an UTF-8 locale, \303\244\244 is an invalid character string. As the 
test results have shown, in some of the implementations, that causes 
pattern matching to be done as if under the C locale. In those 
implementations, it does not match ?\244, but it does match ??\244. In 
other implementations, only the final invalid byte \244 is given special 
treatment, in which case the whole string does match ?\244.



- And I guess these questions would also pop up for the ##, #, %% and %
   forms of parameter expansions, especially when one has a local like
   Big-5.
   In the sense of, can one strip of a character (or byte) that forms
   part of another character.


This does pop up there too but the questions are not new, they are the 
same questions that already pop up for regular pattern matching. 
${var#pat} strips a leading pat of $var if and only if $var matches 
pat*. ${var%pat} strips a trailing pat of $var if and only if $var 
matches *pat. That said, in those cases where shells disagree over 
whether $var matches pat* / *pat, that is those cases where I would 
propose making the result unspecified, the results may also be 
inconsistent with the same shell's pattern matching in other contexts.



   If shells were required not to decompose such valid characters (that
   contain another valid character, when looked at it from right to
   left), then it would also need to be defined how the strings needed
   to be interpreted (most likely of course: as defined by the
   respective char encoding).


This is where the example with β comes in. The current standard, as far 
as I can tell, *already* requires


  var=β
  echo ${var%]}
  case $var in
  *]) echo match
  esac

to print "β", and not print "match", regardless of how that β is 
encoded. There are no invalid bytes here. This can only be done by 
processing the string left-to-right.



   So for all these cases it might additionally be required to check how
   the different shells behave when trying to ##, #, %%, % ...
   And AFAIU, some actually allow to "decompose" a character.


Yes, this is expected and consistent with regular pattern matching.


   And even if the standard were to say, that it must check whether the
   matched part is part of a bigger multibyte character (like in the
   BIG5 case) and then not allow to decompose that would it still
   be allowed to do so when the pattern contains bytes that are
   themselves not valid characters)


Yes, they should be allowed to do so. As we have seen, bash and GNU 
fnmatch() simply fall back to single-byte-character-set matching if the 
string or pattern is not valid in the current locale, and what you 
describe would be the natural result of that.



- Are there any undesired side effects? Like bash, has the nocasematch
   shell option... which IIRC affects patterns... would we break
   anything in such fields?


How could we? What bash does if a non-standard shell option is set is 
not covered by POSIX, nor should it be.



- I think it already is defined (more or less) which locale is actually
   used for the matching, i.e. the current one as set by LC_CTYPE and
   not e.g. the "lexical" locale defined on the start of the shell.


Agreed.


I tested this now. In that same list of shells, and in glibc
fnmatch(),
? only matches a single invalid byte. Tested in an UTF-8 locale with
the
string \200\200 and the patterns ? and ??. With ?, they do not match.
With ??, they do.


The next question that would come to my mind:
Do these tests really give us a definite answer on the behaviour... or
may some things be 

Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-05-19 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
On Thu, 2022-05-19 at 09:05 +0100, Harald van Dijk wrote:
> > 
> > The above, AFAIU, mean that any shell/fnmatch matches a valid
> > multibyte
> > character... but also a byte that is not a character in the locale.
> 
> Correct, though as I wrote later on, the way they go about it is
> different.

And I think, for any real standardisation of this (which I'd still love
to see) quite a few things would need to be reasonably defined,
including but most likely not limited to:
- Does * match bytes (by which I mean 1-n which don't form valid
  characters in the current locale).

- The same for ? ... and if that matches bytes at all - only 1 or n?

- In which "direction" is the matching done, which AFAIU would be
  important, e.g.
  \303\244\244
  *if* '?' were to match also bytes, is '?\244' meant to be matching
  the character followed by the byte:
(\303\244)\244
  or could it non-match the byte followed by more bytes:
(\303)\244\244

- And I guess these questions would also pop up for the ##, #, %% and %
  forms of parameter expansions, especially when one has a local like
  Big-5.
  In the sense of, can one strip of a character (or byte) that forms
  part of another character.
  If shells were required not to decompose such valid characters (that
  contain another valid character, when looked at it from right to
  left), then it would also need to be defined how the strings needed
  to be interpreted (most likely of course: as defined by the
  respective char encoding).

  So for all these cases it might additionally be required to check how
  the different shells behave when trying to ##, #, %%, % ...
  And AFAIU, some actually allow to "decompose" a character.

  And even if the standard were to say, that it must check whether the
  matched part is part of a bigger multibyte character (like in the
  BIG5 case) and then not allow to decompose that would it still
  be allowed to do so when the pattern contains bytes that are
  themselves not valid characters)

- Are there any undesired side effects? Like bash, has the nocasematch
  shell option... which IIRC affects patterns... would we break
  anything in such fields?

- I think it already is defined (more or less) which locale is actually
  used for the matching, i.e. the current one as set by LC_CTYPE and
  not e.g. the "lexical" locale defined on the start of the shell. 



> I tested this now. In that same list of shells, and in glibc
> fnmatch(), 
> ? only matches a single invalid byte. Tested in an UTF-8 locale with
> the 
> string \200\200 and the patterns ? and ??. With ?, they do not match.
> With ??, they do.

The next question that would come to my mind:
Do these tests really give us a definite answer on the behaviour... or
may some things be dependent on the specific locale? Maybe the above
behaviour is *only* with UTF-8?
Or can this be ruled out?

> 


> > So unlike before, in the above bash/fnmatch do seem to let '?'
> > match a
> > single byte that is not a character... and the remaining ones have
> > quite mixed feelings
> Not quite: all of them always let ? match a single invalid byte, but 
> here we have a single byte that is invalid on its own, valid as part
> of 
> a character, and appears in the string as part of that character.
> When 
> processing \303\244, most shells don't process this as the single
> byte 
> \303 followed by the single byte \244, they preprocess this so that
> by 
> the time they actually check whether it matches, they just see the 
> character U+00C4, so that if a pattern looks for \303 on its own, it 
> will not be found.

Hmm... seems a bit strange to me... I mean above you had:

string  pattern
\303.\303\244   ?.?

And e.g. bash didn't match.. my assumption was, because the first \303
is not a character.

But later on you had:
\303\244\303*
\303\244\303?
which bash *did* match.

Sure, the 2 bytes together are already one character, but bash had to
match the single \303 plus * or ? ... and if above ? didn't match the
single invalid \303 it did match the single \244 here (which ain't a
character either).

No even if one says now it's the direction,... there was also:
\303\244.\303   ?.?
with no match in bash... the first ? should be okay, because it's a
char,... the 2nd one would be the lone \303 byte.

> 

> > Seem also a bit strange to me,... all shells match \243 against ?
> > ...
> > i.e. ? matches a single byte that is not a character... but later
> > on it
> > doesn't work again with \243] and ?]
> 
> Remember that \243] is a single character β. \243] is not supposed to
> match when given a pattern ?]. The pattern ?] means "any character, 
> followed by ]". "β" is a character not followed by ]. This is similar
> to 
> how in UTF-8 environments, ä should not match against the pattern ?? 
> even though both of the bytes that make up ä individually do match 
> against the pattern ?.

Okay but isn't that then the case where the matching 

Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-05-19 Thread Harald van Dijk via austin-group-l at The Open Group
On 15/05/2022 16:14, Harald van Dijk via austin-group-l at The Open 
Group wrote:
On 19/04/2022 01:52, Harald van Dijk via austin-group-l at The Open 
Group wrote:

On 15/04/2022 04:57, Christoph Anton Mitterer wrote:

On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l
at The Open Group wrote:

If there is interest in getting this standardised, I can spend some
more
time on creating some hopefully comprehensive tests for this to
confirm
in what cases shells agree and disagree, and use that as a basis for
proposing wording to cover it.


I'd love to see that and if you'd actually do so, I'd kindly ask
Geoff to defer any changes in the ticket #1564 of mine, until it can be
said whether it might be possible to get that standardised.


Very well, I will post tests and test results as soon I can make the 
time for it.


Please see the tests and results here. Apologies for the HTML mail but 
this is hard to make readable in plain text.


String

Pattern

dash, busybox ash, mksh, posh, pdksh

glibc fnmatch

bash

bosh

gwsh

ksh

zsh
\303\244

[\303\244]

no match

match

match

match

match

match

match
\303\244

?

no match

match

match

match

match

match

match
\303

[\303]

match

match

match

match

match

match

match
\303

?

match

match

match

match

match

match

match
\303.\303\244

[\303].[\303\244]

no match

no match

no match

match

match

match

match
\303.\303\244

?.?

no match

no match

no match

match

match

match

match
\303\303\244

[\303][\303\244]

no match

no match

no match

match

match

match

match
\303\303\244

??

no match

no match

no match

match

match

match

match
\303\244.\303

[\303\244].[\303]

no match

no match

no match

match

match

match

match
\303\244.\303

?.?

no match

no match

no match

match

match

match

match
\303\244\303

[\303\244][\303]

no match

no match

no match

match

match

match

match
\303\244\303

??

no match

no match

no match

match

match

match

match
\303\244

\303*

match

match

match

match

no match

match

no match
\303\244

\303?

match

match

match

no match

no match

match

no match
\303\244

[\303]*

match

match

match

match

no match

match

no match
\303\244

[\303]?

match

match

match

no match

no match

match

no match
\303\244

*\204

match

match

match

no match

no match

no match

match
\303\244

?\204

match

match

match

no match

no match

no match

no match
\303\244

*[\204]

match

match

match

no match

no match

no match

no match
\303\244

?[\204]

match

match

match

no match

no match

no match

no match
\243]

[\243]]

match

match

match

match

match

match

match
\243]

?

no match

match

match

match

match

match

match
\243

?

match

match

match

match

match

match

match
\243

[\243]

match

match

match

match

no match

no match

error
\243

[\243!]

match

match

match

match

match

match

match
\243]

[\243!]]

match

match

no match

no match

no match

match

no match
\243]

?]

match

match

no match

no match

no match

no match

no match
\243]

*]

match

match

no match

no match

no match

no match

match

The tests 

Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-05-19 Thread Harald van Dijk via austin-group-l at The Open Group

On 19/05/2022 02:46, Christoph Anton Mitterer wrote:

On Sun, 2022-05-15 at 16:14 +0100, Harald van Dijk wrote:

Please see the tests and results here.


So dash/ash/mksh/posh/pdksh,... and every other shell that doesn't
handle locales at all (and thus works in the C locale)... is anyway
always right (except for bugs), since any (non-NUL) byte is treated as
a character.


Correct.


For the other shells (and fncmatch):


String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh
\303\244
[\303\244]
no match
match
match
match
match
match
match
\303\244
?
no match
match
match
match
match
match
match
\303
[\303]
match
match
match
match
match
match
match
\303
?
match
match
match
match
match
match
match


The above, AFAIU, mean that any shell/fnmatch matches a valid multibyte
character... but also a byte that is not a character in the locale.


Correct, though as I wrote later on, the way they go about it is different.


String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh

\303.\303\244
[\303].[\303\244]
no match
no match
no match
match
match
match
match
\303.\303\244
?.?
no match
no match
no match
match
match
match
match
\303\303\244
[\303][\303\244]
no match
no match
no match
match
match
match
match
\303\303\244
??
no match
no match
no match
match
match
match
match
\303\244.\303
[\303\244].[\303]
no match
no match
no match
match
match
match
match
\303\244.\303
?.?
no match
no match
no match
match
match
match
match
\303\244\303
[\303\244][\303]
no match
no match
no match
match
match
match
match
\303\244\303
??
no match
no match
no match
match
match
match
match



The above, I'm not quite sure what these tell/prove...

I assume the ones with '?': that for all except bash/fnmatch   '?'
matches both, valid characters and a single byte that is no character.


Correct.


And the ones with bracket expression, that these also work when the BE
has either a valid character or a byte (that is not a character) and
vice-versa?


Correct.


If Chet is reading along, is the above intended in bash, or considered
a bug?


IMO it would have been interesting to see whether ? would also match
multiple bytes that are each for themselves and together no valid
character... cause for '*' one can kinda assume that it has this "match
anything" meaning... one could also say that is more or less reasonable
that '?' matches a single invalid byte... but why not several of them?


I tested this now. In that same list of shells, and in glibc fnmatch(), 
? only matches a single invalid byte. Tested in an UTF-8 locale with the 
string \200\200 and the patterns ? and ??. With ?, they do not match. 
With ??, they do.



String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh

\303\244
\303*
match
match
match
match
no match
match
no match
\303\244
\303?
match
match
match
no match
no match
match
no match
\303\244
[\303]*
match
match
match
match
no match
match
no match
\303\244
[\303]?
match
match
match
no match
no match
match
no match
\303\244
*\204
match
match
match
no match
no match
no match
match
\303\244
?\204
match
match
match
no match
no match
no match
no match
\303\244
*[\204]
match
match
match
no match
no match
no match
no match
\303\244
?[\204]
match
match
match
no match
no match
no match
no match




So unlike before, in the above bash/fnmatch do seem to let '?' match a
single byte that is not a character... and the remaining ones have
quite mixed feelings
Not quite: all of them always let ? match a single invalid byte, but 
here we have a single byte that is invalid on its own, valid as part of 
a character, and appears in the string as part of that character. When 
processing \303\244, most shells don't process this as the single byte 
\303 followed by the single byte \244, they preprocess this so that by 
the time they actually check whether it matches, they just see the 
character U+00C4, so that if a pattern looks for \303 on its own, it 
will not be found.



String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh

\243]
[\243]]
match
match
match
match
match
match
match
\243]
?
no match
match
match
match
match
match
match
\243
?
match
match
match
match
match
match
match
\243
[\243]
match
match
match
match
no match
no match
error
\243
[\243!]
match
match
match
match
match
match
match
\243]
[\243!]]
match
match
no match
no match
no match
match
no match
\243]
?]
match
match
no match
no match
no match
no match
no match
\243]
*]
match
match
no match
no match
no match
no match
match
The tests involving \243 are run in a Big5 environment. In Big5,
\243\135 is the representation of β, a single valid character, even
though \135 on its own is still the single character ].


Seem also a bit strange to me,... all shells match \243 against ? ...
i.e. ? matches a single byte that is not a character... but later on it
doesn't work again with \243] and ?]


Remember that \243] is a single character β. \243] is not 

Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-05-18 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
On Sun, 2022-05-15 at 16:14 +0100, Harald van Dijk wrote:
> Please see the tests and results here.

So dash/ash/mksh/posh/pdksh,... and every other shell that doesn't
handle locales at all (and thus works in the C locale)... is anyway
always right (except for bugs), since any (non-NUL) byte is treated as
a character.


For the other shells (and fncmatch):

> String
> Pattern
> dash, busybox ash, mksh, posh, pdksh
> glibc fnmatch
> bash
> bosh
> gwsh
> ksh
> zsh
> \303\244
> [\303\244]
> no match
> match
> match
> match
> match
> match
> match
> \303\244
> ?
> no match
> match
> match
> match
> match
> match
> match
> \303
> [\303]
> match
> match
> match
> match
> match
> match
> match
> \303
> ?
> match
> match
> match
> match
> match
> match
> match
> 
The above, AFAIU, mean that any shell/fnmatch matches a valid multibyte
character... but also a byte that is not a character in the locale.



> String
> Pattern
> dash, busybox ash, mksh, posh, pdksh
> glibc fnmatch
> bash
> bosh
> gwsh
> ksh
> zsh
> 
> \303.\303\244
> [\303].[\303\244]
> no match
> no match
> no match
> match
> match
> match
> match
> \303.\303\244
> ?.?
> no match
> no match
> no match
> match
> match
> match
> match
> \303\303\244
> [\303][\303\244]
> no match
> no match
> no match
> match
> match
> match
> match
> \303\303\244
> ??
> no match
> no match
> no match
> match
> match
> match
> match
> \303\244.\303
> [\303\244].[\303]
> no match
> no match
> no match
> match
> match
> match
> match
> \303\244.\303
> ?.?
> no match
> no match
> no match
> match
> match
> match
> match
> \303\244\303
> [\303\244][\303]
> no match
> no match
> no match
> match
> match
> match
> match
> \303\244\303
> ??
> no match
> no match
> no match
> match
> match
> match
> match
> 
> 
The above, I'm not quite sure what these tell/prove...

I assume the ones with '?': that for all except bash/fnmatch   '?'
matches both, valid characters and a single byte that is no character.

And the ones with bracket expression, that these also work when the BE
has either a valid character or a byte (that is not a character) and
vice-versa?

If Chet is reading along, is the above intended in bash, or considered
a bug?


IMO it would have been interesting to see whether ? would also match
multiple bytes that are each for themselves and together no valid
character... cause for '*' one can kinda assume that it has this "match
anything" meaning... one could also say that is more or less reasonable
that '?' matches a single invalid byte... but why not several of them?





> String
> Pattern
> dash, busybox ash, mksh, posh, pdksh
> glibc fnmatch
> bash
> bosh
> gwsh
> ksh
> zsh
> 
> \303\244
> \303*
> match
> match
> match
> match
> no match
> match
> no match
> \303\244
> \303?
> match
> match
> match
> no match
> no match
> match
> no match
> \303\244
> [\303]*
> match
> match
> match
> match
> no match
> match
> no match
> \303\244
> [\303]?
> match
> match
> match
> no match
> no match
> match
> no match
> \303\244
> *\204
> match
> match
> match
> no match
> no match
> no match
> match
> \303\244
> ?\204
> match
> match
> match
> no match
> no match
> no match
> no match
> \303\244
> *[\204]
> match
> match
> match
> no match
> no match
> no match
> no match
> \303\244
> ?[\204]
> match
> match
> match
> no match
> no match
> no match
> no match
> 
> 

So unlike before, in the above bash/fnmatch do seem to let '?' match a
single byte that is not a character... and the remaining ones have
quite mixed feelings





> String
> Pattern
> dash, busybox ash, mksh, posh, pdksh
> glibc fnmatch
> bash
> bosh
> gwsh
> ksh
> zsh
> 
> \243]
> [\243]]
> match
> match
> match
> match
> match
> match
> match
> \243]
> ?
> no match
> match
> match
> match
> match
> match
> match
> \243
> ?
> match
> match
> match
> match
> match
> match
> match
> \243
> [\243]
> match
> match
> match
> match
> no match
> no match
> error
> \243
> [\243!]
> match
> match
> match
> match
> match
> match
> match
> \243]
> [\243!]]
> match
> match
> no match
> no match
> no match
> match
> no match
> \243]
> ?]
> match
> match
> no match
> no match
> no match
> no match
> no match
> \243]
> *]
> match
> match
> no match
> no match
> no match
> no match
> match
> The tests involving \243 are run in a Big5 environment. In Big5,
> \243\135 is the representation of β, a single valid character, even
> though \135 on its own is still the single character ].

Seem also a bit strange to me,... all shells match \243 against ? ...
i.e. ? matches a single byte that is not a character... but later on it
doesn't work again with \243] and ?]



> The other shells, when either the string or the pattern are not valid
> in the current locale, are not in agreement on whether parts of the
> rest of the string or the pattern are still interpreted according to
> the current locale, and if so, which parts.
I assume this effectively puts an end to any efforts of standardising
this for byte strings, or what is your conclusion?

Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-05-15 Thread Harald van Dijk via austin-group-l at The Open Group
On 19/04/2022 01:52, Harald van Dijk via austin-group-l at The Open 
Group wrote:

On 15/04/2022 04:57, Christoph Anton Mitterer wrote:

On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l
at The Open Group wrote:

If there is interest in getting this standardised, I can spend some
more
time on creating some hopefully comprehensive tests for this to
confirm
in what cases shells agree and disagree, and use that as a basis for
proposing wording to cover it.


I'd love to see that and if you'd actually do so, I'd kindly ask
Geoff to defer any changes in the ticket #1564 of mine, until it can be
said whether it might be possible to get that standardised.


Very well, I will post tests and test results as soon I can make the 
time for it.


Please see the tests and results here. Apologies for the HTML mail but 
this is hard to make readable in plain text.


String

Pattern

dash, busybox ash, mksh, posh, pdksh

glibc fnmatch

bash

bosh

gwsh

ksh

zsh
\303\244

[\303\244]

no match

match

match

match

match

match

match
\303\244

?

no match

match

match

match

match

match

match
\303

[\303]

match

match

match

match

match

match

match
\303

?

match

match

match

match

match

match

match
\303.\303\244

[\303].[\303\244]

no match

no match

no match

match

match

match

match
\303.\303\244

?.?

no match

no match

no match

match

match

match

match
\303\303\244

[\303][\303\244]

no match

no match

no match

match

match

match

match
\303\303\244

??

no match

no match

no match

match

match

match

match
\303\244.\303

[\303\244].[\303]

no match

no match

no match

match

match

match

match
\303\244.\303

?.?

no match

no match

no match

match

match

match

match
\303\244\303

[\303\244][\303]

no match

no match

no match

match

match

match

match
\303\244\303

??

no match

no match

no match

match

match

match

match
\303\244

\303*

match

match

match

match

no match

match

no match
\303\244

\303?

match

match

match

no match

no match

match

no match
\303\244

[\303]*

match

match

match

match

no match

match

no match
\303\244

[\303]?

match

match

match

no match

no match

match

no match
\303\244

*\204

match

match

match

no match

no match

no match

match
\303\244

?\204

match

match

match

no match

no match

no match

no match
\303\244

*[\204]

match

match

match

no match

no match

no match

no match
\303\244

?[\204]

match

match

match

no match

no match

no match

no match
\243]

[\243]]

match

match

match

match

match

match

match
\243]

?

no match

match

match

match

match

match

match
\243

?

match

match

match

match

match

match

match
\243

[\243]

match

match

match

match

no match

no match

error
\243

[\243!]

match

match

match

match

match

match

match
\243]

[\243!]]

match

match

no match

no match

no match

match

no match
\243]

?]

match

match

no match

no match

no match

no match

no match
\243]

*]

match

match

no match

no match

no match

no match

match

The tests involving \303 and \244 are run in an UTF-8 environment. In 
UTF-8, \303\244 is the 

Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-22 Thread Harald van Dijk via austin-group-l at The Open Group

[accidentally replied privately, re-sending to list]

On 22/04/2022 10:54, Geoff Clare via austin-group-l at The Open Group wrote:

Harald van Dijk wrote, on 15 Apr 2022:


For the most part(*), those shells that support locales appear to already be
in agreement that single bytes that do not form a valid multi-byte character
are interpreted as single characters that can be matched with *, with ?, and
with those single bytes themselves. Shells are not in agreement on whether
such single bytes can be matched with [...], nor in those shells where they
can be, whether multiple bracket expressions can be used to match the
individual bytes of a valid multi-byte character.


Shells are not the only issue here.  Pattern matching or expansion is also
performed by find, pax, fnmatch() and glob(), and there is no agreement
between those.  E.g. GNU find -name does not match such bytes with * or ?,
whereas Solaris find does, and the glibc fnmatch() returns an error (not
just "no match") if the string to be matched does not consist of valid
characters.


Good point that those need to be checked too. However, what you describe 
is only what older versions of GNU fnmatch() did. It was regarded as a 
bug, and fixed.


Cheers,
Harald van Dijk



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-22 Thread Geoff Clare via austin-group-l at The Open Group
Harald van Dijk wrote, on 15 Apr 2022:
>
> For the most part(*), those shells that support locales appear to already be
> in agreement that single bytes that do not form a valid multi-byte character
> are interpreted as single characters that can be matched with *, with ?, and
> with those single bytes themselves. Shells are not in agreement on whether
> such single bytes can be matched with [...], nor in those shells where they
> can be, whether multiple bracket expressions can be used to match the
> individual bytes of a valid multi-byte character.

Shells are not the only issue here.  Pattern matching or expansion is also
performed by find, pax, fnmatch() and glob(), and there is no agreement
between those.  E.g. GNU find -name does not match such bytes with * or ?,
whereas Solaris find does, and the glibc fnmatch() returns an error (not
just "no match") if the string to be matched does not consist of valid
characters.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-22 Thread Geoff Clare via austin-group-l at The Open Group
Robert Elz wrote, on 15 Apr 2022:
>
>   | That is how things are at present. The suggested changes just make it
>   | explicit.
> 
> Yes, I know, but that's what I am suggesting that we not do in this one case.
> 
>   | Do you have an alternative proposal?
> 
> Only to the extent of "do nothing".   I am certainly not suggesting that
> we attempt to solve the problem.
> 
> Except perhaps it might be worth adding something to the Rationale (but
> about what, ie: where there, I have no idea) along the lines of:
> 
>   It is often unclear whether a string is to be interpreted as
>   characters in some locale, or as an arbitrary byte string.
>   While it would have been possible to arbitrarily make the various
>   cases more explicit, or explicitly unspecifried, it was considered
>   better, in this version of  to
>   make no changes, as it is believed that much additional work is
>   required to enable a standards-worthy specification possible.
>   This work is beyond the scope of this standard.

It makes no difference to the requirements of the standard whether we
state explicitly in normative text that something is unspecified or
acknowledge in rationale that it is implicitly unspecified.

Personally I prefer to have it explicit in normative text so that
readers don't have to dig through rationale to find out about it
(or worse, report that the normative text is unclear because they
didn't see the rationale).

> The problem I see, is that any specification at all of any of this,
> allows implementors to just say "that is what posix requires" and do
> nothing at all, where we really need some innovation, by someone who
> actually understands the issues and how to deal with them in a rational
> way - or at least who can come up with some kind of plan, and without
> any possibility of being considered a non-conformant implementation
> because of it.

I don't see why a statement in normative text about unspecified
behaviour would have any effect on implementors' attitudes to changing
their implementation from one allowed behaviour to another.

>   | The application can document that it requires pathnames to be in the
>   | same encoding as the user's locale.
> 
> That's not sufficient.Try encoding a find command to look for pathnames
> containing currency symbols.   It should be just a simple find -name 
> '*[ABCD]*'
> type operation, with appropriate substitutions for the ABCD chars.

If all filenames encountered are encoded in the user's locale (as per
the application's documented requirement) and the pattern is encoded in
the user's locale, then POSIX requires this use of find by the
application to work.

>   | > Even worse perhaps, ???.doc which should match 7 char
>   | > names that end in ".doc" (or is that 7 byte names?) (not counting the 
> \0).
>   |
>   | It would match 7-byte names.
> 
> Yes, in the C locale it would.   But do you believe that is what the user
> would have intended?   Are they to be required to work out how many bytes
> their local filenames are encoded as, and enter the appropriate number of '?'
> chars?

The point of adding an explicit statement to normative text is to draw
attention to the issue. Thus users should be aware that ???.doc will
match 7-byte names, and if that's not what they want then they will need
to find a different way to obtain the result they want.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-18 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
Hey.


On Tue, 2022-04-19 at 01:52 +0100, Harald van Dijk wrote:
> Even I did not to apply this to pattern matching. The 
> lexical locale, the locale used for lexing, is only used for lexing, 
> i.e. for recognising tokens, not to how those tokens are then 
> interpreted later on. If locale comes into play for that, as it does
> in 
> pattern matching, it is the then-current value of LC_CTYPE that comes
> into play, as it does in other shells.

So... how is (as per the standard) it intended to work?

My understanding was that if during lexing it sees a pattern '*∈' it
would store the binary representation (as following from the lexical
locale, in which the shell script/input is in principle expected to be)
of these characters for the pattern.

But when the actual pattern matching is done, it would interpret that
binary representation with respect to the current locale (LC_CTYPE).
So if by then, then binary representation of the script's '*∈' would
mean '*z?' in the current locale, it would use that meaning as the
pattern.

Does that sound right?


'∈' not being a member of the portable character set would make it,
AFAIU, in principle valid for being mapped to `z?` in another locale -
while changing the mapping of '*' would be possible, but according to
POSIX produce undefined results.

("If the encoded values associated with each member of the portable
character set are not invariant across all locales supported by the
implementation, if an application uses any pair of locales where the
character encodings differ, or accesses data from an application using
a locale which has different encodings from the locales used by the
application, the results are unspecified.")


> As for future directions, no opinion on that from me.

That would IMO only make sense, if e.g. there was only one and not even
well maintained shell that behaves different from all others.

The "future directions" would indicate to possible new implementers
where things may go and what they should do.
10 years later, one could re-visit the topic, and if that one shell
that behaved different from all others had died in the meantime, and
any possible new ones followed the future directions... one could
standardise it. If not, one could simply leave everything as is and no
one would get into troubles.

Whether such approach actually works out as intended is of course not
guaranteed.


> I would not think this should be a special case: «${foo%.}» should
> strip 
> a trailing «.» in exactly those cases where the shell considers foo
> to 
> match the pattern «*.». However, I can see value in doing some extra 
> tests to verify that this matches what shells do.

Remember that it might not be enough to check whether such shell strip
off correctly when one has the case
  
but also the case where one or more trailing bytes of the first group
and the bytes of the valid character form a new valid character.

While this wouldn't be possible if '.' is the characters (because of
it's special properties)... it can happen with other characters in some
special locales.


> Very well, I will post tests and test results as soon I can make the 
> time for it.

Thanks.


FYI: I think the outcome will also affect the current proposal for
#1561:
https://www.austingroupbugs.net/view.php?id=1561#c5795

in specific the part:
On page 2321 line 74857 section 2.6.2 Parameter Expansion, change:


Thanks,
Chris.



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-18 Thread Harald van Dijk via austin-group-l at The Open Group

On 15/04/2022 04:57, Christoph Anton Mitterer wrote:

On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l
at The Open Group wrote:

Hmm, I would.


I like that :-D This would have been the preferred alternative I've
asked for to look at, in the ticket.




Shells
are not in agreement on whether such single bytes can be matched with
[...], nor in those shells where they can be, whether multiple
bracket
expressions can be used to match the individual bytes of a valid
multi-byte character.

The cases with [...] only come up when scripts themselves use
patterns
that are not valid character strings


You mean in the lexical locale?


I do not, but interesting question. I am one of the few, if not only, 
shell authors that actually implemented "Changing the value of LC_CTYPE 
after the shell has started shall not affect the lexical processing of 
shell commands in the current shell execution environment or its 
subshells" rule. Even I did not to apply this to pattern matching. The 
lexical locale, the locale used for lexing, is only used for lexing, 
i.e. for recognising tokens, not to how those tokens are then 
interpreted later on. If locale comes into play for that, as it does in 
pattern matching, it is the then-current value of LC_CTYPE that comes 
into play, as it does in other shells.



they are unlikely to affect
existing scripts and I imagine there is not much harm in leaving
those
unspecified.


It should however be clearly described that behaviour in this field is
undefined, perhaps with some "future directions" that this might change
some day.


I prefer explicit over implicit as well myself. Perhaps it does not even 
need to be undefined though, perhaps unspecified with a few limited 
options is good enough. I am not sure at this time whether that is feasible.


As for future directions, no opinion on that from me.


The cases with * and ? do come up in existing scripts, but
if shells are in agreement as they appear to be, there is no need to
coordinate with shell authors on whether they would be willing to
change
their implementations, it is possible to change POSIX to describe the
shells' current behaviour.


Well but it's not only * and ? ... it's also a single character
matching that character in a byte string that contains bytes or
sequences thereof which do not form any valid character ... both before
or after that character to be matched.


Yes, I did mention those earlier on in my message but forgot to repeat 
it here. It's where shells also appear to be in agreement, except in the 
same corner case that also applies to [...] where an invalid byte in a 
pattern is used to match part of a valid character in the string.



And since pattern matching notation isn't just used for matching alone,
but e.g. also for string manipulation in parameter expansion (e.g.
"${foo%.}" case)... these shells would also need to agree how to handle
that, wouldn't they?


I would not think this should be a special case: «${foo%.}» should strip 
a trailing «.» in exactly those cases where the shell considers foo to 
match the pattern «*.». However, I can see value in doing some extra 
tests to verify that this matches what shells do.



If there is interest in getting this standardised, I can spend some
more
time on creating some hopefully comprehensive tests for this to
confirm
in what cases shells agree and disagree, and use that as a basis for
proposing wording to cover it.


I'd love to see that and if you'd actually do so, I'd kindly ask
Geoff to defer any changes in the ticket #1564 of mine, until it can be
said whether it might be possible to get that standardised.


Very well, I will post tests and test results as soon I can make the 
time for it.


Cheers,
Harald van Dijk



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-14 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
On Fri, 2022-04-15 at 00:44 +0100, Harald van Dijk via austin-group-l
at The Open Group wrote:
> Hmm, I would.

I like that :-D This would have been the preferred alternative I've
asked for to look at, in the ticket.



> Shells 
> are not in agreement on whether such single bytes can be matched with
> [...], nor in those shells where they can be, whether multiple
> bracket 
> expressions can be used to match the individual bytes of a valid 
> multi-byte character.
> 
> The cases with [...] only come up when scripts themselves use
> patterns 
> that are not valid character strings

You mean in the lexical locale?


> they are unlikely to affect 
> existing scripts and I imagine there is not much harm in leaving
> those 
> unspecified.

It should however be clearly described that behaviour in this field is
undefined, perhaps with some "future directions" that this might change
some day.


> The cases with * and ? do come up in existing scripts, but 
> if shells are in agreement as they appear to be, there is no need to 
> coordinate with shell authors on whether they would be willing to
> change 
> their implementations, it is possible to change POSIX to describe the
> shells' current behaviour.

Well but it's not only * and ? ... it's also a single character
matching that character in a byte string that contains bytes or
sequences thereof which do not form any valid character ... both before
or after that character to be matched.

And since pattern matching notation isn't just used for matching alone,
but e.g. also for string manipulation in parameter expansion (e.g.
"${foo%.}" case)... these shells would also need to agree how to handle
that, wouldn't they?


> If there is interest in getting this standardised, I can spend some
> more 
> time on creating some hopefully comprehensive tests for this to
> confirm 
> in what cases shells agree and disagree, and use that as a basis for 
> proposing wording to cover it.

I'd love to see that and if you'd actually do so, I'd kindly ask
Geoff to defer any changes in the ticket #1564 of mine, until it can be
said whether it might be possible to get that standardised.



Thanks,
Chris.




Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-14 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
On Tue, 2022-04-12 at 18:54 +0700, Robert Elz via austin-group-l at The
Open Group wrote:
> The point was that, at least as I read the proposed text, you're
> defining
> things like '*' to only work (reliably as specified) when the locale
> is
> POSIX (aka C).   In the user's locale, who knows what happens?

TBH, I'm not sure whether I understand the problem.

What means "reliably"? That it always matches any file not starting
with a "."?
One could say it's simply not defined as that.

*If* pattern matching notation is considered to work on character
strings only, that the logical consequence would be that * doesn't
necessarily match $'\xFF'.doc in an UTF-8 locale.

*Not necessarily* - since Geoff's proposal defines the behaviour in
that case clearly as unspecified, and thus a shell could do what e.g.
bash seems to do (or at least that was my understanding) ... and simply
carry on any invalid bytes in strings.

But Shell XYZ may choose to not match.


Geoff's proposal doesn't seem to codify anything, which isn't already
(and unfortunately) allowed anyway... it just clarifies the ambiguity
by the inconsistent use of defined terms.

Which makes it clear to e.g. some random guy like me and the command
substitution with trailing newline case - that I *cannot* simply
assume, that (because of the special properties of '.' or '/' as
characters) it would be enough to simply use one of these two and
stripping them off would work for sure in *any* conforming locale...


On Fri, 2022-04-15 at 06:03 +0700, Robert Elz via austin-group-l at The
Open Group wrote:
>   | Do you have an alternative proposal?
> 
> Only to the extent of "do nothing".   I am certainly not suggesting
> that
> we attempt to solve the problem.

... without such clarification (as made by the proposal) I could have
just rea into the standard what I wanted to have it, i.e. that the
matching *has* to work ("as expected") on (byte-)strings and that it
would be enough to just strip of a trailing '.' without any LC_ALL=C
games.

But others may - in the current form of the standard - choose to
interpret it as defined-on-character-strings-only.


So I think it's better to clarify (even it it's just the "it's
unspecified"), than to leave ambiguous.
One could however add a future direction, telling that this might be
defined some day.


> The problem I see, is that any specification at all of any of this,
> allows implementors to just say "that is what posix requires" and do
> nothing at all, where we really need some innovation, by someone who
> actually understands the issues and how to deal with them in a
> rational
> way - or at least who can come up with some kind of plan, and without
> any possibility of being considered a non-conformant implementation
> because of it.

I'd also prefer something that doesn't result in "undefined behaviour",
"implementation defined" or similar.

Despite being not an expert, it seems unlikely that different character
encodings will ever go away - even if all people would actually use
UTF-8, how would you ever manage to get rid of the whole framework for
different char encodings?

So the only other way would seem to be to specify how pattern matching
notation should work on byte strings.
Which I'd also prefer to have... but is there any consensus for this
(which will then actually be adhered to by shell implementors)?


Cheers,
Chris.



[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-14 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://www.austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-04-15 02:17 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

-- 
 (0005805) calestyo (reporter) - 2022-04-15 02:17
 https://www.austingroupbugs.net/view.php?id=1564#c5805 
-- 
Re: https://www.austingroupbugs.net/view.php?id=1561#c5796

In principle that clarifies my original point.

(Although I'd probably have preferred if all (relevant) implementations
just behave the same already... and this could be made non-undefined.


Do you think something should be done about fnmatch(), page 879?

While it refers to sections 2.13.1 and 2.13.2 (which also fall under your
proposed changes in 2.13) it still uses "string" all over and doesn't
mention any dependency on the locale's character encoding? 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
2022-04-11 13:55 geoffclare Note Added: 0005796  
2022-04-11 22:58 kreNote Added: 0005797  
2022-04-12 08:51 geoffclare Note Added: 0005798  
2022-04-15 02:12 calestyo   Note Added: 0005804  
2022-04-15 02:17 calestyo   Note Added: 0005805  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-14 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://www.austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-04-15 02:12 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

-- 
 (0005804) calestyo (reporter) - 2022-04-15 02:12
 https://www.austingroupbugs.net/view.php?id=1564#c5804 
-- 
Re: https://www.austingroupbugs.net/view.php?id=1561#c5797

Well this proposal is not really changing anything, is it? Why do you think
it's worse to name that something results in undefined behaviour than not
saying anything at all and leave it ambiguous (especially when the wording
is already contradictory)?

Also... it doesn't seem as if locales would ever go away. And even if the
POSIX/C community (and all other affected groups) would decide tomorrow to
abolish locales or at least the choice of character encodings and make all
UTF-8... there would be still millions of lines of code which assume
different character encodings to exist and which thus somehow need to be
defined in a proper manner. 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
2022-04-11 13:55 geoffclare Note Added: 0005796  
2022-04-11 22:58 kreNote Added: 0005797  
2022-04-12 08:51 geoffclare Note Added: 0005798  
2022-04-15 02:12 calestyo   Note Added: 0005804  
==




Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-14 Thread Robert Elz via austin-group-l at The Open Group
Date:Thu, 14 Apr 2022 09:42:37 +0100
From:"Geoff Clare via austin-group-l at The Open Group" 

Message-ID:  <20220414084237.GA15370@localhost>

  | That is how things are at present. The suggested changes just make it
  | explicit.

Yes, I know, but that's what I am suggesting that we not do in this one case.

  | Do you have an alternative proposal?

Only to the extent of "do nothing".   I am certainly not suggesting that
we attempt to solve the problem.

Except perhaps it might be worth adding something to the Rationale (but
about what, ie: where there, I have no idea) along the lines of:

It is often unclear whether a string is to be interpreted as
characters in some locale, or as an arbitrary byte string.
While it would have been possible to arbitrarily make the various
cases more explicit, or explicitly unspecifried, it was considered
better, in this version of  to
make no changes, as it is believed that much additional work is
required to enable a standards-worthy specification possible.
This work is beyond the scope of this standard.

The problem I see, is that any specification at all of any of this,
allows implementors to just say "that is what posix requires" and do
nothing at all, where we really need some innovation, by someone who
actually understands the issues and how to deal with them in a rational
way - or at least who can come up with some kind of plan, and without
any possibility of being considered a non-conformant implementation
because of it.

  | The application can document that it requires pathnames to be in the
  | same encoding as the user's locale.

That's not sufficient.Try encoding a find command to look for pathnames
containing currency symbols.   It should be just a simple find -name '*[ABCD]*'
type operation, with appropriate substitutions for the ABCD chars.

No problem if not all the world's currency symbols are encoded, if we find
one that has been forgotten, it can simply be added.  Currency symbols are
things like the $ sign, British pound, Euro, Yen, Baht, ... (there are a
whole bunch of them).   If there were a [:currency:] class, it would be easy
(and I'd need to come up with a different example).   But there isn't.

If we cannot do something this simple, and expect it to work reliably,
everywhere, then what we have is useless, and needs to be replaced or
reworked.   That's not a standards' body type task.   But we should be
doing nothing to interfere with the production of a solution.

  | The C locale is specified as containing 256 single-byte characters.
  | Thus in the C locale all pathnames are valid character strings.

Sure, understood.

  | > Even worse perhaps, ???.doc which should match 7 char
  | > names that end in ".doc" (or is that 7 byte names?) (not counting the \0).
  |
  | It would match 7-byte names.

Yes, in the C locale it would.   But do you believe that is what the user
would have intended?   Are they to be required to work out how many bytes
their local filenames are encoded as, and enter the appropriate number of '?'
chars?

kre



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-14 Thread Geoff Clare via austin-group-l at The Open Group
Robert Elz wrote, on 12 Apr 2022:
>
>   | 1. The vast majority of apps will never need to do that because they know
>   | (or can assume) that the pathnames they handle either always use the
>   | portable filename character set or use the user's locale.
> 
> The latter, perhaps, the former, certainly not in an international context.
> The point was that, at least as I read the proposed text, you're defining
> things like '*' to only work (reliably as specified) when the locale is
> POSIX (aka C).   In the user's locale, who knows what happens?

That is how things are at present. The suggested changes just make it
explicit.

Do you have an alternative proposal?

>   | I.e. the pathnames are not abitrary (a word I was careful to
>   | include in the proposed changes).
> 
> Sure, the problem is that when dealing with user input (as in, for example,
> the command line args) the application cannot assume that the pathnames are
> not aribtrary.   They're anything that's OK for the user.

The application can document that it requires pathnames to be in the
same encoding as the user's locale.

>   | 2. In apps that truly do need to do matching or expansion on arbitrary
>   | pathnames, a C program can call uselocale() before and after calls to
>   | fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before
>   | handling pathnames (and unset it or restore it afterwards). 
> 
> But how does that help *.doc (in a defined way, as opposed to "of course
> that works in all glob implementations") match a filename that isn't
> entirely ascii (by which I mean, using characters only from the portable
> character set)?

The C locale is specified as containing 256 single-byte characters.
Thus in the C locale all pathnames are valid character strings.

> Even worse perhaps, ???.doc which should match 7 char
> names that end in ".doc" (or is that 7 byte names?) (not counting the \0).

It would match 7-byte names.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: [Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-12 Thread Robert Elz via austin-group-l at The Open Group
Date:Tue, 12 Apr 2022 08:51:51 +
From:"Austin Group Bug Tracker via austin-group-l at The Open 
Group" 
Message-ID:  <1541e949d4c9cd28467acf6033bfd...@austingroupbugs.net>

That is, Geoff Clare:

  | 1. The vast majority of apps will never need to do that because they know
  | (or can assume) that the pathnames they handle either always use the
  | portable filename character set or use the user's locale.

The latter, perhaps, the former, certainly not in an international context.
The point was that, at least as I read the proposed text, you're defining
things like '*' to only work (reliably as specified) when the locale is
POSIX (aka C).   In the user's locale, who knows what happens?

  | I.e. the pathnames are not abitrary (a word I was careful to
  | include in the proposed changes).

Sure, the problem is that when dealing with user input (as in, for example,
the command line args) the application cannot assume that the pathnames are
not aribtrary.   They're anything that's OK for the user.

  | 2. In apps that truly do need to do matching or expansion on arbitrary
  | pathnames, a C program can call uselocale() before and after calls to
  | fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before
  | handling pathnames (and unset it or restore it afterwards). 

But how does that help *.doc (in a defined way, as opposed to "of course
that works in all glob implementations") match a filename that isn't
entirely ascii (by which I mean, using characters only from the portable
character set)?Even worse perhaps, ???.doc which should match 7 char
names that end in ".doc" (or is that 7 byte names?) (not counting the \0).

Anyone from outside the English speaking world is likely to encounter many
of those.

kre



[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-12 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-04-12 08:51 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

-- 
 (0005798) geoffclare (manager) - 2022-04-12 08:51
 https://austingroupbugs.net/view.php?id=1564#c5798 
-- 
> How can a conforming application possibly (sanely) ensure the C locale is
in use when performing pathname expansion using user input that has been
presented in the user's locale

1. The vast majority of apps will never need to do that because they know
(or can assume) that the pathnames they handle either always use the
portable filename character set or use the user's locale. I.e. the
pathnames are not abitrary (a word I was careful to include in the
proposed changes).

2. In apps that truly do need to do matching or expansion on arbitrary
pathnames, a C program can call uselocale() before and after calls to
fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before
handling pathnames (and unset it or restore it afterwards). 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
2022-04-11 13:55 geoffclare Note Added: 0005796  
2022-04-11 22:58 kreNote Added: 0005797  
2022-04-12 08:51 geoffclare Note Added: 0005798  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-11 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-04-11 22:58 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

-- 
 (0005797) kre (reporter) - 2022-04-11 22:58
 https://austingroupbugs.net/view.php?id=1564#c5797 
-- 
How can a conforming application possibly (sanely) ensure the C locale is
in use when performing pathname expansion using user input that has been
presented in the user's locale (and if that is not to be allowed, how
can the user ever sanely use pathnames containing characters that are not
ASCII, and if that is not to be allowed, what good are locales ?)

I truly wish we could simply stop attempting to make the standard
consistent
with regard to the current locale mess, it is all way too broken to be
useful. 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
2022-04-11 13:55 geoffclare Note Added: 0005796  
2022-04-11 22:58 kreNote Added: 0005797  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-11 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-04-11 13:55 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

-- 
 (0005796) geoffclare (manager) - 2022-04-11 13:55
 https://austingroupbugs.net/view.php?id=1564#c5796 
-- 
Suggested changes...

On page 2351 line 76098 section 2.13 Pattern Matching Notation,
change:The pattern matching notation described in this section
is used to specify patterns for matching strings in the
shell.to:The pattern matching notation described
in this section is used to specify patterns for matching character strings
in the shell.
After page 2351 line 76102 section 2.13 Pattern Matching Notation, add a
new paragraph:If an attempt is made to use pattern matching
notation to match a string that contains one or more bytes that do not form
part of a valid character, the behavior is unspecified. Since pathnames can
contain such bytes, portable applications need to ensure that the current
locale is the C or POSIX locale when performing pattern matching (or
expansion) on arbitrary pathnames. 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
2022-04-11 13:55 geoffclare Note Added: 0005796  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-04-07 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


The following issue has been set as RELATED TO issue 0001561. 
== 
https://austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-03-03 03:37 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
==
Relationships   ID  Summary
--
related to  0001561 clarify what kind of data shell variabl...
== 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
2022-04-07 16:30 geoffclare Relationship added   related to 0001561  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-03-02 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://www.austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-03-03 03:37 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
== 

-- 
 (0005729) calestyo (reporter) - 2022-03-03 03:37
 https://www.austingroupbugs.net/view.php?id=1564#c5729 
-- 
Well I guess the whole thing is also, why your point had been earlier, that
'.' as sentinel would be enough, and any implementation that wouldn't carry
on invalid encodings (i.e. bytes that do not form characters), would be
buggy in that respect already.


I guess on the one hand, Geoff is clearly right, when he says that any such
behaviour (especially the complicated mappings that you explained above)
are not expected to be carried out by an implementation (at least not from
the current standard)... and as such '.' would in fact not be enough as the
sentinel for command substitution with trailing newlines, but the LC_ALL=C
would be required.


OTOH... the '*' example above was intended to question whether there are
really any implementations which would filter out filenames which contain
bytes that do not form characters. I'd guess not.


So one more reason, why I think that this should be clearly specified
(which I've requested with this issue)... i.e. if there's consensus that it
(pattern matching) operates on character strings only - clearly name this,
declare operation on non-character strings unspecified (with respect to
their results) and remove all current references that indicate that it
would operate on bytes. 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
2022-03-03 03:37 calestyo   Note Added: 0005729  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-02-25 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://www.austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-02-25 20:54 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
== 

-- 
 (0005719) mirabilos (reporter) - 2022-02-25 20:54
 https://www.austingroupbugs.net/view.php?id=1564#c5719 
-- 
「would a "ls *" in principle be expected to only match filenames who are
character strings in the current locale?」

That’d be the logical consequence of treating it as characters in the
current encoding.

When I added locale “things” to my projects, I extended the definition
of character. Instead of just accepting the characters that are valid in
the current encoding (which is either 8-bit-transparent 7-bit ASCII or
(back then BMP-only, but I’m moving to full 21-bit) UTF-8), so-called
“raw octets” are also mapped into the wide character range.

Every time a conversion fails, the first octet of it is handled as raw
octet, then the conversion restarts on the next one. (This can obviously be
optimised for illegal UTF-8 sequences if one is careful about the beginning
of the next possibly valid sequence.)

In 16-bit wchar_t times (basically “until 2022Q1”), this is mapped into
a PUA range reserved by the CSUR for this. (Not quite optimal.) This is
U+EF80‥U+EFFF. (What happens when you encounter \xEE\xBE\x80 can only be
described as fun.)

In the new scheme, I’m mapping them to U-1080‥U-10FF which is
outside of the range of things, so not a problem (except now I’m
wondering what to set WCHAR_MAX to, but I think 0x10U still, because
only these are, strictly speaking, valid?)

There’s a complication that has to do with the idiotic Standard C API for
mbrtowc(3) in that “return value == 0” is the sole test for “*pwc ==
L'\0'” and so cannot be used to signal that 0 octets have been eaten,
which means I might need to use even higher numbers for 2‑, 3‑ and
4-byte raw octet sequences. (The latter of which has wcwidth() == 4…)

But that’s detail. The thing relevant here is that this is (could be, but
anything else is either discarding the notion of character here (which
would be hard to make congruent with the existence of character classes) or
an active and certainly harmful disservice to users (the “only match
filenames that are valid” I quoted above)) a middle ground between
characters and bytes: bytes that are characters if possible and have
character semantics applied, but may not.

This is currently unspecified. I’d like to (continue) treat(int) things
in a way that means that, for example, ? is either a character or a single
byte from an invalid multibyte sequence (of length 1 or more), with a
subsequent ? catching a possible second byte, and so on. Raw octets are
displayed as � with a wcwidth() of 1 each (or some application-local
suitable encoding, where that is possible). 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
2022-02-25 20:54 mirabilos  Note Added: 0005719  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-02-24 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://www.austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-02-25 04:57 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
== 

-- 
 (0005716) calestyo (reporter) - 2022-02-25 04:57
 https://www.austingroupbugs.net/view.php?id=1564#c5716 
-- 
3) A third aspect, that should perhaps be considered by some more
knowledgable person than me:

One main use of pattern matching is matching filenames (2.13.3 Patterns
Used for Filename Expansion).
Filenames however are explicitly byte strings


So if pattern matching is indeed intended to only have a specified meaning
on character strings (in the current locale), then 2.13.3 should somehow
explain this.

Especially how such patterns should deal with filenames that aren't
character strings in the current locale (e.g. error, unspecified,
ignored?).

For example, would a "ls *" in principle be expected to only match
filenames who are character strings in the current locale? 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
2022-02-25 04:57 calestyo   Note Added: 0005716  
==




[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

2022-02-22 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


The following issue has been SUBMITTED. 
== 
https://www.austingroupbugs.net/view.php?id=1564 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1564
Category:   Shell and Utilities
Type:   Clarification Requested
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:2.13 Pattern Matching Notation 
Page Number:2351 
Line Number:76099 
Final Accepted Text: 
== 
Date Submitted: 2022-02-23 01:54 UTC
Last Modified:  2022-02-23 01:54 UTC
== 
Summary:clariy on what (character/byte) strings pattern
matching notation should work
Description: 
On the mailing list, the question arose (from my side) what the current
wording in the standard implies as to whether pattern matching works on
byte or character strings.


- In some earlier discussion it was pointed out that shell variables
  should be strings (of bytes, other than NUL)
  => which could one lead to think that pattern
 matching must work on any such strings

- 2.6.2 Parameter Expansion
  doesn't seem to say, what the #, ##, % and %% special forms of
  expansion work on: bytes or characters, it just
  refers to the pattern matching chapter


- 2.13. Pattern Matching Notation says:
  "The pattern matching notation described in this section is used to
  specify patterns for matching strings in the shell."
  => strings... would mean bytes (as per 3.375 String)

- 2.13.1 Patterns Matching a Single Character however says:
  "The following patterns matching a single character shall match a
  single character: ordinary characters,..."


I questioned whether one could deduce from that, that patten matching is
required to cope with any non-characters in the string it operates upon.

This was however rejected on the list, and Geoff Clare pointed out, that
since no behaviour is specified (i.e. how the implementation would need to
handle such invalidly encoded character) the use of pattern matching on
arbitrary byte strings is undefined behaviour.
Desired Action: 
Either:
1) - In line 76099, replace "strings" with "character strings" and perhaps
mention that the results when this is done on strings that contain any byte
sequence that is not a character in the current locale, the results are
undefined.

Perhaps also clarify this in fnmatch() (page 879), this doesn't seem to
mention locales at all, but when the above assumption is true, and pattern
matching operates on characters only, wouldn't it then need to be subject
of the current LC_CTYPE?


2) Alternatively, some expert could check whether there are any
shell/fnmatch() implementations which do not simply carry on any bytes that
do not form characters. Probably there are (yash?). But if there weren't
POSIX might even chose to standardise that behaviour, which would probably
be better than leaving it unspecified?!
== 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-23 01:54 calestyo   New Issue
2022-02-23 01:54 calestyo   Name  => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo   Section   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo   Page Number   => 2351
2022-02-23 01:54 calestyo   Line Number   => 76099   
==