Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 26/09/2019 02:36, Harald van Dijk wrote:
POSIX mentions the possibility of locale-specific character classes and 
they are required to be recognised in regular expressions and therefore 
in shell glob patterns:



In addition, character class expressions of the form:

[:name:]

are recognized in those locales where the name keyword has been given 
a charclass definition in the LC_CTYPE category.


I have not checked which shells implement this correctly. I know mine 
does not. I was assuming a locale that does not define [:bogus:] as a 
character class, but should have specified.


I meant to include here that under a locale that does not define 
[:bogus:] as a character class, I expect [[:bogus:]] to silently not 
match anything, like you. I referred back to it later, but forgot to 
actually write it.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 26/09/2019 01:47, Robert Elz wrote:

 Date:Wed, 25 Sep 2019 22:29:36 +0100
 From:Harald van Dijk 
 Message-ID:  

   | These all seem reasonable choices. regcomp() would reject the whole
   | pattern as an error, and character classes are supposed to behave as
   | they do in regular expressions, so I believe yash's behaviour makes the
   | most sense. Is that correct?

Character classes (and bracket expressions) behave like they do in
regcomps, but in glob patterns there is no such thing as "invalid",
patterns that one might assume to be invalid are in reality patterns
that match something different than looks like might have been intended
at first glance.


The possibility of invalid patterns is explicitly acknowledged in the 
description of patterns ending with an unescaped backslash:



If a pattern ends with an unescaped , it is unspecified whether the 
pattern does not match anything or the pattern is treated as invalid.


However, yash treats the patterns I described as not matching anything, 
it does not raise any errors for them. I had not considered the 
possibility of shells raising errors for them and I agree that that is 
not desirable.



So, if one were to decide that [:bogus:] is not a valid character
class, as the name is not valid -- which I think would be a truly poor
choice, as locales are free to define new character classes, and this
approach would make it impossible to ever safely attempt to use such
class ... eg: some languages might (I have no idea if they do or not)
define a character class tonemark (with whatever spelling) to match the
"characters" that indicate the "tone" (which I kind of understand, but
am unable to describe) that is to me used (in Thai there are I believe 5
different words that all sound like "ma" with different tones, and wildly
different meanings, dog, horse, come (I don't know the other two) they
are probnounced with different tones (high, low, rising, falling and normal).
I think there are 7 tones in Vietnamese.

There are glyphs that are written above the consonant (the 'm' in this
case - the Thai 'm' obviously) that indicate which tone (actually take
the base tone implied by the consonant in question and modify it, rather
than being absolute).  Those glyphs are present in written text as a
character following the consonant.

It would be entirely reasonable for a script to look for something
like (usng Thai chars for the 'm' and 'a' of course)
m[[:tonemark:]]a
to match any of those words (except the one that has no mark, never
mind, we'd need  ma | m[[:tonemark:]]a ).

If we were to treat this as invalid, that is, generate an error (and would
it be a "compile time" error, or execution time?) just because we happen to
be in some non tone mark using locale, rather than simply not matching, things
get very difficult.


POSIX mentions the possibility of locale-specific character classes and 
they are required to be recognised in regular expressions and therefore 
in shell glob patterns:



In addition, character class expressions of the form:

[:name:]

are recognized in those locales where the name keyword has been given a 
charclass definition in the LC_CTYPE category.


I have not checked which shells implement this correctly. I know mine 
does not. I was assuming a locale that does not define [:bogus:] as a 
character class, but should have specified.



But if that were to happen and the character class is invalid, then the
bracket expression is

[[:tonemark:]

and is the set of any of the characters '[' ':' 't' 'o' ... 'r' 'k'
(and an extra redundant ':' - since it is a set, duplicates are ignored
just as they are in [aaa])

The pattern above would be matched by a word that contains an 'm' followed
by one from that set, followed by a ']' followed by an 'a'.

That's even worse than generating an error.   Chaacter classes really
must be treated as character classes, regardless of whether they're
recognised in the current locale or not.  This should not be unspecified.


Agreed, regardless of what POSIX currently says. This is what I was 
referring to with "The handling of this in dash, inherited by my shell, 
is just buggy and should be ignored." I intend to fix this, but am not 
sure of the details yet.



[...]
   | As with invalid character classes:
   |
   |case x in [x[.xy.]]) echo x ;; esac # bash, ksh, mksh
   |
   | This would be rejected with an error by regcomp(), so rejecting the
   | whole pattern makes most sense to me.

Same as above, no shell pattern is ever rejected, no matter what.


Same as above, by "rejected" I meant that it is treated as a 
never-matching pattern.



Eg:

case x in [xabc) echo x;; esac

is not "invalid" because  the "bracket expression" has no terminating ']',
rather it simply has no bracket expression at all, and fails to match
here because it only matches the literal string '[xabc'.


This is a special exception, a deviation from the regular expression 

Re: More issues with pattern matching

2019-09-25 Thread Robert Elz
Date:Wed, 25 Sep 2019 22:29:36 +0100
From:Harald van Dijk 
Message-ID:  

  | These all seem reasonable choices. regcomp() would reject the whole 
  | pattern as an error, and character classes are supposed to behave as 
  | they do in regular expressions, so I believe yash's behaviour makes the 
  | most sense. Is that correct?

Character classes (and bracket expressions) behave like they do in
regcomps, but in glob patterns there is no such thing as "invalid",
patterns that one might assume to be invalid are in reality patterns
that match something different than looks like might have been intended
at first glance.

So, if one were to decide that [:bogus:] is not a valid character
class, as the name is not valid -- which I think would be a truly poor
choice, as locales are free to define new character classes, and this
approach would make it impossible to ever safely attempt to use such
class ... eg: some languages might (I have no idea if they do or not)
define a character class tonemark (with whatever spelling) to match the
"characters" that indicate the "tone" (which I kind of understand, but
am unable to describe) that is to me used (in Thai there are I believe 5
different words that all sound like "ma" with different tones, and wildly
different meanings, dog, horse, come (I don't know the other two) they
are probnounced with different tones (high, low, rising, falling and normal).
I think there are 7 tones in Vietnamese.

There are glyphs that are written above the consonant (the 'm' in this
case - the Thai 'm' obviously) that indicate which tone (actually take
the base tone implied by the consonant in question and modify it, rather
than being absolute).  Those glyphs are present in written text as a
character following the consonant.

It would be entirely reasonable for a script to look for something
like (usng Thai chars for the 'm' and 'a' of course)
m[[:tonemark:]]a
to match any of those words (except the one that has no mark, never
mind, we'd need  ma | m[[:tonemark:]]a ).

If we were to treat this as invalid, that is, generate an error (and would
it be a "compile time" error, or execution time?) just because we happen to
be in some non tone mark using locale, rather than simply not matching, things
get very difficult.

But if that were to happen and the character class is invalid, then the
bracket expression is

[[:tonemark:]

and is the set of any of the characters '[' ':' 't' 'o' ... 'r' 'k'
(and an extra redundant ':' - since it is a set, duplicates are ignored
just as they are in [aaa])

The pattern above would be matched by a word that contains an 'm' followed
by one from that set, followed by a ']' followed by an 'a'.

That's even worse than generating an error.   Chaacter classes really
must be treated as character classes, regardless of whether they're
recognised in the current locale or not.  This should not be unspecified.

Ignoring "errors" is the way glob matching has always worked, something
that cannot be interpreted as what it appears to be simply means something
different.

Regular expressions are different, they are more formally defined, and
have always had valid, and invalid cases - so what regcomp() happens to
do isn't really relevant to what happens with shell patterns.  They are
entirely different beasts.

  | 1b. Quoted character classes:
  |

  | I believe that as the special characters to indicate a character class 
  | are "[:" and ":]", the osh behaviour is correct, the character class 
  | name is allowed to be quoted.

I agree, there is nothing that suggests it should be otherwise.

  | The dash/nbsh behaviour, 
  | again inherited by my shell, is close, but the fact that the type of 
  | quoting affects how the character class is treated looks like a bug.

I would agree with that, and for nbsh I will take a look (I haven't yet
verified this, but I certainly believe it could be that way) and fix it.

  | 2. Collating symbols and equivalence classes
  |
  | Collating symbols and equivalence classes are less widely implemented.

nbsh is one that doesn't implement them at all.  That's a defect,
they should be supported, but fixing this is kind of low priority
as in practice, nothing seems to use them (other than tests to see
if they are implemented) so don't necessarily expect a resolution
any time soon (it isn't just the shell, I am not sure if there's
any support for those in NetBSD's locale system at all).


  | As with invalid character classes:
  |
  |case x in [x[.xy.]]) echo x ;; esac # bash, ksh, mksh
  |
  | This would be rejected with an error by regcomp(), so rejecting the 
  | whole pattern makes most sense to me.

Same as above, no shell pattern is ever rejected, no matter what.
Eg:

case x in [xabc) echo x;; esac

is not "invalid" because  the "bracket expression" has no terminating ']',
rather it simply has no bracket expression at all, and fails to match
here because it only matches the literal 

Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 26/09/2019 00:18, Shware Systems wrote:
While it may not be mentioned in that thread, P182, L6005 explictly has 
the blanket "violations ... produce undefined results" that I see could 
apply for bogus names. It would be a semantics error more than a syntax 
one, but language is there.


However, 9.3.5 could also be construed as all other names represent an 
empty set to be unioned with the set of elements for 9.3.5, 6b. as the 
class checked, and if an implementation does not provide any global set 
(since locale definitions have no means to define a per locale set) then 
no match is the required behavior. If a global set is provided, a match 
may occur, but either way all values for name are potentially valid.


This looks like a contradiction between the specification of regular 
expressions and the specification of regcomp().


I agree that the specification for regular expressions does not say that 
unrecognised names are "invalid" and that the failure to say so renders 
the results undefined. At the same time, regcomp() says:



The following constants are defined as the minimum set of error return values, 
although other errors listed as implementation extensions in  are 
possible:



REG_ECTYPE
Invalid character class type referenced.


An implementation that allows any bogus name by either not parsing it as 
a character class or by parsing it as a character class that never 
matches any character would never have regcomp() return REG_ECTYPE, but 
by stating that the minimum set of error return values includes 
REG_ECTYPE, the standard requires regcomp() to return REG_ECTYPE for at 
least some patterns.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 25/09/2019 22:49, Stephane Chazelas wrote:

2019-09-25 22:29:36 +0100, Harald van Dijk:
[...]

1a. Invalid character classes:

   case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh
   case x in [![:bogus:]]) echo x ;; esac # above except osh

[...]

See also

https://www.mail-archive.com/austin-group-l%40opengroup.org/msg02247.html

(and the rest of that thread).


Thanks. This does not cover all of my questions, but does cover some of 
them. I agree with Robert Elz's comment there:



I truly dislike that kind of approach in the standard - particularly if it
is deliberate.   Readers of the text don't know that it is actually
unspecified, as it might be specified somewhere else they haven't
found yet.


The fact that character classes in patterns are defined by reference to 
regular expressions, and in regular expressions they render the whole 
regular expression invalid per regcomp()'s REG_ECTYPE error return 
value, does not appear to be mentioned in that thread. This may change 
the conclusion that the behaviour is implicitly unspecified.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-25 Thread Stephane Chazelas
2019-09-25 22:29:36 +0100, Harald van Dijk:
[...]
> 1a. Invalid character classes:
> 
>   case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh
>   case x in [![:bogus:]]) echo x ;; esac # above except osh
[...]

See also

https://www.mail-archive.com/austin-group-l%40opengroup.org/msg02247.html

(and the rest of that thread).

-- 
Stephane



More issues with pattern matching

2019-09-25 Thread Harald van Dijk
After comparing what my shell does now during pattern matching to what 
it should, I found a few more cases where I do not believe POSIX is 
clear about what is required and where shells are not in agreement. 
These are not related to the backslash handling.


1. Character classes.

Most shells support character classes during pattern matching, as is 
required. They print "x" for


  case x in [[:alnum:]]) echo x ;; esac

1a. Invalid character classes:

  case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh
  case x in [![:bogus:]]) echo x ;; esac # above except osh

The handling of this in dash, inherited by my shell, is just buggy and 
should be ignored.


In bash, bosh, mksh, nbsh, zsh, a character does not match an invalid 
character class. In osh, a character neither matches nor fails to match 
an invalid character class, but the pattern is still valid. In yash, the 
use of [:bogus:] renders the whole pattern invalid.


These all seem reasonable choices. regcomp() would reject the whole 
pattern as an error, and character classes are supposed to behave as 
they do in regular expressions, so I believe yash's behaviour makes the 
most sense. Is that correct?


1b. Quoted character classes:

Shells agree that quoting disables the recognition of character classes, 
but they disagree on how much quoting disables it.


  case x in ["[:alnum:]"]) echo x ;; esac # none
  case x in [[:"alnum:]"]) echo x ;; esac # none
  case x in [[:"alnum:"]]) echo x ;; esac # ksh, mksh, yash, zsh
  case x in [[:\alnum:]])  echo x ;; esac # above plus osh
  case x in [[:"alnum":]]) echo x ;; esac # above plus dash, nbsh

I believe that as the special characters to indicate a character class 
are "[:" and ":]", the osh behaviour is correct, the character class 
name is allowed to be quoted. Is that correct? The dash/nbsh behaviour, 
again inherited by my shell, is close, but the fact that the type of 
quoting affects how the character class is treated looks like a bug.


2. Collating symbols and equivalence classes

Collating symbols and equivalence classes are less widely implemented.

  case x in [[.x.]]) echo x ;; esac # bash, ksh, mksh, osh, yash
  case x in [[=x=]]) echo x ;; esac # same
  case ä in [[=a=]]) echo x ;; esac # bash, ksh, yash
  case a in [[=ä=]]) echo x ;; esac # same

The handling of brackets in pattern matching is defined by reference to 
RE Bracket Expression and no exception has been made for them, so these 
are supposed to be handled in pattern matching as well.


2a. Multi-character collating symbols and equivalence classes

Multi-character support seems impossible to implement portably other 
than by translating patterns to regular expressions as yash does. POSIX 
does not provide any other means to ask the implementation enough 
information about what is supported in the current locale. And when 
things to get translated to regular expressions, it relies on libc 
support, with glibc behaving strangely, but this may just be my limited 
understanding of how things are supposed to work.


  LANG=cy_GB.UTF-8
  case  ch in  [[=ch=]]) echo x ;; esac # none
  case  ch in  [[.ch.]]) echo x ;; esac # yash
  case xch in x[[=ch=]]) echo x ;; esac # yash

Are shells required to support this, and are shells therefore implicitly 
required to translate patterns to regular expressions, or should it be 
okay to implement this with single character support only?


2b. Invalid collating elements

As with invalid character classes:

  case x in [x[.xy.]]) echo x ;; esac # bash, ksh, mksh

This would be rejected with an error by regcomp(), so rejecting the 
whole pattern makes most sense to me. This appears to be what osh is 
doing as well, in a change from how it handles invalid character 
classes, and as expected it is what yash does. Is it the right approach?


2c. Quoted equivalence classes and collating symbols

The same question of quoting applies to these, but here too osh no 
longer behaves the way it did with character classes:


  case x in [[="x="]]) echo x ;; esac # ksh, mksh, osh, yash
  case x in [[."x."]]) echo x ;; esac # same

I believe this is incorrect for the same reason as the quoting in 
character classes. The quoting of "x" should be okay, but the quoting of 
"=" or "." should disable the recognition as an equivalence class or 
collating symbol, so the meaning of the pattern [[="x="]] should change 
to "one of [=x=, followed by ]", just like how the pattern [["=x="]] is 
treated already. Does that sound right?


Cheers,
Harald van Dijk



Re: [1003.1(2016)/Issue7+TC2 0001234]: in most shells, backslash doesn't have two meaning wrt pattern matching

2019-09-25 Thread Harald van Dijk

On 25/09/2019 15:49, Geoff Clare wrote:

There are differences due to the version number change and there are
differences due to the build configuration being different.  I only
mentioned the build configuration in order to preempt a response
claiming that the differences between bash 3 and bash 4 were not
sufficient to justify treating them as different shells.


I still do not understand this, but seeing how little relevance this has 
to the discussion, I am okay with dropping this.



I see it as a separate decision whether to do matching against pathnames
or not.  If matching is done, the treatment of backslash is then the same
as in glob(), find, etc.  If matching is not done, the result is the
same as if matching had been done and no matching pathnames were found.


I guess that makes sense. I was still thinking from the perspective of 
pathname expansion always being performed (aside perhaps from fully 
quoted words) in theory, but optimised away by shells in practice, that 
POSIX currently describes.



Personally I would prefer the backslash-is-always-special option, but
breaking autoconf when a %sn file exists was enough for me to accept
the bash2/3/4 behaviour as a compromise.


Earlier you wrote "the likelihood of this causing problems is extremely
small". This applies here as well. How likely is it for a '%sn' file to
exist? Other than as a deliberate attempt to cause the configure script to
fail, that is, in which case it is doing exactly what the user wanted.


For '%sn' perhaps not very likely, but the fact that this case came
to light in a widely-used open source application means that other
similar cases are likely to exist in other open source applications
and in closed source applications, user's private scripts, etc.


Agreed that similar cases are likely to exist (both with backslashes and 
with other special characters), but are there any cases that we can 
expect to cause problems for users that do not specifically create files 
to break scripts? I suspect the answer to that is no.



Those scripts can be fixed simply by adding quoting.  The autoconf
problem with bash5 can't be fixed that way.


Not the same way, but it could still be trivially fixed: instead of

  as_echo='printf %s\n'

configure scripts could do

  as_echo() { printf '%s\n' "$@"; }
  as_echo=as_echo

Incidentally, this problematic use of $as_echo had already been dropped 
in autoconf more than five years ago, replaced by a direct printf '%s\n' 
without any helper variable, it's just that there has not been a new 
release of autoconf since then, so bash and other software never picked 
up that update. On the autoconf front, there is nothing that needs to be 
done to ensure compatibility with the bash 5 behaviour even in the 
presence of a %sn file aside from getting out a new release. This does 
not really help us today though.



All the problems of all approaches are corner cases that are unlikely to
cause real problems in practice.


And yet, as Stephane reports, there have been several bug reports
against bash5 because of the new behaviour.


Those bug reports are about the unfortunate interaction between this 
treatment of backslashes and non-standard options, not about problems 
with scripts written for POSIX sh or invoking bash in POSIX mode, from 
what I have seen. Yes, I can agree that it is a problem for bash that


  var='printf %s\n hello'
  $var

errors out when the failglob option is enabled, or drops the newline 
when the nullglob option is enabled. I can think of some possible ways 
to handle that, but unless POSIX adds these options, determining how 
they should interact with indirect backslashes should probably not be 
done on this list.


Cheers,
Harald van Dijk



Re: [1003.1(2016)/Issue7+TC2 0001234]: in most shells, backslash doesn't have two meaning wrt pattern matching

2019-09-25 Thread Geoff Clare
Harald van Dijk  wrote, on 25 Sep 2019:
>
> On 25/09/2019 10:22, Geoff Clare wrote:
> >Harald van Dijk  wrote, on 24 Sep 2019:
> >>
> Regardless, a single shell is not enough to say "most shells", not even if
> it is multiple versions of that single shell.
> >>>
> >>>I consider bash 4 on Linux and bash 3 on macOS to be different shells.
> >>>(Their build configuration is different.)
> >>
> >>I do not understand this logic. The build configuration does not differ in
> >>any way that is relevant to pathname expansion. Surely the NetBSD shell is
> >>not counted separately for each port listed on
> >>, so why is bash different?
> >
> >Their behaviour is sufficiently different (in areas other than pathname
> >expansion) to consider them to be different shells.  The same is true
> >for ksh88 and ksh93.
> 
> So it is just that bash 3 and bash 4 are significantly different and both
> versions are still used on current versions of operating systems, it is not
> about build configuration?

There are differences due to the version number change and there are
differences due to the build configuration being different.  I only
mentioned the build configuration in order to preempt a response
claiming that the differences between bash 3 and bash 4 were not
sufficient to justify treating them as different shells.

> >Okay, I see your point now.  When putting part of a pathname in a
> >variable you have to know how it is going to be used in order to know
> >how backslash will be handled.  But this is just one aspect of a wider
> >problem - e.g. you have to know if the variable will be quoted or not
> >when used, which applies to the backslash-is-always-special behaviour
> >as well.
> 
> The shell script author does not necessarily have full control over this,
> though. In $dir/$file, how $dir is treated depends on whether $file contains
> metacharacters, and vice versa. Quoted vs. unquoted is something the shell
> script author does have full control over, and it is easy to check in
> typical scripts that all uses of $dir are quoted, or that all uses of $dir
> are unquoted.

Okay, I guess this counts as an entry in the "cons" columns for the
bash2/3/4 behaviour then.  I'm sure Stephane and others will argue
that it is outweighed by the "cons" for the bash5 behaviour, and I'm
inclined to agree.

> >In any case I see this as a very minor issue.  Putting a whole pattern
> >in a variable is a rare thing to do.  Putting part in a variable and
> >part direct is even more rare.  Coupled with the fact that using
> >backslash in patterns (that you want to be expanded) is also rare, the
> >likelihood of this causing problems is extremely small.
> 
> Putting a pattern in a variable is not that rare. The rest probably is, but
> see below.
> 
> >I wrote the above before I had fully thought it through, and having slept
> >on it my preference is now much stronger, and I certainly would object to
> >specifying the NetBSD sh behaviour.  The reason is because treating
> >backslash differently in different components in indirect shell patterns
> >is inconsistent with direct shell patterns, glob(), find -path, and the
> >pax pattern operand, none of which vary their treatment of backslash
> >across different components of a pattern that contains slashes.
> 
> Likewise, none of them vary their treatment of backslash according to
> whether (other) metacharacters are present. If a file named 'x' exists,
> find . -name '\x' will find it, despite '\x' not containing any
> metacharacters. The proposed resolution already treats backslashes
> differently to how they are treated in glob(), find, etc.

I see it as a separate decision whether to do matching against pathnames
or not.  If matching is done, the treatment of backslash is then the same
as in glob(), find, etc.  If matching is not done, the result is the
same as if matching had been done and no matching pathnames were found.

> >Personally I would prefer the backslash-is-always-special option, but
> >breaking autoconf when a %sn file exists was enough for me to accept
> >the bash2/3/4 behaviour as a compromise.
> 
> Earlier you wrote "the likelihood of this causing problems is extremely
> small". This applies here as well. How likely is it for a '%sn' file to
> exist? Other than as a deliberate attempt to cause the configure script to
> fail, that is, in which case it is doing exactly what the user wanted.

For '%sn' perhaps not very likely, but the fact that this case came
to light in a widely-used open source application means that other
similar cases are likely to exist in other open source applications
and in closed source applications, user's private scripts, etc.

> If you do think that is a problem, it is already a problem regardless of how
> backslash is handled in existing scripts, which pass URLs with query strings
> unquoted to curl or wget. That is, if a script contains
> 
>   curl https://some.site/path?name=value
> 
> you can break that 

Re: [1003.1(2016)/Issue7+TC2 0001234]: in most shells, backslash doesn't have two meaning wrt pattern matching

2019-09-25 Thread Stephane Chazelas
2019-09-25 10:22:07 +0100, Geoff Clare:
[...]
> I wrote the above before I had fully thought it through, and having slept
> on it my preference is now much stronger, and I certainly would object to
> specifying the NetBSD sh behaviour.  The reason is because treating
> backslash differently in different components in indirect shell patterns
> is inconsistent with direct shell patterns, glob(), find -path, and the
> pax pattern operand, none of which vary their treatment of backslash
> across different components of a pattern that contains slashes.

For the record, I agree the NetBSD 8.1 sh behaviour is
undesirable (I believe I made and expanded that case earlier)

[...]
> Personally I would prefer the backslash-is-always-special option, but
> breaking autoconf when a %sn file exists was enough for me to accept
> the bash2/3/4 behaviour as a compromise.
[...]

Note that the new bash5 behaviour has already been the subject
of several bug reports on the bash mailing list, not so
much about the type of case where a %sn exists as those are
dormant kind of issues that are hard to detect, but because it
becomes much more visible when the nullglob or failglob options
are enabled.

As in:

$ NL='\n' bash5 -O failglob -O xpg_echo -c 'echo $NL'
bash5: no match: \n
$ touch n
$ NL='\n' bash5 -O failglob -O xpg_echo -c 'echo $NL'
n

(and yes, one should use printf '%s\n' "$NL", not echo $NL, but
unfortunately not many people are aware that echo mustn't be
used or that parameter expansions must always be quoted in list
contexts even the bash documentation and the POSIX standard text
make those mistakes).

-- 
Stephane



Re: [1003.1(2016)/Issue7+TC2 0001234]: in most shells, backslash doesn't have two meaning wrt pattern matching

2019-09-25 Thread Harald van Dijk

On 25/09/2019 10:22, Geoff Clare wrote:

Harald van Dijk  wrote, on 24 Sep 2019:



Regardless, a single shell is not enough to say "most shells", not even if
it is multiple versions of that single shell.


I consider bash 4 on Linux and bash 3 on macOS to be different shells.
(Their build configuration is different.)


I do not understand this logic. The build configuration does not differ in
any way that is relevant to pathname expansion. Surely the NetBSD shell is
not counted separately for each port listed on
, so why is bash different?


Their behaviour is sufficiently different (in areas other than pathname
expansion) to consider them to be different shells.  The same is true
for ksh88 and ksh93.


So it is just that bash 3 and bash 4 are significantly different and 
both versions are still used on current versions of operating systems, 
it is not about build configuration?



Okay, I see your point now.  When putting part of a pathname in a
variable you have to know how it is going to be used in order to know
how backslash will be handled.  But this is just one aspect of a wider
problem - e.g. you have to know if the variable will be quoted or not
when used, which applies to the backslash-is-always-special behaviour
as well.


The shell script author does not necessarily have full control over 
this, though. In $dir/$file, how $dir is treated depends on whether 
$file contains metacharacters, and vice versa. Quoted vs. unquoted is 
something the shell script author does have full control over, and it is 
easy to check in typical scripts that all uses of $dir are quoted, or 
that all uses of $dir are unquoted.



In any case I see this as a very minor issue.  Putting a whole pattern
in a variable is a rare thing to do.  Putting part in a variable and
part direct is even more rare.  Coupled with the fact that using
backslash in patterns (that you want to be expanded) is also rare, the
likelihood of this causing problems is extremely small.


Putting a pattern in a variable is not that rare. The rest probably is, 
but see below.



I wrote the above before I had fully thought it through, and having slept
on it my preference is now much stronger, and I certainly would object to
specifying the NetBSD sh behaviour.  The reason is because treating
backslash differently in different components in indirect shell patterns
is inconsistent with direct shell patterns, glob(), find -path, and the
pax pattern operand, none of which vary their treatment of backslash
across different components of a pattern that contains slashes.


Likewise, none of them vary their treatment of backslash according to 
whether (other) metacharacters are present. If a file named 'x' exists,
find . -name '\x' will find it, despite '\x' not containing any 
metacharacters. The proposed resolution already treats backslashes 
differently to how they are treated in glob(), find, etc.



Personally I would prefer the backslash-is-always-special option, but
breaking autoconf when a %sn file exists was enough for me to accept
the bash2/3/4 behaviour as a compromise.


Earlier you wrote "the likelihood of this causing problems is extremely 
small". This applies here as well. How likely is it for a '%sn' file to 
exist? Other than as a deliberate attempt to cause the configure script 
to fail, that is, in which case it is doing exactly what the user wanted.


If you do think that is a problem, it is already a problem regardless of 
how backslash is handled in existing scripts, which pass URLs with query 
strings unquoted to curl or wget. That is, if a script contains


  curl https://some.site/path?name=value

you can break that script by creating a 'https:' directory, a 
'some.site' directory in that, and a 'pathXname=value' file in that. 
This is not hypothetical, I have seen multiple scripts that did this. I 
have seen that they did this because I was experimenting with bash's 
failglob option, which of course reported it as not matching anything.


We are not changing the shell semantics to say that pathname expansion 
is no longer performed on words that look like URLs, we just accept that 
this is technically a bug in those scripts, but that it is a bug that is 
so unlikely to cause real problems that for practical purposes we can 
ignore it.


All the problems of all approaches are corner cases that are unlikely to 
cause real problems in practice.



I agree there's a problem.  The proposed wording implies that the indirect
backslash escapes the shell-quoting backslash.

Here's suggestion for how to fix that in the 1st bullet in 2.13.1:

 A  character that is not inside a bracket expression
 shall preserve the literal value of the following character, unless
 the following character is in a part of the pattern where shell
 quoting can be used and is a shell quoting character, in which case
 the behavior is unspecified.

It says the behaviour is unspecified because it seems to cause 

Re: [1003.1(2016)/Issue7+TC2 0001234]: in most shells, backslash doesn't have two meaning wrt pattern matching

2019-09-25 Thread Geoff Clare
Harald van Dijk  wrote, on 24 Sep 2019:
>
> >>Regardless, a single shell is not enough to say "most shells", not even if
> >>it is multiple versions of that single shell.
> >
> >I consider bash 4 on Linux and bash 3 on macOS to be different shells.
> >(Their build configuration is different.)
> 
> I do not understand this logic. The build configuration does not differ in
> any way that is relevant to pathname expansion. Surely the NetBSD shell is
> not counted separately for each port listed on
> , so why is bash different?

Their behaviour is sufficiently different (in areas other than pathname
expansion) to consider them to be different shells.  The same is true
for ksh88 and ksh93.

> >>>I think the way the bug 1234 resolution specifies it (as per bash2/3/4) is
> >>>more straightforward and easier for users to understand. It's a simple 
> >>>binary
> >>>choice: either matching against existing pathnames is performed or it 
> >>>isn't.
> >>>If it is performed, all special pattern-matching characters, including
> >>>backslash, have their special meaning in all components of the pattern.
> >>
> >>You get situations where $dir/file1 and $dir/file2 name two files, but
> >>$dir/file[12] cannot be used to match them both in a single word, though.
> >
> >Can you be more specific?  Perhaps I'm missing something obvious, but
> >I can't think of a case that *cannot* be matched somehow.  E.g. to match
> >a backslash in a filename you can use [\\].
> 
> If dir='\x' and files 'x/file1', 'x/file2', '\x/file1', and '\x/file2' all
> exist, then under the proposed wording, $dir/file1 and $dir/file2 name the
> latter two files, but when you try to combine them in $dir/file[12], the
> meaning changes to that it names the former two. Yes, the value stored in
> the dir variable can be modified to avoid this inconsistency, but that does
> not change that there is an inconsistency in the shell's pathname expansion.
> 
> If indirect \ is always treated as literal, all would match the latter two
> files.
> 
> If indirect \ is always treated as a metacharacter, all would match the
> former two files.
> 
> If indirect \ is determined per pathname component, all would match the
> latter two files.

Okay, I see your point now.  When putting part of a pathname in a
variable you have to know how it is going to be used in order to know
how backslash will be handled.  But this is just one aspect of a wider
problem - e.g. you have to know if the variable will be quoted or not
when used, which applies to the backslash-is-always-special behaviour
as well.

In any case I see this as a very minor issue.  Putting a whole pattern
in a variable is a rare thing to do.  Putting part in a variable and
part direct is even more rare.  Coupled with the fact that using
backslash in patterns (that you want to be expanded) is also rare, the
likelihood of this causing problems is extremely small.

> >>That is a problem that NetBSD sh does not have, and one that the
> >>alternatives of always treating indirect \ as literal or never doing so also
> >>do not have.
> >
> >The three choices that were considered were always treating indirect \
> >as literal, never treating it as literal, or the middle option that has
> >been in use for many years in bash2/3/4.  Given the problems with the
> >first two choices that were discussed at great length on the mailing
> >list, the middle option was felt to be the one that has the best chance
> >of achieving consensus.
> 
> For completeness, problems with the middle option were also discussed at
> great length on the mailing list.
> 
> >I must have overlooked the fact that NetBSD sh behaves slightly differently
> >in among the deluge of emails. (I note that Stephane says he did mention it).

On reflection, I think that the reason I haven't been considering the
NetBSD sh behaviour as an option is because kre said at one point in the
discussion that he would change it after the NetBSD 9 release.

Presumably he is waiting to see how bug 1234 is resolved before deciding
how to change it.

> >I have a slight preference for the bash2/3/4 behaviour for the reasons I
> >stated above, but would not object if others would prefer it.

I wrote the above before I had fully thought it through, and having slept
on it my preference is now much stronger, and I certainly would object to
specifying the NetBSD sh behaviour.  The reason is because treating
backslash differently in different components in indirect shell patterns
is inconsistent with direct shell patterns, glob(), find -path, and the
pax pattern operand, none of which vary their treatment of backslash
across different components of a pattern that contains slashes.

> You probably know my position, which is that it is too late for POSIX to
> change the requirements after shells had started implementing the previously
> required behaviour that under the new wording would no longer be permitted,
> and that aside from that it complicates the shell