Re: More issues with pattern matching

2020-08-05 Thread Harald van Dijk via austin-group-l at The Open Group

On 05/08/2020 15:54, Geoff Clare via austin-group-l at The Open Group wrote:

Harald van Dijk  wrote, on 31 Jul 2020:


Take the previous example glibc's cy_GB.UTF-8 locale, but with a different
collating element: in this locale, "dd" is a single collating element too.
Therefore, this must be matchable by bracket expressions.


Incorrect.

I think you overlooked these statements in XBD 9.3.5 items 2 and 3:

 It is unspecified whether a matching list expression matches a
 multi-character collating element that is matched by one of the
 expressions.

 It is unspecified whether a non-matching list expression matches a
 multi-character collating element that is not matched by any of
 the expressions.


My message was indirectly in reply to your message where you claimed 
that shells were required to support this. I'm very happy to see that 
this is actually not true, thanks for that.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2020-08-05 Thread Geoff Clare via austin-group-l at The Open Group
Harald van Dijk  wrote, on 31 Jul 2020:
>
> Take the previous example glibc's cy_GB.UTF-8 locale, but with a different
> collating element: in this locale, "dd" is a single collating element too.
> Therefore, this must be matchable by bracket expressions.

Incorrect.

I think you overlooked these statements in XBD 9.3.5 items 2 and 3:

It is unspecified whether a matching list expression matches a
multi-character collating element that is matched by one of the
expressions.

It is unspecified whether a non-matching list expression matches a
multi-character collating element that is not matched by any of
the expressions.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: More issues with pattern matching

2020-08-01 Thread Harald van Dijk

On 31/07/2020 00:10, Harald van Dijk wrote:
Take the previous example glibc's cy_GB.UTF-8 locale, but with a 
different collating element: in this locale, "dd" is a single collating 
element too. Therefore, this must be matchable by bracket expressions. 
However, "d" individually must *also* be matched by pattern expressions. 
"dd" can be matched by both [!x] and [!x][!x]. A shell cannot use 
regcomp()+regexec() to find the longest match for [!x] and assume that 
that is matched: a shell where


   case dd in [!x]d) echo match ;; esac

does not print "match" does not implement what POSIX requires. A shell 
where


   case dd in [!x]) echo match ;; esac

does not print "match" does not implement what POSIX requires either. 
Using regcomp()+regexec() to bind [!x] to either "d" or "dd" without 
taking the rest of the pattern into account will fail to match in one of 
these cases. And it needn't be the same way for all bracket expressions 
in a single pattern:


   case ddd in [!x][!x]) echo match ;; esac

Shells are required by POSIX to consider both the possibility that [!x] 
picks up "d" and that it picks up "dd" for each bracket expression 
individually.


A followup example: it seems downright crazy that POSIX would require that

  case ddd in *[!d]*) echo match ;; esac

prints "match", yet that appears to be exactly what it does require, and 
exactly what yash implements: "dd" is a collating element which is not 
"d", and therefore must be matched by [!d].


And this is something where GNU fail to implement the POSIX-specified 
behaviour even in regular expressions. If the regular expression support 
does not work as specified, shells cannot implement pattern matching on 
top of regular expressions and expect correct results.


  $ echo ddd | LC_ALL=cy_GB.UTF-8 grep '[^d]'
  $ echo ddd | LC_ALL=cy_GB.UTF-8 grep '.[^d]'
  ddd

It's clear that if the second prints 'ddd', so should the first, so it's 
clear that this result indicates a bug.


What's not clear to me is whether the second should print 'ddd'. When 
the string 'ddd' is part of a set of strings to be sorted, the first 
collating element is 'dd' and the second is 'd'. The second and third 
character do not together form a collating element. Is it correct that 
grep nevertheless uses that second and third 'd' to match '[^d]'?


If that is not correct, then shells cannot use regexec() at a given 
starting position: that given starting position may yield different 
collating elements compared to when the string is searched from the 
beginning.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2020-07-30 Thread Harald van Dijk

On 26/09/2019 10:20, Geoff Clare wrote:

Geoff Clare  wrote, on 26 Sep 2019:



Are shells required to support this, and are shells therefore implicitly
required to translate patterns to regular expressions, or should it be okay
to implement this with single character support only?


Shells are required to support it.  They don't need to translate
entire patterns to regular expressions - they can use either
regcomp()+regexec() or fnmatch() to see if the bracket expression
matches the next character.


Sorry, I should have written "matches *at* the next character" here;
I didn't mean to imply checking against a single character.

For example, if using regcomp()+regexec() the shell could try to
match the bracket expression against the remainder of the string and
see how much of it regexec() reported as matching.  To use fnmatch()
I suppose you would have to use it in a loop, passing it first one
character, then two, etc. (stopping at the number of characters
between the '.'s).


As I had replied at the time, it is fundamentally impossible in the 
general case as POSIX does not provide any mechanism to escape 
characters and there is nothing in POSIX that rules out the possibility 
of a collating element containing "=]" or ".]".


However, ignoring that aspect of it, looking at implementing this once 
again, implementing it the way you specified is incorrect, fixing it to 
make it correct cannot possibly be done efficiently with standard 
library support, and shells in general don't bother to implement what 
POSIX specifies here.


Take the previous example glibc's cy_GB.UTF-8 locale, but with a 
different collating element: in this locale, "dd" is a single collating 
element too. Therefore, this must be matchable by bracket expressions. 
However, "d" individually must *also* be matched by pattern expressions. 
"dd" can be matched by both [!x] and [!x][!x]. A shell cannot use 
regcomp()+regexec() to find the longest match for [!x] and assume that 
that is matched: a shell where


  case dd in [!x]d) echo match ;; esac

does not print "match" does not implement what POSIX requires. A shell where

  case dd in [!x]) echo match ;; esac

does not print "match" does not implement what POSIX requires either. 
Using regcomp()+regexec() to bind [!x] to either "d" or "dd" without 
taking the rest of the pattern into account will fail to match in one of 
these cases. And it needn't be the same way for all bracket expressions 
in a single pattern:


  case ddd in [!x][!x]) echo match ;; esac

Shells are required by POSIX to consider both the possibility that [!x] 
picks up "d" and that it picks up "dd" for each bracket expression 
individually. This means that in the worst case, if every bracket 
expression in a pattern has X ways to match, and a pattern has Y bracket 
expressions, the shell is required to consider X^Y possibilities. This 
is completely unreasonable and it's obvious why no shell actually does 
this. The complexity can be reduced in theory, but POSIX does not expose 
enough information to allow that to be implemented in a shell. The only 
way around this mess is by translating the whole pattern to a regular 
expression, as only the C library has enough detailed knowledge about 
the locale that it can implement it efficiently.[*] Doing that has its 
own new set of problems though: translating the whole pattern to a 
regular expression means the shell no longer has the option to decide 
how to handle invalid byte sequences (byte sequences that lead to 
EILSEQ) that shells in general try to tolerate, and the shell no longer 
has the option to decide how to handle invalid patterns (patterns 
containing non-existent character classes or collating elements) which 
shells in general also aim to tolerate.


Cheers,
Harald van Dijk

[*] I have not investigated whether implementations actually do do this 
efficiently.




Re: More issues with pattern matching

2019-09-27 Thread Joerg Schilling
"Schwarz, Konrad"  wrote:

> > -Original Message-
> > From: Robert Elz 
>
> > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , not 
> > allowed to be treated the same,
> > explicitly unspecified, or simply never considered (previously) ?
>
> An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]] is that 
> it would allow character-class names
> with white space, e.g., "title case".

There is no need to do this since my implementations for [[:alpha:]] first 
check for the resence of ":]" and then use the text bewteen  [[: and :]] as
character class name.

I expect other implementations to do the same.

Jörg

-- 
 EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin
joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
 URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'



Re: More issues with pattern matching

2019-09-27 Thread Geoff Clare
Robert Elz  wrote, on 27 Sep 2019:
>
>   | In the case of [x[:bogus:]], the use of both colons clearly indicates
>   | the intention to use the new character-class feature.  If the name
>   | between the colons is not a valid class name, that is likely due to
>   | an error on the user or application writer's part when typing the name.
> 
> I had been waiting for that argument, it is the only one that is
> half way rational, and supports that position.  But half way is as
> far as it gets.
> 
> POSIX allows locales to define new char classes, it says so, XBD 7.3,
> page 141, lines 4218-4226.
> 
> Since a locale is allowed to define a new char class name, the shell
> (or regcomp() for the RE case) cannot know whether the user here:
> 
>   | For example, if a user types:
>   |
>   | grep '[[:alhpa:]]' file
> 
> made a typo for alpha (the standard posix defined char class), or really
> intended alhpa a locale specific char class in some locale which is not
> the current one.
> 
> Making this some kind of error, in either REs, or shell patterns (whatever
> the effect of that is) makes it impossible for users to ever safely, and
> simply, use the locale specific locale name.
> 
> They cannot even test which locale is in use as (aside from it being 
> impossible
> to be sure which locales have added this new char class to their definitions)
> there's no guarantee that even if we know that LC_CTYPE=EN_dislexic
> contains the alhpa character class, in some implementations, there is no
> sane way to know whether the current impoementation does.
> 
> That is, unless you're requiring that before a locale specific char class
> can be used, the user (on the command line) or script, is required to
> query the locale and test whether the char class is defined there or not.
> 
> Requiring that would be absurd.

You might consider it absurd, but it is what the standard requires
applications and users to do in order to avoid "undefined results"
(as per XBD 9.1 under "invalid").  The standard even acknowledges that
applications need to be able to do that, in the APPLICATION USAGE for
the locale utility:

Implementations are not required to write out the actual values
for keywords in the categories LC_CTYPE and LC_COLLATE ; however,
they must write out the categories (allowing an application to
determine, for example, which character classes are available).

In a C program, finding out if a character class name is valid for the
current locale is simply a matter of calling wctype(name) and checking
whether it returns (wctype_t)0.  So applications using fnmatch(), glob()
or regcomp() can do that before using a name that isn't one of the
mandated ones.

>   | > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] ,
>   | > not allowed to be treated the same, explicitly unspecified, or simply
>   | > never considered (previously) ?
>   |
>   | I believe the intention is that it be treated the same as [[:alpha:]].
> 
> Good, that is what I would have hoped.   Now maybe we should add something
> to make that explicit.

Yes, I think an addition is warranted. Maybe we should add a new
paragraph to 2.13 (before the 2.13.1 heading) along the lines of:

In the shell, any quoting characters (see [xref to 2.2]) that are
present in a word to be used as a pattern, and are treated as
special, shall participate in pattern matching only through their
effects on other characters; they shall not themselves be treated
as pattern characters. For example:

ls -ld \\*

lists files with names that begin with a single ,

ls -ld "?"*

lists files with names that begin with a ,

ls -ld [[:'alpha':]]*

lists files with names that begin with an alphabetic character in
the current locale, and

ls -ld [[':alpha:']]*

lists files with names that begin with a character from the set
{ '[', ':', 'a', 'l', 'p', 'h' } followed by a ']'.

> 
>   | The word "may" has a strict usage.  See XBD 1.5 - it "Describes a
>   | feature or behavior that is optional for an implementation that
>   | conforms to POSIX.1-2017."
>   |
>   | However, there have been cases in the past where incorrect uses "may"
>   | have been found and changed to "can".
>   |
>   | In any case, the "shall" in XCU 2.13.1 overrides it.
> 
> Only for shell patterns, we still need to decide whether it was the
> defined "may" or an erroneous use which should be replaced by "can"
> for regular expressions.   Given the shell imperative, and the desire
> to make bracket expressions in sh patterns and REs as equivalent as
> possible, I suspect the latter.

The rationale in XRAT says the opposite (A.9.1):

The ISO POSIX-2:1993 standard required bracket expressions like
"[^[:lower:]]" to match multi-character collating elements such as
"ij". However, this requirement led to behavior that many users
did not expect and that could not feasibly be mimicked in user
code, and it was rarely if ever 

Re: More issues with pattern matching

2019-09-26 Thread Harald van Dijk

On 27/09/2019 02:26, Robert Elz wrote:

 Date:Thu, 26 Sep 2019 22:58:10 +0100
 From:Harald van Dijk 
 Message-ID:  


   | 9.3.5 rule 1:

   | "shall be followed by" is a requirement on applications, is it not?

It is.

   | When that requirement is violated, the regular expression or shell
   | pattern is  undefined,

 From where do you draw that conclusion, I see nothing to that effect.

Rather, when that requirement is violated, what exists is not
a character class (or one of the others).   That is all one can
conclude from that text.


Combine it with what is said about violations, which had been referenced 
in this thread already:



When invalid is not used, violations of the specified syntax or semantics for 
REs produce undefined results: this may entail an error, enabling an extended 
syntax for that RE, or using the construct in error as literal characters to be 
matched.

Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-26 Thread Robert Elz
Date:Thu, 26 Sep 2019 22:58:10 +0100
From:Harald van Dijk 
Message-ID:  


  | 9.3.5 rule 1:

  | "shall be followed by" is a requirement on applications, is it not?

It is.

  | When that requirement is violated, the regular expression or shell
  | pattern is  undefined,

>From where do you draw that conclusion, I see nothing to that effect.

Rather, when that requirement is violated, what exists is not
a character class (or one of the others).   That is all one can
conclude from that text.

kre




Re: More issues with pattern matching

2019-09-26 Thread Harald van Dijk

On 26/09/2019 22:12, Robert Elz wrote:

a...@gigawatt.nl said:
   | If this is the whole pattern, then agreed, but if this is only part of  the
   | pattern, I am not sure. [[:alpha]:]] is interpreted by many shells  (bash,
   | bosh, mksh, zsh) as a character class containing an invalid  character 
class
   | name "alpha]".

The part about the invalid class name is certainly correct, but the
interpretation cannot be, XBD 9.3.5 page 185, lines 6136-6138:

A character class expression is expressed as a character class
name enclosed within bracket- ("[:" and ":]") delimiters.

Since "alpha]" is not (cannot be) a character class name, we do not
have a character class expression at all, as a character class name
is required to exist between the delimiters for a character class
expression to exist.


9.3.5 rule 1:


The character sequences "[.", "[=", and "[:" [...]. These symbols shall be followed by a valid expression and 
the matching terminating sequence ".]", "=]", or ":]", as described in the following items.


"shall be followed by" is a requirement on applications, is it not? When 
that requirement is violated, the regular expression or shell pattern is 
undefined, and interpreting alpha] as an invalid character class name is 
a reasonable result.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-26 Thread Robert Elz
Date:Thu, 26 Sep 2019 17:54:21 +0100
From:Geoff Clare 
Message-ID:  <20190926165421.GA32280@lt2.masqnet>

  | In the case of [x[:bogus:]], the use of both colons clearly indicates
  | the intention to use the new character-class feature.  If the name
  | between the colons is not a valid class name, that is likely due to
  | an error on the user or application writer's part when typing the name.

I had been waiting for that argument, it is the only one that is
half way rational, and supports that position.  But half way is as
far as it gets.

POSIX allows locales to define new char classes, it says so, XBD 7.3,
page 141, lines 4218-4226.

Since a locale is allowed to define a new char class name, the shell
(or regcomp() for the RE case) cannot know whether the user here:

  | For example, if a user types:
  |
  | grep '[[:alhpa:]]' file

made a typo for alpha (the standard posix defined char class), or really
intended alhpa a locale specific char class in some locale which is not
the current one.

Making this some kind of error, in either REs, or shell patterns (whatever
the effect of that is) makes it impossible for users to ever safely, and
simply, use the locale specific locale name.

They cannot even test which locale is in use as (aside from it being impossible
to be sure which locales have added this new char class to their definitions)
there's no guarantee that even if we know that LC_CTYPE=EN_dislexic
contains the alhpa character class, in some implementations, there is no
sane way to know whether the current impoementation does.

That is, unless you're requiring that before a locale specific char class
can be used, the user (on the command line) or script, is required to
query the locale and test whether the char class is defined there or not.

Requiring that would be absurd.   Disallowing users from using locale specific
char classes even though the locale is free to provide them would be absurd.
A non-absurd outcome is achieved only when unknown char class names are
treated as missing empty class definitions in the current locale.  That
works, is easy, and clean.

And yes, it means that what are really user errors cannot be trivially
diagnosed, but simply produce unexpected results.   But this is far from
the only case where that happens - sh is a very forgiving language, vast
numbers of obvious user errors are allowed to pass undiagnosed, because the
shell cannot know that what the user entered is not actualy what they
intended to enter, and preventing genuine work in order to improve error
diagnosis is not the direction that the shell has ever taken.   Regular
expressions are more strictly interpreted, and do have detectable error
cases, so in theory those could give errors for the case you describe
(using the char class in the pattern arg to grep, for example) but the
same arguments as above apply here as well, so that is not a desirable
outcome.  Further, even if it were, it would be difficult to achieve,
given that POSIX has merged the definitions of bracket expressions in
shell patterns and regular expressions into one definition, and we really
want unknown clar class names to be usable (if not matching anything) in
shell patterns (as they don't generate error messages if invalid, they
simply match different things - or sometimes nothing - which is even harder
to diagnose than noticing a mistyped class name.

So to:

  | It is not absurd, it makes perfect sense.

I could not disagree more.


  | This is the same reason we added item 8,

I have no problems with item 8, and I understand it (even though I
don't implement it) it simply is not relevant to anything we're currently
talking about.

However:

  | but POSIX was preventing them from behaving in a
  | way that is more useful to the user.

was a very good argument.   So let's adopt the same one for locale defined
class names, and make sure they work well, and are useful to the user.

  | > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] ,
  | > not allowed to be treated the same, explicitly unspecified, or simply
  | > never considered (previously) ?
  |
  | I believe the intention is that it be treated the same as [[:alpha:]].

Good, that is what I would have hoped.   Now maybe we should add something
to make that explicit.

  | This is the only reasonable conclusion if you consider the similarity to:
  | ls *"a"*

That is a good analogy.

  | Clearly the intention here is that the quotes are not treated as part of
  | the pattern, even though pathname expansion is done before quote removal.

Yes, agreed.   That is what I do, and I wish everyone would do the same.

An interlude before we continue:

konrad.schw...@siemens.com said:
  | An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]]
  | is that it would allow character-class names with white space, e.g.,
  | "title case". 

That would be a nice argument, except that it isn't possible, there actually
is a syntax for 

Re: More issues with pattern matching

2019-09-26 Thread Geoff Clare
Robert Elz  wrote, on 26 Sep 2019:
>
>   | Good point.  I think that this, and the behaviour I described, are
>   | both allowed by the standard.
> 
> If they are, they shouldn't be.
> 
> Before char classes, equiv classes, and collating elements were
> invented, bracket expressions could contain anything (so could
> patterns in general).   That makes it hard to add anything new
> without potentially invalidating previously valid code.
> 
> The solution to that relies upon backet expressions being sets,
> where while legal, putting an element in the set more than once
> is a waste of time, and accomplishes nothing.
> 
> That's why these new forms are defined only inside bracket expressions,
> and all have the property of a duplicated character in their syntax, that
> is, isn't just that [: :] looks pretty, whereas [: ] doesn't, it is the
> only way to more or less safely add this new form to patterns.
> 
> So, if we have
> 
>   [[:alpha]
> 
> there is absolutely no question but that this is a bracket expr
> that matches one of the 7 chars
>   [ : a l p h a
> and is in no way any kind of character class reference, whatever it
> looks like its author may have intended, and regardless of what comes
> after it.
> 
> If the standard says any different, or implies different, or even allows
> different, it is simply wrong.

> Now if this kind of "invalid char class" (invalid because the terminating
> : is missing) is to not cause the bracket expression to be invalid, it is
> absurd to believe that the simpler case of an unknown class name could do
> so - simply absurd.

It is not absurd, it makes perfect sense.

In the case of [x[:bogus:]], the use of both colons clearly indicates
the intention to use the new character-class feature.  If the name
between the colons is not a valid class name, that is likely due to
an error on the user or application writer's part when typing the name.

For example, if a user types:

grep '[[:alhpa:]]' file

it is much more useful if grep reports that the RE is invalid, than if it
looks for a [, :, a, l, h, or p character followed by a ].

This is the same reason we added item 8, because utility implementers
recognised that a user typing [:alpha:] is much more likely to be due to
the user forgetting the outer brackets than intending to match one of the
characters in that set, but POSIX was preventing them from behaving in a
way that is more useful to the user.

>   | My point was that ksh93 treats [a"-"b] the same as [a-b] so trying
>   | to test something more specific to do with character classes in ksh93
>   | is not going to yield any useful information.
> 
> Again, sure, and again, not helpful for answering the question asked.
> What buggy implementations happen to do is not really interesting.
> What we want to know here is what the standard says should be done,
> and perhaps also what it should say should be done.
> 
> So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] ,
> not allowed to be treated the same, explicitly unspecified, or simply
> never considered (previously) ?

I believe the intention is that it be treated the same as [[:alpha:]].

This is the only reasonable conclusion if you consider the similarity to:

ls *"a"*

Clearly the intention here is that the quotes are not treated as part of
the pattern, even though pathname expansion is done before quote removal.

>   | My previous reply was based on XBD 9.3.5 item 4, but I have just spotted
>   | that the intro paragraph of 9.3.5 uses the word "may":
> 
> Ater I saw your updated reply on this, which arriuved while I was composing
> my previous message, I also went and looked at the standard, but I looked
> at XCU 2.13.1:
>   The pattern bracket expression also shall match a single
>   collating element.
> So there in the specific to the shell section, we have a "shall".

Good catch.

> Now both of those sections are poorly worded.   In XBD 9.3.5 one might
> interpret it as being "may" because not all bracket expressions match
> collating elements, so it would be absurd to require them to do so.
> 
> That is [abc] matches one of 'a' 'b' or 'c' and no collating elements
> at all, and it would be absurd if the language in 9.3.5 required that
> a specific set of multi-character collating elements shall be matched.
> 
> Or perhaps the "may" there is as you just interpreted it, and means that
> matching multi-char collating elements is optional, even when the
> bracket expression is
>   [[=ch=]]
> 
> Who knows?

The word "may" has a strict usage.  See XBD 1.5 - it "Describes a
feature or behavior that is optional for an implementation that
conforms to POSIX.1-2017."

However, there have been cases in the past where incorrect uses "may"
have been found and changed to "can".

In any case, the "shall" in XCU 2.13.1 overrides it.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: More issues with pattern matching

2019-09-26 Thread Stephane Chazelas
2019-09-26 16:28:27 +, Schwarz, Konrad:
[...]
> POSIX should disallow `:' and `]' in character class names.
[...]

While I would not disagree with that, I don't think POSIX
prevents implementations from using [[:foo:]] for other things
than character classes so that would not be enough to address
concerns here.

In practice, some implementations support [[:<:]] as the
equivalent of the standard ex utility regexp \< operator.

In those implementations, that doesn't even match a character
let alone collating element let alone a class of them. (note
that it's only for [[:<:]], not [x[:<:]] for instance in those
implementations).

One could choose to implement a [[:[<:>]:]] to match on smileys
for instance :-) independantly of whether there's a class by
that name.

-- 
Stephane



RE: More issues with pattern matching

2019-09-26 Thread Schwarz, Konrad



> -Original Message-
> From: Harald van Dijk 
> Sent: Thursday, September 26, 2019 4:39 PM
> To: austin-group-l@opengroup.org
> Cc: austin-group-l@opengroup.org
> Subject: Re: More issues with pattern matching
> 
> On 26/09/2019 13:13, Robert Elz wrote:
> > So, if we have
> >
> > [[:alpha]
> >
> > there is absolutely no question but that this is a bracket expr that
> > matches one of the 7 chars
> > [ : a l p h a
> > and is in no way any kind of character class reference, whatever it
> > looks like its author may have intended, and regardless of what comes
> > after it.
> >
> > If the standard says any different, or implies different, or even
> > allows different, it is simply wrong.
> 
> If this is the whole pattern, then agreed, but if this is only part of the 
> pattern, I am not sure. [[:alpha]:]]
> is interpreted by many shells (bash, bosh, mksh, zsh) as a character class 
> containing an invalid character class
> name "alpha]". It may also be treated as such in ksh and yash, but as the 
> whole pattern fails to match anything,
> it is hard to tell how exactly they interpret it. The interpretation as "any 
> of the characters in '[:alpha',
> followed by ':]]', is something I only see in osh and in your shell.

POSIX should disallow `:' and `]' in character class names.





RE: More issues with pattern matching

2019-09-26 Thread Schwarz, Konrad
> -Original Message-
> From: Robert Elz 

> So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , not 
> allowed to be treated the same,
> explicitly unspecified, or simply never considered (previously) ?

An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]] is that 
it would allow character-class names
with white space, e.g., "title case".

Regards

KAS



Re: More issues with pattern matching

2019-09-26 Thread Harald van Dijk

On 26/09/2019 11:43, Geoff Clare wrote:

My previous reply was based on XBD 9.3.5 item 4, but I have just spotted
that the intro paragraph of 9.3.5 uses the word "may":

 A bracket expression ... is an RE that shall match a specific set
 of single characters, and may match a specific set of
 multi-character collating elements, ...

So it appears that it is optional whether matching a bracket expression
against more than one character is supported.


This is a relief! Multi-character collating elements in the patterns may 
still need to be supported, but if they are only required to match 
single characters in the text being matched, that is doable.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-26 Thread Harald van Dijk

On 26/09/2019 10:20, Geoff Clare wrote:

Geoff Clare  wrote, on 26 Sep 2019:



Are shells required to support this, and are shells therefore implicitly
required to translate patterns to regular expressions, or should it be okay
to implement this with single character support only?


Shells are required to support it.  They don't need to translate
entire patterns to regular expressions - they can use either
regcomp()+regexec() or fnmatch() to see if the bracket expression
matches the next character.


Sorry, I should have written "matches *at* the next character" here;
I didn't mean to imply checking against a single character.

For example, if using regcomp()+regexec() the shell could try to
match the bracket expression against the remainder of the string and
see how much of it regexec() reported as matching.  To use fnmatch()
I suppose you would have to use it in a loop, passing it first one
character, then two, etc. (stopping at the number of characters
between the '.'s).


Oh, I forgot about fnmatch(), I suppose that is generally an 
alternative. Both regcomp() and fnmatch() have a problem, which is that 
characters cannot be escaped by a backslash as it loses its special 
meaning in bracket expressions, but shell quoting allows arbitrary 
characters to appear. What should the shell do when fed [[=".=]"=]]? How 
is this implementable?


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-26 Thread Harald van Dijk

On 26/09/2019 13:13, Robert Elz wrote:

So, if we have

[[:alpha]

there is absolutely no question but that this is a bracket expr
that matches one of the 7 chars
[ : a l p h a
and is in no way any kind of character class reference, whatever it
looks like its author may have intended, and regardless of what comes
after it.

If the standard says any different, or implies different, or even allows
different, it is simply wrong.


If this is the whole pattern, then agreed, but if this is only part of 
the pattern, I am not sure. [[:alpha]:]] is interpreted by many shells 
(bash, bosh, mksh, zsh) as a character class containing an invalid 
character class name "alpha]". It may also be treated as such in ksh and 
yash, but as the whole pattern fails to match anything, it is hard to 
tell how exactly they interpret it. The interpretation as "any of the 
characters in '[:alpha', followed by ':]]', is something I only see in 
osh and in your shell.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-26 Thread Robert Elz
Date:Thu, 26 Sep 2019 11:43:37 +0100
From:Geoff Clare 
Message-ID:  <20190926104337.GA25231@lt2.masqnet>

  | Good point.  I think that this, and the behaviour I described, are
  | both allowed by the standard.

If they are, they shouldn't be.

Before char classes, equiv classes, and collating elements were
invented, bracket expressions could contain anything (so could
patterns in general).   That makes it hard to add anything new
without potentially invalidating previously valid code.

The solution to that relies upon backet expressions being sets,
where while legal, putting an element in the set more than once
is a waste of time, and accomplishes nothing.

That's why these new forms are defined only inside bracket expressions,
and all have the property of a duplicated character in their syntax, that
is, isn't just that [: :] looks pretty, whereas [: ] doesn't, it is the
only way to more or less safely add this new form to patterns.

So, if we have

[[:alpha]

there is absolutely no question but that this is a bracket expr
that matches one of the 7 chars
[ : a l p h a
and is in no way any kind of character class reference, whatever it
looks like its author may have intended, and regardless of what comes
after it.

If the standard says any different, or implies different, or even allows
different, it is simply wrong.

Now if this kind of "invalid char class" (invalid because the terminating
: is missing) is to not cause the bracket expression to be invalid, it is
absurd to believe that the simpler case of an unknown class name could do
so - simply absurd.

Either the unknown class name means that there is no character class, and
all the text which looks like a character class is really just elements of
the bracket expression, or the unknown class name is treated as probably
being a class in some other locale, which has no members in the current
locale, coujld be an interpretation which makes sense (though the latter
is more useful, IMO).  Invalidating the bracket expression makes no sense.

  | >   | XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated 
as
  | >   | a character class, treated as a matching list expression, or rejected
  | >   | as an error.
  | > 
  | > Yes, that is unfortunate, it should be specified than an unknown (but
  | > syntactically valid) class name in a character class is simply to be
  | > treated as a class containing no characters,
  |
  | Item 8 isn't about what's between the ':'s in [[:...:]], it's about
  | an RE that contains [:...:] without the outer pair of square brackets.

Sure.  But as I interpreted Harald's question, to which we are attempting
to reply, things that look like char classes, but are not in a bracket
expression, aren't relevant (nor is 9.3.5 item 8).

The question was entirely about [x[:bogus:]] and [![:bogus:]] so perhaps
we should stick to answering that, and avoid deviating into side issues.

  | My point was that ksh93 treats [a"-"b] the same as [a-b] so trying
  | to test something more specific to do with character classes in ksh93
  | is not going to yield any useful information.

Again, sure, and again, not helpful for answering the question asked.
What buggy implementations happen to do is not really interesting.
What we want to know here is what the standard says should be done,
and perhaps also what it should say should be done.

So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] ,
not allowed to be treated the same, explicitly unspecified, or simply
never considered (previously) ?

  | My previous reply was based on XBD 9.3.5 item 4, but I have just spotted
  | that the intro paragraph of 9.3.5 uses the word "may":

Ater I saw your updated reply on this, which arriuved while I was composing
my previous message, I also went and looked at the standard, but I looked
at XCU 2.13.1:
The pattern bracket expression also shall match a single
collating element.
So there in the specific to the shell section, we have a "shall".

Which means

  | So it appears that it is optional whether matching a bracket expression
  | against more than one character is supported.

perhaps not.

Now both of those sections are poorly worded.   In XBD 9.3.5 one might
interpret it as being "may" because not all bracket expressions match
collating elements, so it would be absurd to require them to do so.

That is [abc] matches one of 'a' 'b' or 'c' and no collating elements
at all, and it would be absurd if the language in 9.3.5 required that
a specific set of multi-character collating elements shall be matched.

Or perhaps the "may" there is as you just interpreted it, and means that
matching multi-char collating elements is optional, even when the
bracket expression is
[[=ch=]]

Who knows?

XCU 2.13.1 is just as badly written, in the opposite direction.  It
(seems to) require every bracket expression to match a collating element.
I doubt that is what it really intends 

Re: More issues with pattern matching

2019-09-26 Thread Geoff Clare
Robert Elz  wrote, on 26 Sep 2019:
>
> So, if bogus is not a valid char class for the locale (and if that is
> treated as meaning the [:...:] is not a character class element of the
> bracket expression, then the bracket expression is
>   [x[:bogus:]
> where all chars between the initial '[' and the terminating ']' are
> simply literal chars.   So this will batch one char that is any of
>   : [ b g o s u x
> and the pattern will batch a word that starts with one of those 7 chars
> and is followed by a ']' char.

Good point.  I think that this, and the behaviour I described, are
both allowed by the standard.

>   | XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as
>   | a character class, treated as a matching list expression, or rejected
>   | as an error.
> 
> Yes, that is unfortunate, it should be specified than an unknown (but
> syntactically valid) class name in a character class is simply to be
> treated as a class containing no characters,

Item 8 isn't about what's between the ':'s in [[:...:]], it's about
an RE that contains [:...:] without the outer pair of square brackets.
I.e. it is unspecified whether [:alpha:] is treated as [[:alpha:]],
treated as [:alph], or rejected.

>   | > 1b. Quoted character classes:
> 
>   | Some shells are known not to handle shell quoting correctly in bracket
>   | expressions (in general, not specific to character classes).
> 
> This issue is specific to character classes (and is subtly different
> than equivalence classes and collating symbols, as the syntax of the
> name is defined, so we know quoting is never actually required for it,
> unlike the others ... though I don't really believe that should make
> a difference.
> 
> The question is whether [:"alpha":] is the same as [:alpha:] or not.

My point was that ksh93 treats [a"-"b] the same as [a-b] so trying
to test something more specific to do with character classes in ksh93
is not going to yield any useful information.

>   | > 2a. Multi-character collating symbols and equivalence classes
>   | > 
> 
>   | >   LANG=cy_GB.UTF-8
>   | >   case  ch in  [[=ch=]]) echo x ;; esac # none
>   | >   case  ch in  [[.ch.]]) echo x ;; esac # yash
>   | >   case xch in x[[=ch=]]) echo x ;; esac # yash
> 
>   | Shells are required to support it.  They don't need to translate
>   | entire patterns to regular expressions - they can use either
>   | regcomp()+regexec() or fnmatch() to see if the bracket expression
>   | matches the next character.
(I later corrected this to "matches at the next character")
> 
> The question here relates to "next character" - in the "case ch" where
> the word being matched is "ch" is that one character, or two?   A bracket
> expression mateches just one, but an equivalence class may, as I understand
> it, include dipthongs (so u-umlaut and ue might be treated the same, where
> the former is one character, and the latter is two).
> 
> Harald's question is whether shells are required to attempt to match
> such things, rather than just "matches the next character" ?

My previous reply was based on XBD 9.3.5 item 4, but I have just spotted
that the intro paragraph of 9.3.5 uses the word "may":

A bracket expression ... is an RE that shall match a specific set
of single characters, and may match a specific set of
multi-character collating elements, ...

So it appears that it is optional whether matching a bracket expression
against more than one character is supported.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: More issues with pattern matching

2019-09-26 Thread Robert Elz
Date:Thu, 26 Sep 2019 09:49:17 +0100
From:Geoff Clare 
Message-ID:  <20190926084917.GA23815@lt2.masqnet>


  | The key here is the way 2.13.1 words the description of '[':
  |
  | If an open bracket introduces a bracket expression as in XBD
  | Section 9.3.5, except [...]. Otherwise, '[' shall match the
  | character itself.
  |
  | (This wording is being improved via bug 985 but that change does not
  | affect how it applies here.)
  |
  | If "bogus" is not a valid character class for the current locale,
  | then the "If" is not satisfied and [x[:bogus:]] is treated as a
  | literal [, a literal x, the bracket expression [:bogus:] and a
  | literal ].

That is most certainly not what would happen.   If an unknown character
class for the current locale is to be treated as invalid (which I think
is an unworkable specification, but might be what the standard currently
allows at least) then the [:bogus:] would not be a character class.

That has no effect on the bracket expression itself, except for where
it terminates.   A bracket expression is simply an opening '[' followed
by an optional ! followed by at least one more character, and then
terminating with a ']' (all unquoted).   I am ignoring the effects of a
leading ^ for this, that's not material here as the example has no ^ in
it at all.

So, if bogus is not a valid char class for the locale (and if that is
treated as meaning the [:...:] is not a character class element of the
bracket expression, then the bracket expression is
[x[:bogus:]
where all chars between the initial '[' and the terminating ']' are
simply literal chars.   So this will batch one char that is any of
: [ b g o s u x
and the pattern will batch a word that starts with one of those 7 chars
and is followed by a ']' char.

  | XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as
  | a character class, treated as a matching list expression, or rejected
  | as an error.

Yes, that is unfortunate, it should be specified than an unknown (but
syntactically valid) class name in a character class is simply to be
treated as a class containing no characters, and which consequently
cannot match anything, so that
[x[:bogus:]]
simply matches 'x' and nothing else.   We should, at the very least,
add "future directions" that indicates that the standard will move in
that direction in a later revision.

  | If it is not treated as a matching list, then the "If" in
  | 2.13.1 is again not satisfied and [:bogus:] is treated as a sequence
  | of literal characters.

Not quite, the ']' would be the terminator for the bracket expression.
  |
  | > 1b. Quoted character classes:

  | Some shells are known not to handle shell quoting correctly in bracket
  | expressions (in general, not specific to character classes).

This issue is specific to character classes (and is subtly different
than equivalence classes and collating symbols, as the syntax of the
name is defined, so we know quoting is never actually required for it,
unlike the others ... though I don't really believe that should make
a difference.

The question is whether [:"alpha":] is the same as [:alpha:] or not.
quoting the characters a l p h a doesn't alter their interpretation
anywhere else, is it reasonable for it to do so here?   This is not
an issue of treating special chars as literals when they are quoted,
as none of them are special, though I guess if we had
IFS=a
we may need to quote "alpha" to avoid [[:alpha:]] being field split into
3 words before pathname expansion gets a chance to interpret it as 
a pattern  seeking files with one char alphabetic names.

Since quote removal has not been performed at the time pathname expansion
is done, and the standard says that until that happens. quote characters
remain in the word, some shells have interpreted this as being a
request to match the class named "alhpa" (literally, that is, a 7 char
long name) which is syntactically valid (class names cannot contain '"'
characters) and therefore not a valid char class, and certainly not the
same as the unquoted form.   We should make it clear that is not the
case (similarly [[:\alph\a:]] [[:'a'lph'a':]] and all other variations)
and these are to be treated the same as the quote removed version for
the purposes of looking up the class name.



  | > 2a. Multi-character collating symbols and equivalence classes
  | > 

  | >   LANG=cy_GB.UTF-8
  | >   case  ch in  [[=ch=]]) echo x ;; esac # none
  | >   case  ch in  [[.ch.]]) echo x ;; esac # yash
  | >   case xch in x[[=ch=]]) echo x ;; esac # yash

  | Shells are required to support it.  They don't need to translate
  | entire patterns to regular expressions - they can use either
  | regcomp()+regexec() or fnmatch() to see if the bracket expression
  | matches the next character.

The question here relates to "next character" - in the "case ch" where
the word being matched is "ch" is that one character, or 

Re: More issues with pattern matching

2019-09-26 Thread Geoff Clare
Geoff Clare  wrote, on 26 Sep 2019:
>
> > Are shells required to support this, and are shells therefore implicitly
> > required to translate patterns to regular expressions, or should it be okay
> > to implement this with single character support only?
> 
> Shells are required to support it.  They don't need to translate
> entire patterns to regular expressions - they can use either
> regcomp()+regexec() or fnmatch() to see if the bracket expression
> matches the next character.

Sorry, I should have written "matches *at* the next character" here;
I didn't mean to imply checking against a single character.

For example, if using regcomp()+regexec() the shell could try to
match the bracket expression against the remainder of the string and
see how much of it regexec() reported as matching.  To use fnmatch()
I suppose you would have to use it in a loop, passing it first one
character, then two, etc. (stopping at the number of characters
between the '.'s).

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: More issues with pattern matching

2019-09-26 Thread Geoff Clare
Harald van Dijk  wrote, on 26 Sep 2019:
>
> >Eg:
> >
> > case x in [xabc) echo x;; esac
> >
> >is not "invalid" because  the "bracket expression" has no terminating ']',
> >rather it simply has no bracket expression at all, and fails to match
> >here because it only matches the literal string '[xabc'.
> 
> This is a special exception, a deviation from the regular expression syntax,
> see 2.13.1:
> 
> >If an open bracket introduces a bracket expression as in XBD RE Bracket 
> >Expression, except that the  character ( '!' ) shall 
> >replace the  character ( '^' ) in its role in a non-matching 
> >list in the regular expression notation, it shall introduce a pattern 
> >bracket expression. A bracket expression starting with an unquoted 
> > character produces unspecified results. Otherwise, '[' shall 
> >match the character itself.
> 
> No such exception has been written for character classes and collating
> elements.

The reference to XBD RE Bracket Expression (9.3.5) applies to the whole
of 9.3.5, which includes the descriptions of what constitutes a valid
character class or collating element.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: More issues with pattern matching

2019-09-26 Thread Geoff Clare
Harald van Dijk  wrote, on 25 Sep 2019:
>
> After comparing what my shell does now during pattern matching to what it
> should, I found a few more cases where I do not believe POSIX is clear about
> what is required and where shells are not in agreement. These are not
> related to the backslash handling.

This isn't a complete response to all the points - I'm just noting
some things that I don't think other responders have mentioned.

> 1a. Invalid character classes:
> 
>   case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh
>   case x in [![:bogus:]]) echo x ;; esac # above except osh
> 
> The handling of this in dash, inherited by my shell, is just buggy and
> should be ignored.
> 
> In bash, bosh, mksh, nbsh, zsh, a character does not match an invalid
> character class. In osh, a character neither matches nor fails to match an
> invalid character class, but the pattern is still valid. In yash, the use of
> [:bogus:] renders the whole pattern invalid.
> 
> These all seem reasonable choices. regcomp() would reject the whole pattern
> as an error, and character classes are supposed to behave as they do in
> regular expressions, so I believe yash's behaviour makes the most sense. Is
> that correct?

The key here is the way 2.13.1 words the description of '[':

If an open bracket introduces a bracket expression as in XBD
Section 9.3.5, except [...]. Otherwise, '[' shall match the
character itself.

(This wording is being improved via bug 985 but that change does not
affect how it applies here.)

If "bogus" is not a valid character class for the current locale,
then the "If" is not satisfied and [x[:bogus:]] is treated as a
literal [, a literal x, the bracket expression [:bogus:] and a
literal ].

XBD 9.3.5 item 8 says it is unspecified whether [:bogus:] is treated as
a character class, treated as a matching list expression, or rejected
as an error.  If it is not treated as a matching list, then the "If" in
2.13.1 is again not satisfied and [:bogus:] is treated as a sequence
of literal characters.

> 1b. Quoted character classes:
> 
> Shells agree that quoting disables the recognition of character classes, but
> they disagree on how much quoting disables it.
> 
>   case x in ["[:alnum:]"]) echo x ;; esac # none
>   case x in [[:"alnum:]"]) echo x ;; esac # none
>   case x in [[:"alnum:"]]) echo x ;; esac # ksh, mksh, yash, zsh
>   case x in [[:\alnum:]])  echo x ;; esac # above plus osh
>   case x in [[:"alnum":]]) echo x ;; esac # above plus dash, nbsh
> 
> I believe that as the special characters to indicate a character class are
> "[:" and ":]", the osh behaviour is correct, the character class name is
> allowed to be quoted. Is that correct? The dash/nbsh behaviour, again
> inherited by my shell, is close, but the fact that the type of quoting
> affects how the character class is treated looks like a bug.

Some shells are known not to handle shell quoting correctly in bracket
expressions (in general, not specific to character classes).  I think
this came to light during discussion of bug 1190.  I seem to recall ksh93
being the main culprit, but other shells may have had bugs as well.

> 2. Collating symbols and equivalence classes
> 
> Collating symbols and equivalence classes are less widely implemented.
> 
>   case x in [[.x.]]) echo x ;; esac # bash, ksh, mksh, osh, yash
>   case x in [[=x=]]) echo x ;; esac # same
>   case ä in [[=a=]]) echo x ;; esac # bash, ksh, yash
>   case a in [[=ä=]]) echo x ;; esac # same
> 
> The handling of brackets in pattern matching is defined by reference to RE
> Bracket Expression and no exception has been made for them, so these are
> supposed to be handled in pattern matching as well.
> 
> 2a. Multi-character collating symbols and equivalence classes
> 
> Multi-character support seems impossible to implement portably other than by
> translating patterns to regular expressions as yash does. POSIX does not
> provide any other means to ask the implementation enough information about
> what is supported in the current locale. And when things to get translated
> to regular expressions, it relies on libc support, with glibc behaving
> strangely, but this may just be my limited understanding of how things are
> supposed to work.
> 
>   LANG=cy_GB.UTF-8
>   case  ch in  [[=ch=]]) echo x ;; esac # none
>   case  ch in  [[.ch.]]) echo x ;; esac # yash
>   case xch in x[[=ch=]]) echo x ;; esac # yash
> 
> Are shells required to support this, and are shells therefore implicitly
> required to translate patterns to regular expressions, or should it be okay
> to implement this with single character support only?

Shells are required to support it.  They don't need to translate
entire patterns to regular expressions - they can use either
regcomp()+regexec() or fnmatch() to see if the bracket expression
matches the next character.

> 2b. Invalid collating elements
> 
> As with invalid character classes:
> 
>   case x in [x[.xy.]]) echo x ;; esac # bash, 

Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 26/09/2019 02:36, Harald van Dijk wrote:
POSIX mentions the possibility of locale-specific character classes and 
they are required to be recognised in regular expressions and therefore 
in shell glob patterns:



In addition, character class expressions of the form:

[:name:]

are recognized in those locales where the name keyword has been given 
a charclass definition in the LC_CTYPE category.


I have not checked which shells implement this correctly. I know mine 
does not. I was assuming a locale that does not define [:bogus:] as a 
character class, but should have specified.


I meant to include here that under a locale that does not define 
[:bogus:] as a character class, I expect [[:bogus:]] to silently not 
match anything, like you. I referred back to it later, but forgot to 
actually write it.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 26/09/2019 01:47, Robert Elz wrote:

 Date:Wed, 25 Sep 2019 22:29:36 +0100
 From:Harald van Dijk 
 Message-ID:  

   | These all seem reasonable choices. regcomp() would reject the whole
   | pattern as an error, and character classes are supposed to behave as
   | they do in regular expressions, so I believe yash's behaviour makes the
   | most sense. Is that correct?

Character classes (and bracket expressions) behave like they do in
regcomps, but in glob patterns there is no such thing as "invalid",
patterns that one might assume to be invalid are in reality patterns
that match something different than looks like might have been intended
at first glance.


The possibility of invalid patterns is explicitly acknowledged in the 
description of patterns ending with an unescaped backslash:



If a pattern ends with an unescaped , it is unspecified whether the 
pattern does not match anything or the pattern is treated as invalid.


However, yash treats the patterns I described as not matching anything, 
it does not raise any errors for them. I had not considered the 
possibility of shells raising errors for them and I agree that that is 
not desirable.



So, if one were to decide that [:bogus:] is not a valid character
class, as the name is not valid -- which I think would be a truly poor
choice, as locales are free to define new character classes, and this
approach would make it impossible to ever safely attempt to use such
class ... eg: some languages might (I have no idea if they do or not)
define a character class tonemark (with whatever spelling) to match the
"characters" that indicate the "tone" (which I kind of understand, but
am unable to describe) that is to me used (in Thai there are I believe 5
different words that all sound like "ma" with different tones, and wildly
different meanings, dog, horse, come (I don't know the other two) they
are probnounced with different tones (high, low, rising, falling and normal).
I think there are 7 tones in Vietnamese.

There are glyphs that are written above the consonant (the 'm' in this
case - the Thai 'm' obviously) that indicate which tone (actually take
the base tone implied by the consonant in question and modify it, rather
than being absolute).  Those glyphs are present in written text as a
character following the consonant.

It would be entirely reasonable for a script to look for something
like (usng Thai chars for the 'm' and 'a' of course)
m[[:tonemark:]]a
to match any of those words (except the one that has no mark, never
mind, we'd need  ma | m[[:tonemark:]]a ).

If we were to treat this as invalid, that is, generate an error (and would
it be a "compile time" error, or execution time?) just because we happen to
be in some non tone mark using locale, rather than simply not matching, things
get very difficult.


POSIX mentions the possibility of locale-specific character classes and 
they are required to be recognised in regular expressions and therefore 
in shell glob patterns:



In addition, character class expressions of the form:

[:name:]

are recognized in those locales where the name keyword has been given a 
charclass definition in the LC_CTYPE category.


I have not checked which shells implement this correctly. I know mine 
does not. I was assuming a locale that does not define [:bogus:] as a 
character class, but should have specified.



But if that were to happen and the character class is invalid, then the
bracket expression is

[[:tonemark:]

and is the set of any of the characters '[' ':' 't' 'o' ... 'r' 'k'
(and an extra redundant ':' - since it is a set, duplicates are ignored
just as they are in [aaa])

The pattern above would be matched by a word that contains an 'm' followed
by one from that set, followed by a ']' followed by an 'a'.

That's even worse than generating an error.   Chaacter classes really
must be treated as character classes, regardless of whether they're
recognised in the current locale or not.  This should not be unspecified.


Agreed, regardless of what POSIX currently says. This is what I was 
referring to with "The handling of this in dash, inherited by my shell, 
is just buggy and should be ignored." I intend to fix this, but am not 
sure of the details yet.



[...]
   | As with invalid character classes:
   |
   |case x in [x[.xy.]]) echo x ;; esac # bash, ksh, mksh
   |
   | This would be rejected with an error by regcomp(), so rejecting the
   | whole pattern makes most sense to me.

Same as above, no shell pattern is ever rejected, no matter what.


Same as above, by "rejected" I meant that it is treated as a 
never-matching pattern.



Eg:

case x in [xabc) echo x;; esac

is not "invalid" because  the "bracket expression" has no terminating ']',
rather it simply has no bracket expression at all, and fails to match
here because it only matches the literal string '[xabc'.


This is a special exception, a deviation from the regular expression 

Re: More issues with pattern matching

2019-09-25 Thread Robert Elz
Date:Wed, 25 Sep 2019 22:29:36 +0100
From:Harald van Dijk 
Message-ID:  

  | These all seem reasonable choices. regcomp() would reject the whole 
  | pattern as an error, and character classes are supposed to behave as 
  | they do in regular expressions, so I believe yash's behaviour makes the 
  | most sense. Is that correct?

Character classes (and bracket expressions) behave like they do in
regcomps, but in glob patterns there is no such thing as "invalid",
patterns that one might assume to be invalid are in reality patterns
that match something different than looks like might have been intended
at first glance.

So, if one were to decide that [:bogus:] is not a valid character
class, as the name is not valid -- which I think would be a truly poor
choice, as locales are free to define new character classes, and this
approach would make it impossible to ever safely attempt to use such
class ... eg: some languages might (I have no idea if they do or not)
define a character class tonemark (with whatever spelling) to match the
"characters" that indicate the "tone" (which I kind of understand, but
am unable to describe) that is to me used (in Thai there are I believe 5
different words that all sound like "ma" with different tones, and wildly
different meanings, dog, horse, come (I don't know the other two) they
are probnounced with different tones (high, low, rising, falling and normal).
I think there are 7 tones in Vietnamese.

There are glyphs that are written above the consonant (the 'm' in this
case - the Thai 'm' obviously) that indicate which tone (actually take
the base tone implied by the consonant in question and modify it, rather
than being absolute).  Those glyphs are present in written text as a
character following the consonant.

It would be entirely reasonable for a script to look for something
like (usng Thai chars for the 'm' and 'a' of course)
m[[:tonemark:]]a
to match any of those words (except the one that has no mark, never
mind, we'd need  ma | m[[:tonemark:]]a ).

If we were to treat this as invalid, that is, generate an error (and would
it be a "compile time" error, or execution time?) just because we happen to
be in some non tone mark using locale, rather than simply not matching, things
get very difficult.

But if that were to happen and the character class is invalid, then the
bracket expression is

[[:tonemark:]

and is the set of any of the characters '[' ':' 't' 'o' ... 'r' 'k'
(and an extra redundant ':' - since it is a set, duplicates are ignored
just as they are in [aaa])

The pattern above would be matched by a word that contains an 'm' followed
by one from that set, followed by a ']' followed by an 'a'.

That's even worse than generating an error.   Chaacter classes really
must be treated as character classes, regardless of whether they're
recognised in the current locale or not.  This should not be unspecified.

Ignoring "errors" is the way glob matching has always worked, something
that cannot be interpreted as what it appears to be simply means something
different.

Regular expressions are different, they are more formally defined, and
have always had valid, and invalid cases - so what regcomp() happens to
do isn't really relevant to what happens with shell patterns.  They are
entirely different beasts.

  | 1b. Quoted character classes:
  |

  | I believe that as the special characters to indicate a character class 
  | are "[:" and ":]", the osh behaviour is correct, the character class 
  | name is allowed to be quoted.

I agree, there is nothing that suggests it should be otherwise.

  | The dash/nbsh behaviour, 
  | again inherited by my shell, is close, but the fact that the type of 
  | quoting affects how the character class is treated looks like a bug.

I would agree with that, and for nbsh I will take a look (I haven't yet
verified this, but I certainly believe it could be that way) and fix it.

  | 2. Collating symbols and equivalence classes
  |
  | Collating symbols and equivalence classes are less widely implemented.

nbsh is one that doesn't implement them at all.  That's a defect,
they should be supported, but fixing this is kind of low priority
as in practice, nothing seems to use them (other than tests to see
if they are implemented) so don't necessarily expect a resolution
any time soon (it isn't just the shell, I am not sure if there's
any support for those in NetBSD's locale system at all).


  | As with invalid character classes:
  |
  |case x in [x[.xy.]]) echo x ;; esac # bash, ksh, mksh
  |
  | This would be rejected with an error by regcomp(), so rejecting the 
  | whole pattern makes most sense to me.

Same as above, no shell pattern is ever rejected, no matter what.
Eg:

case x in [xabc) echo x;; esac

is not "invalid" because  the "bracket expression" has no terminating ']',
rather it simply has no bracket expression at all, and fails to match
here because it only matches the literal 

Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 26/09/2019 00:18, Shware Systems wrote:
While it may not be mentioned in that thread, P182, L6005 explictly has 
the blanket "violations ... produce undefined results" that I see could 
apply for bogus names. It would be a semantics error more than a syntax 
one, but language is there.


However, 9.3.5 could also be construed as all other names represent an 
empty set to be unioned with the set of elements for 9.3.5, 6b. as the 
class checked, and if an implementation does not provide any global set 
(since locale definitions have no means to define a per locale set) then 
no match is the required behavior. If a global set is provided, a match 
may occur, but either way all values for name are potentially valid.


This looks like a contradiction between the specification of regular 
expressions and the specification of regcomp().


I agree that the specification for regular expressions does not say that 
unrecognised names are "invalid" and that the failure to say so renders 
the results undefined. At the same time, regcomp() says:



The following constants are defined as the minimum set of error return values, 
although other errors listed as implementation extensions in  are 
possible:



REG_ECTYPE
Invalid character class type referenced.


An implementation that allows any bogus name by either not parsing it as 
a character class or by parsing it as a character class that never 
matches any character would never have regcomp() return REG_ECTYPE, but 
by stating that the minimum set of error return values includes 
REG_ECTYPE, the standard requires regcomp() to return REG_ECTYPE for at 
least some patterns.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-25 Thread Harald van Dijk

On 25/09/2019 22:49, Stephane Chazelas wrote:

2019-09-25 22:29:36 +0100, Harald van Dijk:
[...]

1a. Invalid character classes:

   case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh
   case x in [![:bogus:]]) echo x ;; esac # above except osh

[...]

See also

https://www.mail-archive.com/austin-group-l%40opengroup.org/msg02247.html

(and the rest of that thread).


Thanks. This does not cover all of my questions, but does cover some of 
them. I agree with Robert Elz's comment there:



I truly dislike that kind of approach in the standard - particularly if it
is deliberate.   Readers of the text don't know that it is actually
unspecified, as it might be specified somewhere else they haven't
found yet.


The fact that character classes in patterns are defined by reference to 
regular expressions, and in regular expressions they render the whole 
regular expression invalid per regcomp()'s REG_ECTYPE error return 
value, does not appear to be mentioned in that thread. This may change 
the conclusion that the behaviour is implicitly unspecified.


Cheers,
Harald van Dijk



Re: More issues with pattern matching

2019-09-25 Thread Stephane Chazelas
2019-09-25 22:29:36 +0100, Harald van Dijk:
[...]
> 1a. Invalid character classes:
> 
>   case x in [x[:bogus:]]) echo x ;; esac # bash,bosh,mksh,nbsh,osh,zsh
>   case x in [![:bogus:]]) echo x ;; esac # above except osh
[...]

See also

https://www.mail-archive.com/austin-group-l%40opengroup.org/msg02247.html

(and the rest of that thread).

-- 
Stephane