Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-12-01 Thread Koichi Murase
2022年12月2日(金) 0:59 Chet Ramey :
> > Thank you for the discussion and for applying the changes.  Besides, I
> > am sorry that I still have a discussion on the behavior of BRACKMATCH,
> > so it was not the last set of changes.  After the above fix, I moved
> > to check the behavior related to PATSCAN, where I found inconsistent
> > results related to the difference between BRACKMATCH and PATSCAN in
> > parsing the bracket expressions.  I checked also other parts of the
> > codebase and found additional inconsistencies.
>
> I have to say I appreciate your comprehensive and detailed reports.

Thank you for your patience. I also appreciate your careful review and
applying patches every time.

--
Koichi



Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-12-01 Thread Chet Ramey

On 11/28/22 5:51 AM, Koichi Murase wrote:

2022年11月23日(水) 5:24 Chet Ramey :

I attached the latest patch against bash-5.2.9.




commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel)

 associative array elements; last set of changes to globbing
 bracket expressions; fix for timing subshell commands


Thank you for the discussion and for applying the changes.  Besides, I
am sorry that I still have a discussion on the behavior of BRACKMATCH,
so it was not the last set of changes.  After the above fix, I moved
to check the behavior related to PATSCAN, where I found inconsistent
results related to the difference between BRACKMATCH and PATSCAN in
parsing the bracket expressions.  I checked also other parts of the
codebase and found additional inconsistencies.


Thanks for the detailed analysis. I applied these patches.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-12-01 Thread Chet Ramey

On 11/28/22 5:51 AM, Koichi Murase wrote:

2022年11月23日(水) 5:24 Chet Ramey :

I attached the latest patch against bash-5.2.9.




commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel)

 associative array elements; last set of changes to globbing
 bracket expressions; fix for timing subshell commands


Thank you for the discussion and for applying the changes.  Besides, I
am sorry that I still have a discussion on the behavior of BRACKMATCH,
so it was not the last set of changes.  After the above fix, I moved
to check the behavior related to PATSCAN, where I found inconsistent
results related to the difference between BRACKMATCH and PATSCAN in
parsing the bracket expressions.  I checked also other parts of the
codebase and found additional inconsistencies.


I have to say I appreciate your comprehensive and detailed reports.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-28 Thread Koichi Murase
2022年11月28日(月) 19:51 Koichi Murase :
> The fix for PATSCAN (C) is included in the second patch
> [r0037.patscan1.parse_subbracket.patch.txt], which solves "Repeat-By 1
> and 4" by using PARSE_SUBBRACKET of the first patch. [...]

Sorry, there was an oversight in the second patch
[r0037.patscan1.parse_subbracket.patch.txt]. We need to also update
the function signatures of glob_patscan{,_wc} (expanded from PATSCAN)
in other files, gmisc.c and glob.c. I attach another patch
[r0037.patscan3.extern.patch.txt] for additional fixes to the second
patch, where I passed 0 as FLAGS so the original behavior is
preserved.

--
Koichi
From ddc72aeb2cd6962e85bdc0819419db3d8cb6576d Mon Sep 17 00:00:00 2001
From: Koichi Murase 
Date: Mon, 28 Nov 2022 21:42:14 +0900
Subject: [PATCH 1/2] fix(lib/glob/glob): update uses of glob_patscan

---
 lib/glob/glob.c  | 12 ++--
 lib/glob/gmisc.c |  4 ++--
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/lib/glob/glob.c b/lib/glob/glob.c
index b66af85c..686d0f6b 100644
--- a/lib/glob/glob.c
+++ b/lib/glob/glob.c
@@ -127,8 +127,8 @@ static int glob_testdir PARAMS((char *, int));
 static char **glob_dir_to_array PARAMS((char *, char **, int));
 
 /* Make sure these names continue to agree with what's in smatch.c */
-extern char *glob_patscan PARAMS((char *, char *, int));
-extern wchar_t *glob_patscan_wc PARAMS((wchar_t *, wchar_t *, int));
+extern char *glob_patscan PARAMS((char *, char *, int, int));
+extern wchar_t *glob_patscan_wc PARAMS((wchar_t *, wchar_t *, wint_t, int));
 
 /* And this from gmisc.c/gm_loop.c */
 extern int wextglob_pattern_p PARAMS((wchar_t *));
@@ -207,7 +207,7 @@ extglob_skipname (pat, dname, flags)
   wild = *pat == '*' || *pat == '?';
   pp = pat + 2;
   se = pp + strlen (pp);   /* end of pattern string */
-  pe = glob_patscan (pp, se, 0);   /* end of extglob pattern */
+  pe = glob_patscan (pp, se, 0, 0);/* end of extglob pattern */
 
   /* if pe == 0, this is an invalid extglob pattern */
   if (pe == 0)
@@ -234,7 +234,7 @@ extglob_skipname (pat, dname, flags)
   nullpat = pe >= (pat + 2) && pe[-2] == '(' && pe[-1] == ')';
 
   /* check every subpattern */
-  while (t = glob_patscan (pp, pe, '|'))
+  while (t = glob_patscan (pp, pe, '|', 0))
 {
   /* If T == PE and *T == 0 (&& PE[-1] == RPAREN), we have hit the end
 of a pattern with no trailing characters. */
@@ -358,7 +358,7 @@ wextglob_skipname (pat, dname, flags)
   wild = *pat == L'*' || *pat == L'?';
   pp = pat + 2;
   se = pp + wcslen (pp);
-  pe = glob_patscan_wc (pp, se, 0);
+  pe = glob_patscan_wc (pp, se, 0, 0);
 
   /* if pe == 0, this is an invalid extglob pattern */
   if (pe == 0)
@@ -382,7 +382,7 @@ wextglob_skipname (pat, dname, flags)
   nullpat = pe >= (pat + 2) && pe[-2] == L'(' && pe[-1] == L')';
 
   /* check every subpattern */
-  while (t = glob_patscan_wc (pp, pe, '|'))
+  while (t = glob_patscan_wc (pp, pe, '|', 0))
 {
   n = t[-1];   /* ( */
   if (wextglob_pattern_p (pp) && n == L')')/* nested 
extglob? */
diff --git a/lib/glob/gmisc.c b/lib/glob/gmisc.c
index f3d74cea..24fdf746 100644
--- a/lib/glob/gmisc.c
+++ b/lib/glob/gmisc.c
@@ -38,7 +38,7 @@
 #include "glob.h"
 
 /* Make sure these names continue to agree with what's in smatch.c */
-extern char *glob_patscan PARAMS((char *, char *, int));
+extern char *glob_patscan PARAMS((char *, char *, int, int));
 
 /* Compile `gm_loop.c' for single-byte characters. */
 #define CHAR   char
@@ -92,7 +92,7 @@ glob_dirscan (pat, dirsep)
{
  if (se == 0)
se = p + strlen (p) - 1;
- pe = glob_patscan (p + 2, se, 0);
+ pe = glob_patscan (p + 2, se, 0, 0);
  if (pe == 0)
continue;
  else if (*pe == 0)
-- 
2.37.2



Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-28 Thread Koichi Murase
2022年11月23日(水) 5:24 Chet Ramey :
> I attached the latest patch against bash-5.2.9.


> commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel)
>
> associative array elements; last set of changes to globbing
> bracket expressions; fix for timing subshell commands

Thank you for the discussion and for applying the changes.  Besides, I
am sorry that I still have a discussion on the behavior of BRACKMATCH,
so it was not the last set of changes.  After the above fix, I moved
to check the behavior related to PATSCAN, where I found inconsistent
results related to the difference between BRACKMATCH and PATSCAN in
parsing the bracket expressions.  I checked also other parts of the
codebase and found additional inconsistencies.


Description
---

First, let me introduce the symbols (A)..(D) to later reference the
implementations of the bracket expression in the codebase.  There are
four independent codes that implement rules for extracting the bracket
expression in the current codebase:

- (A) The main loop of BRACKMATCH: This handles sub-expressions of a
  bracket expression when a matching sub-expression is not found.

- (B) The section of the `matched' label in BRACKMATCH: This handles
  sub-expressions of the bracket expression after a matching
  sub-expression is found.

- (C) PATSCAN: This skips bracket expressions to determine the end of
  the extglob constructs of the form @(...), ?(...), +(...), etc.

- (D) MATCHLEN (lib/glob/gm_loop.c): This function handles bracket
  expressions to count the number of characters that a fixed-length
  pattern can match.

Actually, each of the four implements a distinct rule, which does not
match any of the other three.  These implementations need to be
adjusted to support an identical and consistent rule.


Repeat-By
-

The differences between (A)..(D) cause various strange behaviors.

1. Strange behavior caused by an inconsistency between (A/B) and (C)

  This is what I was first faced with.  The following shows the result
  of [example4.sh] with the current devel, where column 4
  `{yes,no}/{yes,no}' shows `(result)/(what I expect)':

  --- PATSCAN vs BRACKMATCH ---
  #1: pat=@([[.].])A]) str=]no/yes
  #2: pat=@([[.].])A]) str===]A])   no/no
  #3: pat=@([[.].])A]) str=AA]) yes/no
  #4: pat=@([[=]=])A]) str=]no/no
  #5: pat=@([[=]=])A]) str===]A])   no/yes
  #6: pat=@([[=]=])A]) str=AA]) yes/no

  Obvious strange behaviors can be found in cases #3 and #6, where `A'
  matches twice even if there is only one `A' and no repetition such
  as `*()' or `+()' in the pattern.  This is because PATSCAN (C)
  considers the bracket expression to be `[[.].]' while BRACKMATCH
  (A/B) considers the bracket expression to be `[[.].])A]'.  First,
  PATSCAN extracts `@([[.].])', but BRACKMATCH next matches the first
  `A' in the input string using a pattern character `A' outside the
  construct `@()'.  Finally, the remaining part `A])' in the pattern
  is matched literally.

2. Inconsistency between (A) and (B):

  To fix the above item for (A/B) vs (C), I checked the detailed
  behaviors of both and found this.  The parsing of [.ch.], [=a=], and
  [:var:] are not totally consistent before and after a matching
  sub-expression is found.  The following is the second section of the
  result of [example4.sh]:

  --- BRACKMATCH: after match vs before match ---
  #7: pat=[[=]=]ab]str=ayes/no
  #8: pat=[[.[=.]ab]   str=ayes/yes
  #9: pat=[[.[==].]ab] str=ayes/yes

  #10: pat=[a[=]=]b]str=ano/no
  #11: pat=[a[.[=.]b]   str=ano/yes
  #12: pat=[a[.[==].]b] str=ano/yes

  #13: pat=[a[=]=]b]str=byes/no
  #14: pat=[a[=]=]b]str=a=]b]yes/yes
  #15: pat=[a[.[=.]b]   str=byes/yes
  #16: pat=[a[.[=.]b]   str=ab]  yes/no
  #17: pat=[a[.[==].]b] str=byes/yes
  #18: pat=[a[.[==].]b] str=ab]  yes/no

  Cases #7..#9 succeeds, which means that `[=]=]', `[.[=.]', and
  `[.[==].]' form an equivalence class and collating symbols in
  BRACKMATCH (A).  However, cases #10..#12 (which are the bracket
  expressions of the same sub-expression with different ordering)
  fail, which means that `[=]=]', `[.[=.]', and `[.[==].]'  do not
  form an equivalence class or a collating symbol in BRACKMATCH (B).

  Also, cases #13 vs #14, #15 vs #16, and #17 vs #18 demonstrate that
  the same pattern consisting of bracket expressions and normal
  characters can match different numbers of characters.  This means
  that the boundary of the bracket expression can change depending on
  the input string.

  Actually, these patterns are undefined by the standard because an
  equivalence class shall 

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-22 Thread Chet Ramey

On 11/20/22 7:50 AM, Koichi Murase wrote:


   This difference is caused because the slash after the backslash is
   only checked after a matching character is found
   (lib/glob/sm_loop.c:703).  The same check should be applied also
   before a matching character is found (lib/glob/sm_loop.c:573).  I
   attach a patch for this [r0037.brackmatch6.remaining-slash.patch].


I agree with this; it was an inadvertent omission.



There is another related inconsistency.  I just modified my new
extglob engine to follow Bash's choice described above, but then the
behavior became different from that of the actual implementation of
Bash of the current devel.


"If a pattern ends with an unescaped , it is unspecified whether
the pattern does not match anything or the pattern is treated as invalid."

[...]

b. If this is the behavior for the unescaped backslashes outside the
   bracket expressions, which is intensionally different from those in
   the bracket expressions, would it be possible to change the
   treatment of the unescaped backslashes inside the bracket
   expression the same as that of outside so the bracket `[' matches
   literally (as expected in cases #28..#31 of my previous reply [1])?
   The attached [r0037.brackmatch7.unescaped-backslash-option-b.patch]
   is the corresponding patch.


I have changed my mind, and I agree with this. I think the latest draft
of the POSIX standard requires it unambiguously:

"A  that does not introduce a valid bracket expression
shall match the character itself."

and while it does not say so explicitly (the description defers to the
regular expression description, which describes valid bracket expressions
as having a closing bracket), I think it's reasonable to conclude that an
incomplete bracket expression without a `]' is invalid, this text from the
regular expression description aside:

"An expression containing a '[' that is unescaped and is not part of a
bracket expression produces undefined results."

since the text in "Patterns Matching a Single Character" supersedes it.

I attached the latest patch against bash-5.2.9.

Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
*** ../bash-5.2-patched/lib/glob/sm_loop.c  2021-08-03 10:24:49.0 
-0400
--- lib/glob/sm_loop.c  2022-11-22 14:51:12.0 -0500
***
*** 344,347 
--- 344,352 
  return (FNM_NOMATCH);
  
+   /* If we are matching pathnames, we can't match a slash with a
+  bracket expression. */
+   if (sc == L('/') && (flags & FNM_PATHNAME))
+ return (FNM_NOMATCH);
+ 
/* `?' cannot match `.' or `..' if it is the first character of the
   string or if it is the first character following a slash and
***
*** 404,407 
--- 409,414 
  }
  
+ #define SLASH_PATHNAME(c) (c == L('/') && (flags & FNM_PATHNAME))
+ 
  /* Use prototype definition here because of type promotion. */
  static CHAR *
***
*** 452,455 
--- 459,468 
  pc = FOLD (p[1]);
  p += 4;
+ 
+ /* Finding a slash in a bracket expression means you have to
+match the bracket as an ordinary character (see below). */
+ if (pc == L('/') && (flags & FNM_PATHNAME))
+   return ((test == L('[')) ? savep : (CHAR *)0); /*]*/
+ 
  if (COLLEQUIV (test, pc))
{
***
*** 464,467 
--- 477,484 
  if (c == L('\0'))
return ((test == L('[')) ? savep : (CHAR *)0); /*]*/
+ else if (c == L('/') && (flags & FNM_PATHNAME))
+   return ((test == L('[')) ? savep : (CHAR *)0); /*]*/
+ else if (c == L(']'))
+   break;
  c = FOLD (c);
  continue;
***
*** 476,484 
  pc = 0;   /* make sure invalid char classes don't match. */
  /* Find end of character class name */
! for (close = p + 1; *close != '\0'; close++)
if (*close == L(':') && *(close+1) == L(']'))
  break;
  
! if (*close != L('\0'))
{
  ccname = (CHAR *)malloc ((close - p) * sizeof (CHAR));
--- 493,501 
  pc = 0;   /* make sure invalid char classes don't match. */
  /* Find end of character class name */
! for (close = p + 1; *close != '\0' && SLASH_PATHNAME(*close) == 0; 
close++)
if (*close == L(':') && *(close+1) == L(']'))
  break;
  
! if (*close != L('\0') && SLASH_PATHNAME(*close) == 0)
{
  ccname = (CHAR *)malloc ((close - p) * sizeof (CHAR));
***
*** 527,530 
--- 544,549 
  if (c == L('\0'))
return ((test == L('[')) ? savep : (CHAR *)0);
+ else if (c == L('/') && (flags & 

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-20 Thread Koichi Murase
2022年11月18日(金) 2:11 Chet Ramey :
> "If a pattern ends with an unescaped , it is unspecified whether
> the pattern does not match anything or the pattern is treated as invalid."
>
> Bash uses the former interpretation. If "the pattern is treated as invalid"
> means trying to literally match the open bracket and going on from there,
> your interpretation is valid as well. The standard doesn't use that
> language in other places it specifies to treat the bracket as an ordinary
> character to be matched literally, however.

There seem to be still remaining issues.  It is fine for me if Bash
chooses the former, ``the pattern does not match anything'' with a
backslash followed by NUL, but the following cases (see the attached
[reduced3.sh]) with a backslash followed by a slash should still be
fixed:

  #1: pat=a[b\/c]  str=a[b/c]   no/yes
  #2: pat=a[b\/c]  str=ab   no/no
  #3: pat=a[b\/c]  str=ac   yes/no
  [...]

Where the fourth column  shows the result of the current
devel 407d9afc with FNM_PATHNAME (xxx) and the result I expect
(yyy). "yes" means the pattern matches the string, and "no" means the
pattern does not match.

* I expect "yes" for #1 because the bracket expression contains a
  slash before its closing right bracket `]' and thus the beginning
  `[' should be matched literally.  However, the actual behavior is
  "no".

* I expect "no" for both #2 and #3 because the beginning bracket `['
  should be matched literally.  Even when an escaped slash would be
  allowed in the bracket expression so that [b\/c] forms a complete
  bracket expression, the results of #2 and #3 being "no" and "yes",
  respectively, are inconsistent.

  This difference is caused because the slash after the backslash is
  only checked after a matching character is found
  (lib/glob/sm_loop.c:703).  The same check should be applied also
  before a matching character is found (lib/glob/sm_loop.c:573).  I
  attach a patch for this [r0037.brackmatch6.remaining-slash.patch].

--

There is another related inconsistency.  I just modified my new
extglob engine to follow Bash's choice described above, but then the
behavior became different from that of the actual implementation of
Bash of the current devel.

> "If a pattern ends with an unescaped , it is unspecified whether
> the pattern does not match anything or the pattern is treated as invalid."
>
> Bash uses the former interpretation.

The corresponding sentence in the POSIX standard describes the
unescaped backslashes in the general context of the pattern instead of
that in the bracket expression, so I applied this to the new extglob
engine.  However, ``the former interpretation'' that Bash adopts
turned out to be only applied to the unescaped backslashes *inside a
bracket expression*.  This is the remaining part of the output of the
attached [example3.sh] with the current devel 407d9afc:

  [...]
  #4: pat=a\   str=a\   yes/???

So the pattern terminated with unescaped backslash actually matches a
string, where the backslash is treated as a literally-matching
backslash.

a. Is this difference between outside and inside of the bracket
  expressions intensional? I.e., the former interpretation "the
  pattern does not match anything" seems to only apply to the inside
  of bracket expressions.

b. If this is the behavior for the unescaped backslashes outside the
  bracket expressions, which is intensionally different from those in
  the bracket expressions, would it be possible to change the
  treatment of the unescaped backslashes inside the bracket
  expression the same as that of outside so the bracket `[' matches
  literally (as expected in cases #28..#31 of my previous reply [1])?
  The attached [r0037.brackmatch7.unescaped-backslash-option-b.patch]
  is the corresponding patch.

  [1] https://lists.gnu.org/archive/html/bug-bash/2022-11/msg00070.html

c. If the behavior of the unescaped backslash of the outside should
  also be modified to follow the former interpretation "the pattern
  does not match anything", another patch is
  [r0037.brackmatch7.unescaped-backslash-option-c.patch].  However,
  the current behavior outside the bracket expression seems to be
  explicitly required by the tests on tests/glob2.sub:32 and
  tests/glob2.sub:41.

I prefer option b, which keeps the behavior required by
tests/glob2.sub and also consistent between the inside and the outside
of bracket expressions.  It is also consistent with the behavior for
the string end inside bracket expressions.

--
Koichi
From 828b93de72263785d93f86a285d919fdc5be156d Mon Sep 17 00:00:00 2001
From: Koichi Murase 
Date: Sun, 20 Nov 2022 16:09:09 +0900
Subject: [PATCH 1/2] fix(BRACKMATCH): fix remaining slash check

---
 lib/glob/sm_loop.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c
index fa350daa..151d10bd 100644
--- 

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-18 Thread Koichi Murase
2022年11月18日(金) 2:11 Chet Ramey :
> "If a pattern ends with an unescaped , it is unspecified whether
> the pattern does not match anything or the pattern is treated as invalid."
>
> Bash uses the former interpretation. If "the pattern is treated as invalid"
> means trying to literally match the open bracket and going on from there,
> your interpretation is valid as well. The standard doesn't use that
> language in other places it specifies to treat the bracket as an ordinary
> character to be matched literally, however.

Thank you for the explanation. I haven't been considering the
consistency with the treatment of the slashes outside the bracket
expression. I think I need to adjust the treatment of the slash in my
new engine by introducing e.g. a special flag for . Thank you.

--
Koichi



Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-18 Thread Koichi Murase
2022年11月18日(金) 6:29 Chet Ramey :
> The same is true for ranges. If either the first character or the second
> character in the range is a slash it should cause the bracket to be matched
> literally.
>
> Incidentally, I made an additional change so that an incomplete range
> expression causes the bracket to be matched literally.
>
> Thanks for the discussion. I attached an updated version of the patch I
> sent yesterday. I wasn't quite as aggressive as you with macros.

Thank you for the new version of the patch. I have tried it. Thank you
for considering the change to the incomplete range in the bracket
expression.

--
Koichi



Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-17 Thread Chet Ramey

On 11/17/22 6:05 AM, Koichi Murase wrote:


Thank you for the patch.  I applied it locally and tried it.  I attach
a test script that I used: [bracket-slash.sh].  Now I switched to a
loadable builtin that directly calls strmatch instead of hacking
GLOBIGNORE.  Here are the results:

   ---Tests for a slash in bracket expressions---
   #1: pat=ab[/]ef  str=ab[/]ef  yes/yes
   #2: pat=ab[/]ef  str=ab/efno/no
   #3: pat=ab[c/d]efstr=ab[c/d]efyes/yes
   #4: pat=ab[c/d]efstr=abcefyes/no
   #5: pat=ab[.-/]efstr=ab[.-/]efno/yes
   #6: pat=ab[.-/]efstr=ab.efyes/no
   #7: pat=ab[[=/=]]ef  str=ab[[=/=]]ef  yes/yes
   #8: pat=ab[[=/=]]ef  str=ab/efno/no
   #9: pat=ab[[=c=]/]ef str=ab[=/]ef yes/yes
   #10: pat=ab[[=c=]/]ef str=abcefyes/no
   #11: pat=ab[[:alpha:]/]ef str=ab[:/]ef yes/yes
   #12: pat=ab[[:alpha:]/]ef str=abxefyes/no
   #13: pat=ab[/[abc]]ef str=ab[/c]ef yes/yes
   #14: pat=ab[/[abc]]ef str=abc]ef   no/no
   #15: pat=ab[c[=/=]]ef str=ab[c[=/=]]ef yes/yes
   #16: pat=ab[c[=/=]]ef str=abc[=/=]ef   no/no
   #17: pat=ab[c[=/=]]ef str=abcefyes/no
   ---Tests for incomplete bracket expressions---
   #18: pat=ab[c str=ab[c yes/yes
   #19: pat=ab[c str=abc  no/no
   #20: pat=ab[c[=d= str=ab[c[=d= yes/yes
   #21: pat=ab[c[=d= str=abc  no/no
   #22: pat=ab[c[.d  str=ab[c[.d  yes/yes
   #23: pat=ab[c[.d  str=abc  no/no
   #24: pat=ab[c[:alpha: str=ab[c[:alpha: yes/yes
   #25: pat=ab[c[:alpha: str=abc  no/no
   #26: pat=ab[c-str=ab[c-no/yes
   #27: pat=ab[c-str=abc  no/no
   #28: pat=ab[c\str=ab[c\no/yes
   #29: pat=ab[c\str=abc  no/no
   #30: pat=ab[[\str=ab[[\no/yes
   #31: pat=ab[[\str=ab[  no/no

"yes" and "no" on the left of / in the fourth column are the results
after applying the patch you provided. "yes" and "no" on the right of
/ are the results that *I* expect (see below paragraphs).

The new treatment seems to only handle a slash that directly appears
as an independent character in bracket expressions (cases
#1,#2,#13,#14), but if I literally read the standard you quoted, I
feel we should also handle other slashes (#3..#6,#9..#12,#15..#17).


Yes, I was undecided at the time whether character classes and equivalence
classes should have the same treatment or fall back to invalid classes. I
think your interpretation there is the right one.

The same is true for ranges. If either the first character or the second
character in the range is a slash it should cause the bracket to be matched
literally.

Incidentally, I made an additional change so that an incomplete range
expression causes the bracket to be matched literally.

Thanks for the discussion. I attached an updated version of the patch I
sent yesterday. I wasn't quite as aggressive as you with macros.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
*** ../bash-5.2-patched/lib/glob/sm_loop.c  2021-08-03 10:24:49.0 
-0400
--- lib/glob/sm_loop.c  2022-11-17 15:42:42.0 -0500
***
*** 344,347 
--- 344,352 
  return (FNM_NOMATCH);
  
+   /* If we are matching pathnames, we can't match a slash with a
+  bracket expression. */
+   if (sc == L('/') && (flags & FNM_PATHNAME))
+ return (FNM_NOMATCH);
+ 
/* `?' cannot match `.' or `..' if it is the first character of the
   string or if it is the first character following a slash and
***
*** 404,407 
--- 409,414 
  }
  
+ #define SLASH_PATHNAME(c) (c == L('/') && (flags & FNM_PATHNAME))
+ 
  /* Use prototype definition here because of type promotion. */
  static CHAR *
***
*** 452,455 
--- 459,468 
  pc = FOLD (p[1]);
  p += 4;
+ 
+ /* Finding a slash in a bracket expression means you have to
+match the bracket as an ordinary character (see below). */
+ if (pc == L('/') && (flags & FNM_PATHNAME))
+   return ((test == L('[')) ? savep : (CHAR *)0); /*]*/
+ 
  if (COLLEQUIV (test, pc))
{
***
*** 464,467 
--- 477,484 
  if (c == L('\0'))
return ((test == L('[')) ? savep : (CHAR *)0); /*]*/
+ else if (c == L('/') && (flags & FNM_PATHNAME))
+   return ((test == L('[')) ? savep : (CHAR *)0); /*]*/
+ else if (c == L(']'))
+   break;
 

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-17 Thread Chet Ramey

On 11/17/22 6:05 AM, Koichi Murase wrote:


Also, I have a suggestion of changes for more consistent handling of
incomplete bracket expressions.  Currently, incomplete bracket
expressions sometimes fail unconditionally (#26..#31) and sometimes
fall back to a literal `[' plus remaining pattern (#18..#25).  I
attach another patch for this:
[r0037.brackmatch5.incomplete-bracket.patch.txt].


"If a pattern ends with an unescaped , it is unspecified whether
the pattern does not match anything or the pattern is treated as invalid."

Bash uses the former interpretation. If "the pattern is treated as invalid"
means trying to literally match the open bracket and going on from there,
your interpretation is valid as well. The standard doesn't use that
language in other places it specifies to treat the bracket as an ordinary
character to be matched literally, however.


--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-17 Thread Koichi Murase
2022年11月17日(木) 6:47 Chet Ramey :
> The cleverness is due to Russ Cox, a really smart guy who figured this
> stuff out first:
>
> https://research.swtch.com/glob
>
> (https://swtch.com/~rsc/regexp/ is a collection of his writing on regular
> expressions. It's well worth reading.)

Thank you for the references to interesting readings!  Ah, so he is
the developer of re2.  I think this is the first time for me to read
articles written by the developer of re2.  I have looked inside the
codes there.  The idea of my new implementation [1] is closer to the
implementation "bounded-memory DFA" [2] linked from the regexp page
[3] (though I allocate a sufficient memory block calculated by the
length of the pattern string instead of using a fixed size of static
storage).  These do not require the full ``compilation'' of the DFA so
it requires a minimal preprocessing time, but instead, these are not
as fast as compiled DFAs (when we focus on the proportional
coefficients of the linear time complexity).  The new implementation
is still work-in-progress, so I later write the reasoning for the
choice of the implementation strategy when I submit the version ready
for review.

[1] https://gitlab.com/akinomyoga/bash/-/merge_requests/1; If some of
  you want to try it, you can clone it by

  $ git clone https://gitlab.com/akinomyoga/bash.git -b extglob

  but be careful that this is now a moving target occasionally squashed
  and force-pushed.
[2] https://swtch.com/~rsc/regexp/dfa1.c.txt
[3] https://swtch.com/~rsc/regexp/

I attach some benchmarks of the current state [dfaglob.pdf], where
"strmatch_ex" shown by solid lines is the new implementation.  Now
everything is linear with respect to the input string length, but I
would like to add some optimizations for simple patterns after
settling bracket-expression matters.  Also, I think I'd later add the
cases used in https://research.swtch.com/glob to the benchmarks.

--
Koichi


dfaglob.pdf
Description: Adobe PDF document


Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-17 Thread Koichi Murase
2022年11月17日(木) 6:47 Chet Ramey :
> fnmatch with FNM_PATHNAME only has to avoid matching the slash with a
> bracket expression. The shell has an additional constraint: a slash that
> appears in a bracket expression renders the bracket expression void and
> requires the `[' to be matched explicitly. That's why there have to be
> tests for slash in BRACKMATCH. There are two bugs: the off-by-one error
> you note and matching the open bracket explicitly.

Thank you for the explanation.  I did not know the special rule for
the slash in the bracket expression.  I assumed that that part of the
code in BRACKMATCH is related to the behavior of the bracket
expression never matching a slash so just removed it, which was wrong.

> I attached the patch I applied. I didn't include your fix to issue 1 above.

Thank you for the patch.  I applied it locally and tried it.  I attach
a test script that I used: [bracket-slash.sh].  Now I switched to a
loadable builtin that directly calls strmatch instead of hacking
GLOBIGNORE.  Here are the results:

  ---Tests for a slash in bracket expressions---
  #1: pat=ab[/]ef  str=ab[/]ef  yes/yes
  #2: pat=ab[/]ef  str=ab/efno/no
  #3: pat=ab[c/d]efstr=ab[c/d]efyes/yes
  #4: pat=ab[c/d]efstr=abcefyes/no
  #5: pat=ab[.-/]efstr=ab[.-/]efno/yes
  #6: pat=ab[.-/]efstr=ab.efyes/no
  #7: pat=ab[[=/=]]ef  str=ab[[=/=]]ef  yes/yes
  #8: pat=ab[[=/=]]ef  str=ab/efno/no
  #9: pat=ab[[=c=]/]ef str=ab[=/]ef yes/yes
  #10: pat=ab[[=c=]/]ef str=abcefyes/no
  #11: pat=ab[[:alpha:]/]ef str=ab[:/]ef yes/yes
  #12: pat=ab[[:alpha:]/]ef str=abxefyes/no
  #13: pat=ab[/[abc]]ef str=ab[/c]ef yes/yes
  #14: pat=ab[/[abc]]ef str=abc]ef   no/no
  #15: pat=ab[c[=/=]]ef str=ab[c[=/=]]ef yes/yes
  #16: pat=ab[c[=/=]]ef str=abc[=/=]ef   no/no
  #17: pat=ab[c[=/=]]ef str=abcefyes/no
  ---Tests for incomplete bracket expressions---
  #18: pat=ab[c str=ab[c yes/yes
  #19: pat=ab[c str=abc  no/no
  #20: pat=ab[c[=d= str=ab[c[=d= yes/yes
  #21: pat=ab[c[=d= str=abc  no/no
  #22: pat=ab[c[.d  str=ab[c[.d  yes/yes
  #23: pat=ab[c[.d  str=abc  no/no
  #24: pat=ab[c[:alpha: str=ab[c[:alpha: yes/yes
  #25: pat=ab[c[:alpha: str=abc  no/no
  #26: pat=ab[c-str=ab[c-no/yes
  #27: pat=ab[c-str=abc  no/no
  #28: pat=ab[c\str=ab[c\no/yes
  #29: pat=ab[c\str=abc  no/no
  #30: pat=ab[[\str=ab[[\no/yes
  #31: pat=ab[[\str=ab[  no/no

"yes" and "no" on the left of / in the fourth column are the results
after applying the patch you provided. "yes" and "no" on the right of
/ are the results that *I* expect (see below paragraphs).

The new treatment seems to only handle a slash that directly appears
as an independent character in bracket expressions (cases
#1,#2,#13,#14), but if I literally read the standard you quoted, I
feel we should also handle other slashes (#3..#6,#9..#12,#15..#17).
Cases #7 and #8 seem to be already processed in the current devel.  I
think we can understand the rule of the standard by a mental
processing model to first split the pattern with `/' and then to match
each segment with the corresponding segment of the target string.
Thinking in that way, we can actually just replace the NUL test for
the string end with a similar test for NUL or a slash.  I attach a
possible patch for the additional fix
[r0037.brackmatch4.slash-terminated.patch.txt], which applies after
your patch [bracket-slash.patch].

Also, I have a suggestion of changes for more consistent handling of
incomplete bracket expressions.  Currently, incomplete bracket
expressions sometimes fail unconditionally (#26..#31) and sometimes
fall back to a literal `[' plus remaining pattern (#18..#25).  I
attach another patch for this:
[r0037.brackmatch5.incomplete-bracket.patch.txt].

--
Koichi
From cbee6f5a782e9ce1b431704f33a3e03de85c4460 Mon Sep 17 00:00:00 2001
From: Koichi Murase 
Date: Thu, 17 Nov 2022 18:44:19 +0900
Subject: [PATCH 1/3] fix(BRACKMATCH): terminate bracket expression by any
 slash

---
 lib/glob/sm_loop.c | 45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c
index a448c155..f4d8d1f2 100644
--- a/lib/glob/sm_loop.c
+++ b/lib/glob/sm_loop.c
@@ -439,6 +439,16 @@ BRACKMATCH (p, test, flags)
   if (not = (*p == L('!') || *p == L('^')))
 ++p;
 
+  /* POSIX.2 2.13.3 says: `If a  character is found following an
+ unescaped  character before a corresponding
+  is found, the open bracket shall be treated as an
+ 

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-16 Thread Chet Ramey

On 11/14/22 6:06 AM, Koichi Murase wrote:


However, I noticed two strange behaviors of the current engine.
Before adjusting the behavior of the new engine and submitting it for
review, I would like to confirm the (expected) behavior of the current
engine in the current devel.

These two behaviors finally turned out to be both related to the
matching of bracket expression by the function `BRACKMATCH'
(lib/glob/sm_loop.c).

--
1. pattern [[=B=]][c] matches with c

   $ bash-devel --norc
   $ [[ Bc == [[=B=]][c] ]]; echo $?
   0  <-- OK. This is expected.
   $ [[ c == [[=B=]][c] ]]; echo $?
   0  <-- This is unexpected.


This is clearly a problem, and your fix is a good one.



--
2. bracket expression sometimes match or unmatch the slash with
   FNM_PATHNAME.

   FNM_PATHNAME is only used in two places in the Bash codebase.  1)
   One is for the glob matching for the filenames in the directory
   (lib/glob/glob.c).  However, this actually does not seem to have an
   actual effect because FNM_PATHNAME only causes differences in the
   matching when the target string contains a slash but the filenames
   do not contain any slashes.  2) The other is the filtering of the
   pathname-expansion results with GLOBIGNORE (pathexp.c).  So the only
   way to test the behavior of Bash's strmatch with FNM_PATHNAME
   (without writing a C program to directly use the function
   `strmatch') is to use GLOBIGNORE.


This is true, but the two uses above are not directly analogous to
fnmatch(). The bash implementation, due to its origin as an fnmatch()
implementation, uses the same flag names but uses FNM_PATHNAME to signal
the matching engine that it's matching filenames for pathname expansion.

That triggers the difference. The first part of your patch is correct:
a bracket expression can never match a slash in the string. This is
common to both fnmatch and what POSIX calls "Patterns Used For Filename
Expansion."

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_03

fnmatch with FNM_PATHNAME only has to avoid matching the slash with a
bracket expression. The shell has an additional constraint: a slash that
appears in a bracket expression renders the bracket expression void and
requires the `[' to be matched explicitly. That's why there have to be
tests for slash in BRACKMATCH. There are two bugs: the off-by-one error
you note and matching the open bracket explicitly.



   * In fnmatch(3), a bracket expression never matches a / with
 FNM_PATHNAME.


This is one part of the fix.

   * In Bash's strmatch, a bracket expression sometimes matches `/' and
 sometimes does not.  In the codebase, `/' is excluded only when it
 explicitly appears after another character in the bracket
 expression (lib/glob/sm_loop.c:574) even though the comment
 mentions [/].  This is the reason that only [a/] fails with Bash's
 implementation in cases #2..#6 in the above result.


That's the consequence of the test being in the wrong place in BRACKMATCH.



   * What is happening with Bash's implementation in cases #7..#12 is
 related the assumption of the backtracking trick for `*': The
 trick for `*' backtracking explained in the code comment
 lib/glob/sm_loop.c:320---"This is a clever idea from
 glibc"---relies on the behavior that the bracket expression never
 matches a slash that `*' cannot match. 


These two changes should satisfy that.

The cleverness is due to Russ Cox, a really smart guy who figured this
stuff out first:

https://research.swtch.com/glob

(https://swtch.com/~rsc/regexp/ is a collection of his writing on regular
expressions. It's well worth reading.)

I attached the patch I applied. I didn't include your fix to issue 1 above.

Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
*** ../bash-5.2-patched/lib/glob/sm_loop.c  2021-08-03 10:24:49.0 
-0400
--- lib/glob/sm_loop.c  2022-11-16 16:22:42.0 -0500
***
*** 344,347 
--- 344,352 
  return (FNM_NOMATCH);
  
+   /* If we are matching pathnames, we can't match a slash with a
+  bracket expression. */
+   if (sc == L('/') && (flags & FNM_PATHNAME))
+ return (FNM_NOMATCH);
+ 
/* `?' cannot match `.' or `..' if it is the first character of the
   string or if it is the first character following a slash and
***
*** 452,455 
--- 457,466 
  pc = FOLD (p[1]);
  p += 4;
+ 
+ /* Finding a slash in a bracket expression means you have to
+match the bracket as an ordinary character (see below). */
+ if (c == L('/') && (flags & FNM_PATHNAME))
+ 

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-14 Thread Koichi Murase
2022年11月1日(火) 0:59 Chet Ramey :
> If someone wanted to do this, I would take a look at incorporating the
> results, as long as it didn't add dependencies on, say, pcre or gnulib
> regex.

Instead of translating the pattern to a regular expression and
compiling it by regcomp (), I have experimentally implemented
a new extglob engine based on a naive DFA and was comparing the
behavior and the performance with the current implementations of
devel.  [Note: In this report hereafter, ``the current
implementation/engine'' always means the implementation of strmatch
(lib/glob/{strmatch,smatch,sm_loop.c}) in the current devel 31f4d468.]

However, I noticed two strange behaviors of the current engine.
Before adjusting the behavior of the new engine and submitting it for
review, I would like to confirm the (expected) behavior of the current
engine in the current devel.

These two behaviors finally turned out to be both related to the
matching of bracket expression by the function `BRACKMATCH'
(lib/glob/sm_loop.c).

--
1. pattern [[=B=]][c] matches with c

  $ bash-devel --norc
  $ [[ Bc == [[=B=]][c] ]]; echo $?
  0  <-- OK. This is expected.
  $ [[ c == [[=B=]][c] ]]; echo $?
  0  <-- This is unexpected.

  See the attached [r0037.brackmatch1.equivalence-class.patch] for the
  fix.  The problem is caused because [[=B=]][c] is treated as a
  single bracket expression [ « [=B=]][c » ] when the equivalence
  class [=B=] does not match.  This is because `]' after `[[=B=]' is
  treated as if it is the first `]' in the bracket expression (such as
  `]' after `[' in the pattern « []aaa] »).  In the patch, when the
  next character after a failing equivalence class [=B=] is `]', the
  bracket expression has been changed to fail just the same as the
  case for a failing character class [:alpha:]
  (lib/glob/sm_loop.c:530; line number is that in the current devel
  31f4d468).

--
2. bracket expression sometimes match or unmatch the slash with
  FNM_PATHNAME.

  FNM_PATHNAME is only used in two places in the Bash codebase.  1)
  One is for the glob matching for the filenames in the directory
  (lib/glob/glob.c).  However, this actually does not seem to have an
  actual effect because FNM_PATHNAME only causes differences in the
  matching when the target string contains a slash but the filenames
  do not contain any slashes.  2) The other is the filtering of the
  pathname-expansion results with GLOBIGNORE (pathexp.c).  So the only
  way to test the behavior of Bash's strmatch with FNM_PATHNAME
  (without writing a C program to directly use the function
  `strmatch') is to use GLOBIGNORE.

  To demonstrate the behavior of the current implementation, I attach
  a script [fnmpath.sh], which utilizes GLOBIGNORE for the Bash
  implementation.  It compares the result with fnmatch(3).  The result
  in my environment (x86_64-redhat-linux-gnu [Fedora Linux 36 (Server
  Edition)]) is this:

  $ bash-devel fnmpath.sh
  #1: pat=ab/cd/efgyes/yes
  #2: pat=ab[/]cd/efg  yes/no
  #3: pat=ab[/a]cd/efg yes/no
  #4: pat=ab[a/]cd/efg no/no
  #5: pat=ab[!a]cd/efg yes/no
  #6: pat=ab[.-0]cd/efgyes/no
  #7: pat=*/*/efg  yes/yes
  #8: pat=*[/]*/efgno/no
  #9: pat=*[/a]*/efg   no/no
  #10: pat=*[a/]*/efg   no/no
  #11: pat=*[!a]*/efg   no/no
  #12: pat=*[.-0]*/efg  no/no
  #13: pat=ab@(/)cd/efg yes/yes
  #14: pat=*@(/)cd/efg  no/no
  #15: pat=*/cd/efg yes/yes

  This tests whether each pattern matches the string "ab/cd/efg".  Two
  results by Bash's strmatch and fnmatch(3) are connected with
  `/'. "yes" means the pattern matches the string "ab/cd/efg" and "no"
  means it does not match.  Some observations are

  * In fnmatch(3), a bracket expression never matches a / with
FNM_PATHNAME.

  * In Bash's strmatch, a bracket expression sometimes matches `/' and
sometimes does not.  In the codebase, `/' is excluded only when it
explicitly appears after another character in the bracket
expression (lib/glob/sm_loop.c:574) even though the comment
mentions [/].  This is the reason that only [a/] fails with Bash's
implementation in cases #2..#6 in the above result.

  * What is happening with Bash's implementation in cases #7..#12 is
related the assumption of the backtracking trick for `*': The
trick for `*' backtracking explained in the code comment
lib/glob/sm_loop.c:320---"This is a clever idea from
glibc"---relies on the behavior that the bracket expression never
matches a slash that `*' cannot match.  [Note: The exact
requirements for this trick is actually slightly weaker: each
bracket expression needs to match a fixed number of `/', 0 or 1,
when FNM_PATHNAME is specified; it should never match a slash if
it can match other characters, and it should never match other

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-11-03 Thread Chet Ramey

On 10/29/22 1:50 AM, Martin D Kealey wrote:


PS: While we're about it, can we please fix the expansion of « a/*/b/c/d/*
» so it only calls readdir on a/ and each dir matching a/*/b/c/d/ and NOT
call readdir on dirs matching a/*/, a/*/b/, or a/*/b/c/.


It already does this.

Given

$ find . -print
.
./a
./a/z
./a/z/b
./a/z/b/c
./a/x
./a/x/b
./a/x/b/c
./a/x/b/c/d
./a/x/b/c/d/1
./a/x/b/c/d/4
./a/x/b/c/d/3
./a/x/b/c/d/2
./a/x/b/c/d/5
./a/y
./a/y/b

Running `echo a/*/b/c/d/*' calls opendir(2) twice:

$ ./bash ./x1
opendir: a/
opendir: a/x/b/c/d
a/x/b/c/d/1 a/x/b/c/d/2 a/x/b/c/d/3 a/x/b/c/d/4 a/x/b/c/d/5


--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-31 Thread Chet Ramey

On 10/30/22 9:59 PM, Martin D Kealey wrote:


With the exception of the !(LIST) negation, there's a direct correspondence
between extglob and any other regex format. Translating between them is
trivial.


As I said before, I would gladly look at incorporating such a change, as
long as it didn't increase the number of dependencies.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-31 Thread Chet Ramey

On 10/30/22 6:32 AM, Oğuz wrote:


Or
(g) make the existing extglob code faster and avoid introducing more
complexity into the shell.


The extglob code has reasonable performance when it's asked to do the most
common thing: check whether a given null-terminated string matches a given
(possibly extended) pattern, anchored at start and end.

The problem arises when you want to support the leftmost-longest substring
match semantics required by pattern substitution. There's no
straightforward way to simply extend the model -- you have to implement
something new that supports those semantics to replace the simple way it's
done now.

If you want to stick with bash when performance for your use case becomes
unacceptable and eschew the (easy, simple, fast) sed workarounds, you're
going to be frustrated until something changes.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-31 Thread Chet Ramey

On 10/29/22 1:50 AM, Martin D Kealey wrote:

This seems like a good reason to simply translate extglobs into regexes,
which should run in linear time, rather than put effort into building and
debugging a parallel implementation.


If someone wanted to do this, I would take a look at incorporating the
results, as long as it didn't add dependencies on, say, pcre or gnulib
regex.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-31 Thread Chet Ramey

On 10/28/22 2:44 AM, Hyunho Cho wrote:


Bash Version: 5.2
Patch Level: 2
Release Status: release

##

bash "extglob" almost unusable except with very tiny string.
there is no such problems in zsh "kshglob".


Nothing has changed since you brought this up last month. The same two-line
sed workaround works on the 90K string from `gcc --help'.

https://lists.gnu.org/archive/html/bug-bash/2022-09/msg00029.html

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-31 Thread Oğuz İsmail Uysal

On 10/31/22 2:20 PM, Greg Wooledge wrote:

Bash already uses the POSIX regex functions (regcomp(3) et al.) to do
[[ =~ ]] using POSIX ERE. 
Yeah, and offers no more than what your libc's regex engine has. For 
example, you can't tell what `[[ x =~ .{256} ]]' (or even `[[ x =~ "" 
]]') would return without knowing the operating system it's run on. 
extglobs aren't like that and shouldn't be either.




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-31 Thread Greg Wooledge
On Mon, Oct 31, 2022 at 05:00:40AM +0200, Oğuz wrote:
> 31 Ekim 2022 Pazartesi tarihinde Martin D Kealey 
> yazdı:
> > If we use the standard POSIX BRE or ERE, then there's no additional code to
> > ship; it's included as part of the OS. The hard part is what to do with
> > (!LIST), which was the point of my previous post.
> 
> That'd be clunkier than what we already have. Bash targets many platforms
> and it'd have to target as many regex engines if it were to translate
> extglobs to posix regexes. You can't expect all of them to be compatible
> with each other, and they are not. So, if we wish to translate extglobs to
> regexes and have them work regardless of the platform, the easiest way
> forward is to adopt a third party regex engine; about which I said enough
> in my previous email.

Bash already uses the POSIX regex functions (regcomp(3) et al.) to do
[[ =~ ]] using POSIX ERE.  If it weren't for !(list) it would be possible,
even easy, for bash to convert extglobs into POSIX EREs, syntactically,
and then call the same libraries it's already calling.

I'm not sure where your claim of incompatibility comes from.

Regardling !(list), I know that many people use it.  It's fairly
popular, at least among the sort of people who come to IRC asking how
they can match all files except foo and bar.  So, simply dropping it
would not be a viable option.



Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-30 Thread Oğuz
31 Ekim 2022 Pazartesi tarihinde Martin D Kealey 
yazdı:
>
> With the exception of the !(LIST) negation, there's a direct
> correspondence between extglob and any other regex format. Translating
> between them is trivial.
>
If we use the standard POSIX BRE or ERE, then there's no additional code to
> ship; it's included as part of the OS. The hard part is what to do with
> (!LIST), which was the point of my previous post.
>

That'd be clunkier than what we already have. Bash targets many platforms
and it'd have to target as many regex engines if it were to translate
extglobs to posix regexes. You can't expect all of them to be compatible
with each other, and they are not. So, if we wish to translate extglobs to
regexes and have them work regardless of the platform, the easiest way
forward is to adopt a third party regex engine; about which I said enough
in my previous email.

The problem is that it DOESN'T work fine. In practice people encounter
> abysmally slow extglob matching.
>

*when matching against a huge string. Which is rare in my experience, but
of course should be taken into consideration if there are multiple bug
reports; I didn't say anything against that.


-- 
Oğuz


Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-30 Thread Martin D Kealey
I'm top-quoting this because the entire response below seems to be
predicated on a misconception, or perhaps several misconceptions.

Exactly NONE of my suggestions involves expanding the Shell language.
Users would continue to write extglob exactly as they do now, and they
would remain blissfully ignorant of the regex engine underneath. The
"translation" is entirely hidden from them.

Extglob is amenable to compilation into a FSA, which leads to the
conclusion that it's functionally a form of regular expression, even if its
syntax is rather different from regexes that people are familiar with. This
is why I said that writing our own regular/linear extglob implementation
would be equivalent to writing a regex compiler and engine. It's also why I
think the shortest route to having a working implementation would be to
write a translation layer from extglob to one of the existing RE formats.

(The only option I suggested that would modify the shell language would be
the option to remove the !(LIST) negation; I include that for completeness,
I wouldn't actually recommend it.)

The idea that "these are only technical details that nobody cares about" is
just wrong.
There have been bugs filed because the performance is not just "a bit
slow", but "slow by many orders of magnitude".

If discussion of implementation details is out of scope for the bash-bug
mailing list, is there a bash-dev list, or similar, where they should be
discussed?

With the exception of the !(LIST) negation, there's a direct correspondence
between extglob and any other regex format. Translating between them is
trivial.
If we use the standard POSIX BRE or ERE, then there's no additional code to
ship; it's included as part of the OS. The hard part is what to do with
(!LIST), which was the point of my previous post.

-Martin

On Sun, 30 Oct 2022 at 23:35, Oğuz İsmail Uysal 
wrote:

> On 10/30/22 3:25 PM, Martin D Kealey wrote:
> > How much faster do you think it can be made?
> I don't know, irrelevant though.
> > The problem is not that individual steps are slow, but rather [...]

These are technical details; no user cares about them.
>
> The purpose of my suggestions was to /minimize/ the complexity that
> becomes part of Bash's codebase [...]
> I meant complexity of the language, not the codebase.
>
> In my opinion "make the existing extglob code faster" is a wasted effort
> if it doesn't get us to "run in at-most quadratic time" and preferably "run
> in regular (linear) time", and so that basically amounts to "write our own
> regex state machine compiler and regex engine". This is a non-trivial task,
> and would fairly obviously add
> > *more* complex code into Bash's codebase than any of my suggested
> > alternatives.
> extglobs are already a part of the bash language. All of your suggested
> alternatives involve expanding the language in question.


No. Just No.

That's why I
> disagree with all of them.
>






> > (Even my options of "postprocess the codebase" or "modify an existing
> > regex compiler" would leave their execution components untouched; only
> > the compilation phase would be modified, and a modified regex compiler
> > would at least stand a chance of existing as a stand-alone library
> > project.)
> If you mean bash should start shipping a huge library like pcre for
> solving an edge case, I don't think that's reasonable at all; why take
> on such a burden when you already have something that works fine in
> practice?
>

Using PCRE was only one option, and not my preferred one.
POSIX RE is provided by the standard C library, we wouldn't have to ship
anything.

The problem is that it DOESN'T work fine. In practice people encounter
abysmally slow extglob matching.


Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-30 Thread Oğuz İsmail Uysal

On 10/30/22 3:25 PM, Martin D Kealey wrote:

How much faster do you think it can be made?

I don't know, irrelevant though.
The problem is not that individual steps are slow, but rather that it 
takes at least a higher-order-polynomial number of steps, possibly 
more (such as exponential or factorial).
Speeding up the individual steps will make no practical difference, 
while pin-hole optimisations may dramatically speed up some common 
cases, but still leave the most general cases catastrophically slow.

These are technical details; no user cares about them.
The purpose of my suggestions was to /minimize/ the complexity that 
becomes part of Bash's codebase, while leaving as few pathological 
cases as possible - preferably none.

I meant complexity of the language, not the codebase.
In my opinion "make the existing extglob code faster" is a wasted 
effort if it doesn't get us to "run in at-most quadratic time" and 
preferably "run in regular (linear) time", and so that basically 
amounts to "write our own regex state machine compiler and regex 
engine". This is a non-trivial task, and would fairly obviously add 
*more* complex code into Bash's codebase than any of my suggested 
alternatives.
extglobs are already a part of the bash language. All of your suggested 
alternatives involve expanding the language in question. That's why I 
disagree with all of them.
(Even my options of "postprocess the codebase" or "modify an existing 
regex compiler" would leave their execution components untouched; only 
the compilation phase would be modified, and a modified regex compiler 
would at least stand a chance of existing as a stand-alone library 
project.)
If you mean bash should start shipping a huge library like pcre for 
solving an edge case, I don't think that's reasonable at all; why take 
on such a burden when you already have something that works fine in 
practice?




Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-30 Thread Alex fxmbsw7 Ratchev
On Sun, Oct 30, 2022, 11:33 Oğuz  wrote:

> 30 Ekim 2022 Pazar tarihinde Martin D Kealey 
> yazdı:
> >
> > So the options would seem to be:
> > (a) prohibit inversions (you get to pick EITHER extglob or rexglob, not
> > both);
> > (b) bypass convert-to-regex when inversions are present;
> > (c) use PCRE or Vim RE, which already support negations (though not in
> the
> > same form); note that these do not have linear ("regular") time
> performance
> > in any but the trivial cases;
> > (d) compute the "inverse" regex for a given negation (this may require
> Vim
> > RE, see below);
> > (e) post-process the compiled regex (this would be highly dependent on
> the
> > specific RE implementation);
> > (f) pick an existing regex engine, and add the necessary logic to handle
> > negations and conjunctions (see below);
> >
>
> Or
> (g) make the existing extglob code faster and avoid introducing more
> complexity into the shell.
>
> Which, in my opinion, is the easiest and most realistic option.
>

last three times i bugged about glob ( like reuse groupings ) it sounded
'do-it-yourself' back only
i would , have done , if i could ( .c )

-- 
> Oğuz
>


Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-30 Thread Oğuz
30 Ekim 2022 Pazar tarihinde Martin D Kealey 
yazdı:
>
> So the options would seem to be:
> (a) prohibit inversions (you get to pick EITHER extglob or rexglob, not
> both);
> (b) bypass convert-to-regex when inversions are present;
> (c) use PCRE or Vim RE, which already support negations (though not in the
> same form); note that these do not have linear ("regular") time performance
> in any but the trivial cases;
> (d) compute the "inverse" regex for a given negation (this may require Vim
> RE, see below);
> (e) post-process the compiled regex (this would be highly dependent on the
> specific RE implementation);
> (f) pick an existing regex engine, and add the necessary logic to handle
> negations and conjunctions (see below);
>

Or
(g) make the existing extglob code faster and avoid introducing more
complexity into the shell.

Which, in my opinion, is the easiest and most realistic option.


-- 
Oğuz


Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-30 Thread Martin D Kealey
On Sat, 29 Oct 2022 at 22:15, Greg Wooledge  wrote:

> On Sat, Oct 29, 2022 at 04:50:00PM +1100, Martin D Kealey wrote:
> > This seems like a good reason to simply translate extglobs into regexes,
> > which should run in linear time, rather than put effort into building and
> > debugging a parallel implementation.
>
> This isn't straightforward, because of the !(list) feature of extglob.
> There's no analogous construct for that in standard regexes.
>

That's true. and my "simply" is understating the complexity of the task,
but it's not impossible.

There was a recent discussion about this very point on
irc://libera.chat#Bash where it was pointed out that supporting the !(list)
style of inversions in the regex compiler is simply a matter of inverting
the "success" or "failure" indications attached to the states when
compiling a regex, which means that it works correctly when these
constructs are nested.

So the options would seem to be:
(a) prohibit inversions (you get to pick EITHER extglob or rexglob, not
both);
(b) bypass convert-to-regex when inversions are present;
(c) use PCRE or Vim RE, which already support negations (though not in the
same form); note that these do not have linear ("regular") time performance
in any but the trivial cases;
(d) compute the "inverse" regex for a given negation (this may require Vim
RE, see below);
(e) post-process the compiled regex (this would be highly dependent on the
specific RE implementation);
(f) pick an existing regex engine, and add the necessary logic to handle
negations and conjunctions (see below);

A particular difficulty is handling !(!(X)|!(Y)) which by de Morgan's law
translates into @(X) - meaning a target needs to match BOTH X and Y. Only
Vim's Regex engine has direct support for that construct, and it exhibits
non-linear timing when using it.

A naïve implementation could easily require state table space that's
exponential in the breadth of the conjunctions, meaning that the user could
very easily write a rexglob that cannot be compiled with available memory.
So there needs to be a fallback position: either fail, telling the user we
cannot honour the linear speed guarantee, or "try anyway" with a procedure
that could exhibit truly horrible time complexity.

-Martin

PS: I say "we" and "us" advisedly; let's not leave everything to Chet.


Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-29 Thread Greg Wooledge
On Sat, Oct 29, 2022 at 04:50:00PM +1100, Martin D Kealey wrote:
> This seems like a good reason to simply translate extglobs into regexes,
> which should run in linear time, rather than put effort into building and
> debugging a parallel implementation.

This isn't straightforward, because of the !(list) feature of extglob.
There's no analogous construct for that in standard regexes.



Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-28 Thread Martin D Kealey
This seems like a good reason to simply translate extglobs into regexes,
which should run in linear time, rather than put effort into building and
debugging a parallel implementation.

-Martin

PS: While we're about it, can we please fix the expansion of « a/*/b/c/d/*
» so it only calls readdir on a/ and each dir matching a/*/b/c/d/ and NOT
call readdir on dirs matching a/*/, a/*/b/, or a/*/b/c/.


Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

2022-10-28 Thread Greg Wooledge
On Fri, Oct 28, 2022 at 03:44:34PM +0900, Hyunho Cho wrote:
> # this is just for testing purposes.
> $ foo=$( gcc -v --help 2> /dev/null | sed -En 's/^\s*(-[^ ]+).*/\1/p' )

The last time this issue was reported, we mentioned that "gcc --help"
is not a standard document that appears the same on everyone's systems.

I'm not disputing that bash has performance issues here, but in the
interests of getting a coherent bug report (or feature request), please
use input that's available to everyone.

At the very least, show us the output of "gcc -v --help 2>/dev/null | wc"
on your system, so we have a ballpark estimate of the size of the input
and the number of words in it.

On my system, as an example:

unicorn:~$ gcc -v --help 2>/dev/null | wc
   3867   22464  250924
unicorn:~$ gcc --help 2>/dev/null | wc
 61 4303995