Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年12月2日(金) 0:59 Chet Ramey : > > Thank you for the discussion and for applying the changes. Besides, I > > am sorry that I still have a discussion on the behavior of BRACKMATCH, > > so it was not the last set of changes. After the above fix, I moved > > to check the behavior related to PATSCAN, where I found inconsistent > > results related to the difference between BRACKMATCH and PATSCAN in > > parsing the bracket expressions. I checked also other parts of the > > codebase and found additional inconsistencies. > > I have to say I appreciate your comprehensive and detailed reports. Thank you for your patience. I also appreciate your careful review and applying patches every time. -- Koichi
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 11/28/22 5:51 AM, Koichi Murase wrote: 2022年11月23日(水) 5:24 Chet Ramey : I attached the latest patch against bash-5.2.9. commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel) associative array elements; last set of changes to globbing bracket expressions; fix for timing subshell commands Thank you for the discussion and for applying the changes. Besides, I am sorry that I still have a discussion on the behavior of BRACKMATCH, so it was not the last set of changes. After the above fix, I moved to check the behavior related to PATSCAN, where I found inconsistent results related to the difference between BRACKMATCH and PATSCAN in parsing the bracket expressions. I checked also other parts of the codebase and found additional inconsistencies. Thanks for the detailed analysis. I applied these patches. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 11/28/22 5:51 AM, Koichi Murase wrote: 2022年11月23日(水) 5:24 Chet Ramey : I attached the latest patch against bash-5.2.9. commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel) associative array elements; last set of changes to globbing bracket expressions; fix for timing subshell commands Thank you for the discussion and for applying the changes. Besides, I am sorry that I still have a discussion on the behavior of BRACKMATCH, so it was not the last set of changes. After the above fix, I moved to check the behavior related to PATSCAN, where I found inconsistent results related to the difference between BRACKMATCH and PATSCAN in parsing the bracket expressions. I checked also other parts of the codebase and found additional inconsistencies. I have to say I appreciate your comprehensive and detailed reports. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月28日(月) 19:51 Koichi Murase : > The fix for PATSCAN (C) is included in the second patch > [r0037.patscan1.parse_subbracket.patch.txt], which solves "Repeat-By 1 > and 4" by using PARSE_SUBBRACKET of the first patch. [...] Sorry, there was an oversight in the second patch [r0037.patscan1.parse_subbracket.patch.txt]. We need to also update the function signatures of glob_patscan{,_wc} (expanded from PATSCAN) in other files, gmisc.c and glob.c. I attach another patch [r0037.patscan3.extern.patch.txt] for additional fixes to the second patch, where I passed 0 as FLAGS so the original behavior is preserved. -- Koichi From ddc72aeb2cd6962e85bdc0819419db3d8cb6576d Mon Sep 17 00:00:00 2001 From: Koichi Murase Date: Mon, 28 Nov 2022 21:42:14 +0900 Subject: [PATCH 1/2] fix(lib/glob/glob): update uses of glob_patscan --- lib/glob/glob.c | 12 ++-- lib/glob/gmisc.c | 4 ++-- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/glob/glob.c b/lib/glob/glob.c index b66af85c..686d0f6b 100644 --- a/lib/glob/glob.c +++ b/lib/glob/glob.c @@ -127,8 +127,8 @@ static int glob_testdir PARAMS((char *, int)); static char **glob_dir_to_array PARAMS((char *, char **, int)); /* Make sure these names continue to agree with what's in smatch.c */ -extern char *glob_patscan PARAMS((char *, char *, int)); -extern wchar_t *glob_patscan_wc PARAMS((wchar_t *, wchar_t *, int)); +extern char *glob_patscan PARAMS((char *, char *, int, int)); +extern wchar_t *glob_patscan_wc PARAMS((wchar_t *, wchar_t *, wint_t, int)); /* And this from gmisc.c/gm_loop.c */ extern int wextglob_pattern_p PARAMS((wchar_t *)); @@ -207,7 +207,7 @@ extglob_skipname (pat, dname, flags) wild = *pat == '*' || *pat == '?'; pp = pat + 2; se = pp + strlen (pp); /* end of pattern string */ - pe = glob_patscan (pp, se, 0); /* end of extglob pattern */ + pe = glob_patscan (pp, se, 0, 0);/* end of extglob pattern */ /* if pe == 0, this is an invalid extglob pattern */ if (pe == 0) @@ -234,7 +234,7 @@ extglob_skipname (pat, dname, flags) nullpat = pe >= (pat + 2) && pe[-2] == '(' && pe[-1] == ')'; /* check every subpattern */ - while (t = glob_patscan (pp, pe, '|')) + while (t = glob_patscan (pp, pe, '|', 0)) { /* If T == PE and *T == 0 (&& PE[-1] == RPAREN), we have hit the end of a pattern with no trailing characters. */ @@ -358,7 +358,7 @@ wextglob_skipname (pat, dname, flags) wild = *pat == L'*' || *pat == L'?'; pp = pat + 2; se = pp + wcslen (pp); - pe = glob_patscan_wc (pp, se, 0); + pe = glob_patscan_wc (pp, se, 0, 0); /* if pe == 0, this is an invalid extglob pattern */ if (pe == 0) @@ -382,7 +382,7 @@ wextglob_skipname (pat, dname, flags) nullpat = pe >= (pat + 2) && pe[-2] == L'(' && pe[-1] == L')'; /* check every subpattern */ - while (t = glob_patscan_wc (pp, pe, '|')) + while (t = glob_patscan_wc (pp, pe, '|', 0)) { n = t[-1]; /* ( */ if (wextglob_pattern_p (pp) && n == L')')/* nested extglob? */ diff --git a/lib/glob/gmisc.c b/lib/glob/gmisc.c index f3d74cea..24fdf746 100644 --- a/lib/glob/gmisc.c +++ b/lib/glob/gmisc.c @@ -38,7 +38,7 @@ #include "glob.h" /* Make sure these names continue to agree with what's in smatch.c */ -extern char *glob_patscan PARAMS((char *, char *, int)); +extern char *glob_patscan PARAMS((char *, char *, int, int)); /* Compile `gm_loop.c' for single-byte characters. */ #define CHAR char @@ -92,7 +92,7 @@ glob_dirscan (pat, dirsep) { if (se == 0) se = p + strlen (p) - 1; - pe = glob_patscan (p + 2, se, 0); + pe = glob_patscan (p + 2, se, 0, 0); if (pe == 0) continue; else if (*pe == 0) -- 2.37.2
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月23日(水) 5:24 Chet Ramey : > I attached the latest patch against bash-5.2.9. > commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel) > > associative array elements; last set of changes to globbing > bracket expressions; fix for timing subshell commands Thank you for the discussion and for applying the changes. Besides, I am sorry that I still have a discussion on the behavior of BRACKMATCH, so it was not the last set of changes. After the above fix, I moved to check the behavior related to PATSCAN, where I found inconsistent results related to the difference between BRACKMATCH and PATSCAN in parsing the bracket expressions. I checked also other parts of the codebase and found additional inconsistencies. Description --- First, let me introduce the symbols (A)..(D) to later reference the implementations of the bracket expression in the codebase. There are four independent codes that implement rules for extracting the bracket expression in the current codebase: - (A) The main loop of BRACKMATCH: This handles sub-expressions of a bracket expression when a matching sub-expression is not found. - (B) The section of the `matched' label in BRACKMATCH: This handles sub-expressions of the bracket expression after a matching sub-expression is found. - (C) PATSCAN: This skips bracket expressions to determine the end of the extglob constructs of the form @(...), ?(...), +(...), etc. - (D) MATCHLEN (lib/glob/gm_loop.c): This function handles bracket expressions to count the number of characters that a fixed-length pattern can match. Actually, each of the four implements a distinct rule, which does not match any of the other three. These implementations need to be adjusted to support an identical and consistent rule. Repeat-By - The differences between (A)..(D) cause various strange behaviors. 1. Strange behavior caused by an inconsistency between (A/B) and (C) This is what I was first faced with. The following shows the result of [example4.sh] with the current devel, where column 4 `{yes,no}/{yes,no}' shows `(result)/(what I expect)': --- PATSCAN vs BRACKMATCH --- #1: pat=@([[.].])A]) str=]no/yes #2: pat=@([[.].])A]) str===]A]) no/no #3: pat=@([[.].])A]) str=AA]) yes/no #4: pat=@([[=]=])A]) str=]no/no #5: pat=@([[=]=])A]) str===]A]) no/yes #6: pat=@([[=]=])A]) str=AA]) yes/no Obvious strange behaviors can be found in cases #3 and #6, where `A' matches twice even if there is only one `A' and no repetition such as `*()' or `+()' in the pattern. This is because PATSCAN (C) considers the bracket expression to be `[[.].]' while BRACKMATCH (A/B) considers the bracket expression to be `[[.].])A]'. First, PATSCAN extracts `@([[.].])', but BRACKMATCH next matches the first `A' in the input string using a pattern character `A' outside the construct `@()'. Finally, the remaining part `A])' in the pattern is matched literally. 2. Inconsistency between (A) and (B): To fix the above item for (A/B) vs (C), I checked the detailed behaviors of both and found this. The parsing of [.ch.], [=a=], and [:var:] are not totally consistent before and after a matching sub-expression is found. The following is the second section of the result of [example4.sh]: --- BRACKMATCH: after match vs before match --- #7: pat=[[=]=]ab]str=ayes/no #8: pat=[[.[=.]ab] str=ayes/yes #9: pat=[[.[==].]ab] str=ayes/yes #10: pat=[a[=]=]b]str=ano/no #11: pat=[a[.[=.]b] str=ano/yes #12: pat=[a[.[==].]b] str=ano/yes #13: pat=[a[=]=]b]str=byes/no #14: pat=[a[=]=]b]str=a=]b]yes/yes #15: pat=[a[.[=.]b] str=byes/yes #16: pat=[a[.[=.]b] str=ab] yes/no #17: pat=[a[.[==].]b] str=byes/yes #18: pat=[a[.[==].]b] str=ab] yes/no Cases #7..#9 succeeds, which means that `[=]=]', `[.[=.]', and `[.[==].]' form an equivalence class and collating symbols in BRACKMATCH (A). However, cases #10..#12 (which are the bracket expressions of the same sub-expression with different ordering) fail, which means that `[=]=]', `[.[=.]', and `[.[==].]' do not form an equivalence class or a collating symbol in BRACKMATCH (B). Also, cases #13 vs #14, #15 vs #16, and #17 vs #18 demonstrate that the same pattern consisting of bracket expressions and normal characters can match different numbers of characters. This means that the boundary of the bracket expression can change depending on the input string. Actually, these patterns are undefined by the standard because an equivalence class shall
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 11/20/22 7:50 AM, Koichi Murase wrote: This difference is caused because the slash after the backslash is only checked after a matching character is found (lib/glob/sm_loop.c:703). The same check should be applied also before a matching character is found (lib/glob/sm_loop.c:573). I attach a patch for this [r0037.brackmatch6.remaining-slash.patch]. I agree with this; it was an inadvertent omission. There is another related inconsistency. I just modified my new extglob engine to follow Bash's choice described above, but then the behavior became different from that of the actual implementation of Bash of the current devel. "If a pattern ends with an unescaped , it is unspecified whether the pattern does not match anything or the pattern is treated as invalid." [...] b. If this is the behavior for the unescaped backslashes outside the bracket expressions, which is intensionally different from those in the bracket expressions, would it be possible to change the treatment of the unescaped backslashes inside the bracket expression the same as that of outside so the bracket `[' matches literally (as expected in cases #28..#31 of my previous reply [1])? The attached [r0037.brackmatch7.unescaped-backslash-option-b.patch] is the corresponding patch. I have changed my mind, and I agree with this. I think the latest draft of the POSIX standard requires it unambiguously: "A that does not introduce a valid bracket expression shall match the character itself." and while it does not say so explicitly (the description defers to the regular expression description, which describes valid bracket expressions as having a closing bracket), I think it's reasonable to conclude that an incomplete bracket expression without a `]' is invalid, this text from the regular expression description aside: "An expression containing a '[' that is unescaped and is not part of a bracket expression produces undefined results." since the text in "Patterns Matching a Single Character" supersedes it. I attached the latest patch against bash-5.2.9. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ *** ../bash-5.2-patched/lib/glob/sm_loop.c 2021-08-03 10:24:49.0 -0400 --- lib/glob/sm_loop.c 2022-11-22 14:51:12.0 -0500 *** *** 344,347 --- 344,352 return (FNM_NOMATCH); + /* If we are matching pathnames, we can't match a slash with a + bracket expression. */ + if (sc == L('/') && (flags & FNM_PATHNAME)) + return (FNM_NOMATCH); + /* `?' cannot match `.' or `..' if it is the first character of the string or if it is the first character following a slash and *** *** 404,407 --- 409,414 } + #define SLASH_PATHNAME(c) (c == L('/') && (flags & FNM_PATHNAME)) + /* Use prototype definition here because of type promotion. */ static CHAR * *** *** 452,455 --- 459,468 pc = FOLD (p[1]); p += 4; + + /* Finding a slash in a bracket expression means you have to +match the bracket as an ordinary character (see below). */ + if (pc == L('/') && (flags & FNM_PATHNAME)) + return ((test == L('[')) ? savep : (CHAR *)0); /*]*/ + if (COLLEQUIV (test, pc)) { *** *** 464,467 --- 477,484 if (c == L('\0')) return ((test == L('[')) ? savep : (CHAR *)0); /*]*/ + else if (c == L('/') && (flags & FNM_PATHNAME)) + return ((test == L('[')) ? savep : (CHAR *)0); /*]*/ + else if (c == L(']')) + break; c = FOLD (c); continue; *** *** 476,484 pc = 0; /* make sure invalid char classes don't match. */ /* Find end of character class name */ ! for (close = p + 1; *close != '\0'; close++) if (*close == L(':') && *(close+1) == L(']')) break; ! if (*close != L('\0')) { ccname = (CHAR *)malloc ((close - p) * sizeof (CHAR)); --- 493,501 pc = 0; /* make sure invalid char classes don't match. */ /* Find end of character class name */ ! for (close = p + 1; *close != '\0' && SLASH_PATHNAME(*close) == 0; close++) if (*close == L(':') && *(close+1) == L(']')) break; ! if (*close != L('\0') && SLASH_PATHNAME(*close) == 0) { ccname = (CHAR *)malloc ((close - p) * sizeof (CHAR)); *** *** 527,530 --- 544,549 if (c == L('\0')) return ((test == L('[')) ? savep : (CHAR *)0); + else if (c == L('/') && (flags &
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月18日(金) 2:11 Chet Ramey : > "If a pattern ends with an unescaped , it is unspecified whether > the pattern does not match anything or the pattern is treated as invalid." > > Bash uses the former interpretation. If "the pattern is treated as invalid" > means trying to literally match the open bracket and going on from there, > your interpretation is valid as well. The standard doesn't use that > language in other places it specifies to treat the bracket as an ordinary > character to be matched literally, however. There seem to be still remaining issues. It is fine for me if Bash chooses the former, ``the pattern does not match anything'' with a backslash followed by NUL, but the following cases (see the attached [reduced3.sh]) with a backslash followed by a slash should still be fixed: #1: pat=a[b\/c] str=a[b/c] no/yes #2: pat=a[b\/c] str=ab no/no #3: pat=a[b\/c] str=ac yes/no [...] Where the fourth column shows the result of the current devel 407d9afc with FNM_PATHNAME (xxx) and the result I expect (yyy). "yes" means the pattern matches the string, and "no" means the pattern does not match. * I expect "yes" for #1 because the bracket expression contains a slash before its closing right bracket `]' and thus the beginning `[' should be matched literally. However, the actual behavior is "no". * I expect "no" for both #2 and #3 because the beginning bracket `[' should be matched literally. Even when an escaped slash would be allowed in the bracket expression so that [b\/c] forms a complete bracket expression, the results of #2 and #3 being "no" and "yes", respectively, are inconsistent. This difference is caused because the slash after the backslash is only checked after a matching character is found (lib/glob/sm_loop.c:703). The same check should be applied also before a matching character is found (lib/glob/sm_loop.c:573). I attach a patch for this [r0037.brackmatch6.remaining-slash.patch]. -- There is another related inconsistency. I just modified my new extglob engine to follow Bash's choice described above, but then the behavior became different from that of the actual implementation of Bash of the current devel. > "If a pattern ends with an unescaped , it is unspecified whether > the pattern does not match anything or the pattern is treated as invalid." > > Bash uses the former interpretation. The corresponding sentence in the POSIX standard describes the unescaped backslashes in the general context of the pattern instead of that in the bracket expression, so I applied this to the new extglob engine. However, ``the former interpretation'' that Bash adopts turned out to be only applied to the unescaped backslashes *inside a bracket expression*. This is the remaining part of the output of the attached [example3.sh] with the current devel 407d9afc: [...] #4: pat=a\ str=a\ yes/??? So the pattern terminated with unescaped backslash actually matches a string, where the backslash is treated as a literally-matching backslash. a. Is this difference between outside and inside of the bracket expressions intensional? I.e., the former interpretation "the pattern does not match anything" seems to only apply to the inside of bracket expressions. b. If this is the behavior for the unescaped backslashes outside the bracket expressions, which is intensionally different from those in the bracket expressions, would it be possible to change the treatment of the unescaped backslashes inside the bracket expression the same as that of outside so the bracket `[' matches literally (as expected in cases #28..#31 of my previous reply [1])? The attached [r0037.brackmatch7.unescaped-backslash-option-b.patch] is the corresponding patch. [1] https://lists.gnu.org/archive/html/bug-bash/2022-11/msg00070.html c. If the behavior of the unescaped backslash of the outside should also be modified to follow the former interpretation "the pattern does not match anything", another patch is [r0037.brackmatch7.unescaped-backslash-option-c.patch]. However, the current behavior outside the bracket expression seems to be explicitly required by the tests on tests/glob2.sub:32 and tests/glob2.sub:41. I prefer option b, which keeps the behavior required by tests/glob2.sub and also consistent between the inside and the outside of bracket expressions. It is also consistent with the behavior for the string end inside bracket expressions. -- Koichi From 828b93de72263785d93f86a285d919fdc5be156d Mon Sep 17 00:00:00 2001 From: Koichi Murase Date: Sun, 20 Nov 2022 16:09:09 +0900 Subject: [PATCH 1/2] fix(BRACKMATCH): fix remaining slash check --- lib/glob/sm_loop.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c index fa350daa..151d10bd 100644 ---
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月18日(金) 2:11 Chet Ramey : > "If a pattern ends with an unescaped , it is unspecified whether > the pattern does not match anything or the pattern is treated as invalid." > > Bash uses the former interpretation. If "the pattern is treated as invalid" > means trying to literally match the open bracket and going on from there, > your interpretation is valid as well. The standard doesn't use that > language in other places it specifies to treat the bracket as an ordinary > character to be matched literally, however. Thank you for the explanation. I haven't been considering the consistency with the treatment of the slashes outside the bracket expression. I think I need to adjust the treatment of the slash in my new engine by introducing e.g. a special flag for . Thank you. -- Koichi
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月18日(金) 6:29 Chet Ramey : > The same is true for ranges. If either the first character or the second > character in the range is a slash it should cause the bracket to be matched > literally. > > Incidentally, I made an additional change so that an incomplete range > expression causes the bracket to be matched literally. > > Thanks for the discussion. I attached an updated version of the patch I > sent yesterday. I wasn't quite as aggressive as you with macros. Thank you for the new version of the patch. I have tried it. Thank you for considering the change to the incomplete range in the bracket expression. -- Koichi
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 11/17/22 6:05 AM, Koichi Murase wrote: Thank you for the patch. I applied it locally and tried it. I attach a test script that I used: [bracket-slash.sh]. Now I switched to a loadable builtin that directly calls strmatch instead of hacking GLOBIGNORE. Here are the results: ---Tests for a slash in bracket expressions--- #1: pat=ab[/]ef str=ab[/]ef yes/yes #2: pat=ab[/]ef str=ab/efno/no #3: pat=ab[c/d]efstr=ab[c/d]efyes/yes #4: pat=ab[c/d]efstr=abcefyes/no #5: pat=ab[.-/]efstr=ab[.-/]efno/yes #6: pat=ab[.-/]efstr=ab.efyes/no #7: pat=ab[[=/=]]ef str=ab[[=/=]]ef yes/yes #8: pat=ab[[=/=]]ef str=ab/efno/no #9: pat=ab[[=c=]/]ef str=ab[=/]ef yes/yes #10: pat=ab[[=c=]/]ef str=abcefyes/no #11: pat=ab[[:alpha:]/]ef str=ab[:/]ef yes/yes #12: pat=ab[[:alpha:]/]ef str=abxefyes/no #13: pat=ab[/[abc]]ef str=ab[/c]ef yes/yes #14: pat=ab[/[abc]]ef str=abc]ef no/no #15: pat=ab[c[=/=]]ef str=ab[c[=/=]]ef yes/yes #16: pat=ab[c[=/=]]ef str=abc[=/=]ef no/no #17: pat=ab[c[=/=]]ef str=abcefyes/no ---Tests for incomplete bracket expressions--- #18: pat=ab[c str=ab[c yes/yes #19: pat=ab[c str=abc no/no #20: pat=ab[c[=d= str=ab[c[=d= yes/yes #21: pat=ab[c[=d= str=abc no/no #22: pat=ab[c[.d str=ab[c[.d yes/yes #23: pat=ab[c[.d str=abc no/no #24: pat=ab[c[:alpha: str=ab[c[:alpha: yes/yes #25: pat=ab[c[:alpha: str=abc no/no #26: pat=ab[c-str=ab[c-no/yes #27: pat=ab[c-str=abc no/no #28: pat=ab[c\str=ab[c\no/yes #29: pat=ab[c\str=abc no/no #30: pat=ab[[\str=ab[[\no/yes #31: pat=ab[[\str=ab[ no/no "yes" and "no" on the left of / in the fourth column are the results after applying the patch you provided. "yes" and "no" on the right of / are the results that *I* expect (see below paragraphs). The new treatment seems to only handle a slash that directly appears as an independent character in bracket expressions (cases #1,#2,#13,#14), but if I literally read the standard you quoted, I feel we should also handle other slashes (#3..#6,#9..#12,#15..#17). Yes, I was undecided at the time whether character classes and equivalence classes should have the same treatment or fall back to invalid classes. I think your interpretation there is the right one. The same is true for ranges. If either the first character or the second character in the range is a slash it should cause the bracket to be matched literally. Incidentally, I made an additional change so that an incomplete range expression causes the bracket to be matched literally. Thanks for the discussion. I attached an updated version of the patch I sent yesterday. I wasn't quite as aggressive as you with macros. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ *** ../bash-5.2-patched/lib/glob/sm_loop.c 2021-08-03 10:24:49.0 -0400 --- lib/glob/sm_loop.c 2022-11-17 15:42:42.0 -0500 *** *** 344,347 --- 344,352 return (FNM_NOMATCH); + /* If we are matching pathnames, we can't match a slash with a + bracket expression. */ + if (sc == L('/') && (flags & FNM_PATHNAME)) + return (FNM_NOMATCH); + /* `?' cannot match `.' or `..' if it is the first character of the string or if it is the first character following a slash and *** *** 404,407 --- 409,414 } + #define SLASH_PATHNAME(c) (c == L('/') && (flags & FNM_PATHNAME)) + /* Use prototype definition here because of type promotion. */ static CHAR * *** *** 452,455 --- 459,468 pc = FOLD (p[1]); p += 4; + + /* Finding a slash in a bracket expression means you have to +match the bracket as an ordinary character (see below). */ + if (pc == L('/') && (flags & FNM_PATHNAME)) + return ((test == L('[')) ? savep : (CHAR *)0); /*]*/ + if (COLLEQUIV (test, pc)) { *** *** 464,467 --- 477,484 if (c == L('\0')) return ((test == L('[')) ? savep : (CHAR *)0); /*]*/ + else if (c == L('/') && (flags & FNM_PATHNAME)) + return ((test == L('[')) ? savep : (CHAR *)0); /*]*/ + else if (c == L(']')) + break;
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 11/17/22 6:05 AM, Koichi Murase wrote: Also, I have a suggestion of changes for more consistent handling of incomplete bracket expressions. Currently, incomplete bracket expressions sometimes fail unconditionally (#26..#31) and sometimes fall back to a literal `[' plus remaining pattern (#18..#25). I attach another patch for this: [r0037.brackmatch5.incomplete-bracket.patch.txt]. "If a pattern ends with an unescaped , it is unspecified whether the pattern does not match anything or the pattern is treated as invalid." Bash uses the former interpretation. If "the pattern is treated as invalid" means trying to literally match the open bracket and going on from there, your interpretation is valid as well. The standard doesn't use that language in other places it specifies to treat the bracket as an ordinary character to be matched literally, however. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月17日(木) 6:47 Chet Ramey : > The cleverness is due to Russ Cox, a really smart guy who figured this > stuff out first: > > https://research.swtch.com/glob > > (https://swtch.com/~rsc/regexp/ is a collection of his writing on regular > expressions. It's well worth reading.) Thank you for the references to interesting readings! Ah, so he is the developer of re2. I think this is the first time for me to read articles written by the developer of re2. I have looked inside the codes there. The idea of my new implementation [1] is closer to the implementation "bounded-memory DFA" [2] linked from the regexp page [3] (though I allocate a sufficient memory block calculated by the length of the pattern string instead of using a fixed size of static storage). These do not require the full ``compilation'' of the DFA so it requires a minimal preprocessing time, but instead, these are not as fast as compiled DFAs (when we focus on the proportional coefficients of the linear time complexity). The new implementation is still work-in-progress, so I later write the reasoning for the choice of the implementation strategy when I submit the version ready for review. [1] https://gitlab.com/akinomyoga/bash/-/merge_requests/1; If some of you want to try it, you can clone it by $ git clone https://gitlab.com/akinomyoga/bash.git -b extglob but be careful that this is now a moving target occasionally squashed and force-pushed. [2] https://swtch.com/~rsc/regexp/dfa1.c.txt [3] https://swtch.com/~rsc/regexp/ I attach some benchmarks of the current state [dfaglob.pdf], where "strmatch_ex" shown by solid lines is the new implementation. Now everything is linear with respect to the input string length, but I would like to add some optimizations for simple patterns after settling bracket-expression matters. Also, I think I'd later add the cases used in https://research.swtch.com/glob to the benchmarks. -- Koichi dfaglob.pdf Description: Adobe PDF document
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月17日(木) 6:47 Chet Ramey : > fnmatch with FNM_PATHNAME only has to avoid matching the slash with a > bracket expression. The shell has an additional constraint: a slash that > appears in a bracket expression renders the bracket expression void and > requires the `[' to be matched explicitly. That's why there have to be > tests for slash in BRACKMATCH. There are two bugs: the off-by-one error > you note and matching the open bracket explicitly. Thank you for the explanation. I did not know the special rule for the slash in the bracket expression. I assumed that that part of the code in BRACKMATCH is related to the behavior of the bracket expression never matching a slash so just removed it, which was wrong. > I attached the patch I applied. I didn't include your fix to issue 1 above. Thank you for the patch. I applied it locally and tried it. I attach a test script that I used: [bracket-slash.sh]. Now I switched to a loadable builtin that directly calls strmatch instead of hacking GLOBIGNORE. Here are the results: ---Tests for a slash in bracket expressions--- #1: pat=ab[/]ef str=ab[/]ef yes/yes #2: pat=ab[/]ef str=ab/efno/no #3: pat=ab[c/d]efstr=ab[c/d]efyes/yes #4: pat=ab[c/d]efstr=abcefyes/no #5: pat=ab[.-/]efstr=ab[.-/]efno/yes #6: pat=ab[.-/]efstr=ab.efyes/no #7: pat=ab[[=/=]]ef str=ab[[=/=]]ef yes/yes #8: pat=ab[[=/=]]ef str=ab/efno/no #9: pat=ab[[=c=]/]ef str=ab[=/]ef yes/yes #10: pat=ab[[=c=]/]ef str=abcefyes/no #11: pat=ab[[:alpha:]/]ef str=ab[:/]ef yes/yes #12: pat=ab[[:alpha:]/]ef str=abxefyes/no #13: pat=ab[/[abc]]ef str=ab[/c]ef yes/yes #14: pat=ab[/[abc]]ef str=abc]ef no/no #15: pat=ab[c[=/=]]ef str=ab[c[=/=]]ef yes/yes #16: pat=ab[c[=/=]]ef str=abc[=/=]ef no/no #17: pat=ab[c[=/=]]ef str=abcefyes/no ---Tests for incomplete bracket expressions--- #18: pat=ab[c str=ab[c yes/yes #19: pat=ab[c str=abc no/no #20: pat=ab[c[=d= str=ab[c[=d= yes/yes #21: pat=ab[c[=d= str=abc no/no #22: pat=ab[c[.d str=ab[c[.d yes/yes #23: pat=ab[c[.d str=abc no/no #24: pat=ab[c[:alpha: str=ab[c[:alpha: yes/yes #25: pat=ab[c[:alpha: str=abc no/no #26: pat=ab[c-str=ab[c-no/yes #27: pat=ab[c-str=abc no/no #28: pat=ab[c\str=ab[c\no/yes #29: pat=ab[c\str=abc no/no #30: pat=ab[[\str=ab[[\no/yes #31: pat=ab[[\str=ab[ no/no "yes" and "no" on the left of / in the fourth column are the results after applying the patch you provided. "yes" and "no" on the right of / are the results that *I* expect (see below paragraphs). The new treatment seems to only handle a slash that directly appears as an independent character in bracket expressions (cases #1,#2,#13,#14), but if I literally read the standard you quoted, I feel we should also handle other slashes (#3..#6,#9..#12,#15..#17). Cases #7 and #8 seem to be already processed in the current devel. I think we can understand the rule of the standard by a mental processing model to first split the pattern with `/' and then to match each segment with the corresponding segment of the target string. Thinking in that way, we can actually just replace the NUL test for the string end with a similar test for NUL or a slash. I attach a possible patch for the additional fix [r0037.brackmatch4.slash-terminated.patch.txt], which applies after your patch [bracket-slash.patch]. Also, I have a suggestion of changes for more consistent handling of incomplete bracket expressions. Currently, incomplete bracket expressions sometimes fail unconditionally (#26..#31) and sometimes fall back to a literal `[' plus remaining pattern (#18..#25). I attach another patch for this: [r0037.brackmatch5.incomplete-bracket.patch.txt]. -- Koichi From cbee6f5a782e9ce1b431704f33a3e03de85c4460 Mon Sep 17 00:00:00 2001 From: Koichi Murase Date: Thu, 17 Nov 2022 18:44:19 +0900 Subject: [PATCH 1/3] fix(BRACKMATCH): terminate bracket expression by any slash --- lib/glob/sm_loop.c | 45 - 1 file changed, 24 insertions(+), 21 deletions(-) diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c index a448c155..f4d8d1f2 100644 --- a/lib/glob/sm_loop.c +++ b/lib/glob/sm_loop.c @@ -439,6 +439,16 @@ BRACKMATCH (p, test, flags) if (not = (*p == L('!') || *p == L('^'))) ++p; + /* POSIX.2 2.13.3 says: `If a character is found following an + unescaped character before a corresponding + is found, the open bracket shall be treated as an +
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 11/14/22 6:06 AM, Koichi Murase wrote: However, I noticed two strange behaviors of the current engine. Before adjusting the behavior of the new engine and submitting it for review, I would like to confirm the (expected) behavior of the current engine in the current devel. These two behaviors finally turned out to be both related to the matching of bracket expression by the function `BRACKMATCH' (lib/glob/sm_loop.c). -- 1. pattern [[=B=]][c] matches with c $ bash-devel --norc $ [[ Bc == [[=B=]][c] ]]; echo $? 0 <-- OK. This is expected. $ [[ c == [[=B=]][c] ]]; echo $? 0 <-- This is unexpected. This is clearly a problem, and your fix is a good one. -- 2. bracket expression sometimes match or unmatch the slash with FNM_PATHNAME. FNM_PATHNAME is only used in two places in the Bash codebase. 1) One is for the glob matching for the filenames in the directory (lib/glob/glob.c). However, this actually does not seem to have an actual effect because FNM_PATHNAME only causes differences in the matching when the target string contains a slash but the filenames do not contain any slashes. 2) The other is the filtering of the pathname-expansion results with GLOBIGNORE (pathexp.c). So the only way to test the behavior of Bash's strmatch with FNM_PATHNAME (without writing a C program to directly use the function `strmatch') is to use GLOBIGNORE. This is true, but the two uses above are not directly analogous to fnmatch(). The bash implementation, due to its origin as an fnmatch() implementation, uses the same flag names but uses FNM_PATHNAME to signal the matching engine that it's matching filenames for pathname expansion. That triggers the difference. The first part of your patch is correct: a bracket expression can never match a slash in the string. This is common to both fnmatch and what POSIX calls "Patterns Used For Filename Expansion." https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_03 fnmatch with FNM_PATHNAME only has to avoid matching the slash with a bracket expression. The shell has an additional constraint: a slash that appears in a bracket expression renders the bracket expression void and requires the `[' to be matched explicitly. That's why there have to be tests for slash in BRACKMATCH. There are two bugs: the off-by-one error you note and matching the open bracket explicitly. * In fnmatch(3), a bracket expression never matches a / with FNM_PATHNAME. This is one part of the fix. * In Bash's strmatch, a bracket expression sometimes matches `/' and sometimes does not. In the codebase, `/' is excluded only when it explicitly appears after another character in the bracket expression (lib/glob/sm_loop.c:574) even though the comment mentions [/]. This is the reason that only [a/] fails with Bash's implementation in cases #2..#6 in the above result. That's the consequence of the test being in the wrong place in BRACKMATCH. * What is happening with Bash's implementation in cases #7..#12 is related the assumption of the backtracking trick for `*': The trick for `*' backtracking explained in the code comment lib/glob/sm_loop.c:320---"This is a clever idea from glibc"---relies on the behavior that the bracket expression never matches a slash that `*' cannot match. These two changes should satisfy that. The cleverness is due to Russ Cox, a really smart guy who figured this stuff out first: https://research.swtch.com/glob (https://swtch.com/~rsc/regexp/ is a collection of his writing on regular expressions. It's well worth reading.) I attached the patch I applied. I didn't include your fix to issue 1 above. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/ *** ../bash-5.2-patched/lib/glob/sm_loop.c 2021-08-03 10:24:49.0 -0400 --- lib/glob/sm_loop.c 2022-11-16 16:22:42.0 -0500 *** *** 344,347 --- 344,352 return (FNM_NOMATCH); + /* If we are matching pathnames, we can't match a slash with a + bracket expression. */ + if (sc == L('/') && (flags & FNM_PATHNAME)) + return (FNM_NOMATCH); + /* `?' cannot match `.' or `..' if it is the first character of the string or if it is the first character following a slash and *** *** 452,455 --- 457,466 pc = FOLD (p[1]); p += 4; + + /* Finding a slash in a bracket expression means you have to +match the bracket as an ordinary character (see below). */ + if (c == L('/') && (flags & FNM_PATHNAME)) +
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
2022年11月1日(火) 0:59 Chet Ramey : > If someone wanted to do this, I would take a look at incorporating the > results, as long as it didn't add dependencies on, say, pcre or gnulib > regex. Instead of translating the pattern to a regular expression and compiling it by regcomp (), I have experimentally implemented a new extglob engine based on a naive DFA and was comparing the behavior and the performance with the current implementations of devel. [Note: In this report hereafter, ``the current implementation/engine'' always means the implementation of strmatch (lib/glob/{strmatch,smatch,sm_loop.c}) in the current devel 31f4d468.] However, I noticed two strange behaviors of the current engine. Before adjusting the behavior of the new engine and submitting it for review, I would like to confirm the (expected) behavior of the current engine in the current devel. These two behaviors finally turned out to be both related to the matching of bracket expression by the function `BRACKMATCH' (lib/glob/sm_loop.c). -- 1. pattern [[=B=]][c] matches with c $ bash-devel --norc $ [[ Bc == [[=B=]][c] ]]; echo $? 0 <-- OK. This is expected. $ [[ c == [[=B=]][c] ]]; echo $? 0 <-- This is unexpected. See the attached [r0037.brackmatch1.equivalence-class.patch] for the fix. The problem is caused because [[=B=]][c] is treated as a single bracket expression [ « [=B=]][c » ] when the equivalence class [=B=] does not match. This is because `]' after `[[=B=]' is treated as if it is the first `]' in the bracket expression (such as `]' after `[' in the pattern « []aaa] »). In the patch, when the next character after a failing equivalence class [=B=] is `]', the bracket expression has been changed to fail just the same as the case for a failing character class [:alpha:] (lib/glob/sm_loop.c:530; line number is that in the current devel 31f4d468). -- 2. bracket expression sometimes match or unmatch the slash with FNM_PATHNAME. FNM_PATHNAME is only used in two places in the Bash codebase. 1) One is for the glob matching for the filenames in the directory (lib/glob/glob.c). However, this actually does not seem to have an actual effect because FNM_PATHNAME only causes differences in the matching when the target string contains a slash but the filenames do not contain any slashes. 2) The other is the filtering of the pathname-expansion results with GLOBIGNORE (pathexp.c). So the only way to test the behavior of Bash's strmatch with FNM_PATHNAME (without writing a C program to directly use the function `strmatch') is to use GLOBIGNORE. To demonstrate the behavior of the current implementation, I attach a script [fnmpath.sh], which utilizes GLOBIGNORE for the Bash implementation. It compares the result with fnmatch(3). The result in my environment (x86_64-redhat-linux-gnu [Fedora Linux 36 (Server Edition)]) is this: $ bash-devel fnmpath.sh #1: pat=ab/cd/efgyes/yes #2: pat=ab[/]cd/efg yes/no #3: pat=ab[/a]cd/efg yes/no #4: pat=ab[a/]cd/efg no/no #5: pat=ab[!a]cd/efg yes/no #6: pat=ab[.-0]cd/efgyes/no #7: pat=*/*/efg yes/yes #8: pat=*[/]*/efgno/no #9: pat=*[/a]*/efg no/no #10: pat=*[a/]*/efg no/no #11: pat=*[!a]*/efg no/no #12: pat=*[.-0]*/efg no/no #13: pat=ab@(/)cd/efg yes/yes #14: pat=*@(/)cd/efg no/no #15: pat=*/cd/efg yes/yes This tests whether each pattern matches the string "ab/cd/efg". Two results by Bash's strmatch and fnmatch(3) are connected with `/'. "yes" means the pattern matches the string "ab/cd/efg" and "no" means it does not match. Some observations are * In fnmatch(3), a bracket expression never matches a / with FNM_PATHNAME. * In Bash's strmatch, a bracket expression sometimes matches `/' and sometimes does not. In the codebase, `/' is excluded only when it explicitly appears after another character in the bracket expression (lib/glob/sm_loop.c:574) even though the comment mentions [/]. This is the reason that only [a/] fails with Bash's implementation in cases #2..#6 in the above result. * What is happening with Bash's implementation in cases #7..#12 is related the assumption of the backtracking trick for `*': The trick for `*' backtracking explained in the code comment lib/glob/sm_loop.c:320---"This is a clever idea from glibc"---relies on the behavior that the bracket expression never matches a slash that `*' cannot match. [Note: The exact requirements for this trick is actually slightly weaker: each bracket expression needs to match a fixed number of `/', 0 or 1, when FNM_PATHNAME is specified; it should never match a slash if it can match other characters, and it should never match other
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 10/29/22 1:50 AM, Martin D Kealey wrote: PS: While we're about it, can we please fix the expansion of « a/*/b/c/d/* » so it only calls readdir on a/ and each dir matching a/*/b/c/d/ and NOT call readdir on dirs matching a/*/, a/*/b/, or a/*/b/c/. It already does this. Given $ find . -print . ./a ./a/z ./a/z/b ./a/z/b/c ./a/x ./a/x/b ./a/x/b/c ./a/x/b/c/d ./a/x/b/c/d/1 ./a/x/b/c/d/4 ./a/x/b/c/d/3 ./a/x/b/c/d/2 ./a/x/b/c/d/5 ./a/y ./a/y/b Running `echo a/*/b/c/d/*' calls opendir(2) twice: $ ./bash ./x1 opendir: a/ opendir: a/x/b/c/d a/x/b/c/d/1 a/x/b/c/d/2 a/x/b/c/d/3 a/x/b/c/d/4 a/x/b/c/d/5 -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 10/30/22 9:59 PM, Martin D Kealey wrote: With the exception of the !(LIST) negation, there's a direct correspondence between extglob and any other regex format. Translating between them is trivial. As I said before, I would gladly look at incorporating such a change, as long as it didn't increase the number of dependencies. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 10/30/22 6:32 AM, Oğuz wrote: Or (g) make the existing extglob code faster and avoid introducing more complexity into the shell. The extglob code has reasonable performance when it's asked to do the most common thing: check whether a given null-terminated string matches a given (possibly extended) pattern, anchored at start and end. The problem arises when you want to support the leftmost-longest substring match semantics required by pattern substitution. There's no straightforward way to simply extend the model -- you have to implement something new that supports those semantics to replace the simple way it's done now. If you want to stick with bash when performance for your use case becomes unacceptable and eschew the (easy, simple, fast) sed workarounds, you're going to be frustrated until something changes. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 10/29/22 1:50 AM, Martin D Kealey wrote: This seems like a good reason to simply translate extglobs into regexes, which should run in linear time, rather than put effort into building and debugging a parallel implementation. If someone wanted to do this, I would take a look at incorporating the results, as long as it didn't add dependencies on, say, pcre or gnulib regex. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 10/28/22 2:44 AM, Hyunho Cho wrote: Bash Version: 5.2 Patch Level: 2 Release Status: release ## bash "extglob" almost unusable except with very tiny string. there is no such problems in zsh "kshglob". Nothing has changed since you brought this up last month. The same two-line sed workaround works on the 90K string from `gcc --help'. https://lists.gnu.org/archive/html/bug-bash/2022-09/msg00029.html -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 10/31/22 2:20 PM, Greg Wooledge wrote: Bash already uses the POSIX regex functions (regcomp(3) et al.) to do [[ =~ ]] using POSIX ERE. Yeah, and offers no more than what your libc's regex engine has. For example, you can't tell what `[[ x =~ .{256} ]]' (or even `[[ x =~ "" ]]') would return without knowing the operating system it's run on. extglobs aren't like that and shouldn't be either.
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On Mon, Oct 31, 2022 at 05:00:40AM +0200, Oğuz wrote: > 31 Ekim 2022 Pazartesi tarihinde Martin D Kealey > yazdı: > > If we use the standard POSIX BRE or ERE, then there's no additional code to > > ship; it's included as part of the OS. The hard part is what to do with > > (!LIST), which was the point of my previous post. > > That'd be clunkier than what we already have. Bash targets many platforms > and it'd have to target as many regex engines if it were to translate > extglobs to posix regexes. You can't expect all of them to be compatible > with each other, and they are not. So, if we wish to translate extglobs to > regexes and have them work regardless of the platform, the easiest way > forward is to adopt a third party regex engine; about which I said enough > in my previous email. Bash already uses the POSIX regex functions (regcomp(3) et al.) to do [[ =~ ]] using POSIX ERE. If it weren't for !(list) it would be possible, even easy, for bash to convert extglobs into POSIX EREs, syntactically, and then call the same libraries it's already calling. I'm not sure where your claim of incompatibility comes from. Regardling !(list), I know that many people use it. It's fairly popular, at least among the sort of people who come to IRC asking how they can match all files except foo and bar. So, simply dropping it would not be a viable option.
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
31 Ekim 2022 Pazartesi tarihinde Martin D Kealey yazdı: > > With the exception of the !(LIST) negation, there's a direct > correspondence between extglob and any other regex format. Translating > between them is trivial. > If we use the standard POSIX BRE or ERE, then there's no additional code to > ship; it's included as part of the OS. The hard part is what to do with > (!LIST), which was the point of my previous post. > That'd be clunkier than what we already have. Bash targets many platforms and it'd have to target as many regex engines if it were to translate extglobs to posix regexes. You can't expect all of them to be compatible with each other, and they are not. So, if we wish to translate extglobs to regexes and have them work regardless of the platform, the easiest way forward is to adopt a third party regex engine; about which I said enough in my previous email. The problem is that it DOESN'T work fine. In practice people encounter > abysmally slow extglob matching. > *when matching against a huge string. Which is rare in my experience, but of course should be taken into consideration if there are multiple bug reports; I didn't say anything against that. -- Oğuz
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
I'm top-quoting this because the entire response below seems to be predicated on a misconception, or perhaps several misconceptions. Exactly NONE of my suggestions involves expanding the Shell language. Users would continue to write extglob exactly as they do now, and they would remain blissfully ignorant of the regex engine underneath. The "translation" is entirely hidden from them. Extglob is amenable to compilation into a FSA, which leads to the conclusion that it's functionally a form of regular expression, even if its syntax is rather different from regexes that people are familiar with. This is why I said that writing our own regular/linear extglob implementation would be equivalent to writing a regex compiler and engine. It's also why I think the shortest route to having a working implementation would be to write a translation layer from extglob to one of the existing RE formats. (The only option I suggested that would modify the shell language would be the option to remove the !(LIST) negation; I include that for completeness, I wouldn't actually recommend it.) The idea that "these are only technical details that nobody cares about" is just wrong. There have been bugs filed because the performance is not just "a bit slow", but "slow by many orders of magnitude". If discussion of implementation details is out of scope for the bash-bug mailing list, is there a bash-dev list, or similar, where they should be discussed? With the exception of the !(LIST) negation, there's a direct correspondence between extglob and any other regex format. Translating between them is trivial. If we use the standard POSIX BRE or ERE, then there's no additional code to ship; it's included as part of the OS. The hard part is what to do with (!LIST), which was the point of my previous post. -Martin On Sun, 30 Oct 2022 at 23:35, Oğuz İsmail Uysal wrote: > On 10/30/22 3:25 PM, Martin D Kealey wrote: > > How much faster do you think it can be made? > I don't know, irrelevant though. > > The problem is not that individual steps are slow, but rather [...] These are technical details; no user cares about them. > > The purpose of my suggestions was to /minimize/ the complexity that > becomes part of Bash's codebase [...] > I meant complexity of the language, not the codebase. > > In my opinion "make the existing extglob code faster" is a wasted effort > if it doesn't get us to "run in at-most quadratic time" and preferably "run > in regular (linear) time", and so that basically amounts to "write our own > regex state machine compiler and regex engine". This is a non-trivial task, > and would fairly obviously add > > *more* complex code into Bash's codebase than any of my suggested > > alternatives. > extglobs are already a part of the bash language. All of your suggested > alternatives involve expanding the language in question. No. Just No. That's why I > disagree with all of them. > > > (Even my options of "postprocess the codebase" or "modify an existing > > regex compiler" would leave their execution components untouched; only > > the compilation phase would be modified, and a modified regex compiler > > would at least stand a chance of existing as a stand-alone library > > project.) > If you mean bash should start shipping a huge library like pcre for > solving an edge case, I don't think that's reasonable at all; why take > on such a burden when you already have something that works fine in > practice? > Using PCRE was only one option, and not my preferred one. POSIX RE is provided by the standard C library, we wouldn't have to ship anything. The problem is that it DOESN'T work fine. In practice people encounter abysmally slow extglob matching.
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On 10/30/22 3:25 PM, Martin D Kealey wrote: How much faster do you think it can be made? I don't know, irrelevant though. The problem is not that individual steps are slow, but rather that it takes at least a higher-order-polynomial number of steps, possibly more (such as exponential or factorial). Speeding up the individual steps will make no practical difference, while pin-hole optimisations may dramatically speed up some common cases, but still leave the most general cases catastrophically slow. These are technical details; no user cares about them. The purpose of my suggestions was to /minimize/ the complexity that becomes part of Bash's codebase, while leaving as few pathological cases as possible - preferably none. I meant complexity of the language, not the codebase. In my opinion "make the existing extglob code faster" is a wasted effort if it doesn't get us to "run in at-most quadratic time" and preferably "run in regular (linear) time", and so that basically amounts to "write our own regex state machine compiler and regex engine". This is a non-trivial task, and would fairly obviously add *more* complex code into Bash's codebase than any of my suggested alternatives. extglobs are already a part of the bash language. All of your suggested alternatives involve expanding the language in question. That's why I disagree with all of them. (Even my options of "postprocess the codebase" or "modify an existing regex compiler" would leave their execution components untouched; only the compilation phase would be modified, and a modified regex compiler would at least stand a chance of existing as a stand-alone library project.) If you mean bash should start shipping a huge library like pcre for solving an edge case, I don't think that's reasonable at all; why take on such a burden when you already have something that works fine in practice?
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On Sun, Oct 30, 2022, 11:33 Oğuz wrote: > 30 Ekim 2022 Pazar tarihinde Martin D Kealey > yazdı: > > > > So the options would seem to be: > > (a) prohibit inversions (you get to pick EITHER extglob or rexglob, not > > both); > > (b) bypass convert-to-regex when inversions are present; > > (c) use PCRE or Vim RE, which already support negations (though not in > the > > same form); note that these do not have linear ("regular") time > performance > > in any but the trivial cases; > > (d) compute the "inverse" regex for a given negation (this may require > Vim > > RE, see below); > > (e) post-process the compiled regex (this would be highly dependent on > the > > specific RE implementation); > > (f) pick an existing regex engine, and add the necessary logic to handle > > negations and conjunctions (see below); > > > > Or > (g) make the existing extglob code faster and avoid introducing more > complexity into the shell. > > Which, in my opinion, is the easiest and most realistic option. > last three times i bugged about glob ( like reuse groupings ) it sounded 'do-it-yourself' back only i would , have done , if i could ( .c ) -- > Oğuz >
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
30 Ekim 2022 Pazar tarihinde Martin D Kealey yazdı: > > So the options would seem to be: > (a) prohibit inversions (you get to pick EITHER extglob or rexglob, not > both); > (b) bypass convert-to-regex when inversions are present; > (c) use PCRE or Vim RE, which already support negations (though not in the > same form); note that these do not have linear ("regular") time performance > in any but the trivial cases; > (d) compute the "inverse" regex for a given negation (this may require Vim > RE, see below); > (e) post-process the compiled regex (this would be highly dependent on the > specific RE implementation); > (f) pick an existing regex engine, and add the necessary logic to handle > negations and conjunctions (see below); > Or (g) make the existing extglob code faster and avoid introducing more complexity into the shell. Which, in my opinion, is the easiest and most realistic option. -- Oğuz
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On Sat, 29 Oct 2022 at 22:15, Greg Wooledge wrote: > On Sat, Oct 29, 2022 at 04:50:00PM +1100, Martin D Kealey wrote: > > This seems like a good reason to simply translate extglobs into regexes, > > which should run in linear time, rather than put effort into building and > > debugging a parallel implementation. > > This isn't straightforward, because of the !(list) feature of extglob. > There's no analogous construct for that in standard regexes. > That's true. and my "simply" is understating the complexity of the task, but it's not impossible. There was a recent discussion about this very point on irc://libera.chat#Bash where it was pointed out that supporting the !(list) style of inversions in the regex compiler is simply a matter of inverting the "success" or "failure" indications attached to the states when compiling a regex, which means that it works correctly when these constructs are nested. So the options would seem to be: (a) prohibit inversions (you get to pick EITHER extglob or rexglob, not both); (b) bypass convert-to-regex when inversions are present; (c) use PCRE or Vim RE, which already support negations (though not in the same form); note that these do not have linear ("regular") time performance in any but the trivial cases; (d) compute the "inverse" regex for a given negation (this may require Vim RE, see below); (e) post-process the compiled regex (this would be highly dependent on the specific RE implementation); (f) pick an existing regex engine, and add the necessary logic to handle negations and conjunctions (see below); A particular difficulty is handling !(!(X)|!(Y)) which by de Morgan's law translates into @(X) - meaning a target needs to match BOTH X and Y. Only Vim's Regex engine has direct support for that construct, and it exhibits non-linear timing when using it. A naïve implementation could easily require state table space that's exponential in the breadth of the conjunctions, meaning that the user could very easily write a rexglob that cannot be compiled with available memory. So there needs to be a fallback position: either fail, telling the user we cannot honour the linear speed guarantee, or "try anyway" with a procedure that could exhibit truly horrible time complexity. -Martin PS: I say "we" and "us" advisedly; let's not leave everything to Chet.
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On Sat, Oct 29, 2022 at 04:50:00PM +1100, Martin D Kealey wrote: > This seems like a good reason to simply translate extglobs into regexes, > which should run in linear time, rather than put effort into building and > debugging a parallel implementation. This isn't straightforward, because of the !(list) feature of extglob. There's no analogous construct for that in standard regexes.
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
This seems like a good reason to simply translate extglobs into regexes, which should run in linear time, rather than put effort into building and debugging a parallel implementation. -Martin PS: While we're about it, can we please fix the expansion of « a/*/b/c/d/* » so it only calls readdir on a/ and each dir matching a/*/b/c/d/ and NOT call readdir on dirs matching a/*/, a/*/b/, or a/*/b/c/.
Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
On Fri, Oct 28, 2022 at 03:44:34PM +0900, Hyunho Cho wrote: > # this is just for testing purposes. > $ foo=$( gcc -v --help 2> /dev/null | sed -En 's/^\s*(-[^ ]+).*/\1/p' ) The last time this issue was reported, we mentioned that "gcc --help" is not a standard document that appears the same on everyone's systems. I'm not disputing that bash has performance issues here, but in the interests of getting a coherent bug report (or feature request), please use input that's available to everyone. At the very least, show us the output of "gcc -v --help 2>/dev/null | wc" on your system, so we have a ballpark estimate of the size of the input and the number of words in it. On my system, as an example: unicorn:~$ gcc -v --help 2>/dev/null | wc 3867 22464 250924 unicorn:~$ gcc --help 2>/dev/null | wc 61 4303995