Re: bug with special bracket expressions in regular expressions
On 02/09/2013 19:45, Andriy Gapon wrote: It seems that the code works like this: - first it matches cd0 and removes it - then it passes cd1 xx for matching with a flag that tells that this is not a real start of the string - thus the matching code o knows that this is not a real line start, so it can't match [[::]] just for that reason o it does _not_ know what was the character before the start of the given substring, so it can not know if it could match [[::]] So matching fails. Not sure if this is an internal problem of regex(3) or a problem of how sed(1) uses regex(3). I've come up with a patch to fix this problem: https://reviews.freebsd.org/D2792 I am not sure who among the developers is interested in the regexp code, so currently the request does not have any reviewers. If know that code well or care about its correctness please add yourself ti the review request. All testers are welcome. The issue could be quite an edge case, but I am more interested to see if no regressions are introduced. Thanks. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bug with special bracket expressions in regular expressions
On 02/09/2013 16:09, Damian Weber wrote: On Mon, 2 Sep 2013, Andriy Gapon wrote: re_format(7) says: There are two special cases? of bracket expressions: the bracket expres? sions ?[[::]]? and ?[[::]]? match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore. This is an extension, compatible with but not specified by IEEE Std 1003.2 (?POSIX.2?), and should be used with caution in software intended to be portable to other systems. However I observe the following: $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g' cd1 xx In my opinion '[[::]]' should not affect how the pattern is matched in this case. Any thoughts, suggestions? there are two simpler expressions, whose difference I don't understand either (tested on 8.4-PRERELEASE) $ echo cd0 cd1 xx | sed 's/cd[0-9] //g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9] //g' cd1 xx Well, I agree with your analysis, and I think it's certainly a bug. Do you think that the BUGS line in regex(3) should perhaps be extended to never works properly?: Word-boundary matching does not work properly in multibyte locales. [[::]] can be replaced by \b in a pcre, which works perfectly fine (of course) echo this word word should be deleted | perl -pe 's,\bword ,,g' this should be deleted Chris -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
bug with special bracket expressions in regular expressions
re_format(7) says: There are two special cases‡ of bracket expressions: the bracket expres‐ sions ‘[[::]]’ and ‘[[::]]’ match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore. This is an extension, compatible with but not specified by IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software intended to be portable to other systems. However I observe the following: $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g' cd1 xx In my opinion '[[::]]' should not affect how the pattern is matched in this case. Any thoughts, suggestions? Thank you! -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bug with special bracket expressions in regular expressions
On Mon, 2 Sep 2013, Andriy Gapon wrote: re_format(7) says: There are two special cases? of bracket expressions: the bracket expres? sions ?[[::]]? and ?[[::]]? match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore. This is an extension, compatible with but not specified by IEEE Std 1003.2 (?POSIX.2?), and should be used with caution in software intended to be portable to other systems. However I observe the following: $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g' cd1 xx In my opinion '[[::]]' should not affect how the pattern is matched in this case. Any thoughts, suggestions? there are two simpler expressions, whose difference I don't understand either (tested on 8.4-PRERELEASE) $ echo cd0 cd1 xx | sed 's/cd[0-9] //g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9] //g' cd1 xx -- Damian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bug with special bracket expressions in regular expressions
on 02/09/2013 17:54 Andriy Gapon said the following: re_format(7) says: There are two special cases‡ of bracket expressions: the bracket expres‐ sions ‘[[::]]’ and ‘[[::]]’ match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore. This is an extension, compatible with but not specified by IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software intended to be portable to other systems. However I observe the following: $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g' cd1 xx In my opinion '[[::]]' should not affect how the pattern is matched in this case. It seems that the code works like this: - first it matches cd0 and removes it - then it passes cd1 xx for matching with a flag that tells that this is not a real start of the string - thus the matching code o knows that this is not a real line start, so it can't match [[::]] just for that reason o it does _not_ know what was the character before the start of the given substring, so it can not know if it could match [[::]] So matching fails. Not sure if this is an internal problem of regex(3) or a problem of how sed(1) uses regex(3). -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: bug with special bracket expressions in regular expressions
On Mon, Sep 2, 2013 at 7:45 PM, Andriy Gapon a...@freebsd.org wrote: on 02/09/2013 17:54 Andriy Gapon said the following: re_format(7) says: There are two special cases‡ of bracket expressions: the bracket expres‐ sions ‘[[::]]’ and ‘[[::]]’ match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by ctype(3)) or an underscore. This is an extension, compatible with but not specified by IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software intended to be portable to other systems. However I observe the following: $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g' cd1 xx In my opinion '[[::]]' should not affect how the pattern is matched in this case. It seems that the code works like this: - first it matches cd0 and removes it - then it passes cd1 xx for matching with a flag that tells that this is not a real start of the string - thus the matching code o knows that this is not a real line start, so it can't match [[::]] just for that reason o it does _not_ know what was the character before the start of the given substring, so it can not know if it could match [[::]] So matching fails. Not sure if this is an internal problem of regex(3) or a problem of how sed(1) uses regex(3). -- Andriy Gapon In my opinion this is a bug. The [[::]] operator is said to match the empty string at the beginning of a word with no mention that the word has to be at the beginning of the whole string that is matched. OS X version of sed(1) works differently: $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g' xx $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g' xx -Kimmo ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org