Re: bug with special bracket expressions in regular expressions

2015-06-12 Thread Andriy Gapon
On 02/09/2013 19:45, Andriy Gapon wrote:
 It seems that the code works like this:
 - first it matches cd0  and removes it
 - then it passes cd1 xx for matching with a flag that tells that this is not
   a real start of the string
 - thus the matching code
  o knows that this is not a real line start, so it can't match [[::]]
just for that reason
  o it does _not_ know what was the character before the start of the given
substring, so it can not know if it could match [[::]]
 
 So matching fails.
 Not sure if this is an internal problem of regex(3) or a problem of how sed(1)
 uses regex(3).

I've come up with a patch to fix this problem:
https://reviews.freebsd.org/D2792

I am not sure who among the developers is interested in the regexp code, so
currently the request does not have any reviewers.  If know that code well or
care about its correctness please add yourself ti the review request.

All testers are welcome.  The issue could be quite an edge case, but I am more
interested to see if no regressions are introduced.

Thanks.
-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: bug with special bracket expressions in regular expressions

2013-10-01 Thread Chris Rees

On 02/09/2013 16:09, Damian Weber wrote:


On Mon, 2 Sep 2013, Andriy Gapon wrote:


re_format(7) says:
  There are two special cases? of bracket expressions: the bracket expres?
  sions ?[[::]]? and ?[[::]]? match the null string at the beginning and
  end of a word respectively.  A word is defined as a sequence of word
  characters which is neither preceded nor followed by word characters.  A
  word character is an alnum character (as defined by ctype(3)) or an
  underscore.  This is an extension, compatible with but not specified by
  IEEE Std 1003.2 (?POSIX.2?), and should be used with caution in software
  intended to be portable to other systems.

However I observe the following:
$ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g'
xx
$ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g'
cd1 xx

In my opinion '[[::]]' should not affect how the pattern is matched in this 
case.

Any thoughts, suggestions?

there are two simpler expressions, whose difference I don't understand either
(tested on 8.4-PRERELEASE)

$ echo cd0 cd1 xx | sed 's/cd[0-9] //g'
xx
$ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9] //g'
cd1 xx


Well, I agree with your analysis, and I think it's certainly a bug.

Do you think that the BUGS line in regex(3) should perhaps be extended 
to never works properly?:


Word-boundary matching does not work properly in multibyte locales.

[[::]] can be replaced by \b in a pcre, which works perfectly fine (of 
course)


echo this word word should be deleted | perl -pe 's,\bword ,,g' this 
should be deleted


Chris

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


bug with special bracket expressions in regular expressions

2013-09-02 Thread Andriy Gapon

re_format(7) says:
 There are two special cases‡ of bracket expressions: the bracket expres‐
 sions ‘[[::]]’ and ‘[[::]]’ match the null string at the beginning and
 end of a word respectively.  A word is defined as a sequence of word
 characters which is neither preceded nor followed by word characters.  A
 word character is an alnum character (as defined by ctype(3)) or an
 underscore.  This is an extension, compatible with but not specified by
 IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software
 intended to be portable to other systems.

However I observe the following:
$ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g'
xx
$ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g'
cd1 xx

In my opinion '[[::]]' should not affect how the pattern is matched in this 
case.

Any thoughts, suggestions?
Thank you!
-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: bug with special bracket expressions in regular expressions

2013-09-02 Thread Damian Weber


On Mon, 2 Sep 2013, Andriy Gapon wrote:

 re_format(7) says:
  There are two special cases? of bracket expressions: the bracket expres?
  sions ?[[::]]? and ?[[::]]? match the null string at the beginning and
  end of a word respectively.  A word is defined as a sequence of word
  characters which is neither preceded nor followed by word characters.  A
  word character is an alnum character (as defined by ctype(3)) or an
  underscore.  This is an extension, compatible with but not specified by
  IEEE Std 1003.2 (?POSIX.2?), and should be used with caution in software
  intended to be portable to other systems.
 
 However I observe the following:
 $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g'
 xx
 $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g'
 cd1 xx
 
 In my opinion '[[::]]' should not affect how the pattern is matched in this 
 case.
 
 Any thoughts, suggestions?

there are two simpler expressions, whose difference I don't understand either
(tested on 8.4-PRERELEASE)

$ echo cd0 cd1 xx | sed 's/cd[0-9] //g'
xx
$ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9] //g'
cd1 xx

-- Damian

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: bug with special bracket expressions in regular expressions

2013-09-02 Thread Andriy Gapon
on 02/09/2013 17:54 Andriy Gapon said the following:
 
 re_format(7) says:
  There are two special cases‡ of bracket expressions: the bracket expres‐
  sions ‘[[::]]’ and ‘[[::]]’ match the null string at the beginning and
  end of a word respectively.  A word is defined as a sequence of word
  characters which is neither preceded nor followed by word characters.  A
  word character is an alnum character (as defined by ctype(3)) or an
  underscore.  This is an extension, compatible with but not specified by
  IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software
  intended to be portable to other systems.
 
 However I observe the following:
 $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g'
 xx
 $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g'
 cd1 xx
 
 In my opinion '[[::]]' should not affect how the pattern is matched in this 
 case.

It seems that the code works like this:
- first it matches cd0  and removes it
- then it passes cd1 xx for matching with a flag that tells that this is not
  a real start of the string
- thus the matching code
 o knows that this is not a real line start, so it can't match [[::]]
   just for that reason
 o it does _not_ know what was the character before the start of the given
   substring, so it can not know if it could match [[::]]

So matching fails.
Not sure if this is an internal problem of regex(3) or a problem of how sed(1)
uses regex(3).

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: bug with special bracket expressions in regular expressions

2013-09-02 Thread Kimmo Paasiala
On Mon, Sep 2, 2013 at 7:45 PM, Andriy Gapon a...@freebsd.org wrote:
 on 02/09/2013 17:54 Andriy Gapon said the following:

 re_format(7) says:
  There are two special cases‡ of bracket expressions: the bracket expres‐
  sions ‘[[::]]’ and ‘[[::]]’ match the null string at the beginning and
  end of a word respectively.  A word is defined as a sequence of word
  characters which is neither preceded nor followed by word characters.  A
  word character is an alnum character (as defined by ctype(3)) or an
  underscore.  This is an extension, compatible with but not specified by
  IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software
  intended to be portable to other systems.

 However I observe the following:
 $ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g'
 xx
 $ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g'
 cd1 xx

 In my opinion '[[::]]' should not affect how the pattern is matched in this 
 case.

 It seems that the code works like this:
 - first it matches cd0  and removes it
 - then it passes cd1 xx for matching with a flag that tells that this is not
   a real start of the string
 - thus the matching code
  o knows that this is not a real line start, so it can't match [[::]]
just for that reason
  o it does _not_ know what was the character before the start of the given
substring, so it can not know if it could match [[::]]

 So matching fails.
 Not sure if this is an internal problem of regex(3) or a problem of how sed(1)
 uses regex(3).

 --
 Andriy Gapon

In my opinion this is a bug. The [[::]] operator is said to match the
empty string at the beginning of a word with no mention that the word
has to be at the beginning of the whole string that is matched. OS X
version of sed(1) works differently:

$ echo cd0 cd1 xx | sed 's/cd[0-9][^ ]* *//g'
xx
$ echo cd0 cd1 xx | sed 's/[[::]]cd[0-9][^ ]* *//g'
xx

-Kimmo
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org