Re: Flex / Regexps (Re: awk regex bug)

2015-06-08 Thread hruodr
I wrote:

 [...] Why should
 be difficult to track the indices in yytext of the beginning and the end 
 of each matching subexpression, in two arrays of integers (one for
 the beginning and one for the end)? [...] 

More exactly: in the first array the index of the first element of
the matching subexpression, in the second the index of the last
element plus one. When both indices are equal, then the subexpression
is void. 

If the second index correspond to something irrelevant in yytext, then 
one can set yytext there to 0 for easily obtaining a pointer to a null 
terminating string equal to the subexpression.

Just dreaming.

Rodrigo.



Flex / Regexps (Re: awk regex bug)

2015-06-08 Thread hruodr
Otto Moerbeek o...@drijf.net wrote:

 Refering to subpatterns is not available in flex.  I suppose it is not
 available since it would require a more complex re engine.
 Interpretation of the lexical value should be hand-crafted. 

I also though caomplexity can be the reason, but I have doubts. Why should
be difficult to track the indices in yytext of the beginning and the end 
of each matching subexpression, in two arrays of integers (one for
the beginning and one for the end)? Neither memory nor time seems to
be a problem. And hand crafting means not only avoidable programming
work and unreadability, but a second pass that adds complexity.

A nice source on regexps is here:  https://swtch.com/~rsc/regexp/

In the first article listed there you read:


While writing the text editor sam [6] in the early 1980s, Rob Pike wrote a 
new regular expression implementation, which Dave Presotto extracted into 
a library that appeared in the Eighth Edition. Pike's implementation 
incorporated submatch tracking [sic!] into an efficient NFA simulation but, 
like the rest of the Eighth Edition source, was not widely distributed. 
Pike himself did not realize that his technique was anything new. Henry 
Spencer reimplemented the Eighth Edition library interface from scratch, 
but using backtracking, and released his implementation into the public 
domain. It became very widely used, eventually serving as the basis for 
the slow regular expression implementations mentioned earlier: Perl, PCRE, 
Python, and so on. (In his defense, Spencer knew the routines could be 
slow, and he didn't know that a more efficient algorithm existed. He 
even warned in the documentation, Many users have found the speed 
perfectly adequate, although replacing the insides of egrep with this 
code would be a mistake.) Pike's regular expression implementation, 
extended to support Unicode, was made freely available with sam in late 
1992, but the particularly efficient regular expression search algorithm 
went unnoticed. The code is now available in many forms: as part of sam, 
as Plan 9's regular expression library, or packaged separately for Unix. 
Ville Laurikari independently discovered Pike's algorithm in 1999, 
developing a theoretical foundation as well [2]. 


Note that OpenBSD's regex library seems to use the slow Spencer 
implementation.

Rodrigo.



Re: awk regex bug

2015-06-08 Thread Otto Moerbeek
On Mon, Jun 08, 2015 at 02:49:44PM +, hru...@gmail.com wrote:

 Otto Moerbeek o...@drijf.net wrote:
 
  Tradiotionally, { } pattersn are not part of awk re's.
 
  Posix added them, but we do not include them afaik. Gnu awk only accepts
  them if given an extra arg (--posix or --re-interval).
 
  I think this should be documented.
 
 Although there is a clear theory about regular expressions, I have the
 impression that there is no standard syntax. One needs to read again and
 again the documentation of programs that use them.
 
 I am just missing a way to reference in a (f)lex action a previously
 matched subexpression (like with \m in a substitution with ed).
 
 Why is this? Because lex is so old? And what does people do in these
 cases?
 
 Rodrigo

Refering to subpatterns is not available in flex.  I suppose it is not
available since it would require a more complex re engine.
Interpretation of the lexical value should be hand-crafted. 

-Otto



Re: awk regex bug

2015-06-08 Thread hruodr
Otto Moerbeek o...@drijf.net wrote:

 Tradiotionally, { } pattersn are not part of awk re's.

 Posix added them, but we do not include them afaik. Gnu awk only accepts
 them if given an extra arg (--posix or --re-interval).

 I think this should be documented.

Although there is a clear theory about regular expressions, I have the
impression that there is no standard syntax. One needs to read again and
again the documentation of programs that use them.

I am just missing a way to reference in a (f)lex action a previously
matched subexpression (like with \m in a substitution with ed).

Why is this? Because lex is so old? And what does people do in these
cases?

Rodrigo



Re: awk regex bug

2015-05-28 Thread Otto Moerbeek
On Thu, May 28, 2015 at 02:08:47AM -0500, cwl...@mst.edu wrote:

 Hi misc,
 
 I'm running a 5.7 release, and I'm wondering if anyone can confirm
 an awk bug I found.
 
 Curly brackets are treated as literal characters instead of bounds
 as specified by re_format(7).
 
 Reproduction:
 
 echo aa | awk '/a{2}/'
 
 produces no output instead of printing aa as expected.
 
 echo 'a{2}' | awk '/a{2}/'
 
 produces output when none is expected.
 
 This bug seems awk specific since the equivalents using grep
 
 echo aa | grep -E 'a{2}'
 
 echo 'a{2}' | grep -E 'a{2}'
 
 work as expected.

Tradiotionally, { } pattersn are not part of awk re's.

Posix added them, but we do not include them afaik. Gnu awk only accepts
them if given an extra arg (--posix or --re-interval).

I think this should be documented.

-Otto



awk regex bug

2015-05-28 Thread cwlmb3
Hi misc,

I'm running a 5.7 release, and I'm wondering if anyone can confirm
an awk bug I found.

Curly brackets are treated as literal characters instead of bounds
as specified by re_format(7).

Reproduction:

echo aa | awk '/a{2}/'

produces no output instead of printing aa as expected.

echo 'a{2}' | awk '/a{2}/'

produces output when none is expected.

This bug seems awk specific since the equivalents using grep

echo aa | grep -E 'a{2}'

echo 'a{2}' | grep -E 'a{2}'

work as expected.