Re: RFC 158 (v1) Regular Expression Special Variables

2000-09-11 Thread Hugo

Mark-Jason Dominus writes:
: There's also long been talk/thought about making $ and $1 
: and friends magic aliases into the original string, which would
: save that cost.
:
:Please correct me if I'm mistaken, but I believe that that's the way
:they are implemented now.  A regex match populates the -startp and
:-endp parts of the regex structure, and the elements of these items
:are byte offsets into the original string.

I went on a briefish trawl for this the other day, and as far as I
can tell what happens is this:
- during matching, the startp/endp pairs are populated with offsets
into the target string
- immediately after matching, the target string is copied if needed,
and the PL_curpm object is updated to refer to the copy
- the copy is needed if any of the special variables can be referred
to: $`, $, $', $1, $2, ...

The result of that is that if there are backreferences in the regexp,
the copy is always needed; if not, the copy is needed only if $ or
her kin have been seen. So regexps with backrefs should suffer no
slowdown from use of $ in the same program, but regexps without
backrefs will get a (potentially) unnecessary copy.

The other problem with this, of course, is that the compiler may not
yet have seen the $ we intend to use:
  crypt% perl -wle '$_="foo"; /.*/; $_="bar"; print eval q{$}'
  bar
  crypt% 
.. and I think coredumps may be possible from this. (Hmm, perlbug
upcoming.)

Hugo



Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Uri Guttman

 "TC" == Tom Christiansen [EMAIL PROTECTED] writes:

   $`, $ and $' are useful variables which are never used by any
   experienced Perl hacker since they have well known problems with
   efficiency. 

  TC That's hardly true.  I could show you plenty of code from
  TC inexperienced Perl hackers like lwall that use them.  But
  TC the cost in understood.  :-)

those early perl3 scripts by lwall floating around in /etc were poorly
written. i am glad they are finally out of the distribution.

  TC The rest of what you said probably is reasonable, however.

  TC The (.*?)(blah)(.*) solution kind works sometimes, but is 
  TC hardly pleasant.  Likewise the @+ and @- stuff.

i would like to see the @+ and @- stuff made to work faster or beterr or
something. they have merit but not practicality.

another related grabbing issue is grabbing repeated groups like

@all_words = /(\w+\s+)+/ ;

we only get the last match from that. but that should be a separate rfc.

  TC There's also long been talk/thought about making $ and $1 
  TC and friends magic aliases into the original string, which would
  TC save that cost.

but if you modify that string with s/// you lose unless you make a
copy. in fact $`, $ and $' should just be aliases if the op was
m///. it is the s/// case that is the problem. 

that brings up the question about how often is $ needed after a s///?
it almost makes little sense since you are matching and modifying. maybe
we can also remove support for them with s/// and thereby remove the
copy penalty. but my idea would work in both cases and puts it under
program control so we could just use that.

uri

-- 
Uri Guttman  -  [EMAIL PROTECTED]  --  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  ---  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  --  http://www.northernlight.com



Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Tom Christiansen

those early perl3 scripts by lwall floating around in /etc were poorly
written. i am glad they are finally out of the distribution.

Those weren't the scripts I was thinking about, and it is *NOT*
ipso facto true that something which uses $ or $` is poorly
written.

--tom



Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread David L. Nicol

Tom Christiansen wrote:

 There's also long been talk/thought about making $ and $1
 and friends magic aliases into the original string, which would
 save that cost.


I was distressed to discover that s///g does not rebuild the
old string between matches, but only at the end.  It broke my
random anagram generator which was depending on instant updates.


If STRING was a linked list of partially full blocks rather than
a big piece of contiguous space, we could do length-altering substitutions
without copying.

-- 
  David Nicol 816.235.1187 [EMAIL PROTECTED]
   safety first: Republicans for Nader in 2000



Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Mark-Jason Dominus


 Please correct me if I'm mistaken, but I believe that that's the way
 they are implemented now.  A regex match populates the -startp and
 -endp parts of the regex structure, and the elements of these items
 are byte offsets into the original string.  
 
 I haven't looked at it at all, and perhaps that 's sometihng Ilya
 did when creating @+ etc.  So you might be right.  

As far as I know it's the same in 5.000.

I thought the problem with $ was that the regex engine has to adjust
the offsets in the startp/endp arrays every time it scans forward a
character or backtracks a character.  

But maybe the effect of $ is greatly exaggerated or is a relic from
perl4?  Has anyone actually benchmarked this recently?