Re: hyperoperators (was: Apocalypse)

2001-10-10 Thread Hugo van der Sanden

Alberto Simoes wrote:
:2) using ^ for mapping operators.. this only works with two lists.
:The problem here is that we have a pair of lists, and want a
:list of pairs. There can be other situations where we have
:three lists, instead of a list of tripplets... I thought it was
:better to have a 'evidence' or 'factorize' for lists in a way
:((a,b,c),(1,2,3)) will become ((a,1),(b,2),(c,3)) and
:((1,2,3),(4,5,6),(7,8,9)) will become ((1,4,7),(2,5,8),(3,6,9)).
:This way, the ^ operator could be replaced with a simple map...
:More generic, less operators confusion... better? maybe...

I'm hoping we'll get the facility to add user-defined hyperoperators,
in which case it will be easy to add other list-manipulation
strategies in a generic manner. With a bit of luck, the commonest
such hyperoperators will get to be in a standard class that everyone
uses, rather than everyone going off to invent their own symbols.

Hugo



Re: redraft (v2) for RFC 348 Regex assertions in plain Perl code

2000-10-01 Thread Hugo

In [EMAIL PROTECTED], Bart Lateur writes:
:Likely the most justifiable to want to be able to execute Perl code in a
reason

:This makes the implementation very tricky. I
:wouldn't be surprised if precisely this feature is the main reason why
:the current implementation is so notoriously unstable.

I'm not aware of any instability caused by this. The instability is
caused by various other factors, discussed at length on p5p.

:The fact that the embedded code is called 3 times, not more, surely
:suprised me. It probably will surprise many people. Apparently, it is
:only executed once for every lowercase letter, not just for any
:character.
:
:This inpredictability is yet another reason to discourage incrementally
:modify global data structures.

I think this is precisely why the non-assertion form encourages use of
local() - in general, the local() constructs will have executed a
predictable number of times _that have not been unwound_ by the time
a successful match is achieved. I don't think this observation (of
mine) is particularly relevant to the proposal, however.

:=head2 /(?(condition)yes-pattern|no-pattern)/

The simplest form of this is (?(1)yes|no). This is rather harder to
emulate with other mechanisms without running to eval. OTTOMH it is
equivalent to (??{ defined($1) ? 'yes' : 'no' }).

Hugo



More on RFC 93 (was Re: RFC 316 (v1) ...)

2000-09-30 Thread Hugo

In [EMAIL PROTECTED], Bart Lateur writes:
:Yes, but RFC 93 has some other disadvantages.

In respect of the number of calls, there seems nothing in RFC 93
to stop us permitting the callback to return more or fewer than the
requested number of characters. So a filehandle, for example, could
choose to return some multiple of 4K blocks for every request. A
socket conenction that applies a line-based protocol would probably
read a line at a time, while another socket might return just those
characters available to read without blocking.

:Furthermore, where is the resulting buffer stored? People usually still
:want a copy of their data, to do yet other things with. Here, the data
:has disappeared into thin air. The only way to get it, is putting
:capturing parens in the regex.

It seems to me that $` and $ are the right solutions here. I assume
that perl6 will not allow this to cause an overreaching performance
problem. In this context we have the additional advantage that the
only copy of the accumulated string is owned by the regexp engine,
so no additional copy need be made to protect it.

:Compared to that, RFC 93 feels like a straightjacket. To me.

Strangely it feels uncommonly liberating to me.

:You may have to completely rewrite your script. So much for code reuse.

I don't believe that it need be so painful to take advantage of it
in existing code. We can ease that by providing a selection of
helpful ready-rolled routines for common tasks.

Hugo



Re: RFC 112 (v3) Asignment within a regex

2000-09-29 Thread Hugo

In [EMAIL PROTECTED], "Richard Proctor" writes:
:In general all assignments should wait to the very end, and then assign
:them all. [...] If the expression finally fails the localised values
:would unroll.

Ah, I hadn't anticipated that - I had assumed you would get whatever
was the last value set. Please can you make sure this is clearly
explained in the next version of the RFC?

Hugo



Re: RFC 348 (v1) Regex assertions in plain Perl code

2000-09-29 Thread Hugo

In [EMAIL PROTECTED], Perl6 RFC Librarian writes:
:=item assertion in Perl5
:
: (?(?{not COND})(?!))
: (?(?{not do { COND }})(?!))

Or (?(?{COND})|(?!)).

Migration could consider replacing detectable equivalents of such
constructs with the favoured new construct.

:"local" inside embedded code will no longer be supported, nor will
:consitional regexes. The Perl5 - Perl6 translator should warn if it
:ever encounters one of these.

I'm not convinced that removing either of these are necessary to the
main thrust of the proposal. They may both still be useful in their
own right, and you seem to offer little evidence against them other
than that you don't like them.

I do like the idea of making (?{...}) an assertion, all the more
because we have a simple migration path that avoids unnecessarily
breaking existing scripts: wrap $code as '$^R = do { $code }; 1'.

If you want to remove support for 'local' in embedded code, it is
worth a full proposal in its own right that will explain what will
happen if people try to do that. (I think it will make perl
unnecessarily more complex to detect and disable it in this case.)
Similarly if you want to remove support for (?(...)) completely,
you need to address the utility and options for migration for all
the available uses of it, not just the one addressed by the new
handling of (?{...}).

Hugo



Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-28 Thread Hugo

In [EMAIL PROTECTED], Tom Christiansen writes:
:I consider recursive regexps very useful:
:
: $a = qr{ (? [^()]+ ) | \( (??{ $a }) \) };
:
:Yes, they're "useful", but darned tricky sometimes, and in
:ways other than simple regex-related stuff.  For example,
:consider what happens if you do
:
:my $regex = qr{ (? [^()]+ ) | \( (??{ $regex }) \) };
:
:That doesn't work due to differing scopings on either side
:of the assignment.

Yes, this is a problem. But it bites people in other situations
as well:
  my $fib = sub { $_[0]  2 ? 1 : $fib($_[0] - 1) };

I haven't kept up with the non-regexp RFCs, but I hope someone
has suggested an alternative scoping that would permit these
cases to refer to the just-introduced variable. Perhaps we
should special-case qr{} and sub{} - I can't offhand think of
another area that suffers from this, and I don't think these
two areas would suffer from an inability to refer to the same-
-name variable in an outlying scope.

A useful alternative might be a different special case. Plucking
random grammar, perhaps:
  my $regex = qr{ (? [^()]+ ) | \( ^^ \) }x;

Certainly I think a simple self-reference is likely to be a
common enough use that it would help to avoid the full deferred
eval infrastructure, even when it works properly.

:And clearly a non-regex approach could be more legible for
:recursive parsing.

Like any aspect of programming, if you use it regularly it will
become easier to read. And comments are a wonderful thing.

Hugo



Re: RFC 331 (v1) Consolidate the $1 and C\1 notations

2000-09-28 Thread Hugo

:=item *
:/(foo)_$1_bar/
:
:=item *
:/(foo)_C\1_bar/

Please don't do this: write C/(foo)_\1_bar/ or /(foo)_\1_bar/, but
don't insert C in the middle: that makes it much more difficult to
read.

:mean different things:  the second will match 'foo_foo_bar', while the
:first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was

should be: foo_[SOMETHING]_bar

:captured in the Bprevious match...which could be a long, long way away,
:possibly even in some module that you didn't even realize you were
:including (because it was included by a module that was included by a
:module that was included by a...). 

This seems a bit unfair. It is just another variable. Any variable
you include in a pattern, you are assumed to know that it contains
the intended value - there is nothing special about $1 in this regard.

:The key fact here is that, in the first section of a s/// you are supposed
:to use C\1, but in the second portion you are supposed to use $1.  If
:you understand the whole logical structure behind it and understand how an
:s/// works (i.e., the right hand side of an s/// is a double-quoted
:string, not a regex), you will understand the distinction.  For newbies,
:however, it is apt to be quite confusing.

I think the whole idea that the LHS of s/// is a pattern, but the
RHS is a string (module /e, of course) is apt to be confusing when
you first encounter it. You won't be able to make sense of any but
the simplest use of s/// until you understand it, I think, and the
documentation expresses it quite clearly.

:=item *
:${P1} means what $1 currently means (first match in last regex)

Do you understand that this is the same variable as $P1? Traditionally,
perl very rarely coopts variable names that start with alphanumerics,
and (off the top of my head) all the ones it does so coopt are letters
only (ARGV, AUTOLOAD, STDOUT etc). I think we need better reasons to
extend that to all $P1-style variables.

If you are suggesting that they should have a special meaning only
in regexps, and only if braced, then I'd find it even more confusing.
The use of braces is usually the easiest (and only?) way to split
out a variable from following alphanumerics:
  /foo${P1}bar/

:These changes eliminate a potential source of confusion, retain all
:functionality, provide an easy migration path for P526, and the last
:notation (${P1}) serves as a clear indicator that you are talking about
:something from outside the current regex.

What is the migration path for existing uses of $P1-style variables?

:=item *
:s/(bar)(bell)/${P1}$2/ # changes "barbell" to "foobell"

Note that in the current regexp engine, ${P1} has disappeared by the
time matching starts. Can you explain why we need to change this?
Note also that if you are sticking with ${P1} either we need to
rename all existing user variables of this form, or we can no longer
use the existing 'interpolate this string' (or eval, double-eval etc)
routines, and have to roll our own for this (these) as well.

:=head1 IMPLEMENTATION
:
:This may require significant changes to the regex engine, which is a topic
:on which I am not qualified to speak.  Could someone with more
:knowledge/experience please chime in?

Currently the regexp compiler is handed a string in which $variables
have already interpolated. We'd need to avoid that and get either
the the raw data for the string or some list that has undergone a
minimum of preparation. It is possible we need that anyway - it is
a prerequisite for some of the other proposed enhancements (such as
the meta-referred-to RFC 112) and would certainly make the regexp
engine more flexible - but it is certainly substantial work. I don't
know what gotchas may arise. In general it seems a shame to recreate
large parts of the existing string parsing/interpolation code, but
it may not be possible to avoid it.

Changing the lifetime of backreferences feels likely to be difficult,
but it isn't clear to me what you are trying to achieve here. I think
you at least need to add an example of how it would act under s///g
and s///ge.

:=head1 REFERENCES
:
:RFC 112: Assignment within a regex
:
:RFC 276: Localising Paren Counts in qr()s.

I didn't see a mention of these in the body of the proposal.

To me, the prime issue is with \1. The backslash is heavily overloaded
in perl, and that makes it difficult to suggest a consistent and
legible extension that would allow us to refer back to either variables
(RFC 112) or hash keys (RFC 150). I don't think switching to $1 is any
help for those, though.

Hugo



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Hugo

In [EMAIL PROTECTED], Bart Lateur writes:
:I'll try to find that "thread" back.

This was my message:

  http://www.mail-archive.com/perl6-language-regex%40perl.org/msg00354.html

:I don't think changing /s is the right solution. I think this will
:incline people to try and fix their problems by adding /s, without
:realising that this changes the definition of every . in their
:regexp as well.
:
:Perhaps. I do think that, in general, textual data falls into one of
:three categories:
:
: * text with possibly embedded newlines
: * text with no embedded newlines
: * text with an irrelevant newline at the very end.
:
:The '/s' option is for the 1st case. No '/s' for the 3rd. As for #2: you
:don't care.

I'd distinguish the first case further into 'the newlines are
significant' or not - /s is often desired for the first case,
and /m often for the second. And then I'd be tempted to repeat
the whole list, replacing 'newline' with 'record separator'.

I have to say I'm quite prejudiced against /s - I consider myself
reasonably knowledgeable about regexps, but on average about once
a month I find myself unsure enough about which is /m and which
is /s that I need to check the top of perlre to be sure. I think
we've appreciated for some time that it was a mistake to name them
as if they were opposites, but if anything I'd like to reduce the
need for them rather than to increase it.

Hugo



Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-26 Thread Hugo

In 005501c027eb$43bafe60$[EMAIL PROTECTED], "Michael Maraist" writes:
:As you said, we shouldn't encourage full-fledged execution (since core dumps
:are common).

Let's not redefine the language just because there are bugs to fix.
Surely it is better to concentrate first on fixing the bugs so that
we can then more fairly judge whether the feature is useful enough
to justify its existence.

:One restriction might be to disallow various op-codes within the reg-ex
:assertion.  Namely user-function calls, reg-ex's, and most OS or IO
:operations.

That seems quite unreasonable. Why do you _want_ to restrict someone
from calling isKeyword($1) within the regexp, which will then read
the keyword patterns from a file and check $1 against those patterns
using regexps? It seems like an entirely reasonable and useful thing
to do.

Hugo



Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-26 Thread Hugo

In [EMAIL PROTECTED], Bart Lateur writes:
:On 25 Sep 2000 20:14:52 -, Perl6 RFC Librarian wrote:
:
:Remove C?{ code }, C??{ code } and friends.
:
:I'm putting the finishing touches on an RFC to drop (?{...}) and replace
:it with something far more localized, hence cleaner: assertions, also in
:Perl code. That way,
:
:   /(?!\d)(\d+)(?{$1  256})/
:
:would only match integers between 0 and 255.

I'd like to suggest an alternative semantic for this: rename
(??{ code }) to (?{ code }), and use the newly freed (??{ code })
for the assertions. (I was about to write an RFC for just that, so
I'm glad I can save a bit of time. :)

Hugo



Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Hugo

In [EMAIL PROTECTED], Perl6 RFC Librarian writes:
:It would be preferable to keep the regular expression engine as
:self-contained as possible, if nothing else to enable it to be used
:either outside Perl or inside standalone translated Perl programs
:without a Perl runtime.
:
:To do this, we'll have to remove the bits of the engine that call 
:Perl code. In short: C?{ code } and C??{ code } must die.

I would have thought it more reasonable, if you wish to create
standalone translated Perl programs without a Perl runtime, to fail
with a helpful error if you encounter a construct that won't permit
it. You'll need to remove chunks of eval() and do() as well,
otherwise, and probably more besides.

In the context of a more shareable regexp engine, I would like to
see (? and (?? stay, but they need to be implemented more cleanly.
You could handle them quite nicely, I think, with just three
well-defined external hooks: one to find the matching brace at the
end of the code, one to parse the code, and one to run the code.
Anyone wishing to re-use the regexp library could then choose either
to keep the default drop-in replacements for those hooks (that die)
or provide their own equivalents to the perl usage.

I consider recursive regexps very useful:

 $a = qr{ (? [^()]+ ) | \( (??{ $a }) \) };

.. and I class re-eval in general in the arena of 'making hard
things possible'. But whether or not they stay, it would probably
also be useful to have a more direct way of expressing simple
recursive regexps such as the above without resorting to a costly
eval. When I've tried to come up with an appropriate restriction,
however, I find it very difficult to pick a dividing line.

Hugo



Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Hugo

In [EMAIL PROTECTED], Perl6 RFC Librarian writes:
:=head1 ABSTRACT
:
:Remove C?{ code }, C??{ code } and friends.

Whoops, I missed this bit - what 'friends' do you mean?

Hugo



Re: Perlstorm #0040

2000-09-23 Thread Hugo

In [EMAIL PROTECTED], Richard Proctor writes
:
:TomCs perl storm has:
:
: Figure out way to do 
: 
: /$e1 $e2/
: 
: safely, where $e1 might have '(foo) \1' in it. 
: and $e2 might have '(bar) \1' in it.  Those won't work.
:
:If e1 and e2 are qr// type things the answer might be to localise 
:the backref numbers in each qr// expression.  
:
:If they are not qr//s it might still be possible to achieve if the expansion
:of variables in regexes is done by the regex compiler it could recognise
:this context and localise the backrefs.
:
:Any code like this is going to have real problem with $1 etc if used later,
:use of assignment in a regex and named backrefs (RFC 112) would make this
:a lot safer.

I think it is reaonable to ask whether the current handling of qr{}
subpatterns is correct:

perl -wle '$a=qr/(a)\1/; $b=qr/(b).*\1/; /$a($b)/g and print join ":", $1, pos for 
"aabbac"'
a:5

I'm tempted to suggest it isn't; that the paren count should be local
to each qr{}, so that the above prints 'bb:4'. I think that most people
currently construct their qr{} patterns as if they are going to be
handled in isolation, without regard to the context in which they are
embedded - why else do they override the embedder's flags if not to
achieve that?

The problem then becomes: do we provide a mechansim to access the
nested backreferences outside of the qr{} in which they were referenced,
and if so what syntax do we offer to achieve that? I don't have an answer
to the latter, which tempts me to answer 'no' to the former for all the
wrong reasons. I suspect (and suggest) that complication is the only
reason we don't currently have the behaviour I suggest the rest of the
semantics warrant - that backreferences are localised within a qr().

I lie: the other reason qr{} currently doesn't behave like that is that
when we interpolate a compiled regexp into a context that requires it be
recompiled, we currently ignore the compiled form and act only on the
original string. Perhaps this is also an insufficiently intelligent thing
to do.

Hugo



Re: \z vs \Z vs $

2000-09-20 Thread Hugo

In 12839.969393548@chthon, Tom Christiansen writes:
:What can be done to make $ work "better", so we don't have to
:make people use /foo\z/ to mean /foo$/?  They'll keep writing
:the $ for things that probably oughtn't abide optional newlines.
:
:Remember that /$/ really means /(?=\n?\z)/. And likewise with \Z.

It might be reasonable to redefine $ to mean the same as \z whenever
the /s flag is supplied. Another possibility would be to have a
scoped "use re qw/simple_anchor/' pragma to achieve the same. And
another would be simply to switch the meaning of $ and \z.

None of these feel particularly satisfactory, however, and I think
any change to the current semantics would be difficult for existing
perl programmers.

Perhaps '$$' to mean 'match at end of string (without /m) or at end
of any line (with /m)? The p52p6 translator can easily replace
references to $$ with ${$}. I can't think of a usefully different
meaning for ^^, but as currently defined it will already do the
right thing.

I don't know what proposals have come out of the other wgs, but if
we know when a variable has been read from a line-oriented input
medium, then we could turn on the special meaning of $ only in such
cases and define it as $$ above in all other cases. I think this
would be more confusing, though.

We could also consider changing the base definition to (?=($/)?\z),
particularly if $/ is to be seen as a regexp.

I think I like $$ the best.

Hugo



perl6-language-regex summary for 20000920

2000-09-20 Thread Hugo

perl6-language-regex

Summary report 2920

Mark-Jason Dominus has relinquished the wg chair due to the pressure
of other commitments; I'll be taking over the chair for the short
time remaining. Thanks to Mark-Jason for all his hard work.

I'll be contacting the authors of all outstanding RFCs shortly to
encourage them to work towards freezing them as soon as practical.

Hugo


RFC 72: The regexp engine should go backward as well as
forward. (Peter Heslin)

Peter says (edited):
:If the regexp code is unlikely to be rewritten from the ground up, then
:there may be little chance of this feature being implemented. I'll make
:a pitch for it anyway at the end of my talk at YAPC::Europe, and then
:I'll freeze the RFC.

RFC 93: Regex: Support for incremental pattern matching  (Damian Conway)

Now frozen at v3 with no changes; I don't think there was a v2.

RFC 110: counting matches  (Richard Proctor)

Richard added my suggestions about the interaction between /t, /g
and \G, and froze the RFC soon after.

RFC 112: Assignment within a regex  (Richard Proctor)

No discussion.

RFC 138: Eliminate =~ operator.  (Steve Fink)

Withdrawn.

RFC 144: Behavior of empty regex should be simple  (Mark Dominus)

Frozen.

RFC 145: Brace-matching for Perl Regular Expressions  (Eric Roode)

No discussion directly about this RFC. The discussion of XML/HTML-
-specific extensions continued for a short while, but has not
resulted in an RFC.

The closest we have to an emerging consensus appears to be that
it is very difficult to pin down a precise problem to solve - the
areas in which we want to match pairs of delimiters (such as
numeric expressions, C code, perl code, HTML and XML) each seem
to require a variety of special cases, each different from the
other.

RFC 150: Extend regex syntax to provide for return of a hash of
 matched subpatterns  (Kevin Walker)

One suggestion from me of (?\%key) for backreferencing, but no
substantive discussion.

RFC 158: Regular Expression Special Variables  (Uri Guttman)

No discussion.

RFC 164: Replace =~, !~, m//, s///, and tr// with match(), subst(),
 and trade()  (Nathan Wiger)

This RFC has now been frozen; the frozen version included some
rewording and a couple of additional explanatory notes, as well
as introducing a typo ('$gotis') in an example.

RFC 165: Allow variables in tr///  (Richard Proctor)

Surprisingly, no discussion.

RFC 166: Alternative lists and quoting of things  (Richard Proctor)

New version, with a new name (was 'Additions to regexs'). This RFC
is not currently available from the archive due to a misfiling, but
you'll find it here:
  http://www.mail-archive.com/perl6-language-regex@perl.org/msg00350.html

This removes two of the three original suggestions, and expands on
the remaining one. Mark-Jason pointed out that the (new) extension
to (?\Q$foo) is not needed.

RFC 170: Generalize =~ to a special-purpose assignment operator
 (Nathan Wiger)

Now frozen, with some modifications.

RFC 197: Numberic Value Ranges In Regular Expressions (David Nichol)
  
No discussion.

RFC 198: Boolean Regexes (Richard Proctor)

No discussion.

New RFCS

Of the other discussions that may still spawn a new RFC, most have been
mentioned previously. One new one: Tom Christiansen has asked '[w]hat
can be done to make $ work "better", so we don't have to make people
use /foo\z/ to mean /foo$/'.



Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-17 Thread Hugo

mike mulligan writes:
:From: Hugo [EMAIL PROTECTED]
:Sent: Tuesday, September 12, 2000 2:54 PM
:
: 3. The regexp is matched left to right: first the lookbehind, then 'X',
: then '[yz]'.
:
:Thanks for the insight - I was stuck in my bad assumption that the optimized
:behavior was the only behavior.
:
:What I am not sure of is whether the "optimization" is ever dangerous.  In
:other words, is there ever a difference in end-result between, doing at each
:point: 1. test look-behind and then test the remainder of the regex, vs 2.
:test the remainder of the regex, and then test the look-behind?

Sometimes it may not be possible at all:
  "axbcxd" =~ /(?= a(.)b ) c\1d/x;

:I am without a motiviating example, but can anyone see utility in a
:non-greedy look-behind that operates in sense "2" above?   Syntax:
:(?=pat)?(?!pat)?Currently, a question-mark like this on a
:look-behind makes it optional, defeating the assertion's purpose.  If anyone
:has a good example, I'll take on writing a RFC.

Currently, a question mark like this on a lookbehind is apparently
ignored:
  crypt% ./perl -wle '/(?=test)?/'
  Quantifier unexpected on zero-length expression before HERE mark in regex 
m/(?=test)?  HERE / at -e line 1.
  Use of uninitialized value in pattern match (m//) at -e line 1.
  crypt% 

.. but I don't know why, since it could arguably be useful:
  / (?= (+|-) )? \d+ /x;
  print defined($1) ? "sign: '$1'\n" : "no sign\n";

Note that you can rewrite /(?=[aeiou])X[yz]/ as /X[yz](?=[aeiou]..)/
if you really want ...

Hugo



negative variable-length lookbehind example

2000-09-14 Thread Hugo

In RFC 72, Peter Heslin gives this example:
:Imagine a very long input string containing data such as this:
:
:... GCAAGAATTGAACTGTAG ...
:
:If you want to match text that matches /GA+C/, but not when it
:follows /G+A+T+/, you cannot at present do so easily.

I haven't tried to work it out exactly, but I think you can
achieve this (and fairly efficiently) with something like:
  /
(?: ^ |  # else we won't match at start
  (?: (? G+ A+ T+) | (.) )*
  (?(1) | . )
)
G A+ C
  /x

This requires that the regexp engine reliably leaves $1 unset if
we took the G+A+T+ branch last time through the (...)*, which
has been an area of many bugs and no little discussion in perl5;
I'm not sure of the status of that in latest perls.

It isn't particularly relevant to this proposal since there are
other combinations that can't be resolved in this way; I thought
it might be of interest nonetheless.

Hugo



Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-12 Thread Hugo

In 085601c01cc8$2c94f390$[EMAIL PROTECTED], "mike mulligan" w
rites:
:From: Hugo [EMAIL PROTECTED]
:Sent: Monday, September 11, 2000 11:59 PM
:
:
: mike mulligan replied to Peter Heslin:
: : ... it is greedy in the sense of the forward matching "*" or "+"
:constructs.
: : [snip]
:
: This is nothing to do with greediness and everything to do with
: left-to-rightness. The regexp engine does not look for x* except
: in those positions where the lookbehind has already matched.
:
:I was trying to understand at what point the lookbehind was attempted, and
:confused myself and posted a bad example.  My apologies to everyone.  Let's
:see if I can make sense of it on a second try.
:
:My question is: if I have the regex  /(?=[aeiou]X[yz]+/  then does Perl: 1.
:scan first for 'X', test the lookbehind, and then test the '[yz]',  or 2.
:scan for 'X[yz]' and then test the lookbehind?

3. The regexp is matched left to right: first the lookbehind, then 'X',
then '[yz]'.

:I am expecting these two alternatives to give the same result, but certain
:test strings might run slower or faster depending on the approach.
:
:Running perl -Dr shows that alternative 1 is used:

Running perl -Dr shows that alternative 3 is used. However the -Dr data
is confused by the optimiser, which happens to have chosen the fixed
string 'X' as something worth searching for first. So the optimiser
permits the main matching engine to look only at those positions where
there is an 'X' immediately following.

I've annotated the -Dr output below to try and clarify. Note that if
you replace 'X' with '(x|X)', no optimisations take place (other than
a 'minimum length' check) and -Dr will give a much clearer picture of
the flow; again, if you replace 'X[yz]' with '(x|X)y' the optimiser
will now pick 'y' as the most significant thing worth searching for.

Hope this helps,

Hugo
---
:qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/
:
:Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'...

The optimiser is entered.

:Found anchored substr `X' at offset 1...

This is what the optimiser is looking for.

:Guessed: match at offset 1

This is what the optimiser found.

:Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz'

The real matcher is entered.

:  Setting an EVAL scope, savestack=3
:   1 a XuhXvoXyz  |  1:  IFMATCH[-1]
:   0  aXuhXvoXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:   1 a XuhXvoXyz  | 12:SUCCEED

Ok.

:  could match...
:   1 a XuhXvoXyz  | 14:  EXACT X

Checking 'X' ...

:   2 aX uhXvoXyz  | 16:  ANYOF[yz]

Checking '[yz]' ...

:failed...

Failed: try the next position permitted by the optimiser.

:  Setting an EVAL scope, savestack=3
:   4 aXuh XvoXyz  |  1:  IFMATCH[-1]
:   3 aXu hXvoXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:  failed...

Failed.

:failed...

Try the next position permitted by the optimiser.

:  Setting an EVAL scope, savestack=3
:   7 aXuhXvo Xyz  |  1:  IFMATCH[-1]
:   6 aXuhXv oXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:   7 aXuhXvo Xyz  | 12:SUCCEED

Ok.

:  could match...
:   7 aXuhXvo Xyz  | 14:  EXACT X

Checking 'X' ...

:   8 aXuhXvoX yz  | 16:  ANYOF[yz]

Checking '[yz]' ...

:   9 aXuhXvoXy z  | 25:  END
:Match successful!

Match successful.



Re: RFC 158 (v1) Regular Expression Special Variables

2000-09-11 Thread Hugo

Mark-Jason Dominus writes:
: There's also long been talk/thought about making $ and $1 
: and friends magic aliases into the original string, which would
: save that cost.
:
:Please correct me if I'm mistaken, but I believe that that's the way
:they are implemented now.  A regex match populates the -startp and
:-endp parts of the regex structure, and the elements of these items
:are byte offsets into the original string.

I went on a briefish trawl for this the other day, and as far as I
can tell what happens is this:
- during matching, the startp/endp pairs are populated with offsets
into the target string
- immediately after matching, the target string is copied if needed,
and the PL_curpm object is updated to refer to the copy
- the copy is needed if any of the special variables can be referred
to: $`, $, $', $1, $2, ...

The result of that is that if there are backreferences in the regexp,
the copy is always needed; if not, the copy is needed only if $ or
her kin have been seen. So regexps with backrefs should suffer no
slowdown from use of $ in the same program, but regexps without
backrefs will get a (potentially) unnecessary copy.

The other problem with this, of course, is that the compiler may not
yet have seen the $ we intend to use:
  crypt% perl -wle '$_="foo"; /.*/; $_="bar"; print eval q{$}'
  bar
  crypt% 
.. and I think coredumps may be possible from this. (Hmm, perlbug
upcoming.)

Hugo



all regexp RFCs

2000-09-08 Thread Hugo

Hi guys, I'm sorry that time has not permitted me to join and take an
active part in the perl6-language-regex list; however, I have grabbed
an opportunity to look through the RFCs generated to date, and thought
I should throw some comments at you.

Apologies in advance for so rudely dumping this lot and _still_ not
joining the list; sorry also if I duplicate stuff that's already
been said. Feel free to ignore all or any of this. You'll need to cc
me if you want me to see replies, and in that case you might want to
do what I didn't, and tailor the subject to be more specific.

I've tried in particular to add a note about implementation issues
in each case.

Enjoy,

Hugo
---
RFC 72: Variable-length lookbehind: the regexp engine should also go backward.
==

This is an interesting idea. However, it is not obvious to me that
there is any practical difference between the existing:
  /(?= a+ ) b/x
.. and the proposed:
  /b (?`= a+ )/x
.. which implies that implementing one would be as difficult as the
other. And if that is the case, fixing (?=...) to support variable
length would be preferable, since it is more general. (Consider
/\d+ (?! 00) \. \d+/x, for example: AFAICS the proposed (?`=...)
does not allow the lookbehind to be anchored anywhere other than the
start of the match.)

While it would be great to have a working variable-length lookbehind,
it is not obvious how you would implement it: the internal structure
of a compiled regexp, as currently implemented, does not (I believe)
hold enough information to allow you to walk it backwards. It might
still be possible, though, with a fair amount of effort; you would,
for example, have to rewrite (?= ([abc]) ([def]) g \2 \1 ) into
(?= \1 \2 g ([def]) ([abc]) ), or maybe swap the \1 and \2.

RFC 93: Regex: Support for incremental pattern matching
==

I love this to bits. You might consider changing the arguements to
the fetcher($n;$s), such that if $n is positive it requests the
next $n characters, else it is a final call returning the -$n bytes
of $s to the stream. Not sure if this is any better than the current
proposal, but it might be easier to understand if the first argument
always represented a number of bytes.

I do not think implementation should be too difficult, though I
assume all optimisation should be turned off for such matches. It
might also be desirable to have a new regexp flag 'no optimisation
desired' to avoid the compile-time work done for optimisation's
sake, for optimisation's sake. IYSWIM.

RFC 110: counting matches
===

I like this too. I'd suggest /t should mean a) return a scalar of
the number of matches and b) don't set any special variables. Then
/t without /g would return 0 or 1, but be faster since no extra
information need be captured (except internally for (.)\1 type
matching - compile time checks could determine if these are needed,
though (?{..}) and (??{..}) patterns would require disabling of
that optimisation). /tg would give a scalar count of the total
number of matches. \G would retain its meaning.

Any which way, implementation should be fairly straightforward,
though ensuring that optimisations occurred precisely when they
are safe would probably involve a few bug-chasing cycles.

RFC 112: Assignment within a regex
===

This is cool, and has been requested several times in the past.
There is an outstanding issue of how variable references should
be scoped when encountered within regexps, however. Consider:

  {
local $a = 1;
my $re = qr{ (?$a = .) }x
{
  my $a = 2;
  "3" =~ $re;
  print $a;
}
print $a;
  }

This is a problem that needs to be solved in any case, for proper
understanding of how (?{..}) and (??{..}) should be interpreted,
and I assume this proposed feature should be handled the same way.
Implementation should not be particularly difficult once that
knotty issue is resolved.

RFC 144: Behavior of empty regex should be simple
===

Absolutely. snip

RFC 145: Brace-matching for Perl Regular Expressions
===

This is an interesting idea. I'm not sure how useful it would
actually be: as far as I can see it would not match the block
on code such as:

  use matchpairs '{' = '}';
  EOF =~ /\m.*\M/;
  {
my $brace = '{';
...
  }
  EOF

.. and most of the pair-matching patterns I've tried to write in
the past have needed to cope with embedded oddities such as
quoted-strings, comments etc.

It might be useful to add some more complex examples to show
how you'd deal with such things. Another type of example that
would be useful is HTML parsing:
  table border=1
trstuff.../tr
trstuff...
  /TABLE
.. since it also isn't clear to me whether you'd be able to
extract the table contents, or the rows, using the mechanisms
of this proposal.

RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns
===

This is cool - I don't think I've seen this suggested before.

Implementation might be a bit more work: the back