subject:"RFC 145 \(alternate approach\)"

Re: what (?x) are in use? (was RFC 145 (alternate approach))

2000-09-11 Thread Mark-Jason Dominus



> In theory, all letters should be reserved to map to future flags for
> the same reason. 

My recollection is that Larry specifically mandated this, and that's
why (?p...) was changed to (??...) in 5.6.0.

what (?x) are in use? (was RFC 145 (alternate approach))

2000-09-11 Thread Hugo


[resent, 'cos I can't spell "perl6"]

Richard Proctor wrote:
:The whole (?x set of thingies is getting complicated...  The list of what is
:used at present (and in current suggestions is:
:
:Current Use in perl5
:
:(?# comment
:(?imsx  flags
:(?-imsx flags

That's actually (?iogcmsx and (?-iogcmsx. ('o' is ignored; I'm not sure
what if any effect 'g' and 'c' have, but it probably ain't pretty.)

In theory, all letters should be reserved to map to future flags for
the same reason. That's why we've been accumulating multi-punctuation
signifiers, so it may be time to go for a new paradigm with more room
for expansion: (+keyword) or (*keyword) would seem to be candidates.

Hugo

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread David L. Nicol


Bart Lateur wrote:
> 
> On 06 Sep 2000 18:04:18 -0700, Randal L. Schwartz wrote:
> 
> >I think the -1 indexing for "end of array" came from there.  Or at
> >least, it was in Perl long before it was in Python, and it was in Icon
> >before it was in Perl, so I had always presumed Larry had seen Icon.
> >Larry?


I thought he got it from the substr function in CDC mainframe BASIC, which
in which negative positions mean "from the end of the string"

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Michael Maraist

- Original Message -
From: "Jonathan Scott Duff" <[EMAIL PROTECTED]>
Subject: Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145
(alternate approach))

> How about qy() for Quote Yacc  :-)  This stuff is starting to look
> more and more like we're trying to fold lex and yacc into perl.  We
> already have lex through (?{code}) in REs, but we have to hand-write
> our own yacc-a-likes.

Though you can do cool stuff in (?{code}), I wouldn't quite call it lex.
First off we're dealing with NFA instead of DFA, and at the very least, that
gives you back-tracking.  True, local's allow you to preserve state to some
degree.  But the following is as close as I can consider (?{code}) a lexer:

sub lex_init {
my $str = shift;
our @tokens;
$str =~ / \G (?{ local @tokens; })
   (?: TokenDelim(\d+) (?{ push @tokens, [ 'digit', $1 ] })
   | TokenDelim(\w+) (?{ push @tokens, [ 'word', $1 ] })
   )
/gx;
}

sub getNextToken {  shift @tokens; }

I'm not even suggesting this is a good design.  Just showing how akward it
is.

Other problems with the lexing in perl is that you pretty much need the
entire string before you begin processing, while a good lexer only needs the
next character.  Ideally, this is a character stream.  Already we're talking
about a lot of alteration and work here..  Not something I'd be crazy about
putting into the core.

-Michael

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Mark-Jason Dominus



> I think what is needed is something along the line of :

Joe McMahon and I are working on something along these lines.

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Jarkko Hietaniemi


On Thu, Sep 07, 2000 at 03:42:01PM -0400, Eric Roode wrote:
> Richard Proctor wrote:
> >
> >I think what is needed is something along the line of :
> >
> >   $re = qz{ '(' \$re ')'
> >| \$re \$re
> >| [^()]+
> >   };
> >   
> >Where qz is some hypothetical new quoting syntax
> 
> Well, we currently have qr{}, and ??{} does something like your \$re.
> 
> Warning: radical ideas ahead.
> 
> What would be useful, would be to leave REs the hell alone; they're 
> great as-is, and are only getting hairier and hairier. What would be
> useful, would be to create a new non-regular pattern-matching/parsing
> "language" within Perl, that combines the best of Perl REs, lex, 
> SNOBOL, Icon, state machines, and what have you. 

Agreed.  "Yet another quoting construct", "yet another \construct",
"yet another (? construct".  Argh, please, no.  Make all the above and
all we've learned from Parse::RecDescent et alia to collide at light
speed and see what new cool particles will spring forth.


-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Damian Conway


   > What would be useful, would be to leave REs the hell alone; they're 
   > great as-is, and are only getting hairier and hairier.

Amen!
   
   > What would be useful, would be to create a new non-regular
   > pattern-matching/parsing "language" within Perl, that combines
   > the best of Perl REs, lex, SNOBOL, Icon, state machines, and what
   > have you.

Have you seen Parse::RecDescent? I think it -- and other parsing
modules such as Parse::Yapp and Parse::YALALR -- represents the
direction we should be heading.

Damian

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Jonathan Scott Duff

On Thu, Sep 07, 2000 at 08:20:42PM +0100, Richard Proctor wrote:
> I think what is needed is something along the line of :
> 
>$re = qz{ '(' \$re ')'
> | \$re \$re
> | [^()]+
>};
>
> Where qz is some hypothetical new quoting syntax

How about qy() for Quote Yacc  :-)  This stuff is starting to look
more and more like we're trying to fold lex and yacc into perl.  We
already have lex through (?{code}) in REs, but we have to hand-write
our own yacc-a-likes.

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Eric Roode

Richard Proctor wrote:
>
>I think what is needed is something along the line of :
>
>   $re = qz{ '(' \$re ')'
>| \$re \$re
>| [^()]+
>   };
>   
>Where qz is some hypothetical new quoting syntax

Well, we currently have qr{}, and ??{} does something like your \$re.

Warning: radical ideas ahead.

What would be useful, would be to leave REs the hell alone; they're 
great as-is, and are only getting hairier and hairier. What would be
useful, would be to create a new non-regular pattern-matching/parsing
"language" within Perl, that combines the best of Perl REs, lex, 
SNOBOL, Icon, state machines, and what have you. 

$string =~ qz< $start_numF:No_Num# Start with a number
-(\d|) => $end_num  F:One_Num   # Look for an end num
{$start_num > $end_num?}S:Got_Range # comparison

No_Num: (\$\w+) => $start_var   S:Its_A_Var
exit (status=error yada yada)

One_Num: etc etc

PARSE_THE_HELL_OUT_OF_THIS

This new sub-language would, like SNOBOL, allow you to piece together
patterns into powerful expressions, would be more readable than Perl's
line-noise expressions (which are only getting worse with (?this) and
(?that)!), would allow recursive processing, maybe even looping.

Now _that_ would rule :-)

 --
 Eric J. Roode,  [EMAIL PROTECTED]   print  scalar  reverse  sort
 Senior Software Engineer'tona ', 'reh', 'ekca', 'lre',
 Myxa Corporation'.r', 'h ', 'uj', 'p ', 'ts';

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Richard Proctor


On Wed 06 Sep, Mark-Jason Dominus wrote:
> 
> I've been thinking the same thing.  It seems to me that the attempts to
> shoehorn parsers into regex syntax have either been unsuccessful
> (yielding an underpowered extension) or illegible or both.
> 
>SNOBOL: 
> parenstring = '(' *parenstring ')'  
> | *parenstring *parenstring
> | span('()')
> 
> 
> This is not exactly the same, but I tried a direct translation:
> 
>  $re = qr{ \( (??{$re}) \)
>  | (??{$re}) (??{$re})
>  | (?> [^()]+)
>  }x;
> 

I think what is needed is something along the line of :

   $re = qz{ '(' \$re ')'
| \$re \$re
| [^()]+
   };
   
Where qz is some hypothetical new quoting syntax

Richard

-- 

[EMAIL PROTECTED]

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Bart Lateur

On 06 Sep 2000 18:04:18 -0700, Randal L. Schwartz wrote:

>I think the -1 indexing for "end of array" came from there.  Or at
>least, it was in Perl long before it was in Python, and it was in Icon
>before it was in Perl, so I had always presumed Larry had seen Icon.
>Larry?

Do not assume that these are the only languages that exist. There must
be hundreds of languages; see the famous "Free Compilers" list
(). At least a few of these do
support -1 for last array index.

p.s. Shall I bring up the "@array[2 .. -1] should do the proper thing"
requested feature again? Oops, I just did. I think implementing this
basically requires lazy evaluation of the (2 .. -1) thing, so when it
eventually needs to be turned into a list of numbers, [a] it is aware of
the fact that it's in an "list indexing context", and [b] it knows the
number of list items.

And yes, some of the other languages do properly support this feature.

-- 
Bart.

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Randal L. Schwartz

> "Jarkko" == Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:

>> "You want Icon, you know where to find it..." :)

Jarkko> Hey, it's one of the few languages we haven't yet stolen a
Jarkko> neat feature or few from...  (I don't really count the few
Jarkko> regex thingies as full-fledged stealing, more like an
Jarkko> experimental sleight-of-hand.)

I think the -1 indexing for "end of array" came from there.  Or at
least, it was in Perl long before it was in Python, and it was in Icon
before it was in Perl, so I had always presumed Larry had seen Icon.
Larry?

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Jarkko Hietaniemi


On Wed, Sep 06, 2000 at 03:47:57PM -0700, Randal L. Schwartz wrote:
> > "Mark-Jason" == Mark-Jason Dominus <[EMAIL PROTECTED]> writes:
> 
> Mark-Jason> I have some ideas about how to do this, and I will try to
> Mark-Jason> write up an RFC this week.
> 
> "You want Icon, you know where to find it..." :)

Hey, it's one of the few languages we haven't yet stolen a neat
feature or few from...  (I don't really count the few regex thingies
as full-fledged stealing, more like an experimental sleight-of-hand.)

> But yes, a way that allows programmatic backtracking sort of "inside out"
> from a regex would be nice.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Mark-Jason Dominus

> > "Mark-Jason" == Mark-Jason Dominus <[EMAIL PROTECTED]> writes:
> 
> Mark-Jason> I have some ideas about how to do this, and I will try to
> Mark-Jason> write up an RFC this week.
> 
> "You want Icon, you know where to find it..." :)

That's exactly my motivation.  It seems to me that trying to cram Icon
into regexes isn't working well, but that a small transplant of Icon
into the core language might suffice instead of a lot of cramming.

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Randal L. Schwartz


> "Mark-Jason" == Mark-Jason Dominus <[EMAIL PROTECTED]> writes:

Mark-Jason> I have some ideas about how to do this, and I will try to
Mark-Jason> write up an RFC this week.

"You want Icon, you know where to find it..." :)

But yes, a way that allows programmatic backtracking sort of "inside out"
from a regex would be nice.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Mark-Jason Dominus



> >...My point is that I think we're approaching this
> >the wrong way.  We're trying to apply more and more parser power into what
> >classically has been the lexer / tokenizer, namely our beloved
> >regular-expression engine.

I've been thinking the same thing.  It seems to me that the attempts to
shoehorn parsers into regex syntax have either been unsuccessful
(yielding an underpowered extension) or illegible or both.

An approach that appears to have been more successful is to find ways
to integrate regexes *into* parser code more effectively.  Damian
Conway's Parse::RecDescent module does this, and so does SNOBOL.

In SNOBOL, if you want to write a pattern that matches balanced
parenteses, it's easy and straightforward and legible:

parenstring = '(' *parenstring ')'  
| *parenstring *parenstring
| span('()')


(span('()') is like [^()]* in Perl.)

The solution in Parse::RecDescent is similar.

Compare this with the solutions that work now:

 # man page solution
 $re = qr{
  \(
(?:
   (?> [^()]+ )# Non-parens without backtracking
 |
   (??{ $re }) # Group with matching parens
 )*
  \)
}x;

This is not exactly the same, but I tried a direct translation:

 $re = qr{ \( (??{$re}) \)
 | (??{$re}) (??{$re})
 | (?> [^()]+)
 }x;

and it looks worse and dumps core.  

This works:

qr{
  ^
  (?{ local $d=0 })
  (?:   
  \(
  (?{$d++}) 
   |  
  \)
  (?{$d--})
  (?
(?{$d<0})
(?!) 
  )  
   |  
  (?> [^()]* )
  
  )* 


  (?
(?{$d!=0})  
(?!)
  )
 $
}x;

but it's rather difficult to take seriously.

The solution proposed in the recent RFC 145:

/([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g

is not a lot better.  David Corbin's alternative looks about the same.

On a different topic from the same barrel, we just got a proposal that
([23,39]) should match only numbers between 23 and 39.  It seems to me
that rather than trying to shoehorn one special-purpose syntax after
another into the regex language, which is already overloaded, that it
would be better to try to integrate regex matching better with Perl
itself.  Then you could use regular Perl code to control things like
numeric ranges.  

Note that at present, you can get the effect of [(23,39)] by writing
this:

(\d+)(?(?{$1 < 23 || $1 > 39})(?!))

which isn't pleasant to look at, but I think it points in the right
direction, because it is a lot more flexible than [(23,39)].  If you
need to fix it to match 23.2 but not 39.5, it is straightforward to do
that:  
  
(\d+(\.\d*)?)(?(?{$1 < 23 || $1 > 39})(?!))

The [(23,39)] notation, however, is doomed.All you can do is
propose Yet Another Extension for Perl 7.

The big problem with 

(\d+)(?(?{$1 < 23 || $1 > 39})(?!))

is that it is hard to read and understand.

The real problem here is that regexes are single strings.  When you
try to compress a programming language into a single string this way,
you end up with something that looks like Befunge or TECO.  We are
going in the same direction here.

Suppose there were an alternative syntax for regexes that did *not*
require that everything be compressed into a single string?  Rather
than trying to pack all of Perl into the regex syntax, bit by bit,
using ever longer and more bizarre punctuation sequences, I think a
better solution would be to try to expose the parts of the regex
engine that we are trying to control.

I have some ideas about how to do this, and I will try to write up an
RFC this week.

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread David Corbin


Jonathan Scott Duff wrote:
> 
> On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote:
> > What if we added special XML/HTML-parsing ?< and ?> operators?
> 
> What if we just provided deep enough hooks into the RE engine that
> specialized parsing constructs like these could easily be added by
> those who need them?
> 

In principle, that's a very Perlish thing to do...

> -Scott
> --
> Jonathan Scott Duff
> [EMAIL PROTECTED]

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-06 Thread Michael Maraist

- Original Message -
From: "Richard Proctor" <[EMAIL PROTECTED]>
Sent: Tuesday, September 05, 2000 1:49 PM
Subject: Re: RFC 145 (alternate approach)

> On Tue 05 Sep, David Corbin wrote:
> > Nathan Wiger wrote:
> > > But, how about a new ?m operator?
> > >/(?m<<|[).*?(?M>>|])/;
> There already is a (?m
> Current Use in perl5
> (?# comment
> (?imsx flags
> (?-imsx flags
> (?: subexpression without bracket capture
> (?= zero-width positive look ahead
> (?! zero width negative look ahead
> (?<= zero-width positve look behind
> (? (?{code} Execute code
> (??{code} Execute code and use result as pattern
> (?> Independant subexpression
> (?(condition)yes-pattern
> (?(condition)yes-pattern|no-pattern
>
> Suggested in RFCs either current or in development
>
> (?$foo= suggested for assignment (RFC 112)
> (?%foo= suggested for hash assignment (RFC 150?)
>
> (?@foo suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166)
> (?Q@foo) Quote each item of lists (RFC 166)
> (?^pattern) matches anything that does not match pattern
> (RFC 166 but will be somewhere else on next rewrite [1])
> (?F Failure tokens (RFC in development by me [1])
> (?r),(?f) Suggested in Direction Control RFC 1
> (?& Boolean regexes (RFC in development [1])
> (?*{code}) Execute code with pass/fail result (RFC in development [1])
>
> a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z
> A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z
> 0,1,2,3,4,5,6,7,8,9
> `_,."+[];'~)

Ok, I've read through some of the archives, and thought this was a good
starting point.
I haven't seen any discussion on an obvious solution (though in another
email, I suggested that this approach should be foregone in favor of a
parsing approach.. But one thing at a time).

There are two general problems as I see it.  First, you have to be able to
specify exactly what you're matching.  Obviously generically matching "[<(`"
etc is going to be upset if your nesting has simple things like " a < 5 " or
"I'm going home, it's hot".  A design goal, therefore should be to
explicitly state the matching characters.  Second, you need to be able to
apply additional expression-syntax to match inside the nesting.

An additional problem occurs when you suggest using pragmas to specify
delimeters.  It could be a performance hit, if not a developer's nightmare.
When I run eval, must I always set the pragma, just in case there is some
wierd scoping problem?  Same problem as when using all global variables (and
the 'local' keyword.  God I hate that thing).

Therefore, I suggest a commonly used form:

/(?N [ { ] . )/x

Note that I use N which stands for nesting instead of the redunant 'M'atch.
I don't know how well character-based op-codes will be accepted.  As pointed
out above, the symbol-space is shrinking fast.

The dots describe further matching / capturing within the delimeters.  Thus
/A (?N [ { ] ) B/x
will match 'A' followed by a bracket grouping (anything therein is fine),
then followed by 'B'.

/A (?N [ { ] ( .* ) ) B/x
does the same as above, but captures the internal contents (excluding the
delimeters).

/A ( (?N [ { ]  ) ) B/x
Will capture all the conents, including the delimeters.

/A (?N [ [ ( ]  ( .* )  ) B/x
Same as before, but with squares and parentheses.  Note delim specifiers can
obey the same rules as normal character classes, thus [ [ ( { < ] means
collect the entire group.  POSIX classes can be used for all of them, as in
[=open_braces=] (don't care what the phrase actually is).  The reason I
chose this is becuase we are essentially doing a character class, so we
might as well explicitly use one; It makes more logical sence.  Note that to
make emacs happy, you should be able to escape all the one-way delimeters.
as in [ \[ \( \{ \< ].  That might also make it easier to read, explicitly
showing that these are being treated as characters, and not as actual
operators.

As for special operations such as (/* ... */ ), then I would recommend the
usage of named-character classes.  [=c_comment=], for example.  I'm not sure
how those classes are defined, but this obviously requires the system to be
extensible (RFC anyone?).  Course this violates my issue of using pragmas to
alter the operation of reg-ex's.  Most likely only built-in types should
work.

Another feature could be to treat the end of matching-brace as an
end-of-line.  Thus the above .* will properly exit.  If this turns out to
not work, then .* can necessarily be replaced by .*?.  The advantage of this
is in nested expressions, as in:

$r_kw = qr/Keyword \s* .* /x;
$r_lisp_expr = qr/ (?N [ ( ] $r_kw ) /x;
$line = <>;
$line =~ $r_lisp_expr;

But this would also have worked wit

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Tom Christiansen


>I am working on an RFC
>to allow boolean logic ( && and || and !) to apply a number of patterns to
>the same substring to allow easier mining of information out of such
>constructs. 

What, you don't like: :-)

$pattern = $conjunction eq "AND"
? join(''  => map { "(?=.*$_)" } @patterns)
| join("|" =>@patterns);

--tom

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Richard Proctor

On Wed 06 Sep, David Corbin wrote:
> Nathan Wiger wrote:
> > 
> > > It would be useful (and increasingly more common) to be able to match
> > > qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where
> > > those can nest as well.  Something like
> > >
> > > match this with
> > >
> > >   not this but
> > >this.
> > 
> > I suspect this is going to need a ?[ and ?] of its own. I've been
> > thinking about this since your email on the subject yesterday, and I
> > don't see how either RFC 145 or this alternative method could support
> > it, since there are two tags - > and  > asymmetrically, and neither approach gives any credence to what's
> > contained inside the tag. So  would be matched itself as "< matches
> > >".
> 
> Actually, in one of my responses I did outline a syntax which would handle
> this with reasonably ease, I think.  If the contents of (?[) is considered
> a pattern, then you can define a matching pattern.

I think it should be a list of patterns rather than a single pattern.

Each pattern in the list is attempted left to right until one matches.  I now
dont think it should be a hash as it needs to be ordered.  But using the =>
as the l/r separateor does  make it clear.

> 
> m:(?['<\w+>' => '').*(?]):
> 
> 
> I'll grant you it's not the simplest syntax, but it's a lot simpler than
> using the 5.6 method... :)

Actually that simple case is handled as m:<(\w+)>.*: but I 
think this is getting somewhere.  This is a rich syntax that has lots of
potential uses, not just for html.

> > 
> > What if we added special XML/HTML-parsing ?< and ?> operators?
> > Unfortunately, as Richard notes, ?> is already taken, but I will use it
> > for the examples to make things symmetrical.
> > 
> >?<  =  opening tag (with name specified)
> >?>  =  closing tag (matches based on nesting)

We are running out of (? syntax, we might want to find some other construct
before long.  But anyway, XML/HTML is important, but I am not convinced
that what is being covered here really helps.  I am working on an RFC
to allow boolean logic ( && and || and !) to apply a number of patterns to
the same substring to allow easier mining of information out of such
constructs. 

Richard

-- 

[EMAIL PROTECTED]

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Tom Christiansen


>...My point is that I think we're approaching this
>the wrong way.  We're trying to apply more and more parser power into what
>classically has been the lexer / tokenizer, namely our beloved
>regular-expression engine.

>A great deal of string processing is possible with perls enhanced NFA
>engine, but at some point we're looking at perl code that is inside out: all
>code embedded within a reg-ex.  That, boys and girls, is a parser, and I'm
>not convinced it's the right approach for rapid design, and definately not
>for large-scale robust design.

What you say has, I think, a great deal of sense.  While Jon and
I--with Nathan, actually (see inside page credits)--were trying to
figure out how to go about presenting all this wacky stuff for the
final section of the new regex chapter in the Camel:

Fancy Patterns
Lookaround Assertions
Non-Backtracking Subpatterns
Programmatic Patterns
Generated patterns
Substitution evaluations
Match-time code evaluation
Match-time pattern interpolation
Conditional interpolation
Defining Your Own Assertions

We kept coming back to sentiments remarkably similar to those you
yourself have just expressed: although I think we managed to put a
decently positive shine on the matter for the print version, it
still really seems that that the inside-outness of this is very
hard on your brain, and of remarkably abstruse appeal to the
incredibly few.  (Names of the usual suspects omitted to avoid using
four-letter words in public forums. :-)

I would welcome a less inside-out approach, as well as one that
were more procedural--or at least more symbolic and less punctuational.

--tom

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Nathan Wiger

David Corbin wrote:
>
> m:(?['' => '').*(?]):
> 
> or more generically
> 
> m:(?['<\w+>' => '').*(?]):

I think these are good; but I do also like the idea of "automatic
reversing" by default, since that's a common operation.

Let's combine the ideas, as Richard suggests. How about:

   1. When a scalar value is provided as the argument
  to ?[, then that value is automatically reversed
  character-wise and bracket-wise.

   2. When a list is provided, each pair in the list
  is what to match.

So here are some examples:

   m/(?[<<<)Some stuff(?])/;# <<>>
   m/\@(?[{[)weird perl(?])/;   # @{[weird perl]}

   m/(?['<\w+>' => '').*(?])/;  # Text

   # less verbose, more robust
   my @tag = qw('<(\w+)\s*.*?>' => '');
   m/(?[@tag)Some title(?])(?[@tag)Open(?[@tag)Embedded(?])(?]);

That last one would match

   Some title

  Open
  Embedded

So really, all RFC 145 needs to do is introduce ?[ and ?], which do a
couple things by default (like brace-matching and character reversing),
but are actually general-purpose nesting operators when provided with a
list of things to match.

-Nate

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Michael Maraist

- Original Message -
From: "Jonathan Scott Duff" <[EMAIL PROTECTED]>
Subject: Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145
(alternate approach))

> On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote:
> > What if we added special XML/HTML-parsing ?< and ?> operators?
>
> What if we just provided deep enough hooks into the RE engine that
> specialized parsing constructs like these could easily be added by
> those who need them?
>
> -Scott

Ok, I've avoided this thread for a while, but I'll make my comment now.
I've played with several ideas of reg-ex extensions that would allow
arbitrary "parsing".  My first goal was to be able to parse perl-like text,
then later a simple nested parentheses, then later nested xml as with this
thread.

I have been able to solve these problems using perl5.6's recursive reg-ex's,
and inserted procedure code.  Unfortunately this isn't very safe, nor is it
'pretty' to figure out by a non-perl-guru.  What's more, what I'm attempting
to do with these nested parens and xml is to _parse_ the data.. Well, guess
what guys, we've had decades of research into the area of parsing, and we
came out with yacc and lex.  My point is that I think we're approaching this
the wrong way.  We're trying to apply more and more parser power into what
classically has been the lexer / tokenizer, namely our beloved
regular-expression engine.

A great deal of string processing is possible with perls enhanced NFA
engine, but at some point we're looking at perl code that is inside out: all
code embedded within a reg-ex.  That, boys and girls, is a parser, and I'm
not convinced it's the right approach for rapid design, and definately not
for large-scale robust design.

As for XML, we already have lovely c-modules that take of that.. You even
get your choice.  Call per tag, or generate a tree (where you can search for
sub-trees).  What else could you want?  (Ok, stupid question, but you could
still accomplish it via a customized parser).

My suggestion, therefore would be to discuss a method of encorportating more
powerful and convinient parsing within _perl_; not necessarily directly
within the reg-ex engine, and most likely not within a reg-ex statement.  I
know we have Yacc and Parser modules.  But try this out for size: Perl's
very name is about extraction and reporting.  Reg-ex's are fundamental to
this, but for complex jobs, so is parsing.  After I think about this some
more, I'm going to make an RFC for it.  If anyone has any hardened opinions
on the matter, I'd like to hear from you while my brain churns.

-Michael

Re: RFC 145 (alternate approach)

2000-09-06 Thread Richard Proctor

On Tue 05 Sep, Nathan Wiger wrote:
>"normal"   "reversed"
>-- ---
>103301
>99aa99
>(( ))
><+ +>
>{{[!<_ _>!]}}
>{__A1( )A1__}
> 
> That is, when a bracket is encountered, the "reverse" of that is
> automatically interpreted as its closing counterpart. This is the same
> reason why qq// and qq() and qq{} all work without special notation. 
> 
> So we can replace @^g and @^G with simple precendence rules, the same
> that are actually invoked automatically throughout Perl already.
> 
> >   (?[( => ),{ => }, 01 => 10)
> > 
> > sort of hashish in style.
> 
> I actually think this is redundant, for the reasons I mentioned above.
> I'm not striking it down outright, but it seems simple rules could make
> all this unnecessary. 

I dont think you will ever come up with a set of rules that will satisfy
everybody all the time.  what about html comments  are they
brackets?  What about people doing 66/99 pairs?  The best you could
achieve is a set of default rules as you have suggested AND a way
of overriding them with an explicit hash of what is the closing

bracket for each opening bracket.

The two methods depend on what follows the (?[ is it a hash or not.

For the "Default" method the list of brackets could be as has been
suggested a regex, or perhaps a simple comma separated list.  For this
you should define what is the "reverse" of each character, at
least for latin-1, what do you do about the full utf-8...?  An \X type
construct that covers all the common brackets might be a usefull addition
({

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Jonathan Scott Duff


On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote:
> What if we added special XML/HTML-parsing ?< and ?> operators?

What if we just provided deep enough hooks into the RE engine that
specialized parsing constructs like these could easily be added by
those who need them?

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread David Corbin


Nathan Wiger wrote:
> 
> > It would be useful (and increasingly more common) to be able to match
> > qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those
> > can nest as well.  Something like
> >
> > match this with
> >
> >   not this but
> >this.
> 
> I suspect this is going to need a ?[ and ?] of its own. I've been
> thinking about this since your email on the subject yesterday, and I
> don't see how either RFC 145 or this alternative method could support
> it, since there are two tags - > and  asymmetrically, and neither approach gives any credence to what's
> contained inside the tag. So  would be matched itself as "< matches
> >".

Actually, in one of my responses I did outline a syntax which would
handle this with
reasonably ease, I think.  If the contents of (?[) is considered a
pattern, then you can
define a matching pattern.

Consider either of these.

m:(?[]).*?(?]): 

or

m:(?['' => '').*(?]):# really ought to include (?i:) in
there, but left out for readablity

or more generically

m:(?['<\w+>' => '').*(?]):


I'll grant you it's not the simplest syntax, but it's a lot simpler than
using the 5.6 method... :)
> 
> What if we added special XML/HTML-parsing ?< and ?> operators?
> Unfortunately, as Richard notes, ?> is already taken, but I will use it
> for the examples to make things symmetrical.
> 
>?<  =  opening tag (with name specified)
>?>  =  closing tag (matches based on nesting)
> 
> Your example would simply be:
> 
>/(?)[\s\w]*(?>)/;
> 
> What makes me nervous about this is that ?< and ?> seem special-case.
> They are, but then again XML and HTML are also pervasive. So a
> special-case for something like this might not be any stranger than
> having a special-case for sin() and cos() - they're extremely important
> operations.
> 
> The other thing that this doesn't handle is tags with no closing
> counterpart, like:
> 
>
> 
> Perhaps for these the easiest thing is to tell people not to use ?< and
> ?>:
> 
>/(?)(?>)/;
> 
> Would match
> 
>
>   Some stuff
>
> 
> Finally, tags which take arguments:
> 
>Stuff
> 
> Would require some type of "this is optional" syntax:
> 
>/(?)/
> 
> Perhaps only the first word specified is taken as the tag name? This is
> the XML/HTML spec anyways.
> 
> -Nate

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]

XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Nathan Wiger


> It would be useful (and increasingly more common) to be able to match
> qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those
> can nest as well.  Something like
> 
> match this with
>
>   not this but
>this.

I suspect this is going to need a ?[ and ?] of its own. I've been
thinking about this since your email on the subject yesterday, and I
don't see how either RFC 145 or this alternative method could support
it, since there are two tags - > and  would be matched itself as "< matches
>".

What if we added special XML/HTML-parsing ?< and ?> operators?
Unfortunately, as Richard notes, ?> is already taken, but I will use it
for the examples to make things symmetrical.

   ?<  =  opening tag (with name specified)
   ?>  =  closing tag (matches based on nesting)

Your example would simply be:

   /(?)[\s\w]*(?>)/;

What makes me nervous about this is that ?< and ?> seem special-case.
They are, but then again XML and HTML are also pervasive. So a
special-case for something like this might not be any stranger than
having a special-case for sin() and cos() - they're extremely important
operations.

The other thing that this doesn't handle is tags with no closing
counterpart, like:

   

Perhaps for these the easiest thing is to tell people not to use ?< and
?>:

   /(?)(?>)/;

Would match

   
  Some stuff
   

Finally, tags which take arguments:

   Stuff

Would require some type of "this is optional" syntax:

   /(?)/

Perhaps only the first word specified is taken as the tag name? This is
the XML/HTML spec anyways.

-Nate

Re: RFC 145 (alternate approach)

2000-09-06 Thread Buddha Buck


At 09:05 AM 9/6/00 -0400, David Corbin wrote:
>I'd suggest also, that (?[) (with no specified brackets) have the
>default meaning
>of the "four standard brackets" :
>
>(?['('=>')','{'=>'}','['=>']','<'=>'>')
>
>Note also the subtle syntax change.  We are either dealing with strings
>or with patterns.  The consensus seems to be against patterns (I can
>understand that).  Given that, we need  to quote the right hand side of
>the => operator I think.  The quotes on the left side would be optional,
>I think.

It would be useful (and increasingly more common) to be able to match 
qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those 
can nest as well.  Something like

match this with
   
  not this but
   this.

>Richard Proctor wrote:
> >
> > On Tue 05 Sep, Nathan Wiger wrote:
> > > Eric Roode wrote:
> > > Now *that* sounds cool, I like it!
> > >
> > > What if the RFC only suggested the addition of two new constructs, (?[)
> > > and (?]), which did nested matches. The rest would be bound by standard
> > > regex constructs and your imagination!
> > >
> > > That is, the ?] simply takes whatever the closest ?[ matched and
> > > reverses it, verbatim, including ordering, case, and number of
> > > characters. The only trick would be a way to get what "reverses it"
> > > means correct.
> > >
> >
> > No ?] should match the closest ?[ it should nest the ?[s bound by any
> > brackets in the regex and act accordingly.
> >
> > Also this does not work as a definition of simple bracket matching as you
> > need ( to match ) not ( to match (.  A ?[ list should specify for each
> > element what the matching element is perhaps
> >
> >   (?[( => ),{ => }, 01 => 10)
> >
> > sort of hashish in style.
> >
> > Perhaps the brackets could be defined as a hash allowing (?[%Hash)
> >
> > Richard
> >
> > --
> >
> > [EMAIL PROTECTED]
>
>--
>David Corbin
>Mach Turtle Technologies, Inc.
>http://www.machturtle.com
>[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-06 Thread David Corbin


I'd suggest also, that (?[) (with no specified brackets) have the
default meaning
of the "four standard brackets" :

(?['('=>')','{'=>'}','['=>']','<'=>'>')

Note also the subtle syntax change.  We are either dealing with strings
or with patterns.  The consensus seems to be against patterns (I can
understand that).  Given that, we need  to quote the right hand side of
the => operator I think.  The quotes on the left side would be optional,
I think.

Richard Proctor wrote:
> 
> On Tue 05 Sep, Nathan Wiger wrote:
> > Eric Roode wrote:
> > Now *that* sounds cool, I like it!
> >
> > What if the RFC only suggested the addition of two new constructs, (?[)
> > and (?]), which did nested matches. The rest would be bound by standard
> > regex constructs and your imagination!
> >
> > That is, the ?] simply takes whatever the closest ?[ matched and
> > reverses it, verbatim, including ordering, case, and number of
> > characters. The only trick would be a way to get what "reverses it"
> > means correct.
> >
> 
> No ?] should match the closest ?[ it should nest the ?[s bound by any
> brackets in the regex and act accordingly.
> 
> Also this does not work as a definition of simple bracket matching as you
> need ( to match ) not ( to match (.  A ?[ list should specify for each
> element what the matching element is perhaps
> 
>   (?[( => ),{ => }, 01 => 10)
> 
> sort of hashish in style.
> 
> Perhaps the brackets could be defined as a hash allowing (?[%Hash)
> 
> Richard
> 
> --
> 
> [EMAIL PROTECTED]

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-05 Thread David L. Nicol

David Corbin wrote:

> > I've got some vague ideas on solving all of these, I'll go into if
> > people like the basic concept enough.

not just in regexes, but in general, a way to extend the set of bratches
that Perl knows about would be very nice.  for instance it is very difficult
for people using european keyboards to produce curlies; if it was possible to
say that Q is the opening brace and it matches against q later, or any arbitrary
characters, such as the single-character versions of << and >> which I am not
capable of producing, if it was possible to specify this in the code somewhere
for instance 

$CORE::BRATCH{'Q'} = 'q';

(or maybe lexically scoped)

after that one could say 

$isafromline = qrQ^Fromq;

for instance.

-- 
  David Nicol 816.235.1187 [EMAIL PROTECTED]
   perl -e'@w=<>;for(;;){sleep print[rand@w]}' /usr/dict/words

Re: RFC 145 (alternate approach)

2000-09-05 Thread Nathan Wiger


Nathan Wiger wrote:
>
>"normal"   "reversed"
>-- ---
>{__A1( )A1__}

That should be:

 {__A1( )1A__}

Why you would delimit text this way I have no idea, but it could still
work...

-Nate

Re: RFC 145 (alternate approach)

2000-09-05 Thread Nathan Wiger

Richard Proctor wrote:
> 
> No ?] should match the closest ?[ it should nest the ?[s bound by any
> brackets in the regex and act accordingly.

Good point.

> Also this does not work as a definition of simple bracket matching as you
> need ( to match ) not ( to match (.  A ?[ list should specify for each
> element what the matching element is perhaps

Actually, it should with some simple precedence rules. If ?] reverses
the ordering of ?[, *and* we define "reversing" for bracketed pairs
consistent with the current Perl definition in other contexts, then this
is all automatic:

   "normal"   "reversed"
   -- ---
   103301
   99aa99
   (( ))
   <+ +>
   {{[!<_ _>!]}}
   {__A1( )A1__}

That is, when a bracket is encountered, the "reverse" of that is
automatically interpreted as its closing counterpart. This is the same
reason why qq// and qq() and qq{} all work without special notation. 

So we can replace @^g and @^G with simple precendence rules, the same
that are actually invoked automatically throughout Perl already.

>   (?[( => ),{ => }, 01 => 10)
> 
> sort of hashish in style.

I actually think this is redundant, for the reasons I mentioned above.
I'm not striking it down outright, but it seems simple rules could make
all this unnecessary. 

-Nate

Re: RFC 145 (alternate approach)

2000-09-05 Thread Richard Proctor

On Tue 05 Sep, Nathan Wiger wrote:
> Eric Roode wrote:
> Now *that* sounds cool, I like it!
> 
> What if the RFC only suggested the addition of two new constructs, (?[)
> and (?]), which did nested matches. The rest would be bound by standard
> regex constructs and your imagination!
> 
> That is, the ?] simply takes whatever the closest ?[ matched and
> reverses it, verbatim, including ordering, case, and number of
> characters. The only trick would be a way to get what "reverses it"
> means correct.
> 

No ?] should match the closest ?[ it should nest the ?[s bound by any
brackets in the regex and act accordingly.  

Also this does not work as a definition of simple bracket matching as you
need ( to match ) not ( to match (.  A ?[ list should specify for each
element what the matching element is perhaps 

  (?[( => ),{ => }, 01 => 10)

sort of hashish in style.

Perhaps the brackets could be defined as a hash allowing (?[%Hash)

Richard

-- 

[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-05 Thread Jonathan Scott Duff

On Tue, Sep 05, 2000 at 02:12:23PM -0400, Eric Roode wrote:
> Unfortunately, as Richard Proctor pointed out, ?m is taken. Perhaps
> (?[list|of|openers)  and  (?]list|of|closers)   ?

That breaks the visual meaning of "|" as alternation if the RE engine
is to be smart enough to match the closers with the right openers.
Plus, it leaves it up to the programmer to get his openers and closers
in the right position in the list which seems error prone to me.
How about these?

/(?[open0,close0|open1,close1)...(?])/  # preserves alternation
/(?[open0,close0,open1,close1)...(?])/
/(?[ open0 => close0, open1 => close1)...(?])/
/(?[open0 ]close0 [open1 ]close1)...(?])/
/(?[open0.close0 open1.close1)...(?])/

Blah!  I see no good short syntax for this.  Is it really that common
an operation to match paired delimiters (SGML and its progeny not
withstanding) ?  I mean, we can already do it in the language, why do
we want a shortcut?

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-05 Thread Nathan Wiger

Eric Roode wrote:
>
> Unfortunately, as Richard Proctor pointed out, ?m is taken. Perhaps
> (?[list|of|openers)  and  (?]list|of|closers)   ?
>
> Does that look too bizarre, with the lone square bracket in each?
> Or does that serve to make it mnemonic (which is my intention)?

Actually, I personally like this, and was on the verge of suggesting
similar myself, believe it or not. It makes a lot of sense to me. I
don't like the m vs M because that smacks too much of negation for my
tastes.

> And --- can-of-worms time --- we're only intending the list elements
> to be constant characters, but that syntax *looks* like it can take a
> regular expression for any of the list elements

Now *that* sounds cool, I like it!

What if the RFC only suggested the addition of two new constructs, (?[)
and (?]), which did nested matches. The rest would be bound by standard
regex constructs and your imagination!

   /(?[\d+)[\s\w]+?(?])/

Would match

  01HelloThere10
  999 Important Mesage 999

But not

  01HelloThere01
  999 Important Message 9

That is, the ?] simply takes whatever the closest ?[ matched and
reverses it, verbatim, including ordering, case, and number of
characters. The only trick would be a way to get what "reverses it"
means correct.

> Sound about right?

I think I'm really starting to like this now... :-)

-Nate

Re: RFC 145 (alternate approach)

2000-09-05 Thread Eric Roode

I think David's on to something good here. A major problem with 
holding the bracket-matching possibilities in a special variable
(or a pair of them) is that one can't figure out what the RE is
going to do just by looking at it -- you have to look elsewhere.

Nathan Wiger wrote:
>I think it's cool too, I don't like the @^g and ^@G either. But I worry
>about the double-meaning of the []'s in your solution, and the fact that
>these:
>
>   /\m[...]...\M/;
>   /\d[...]...\D/;
>
>Will work so differently. 

Yes. Things that look similar should act similar. Things that act
differently should look different.

>But, how about a new ?m operator?
>
>   /(?m<<|[).*?(?M>>|])/;
>
>Then the ?M matches pairs with the previous ?m, if there was one that
>was matched. The | character separates or'ed sets consistent with other
>regex patterns.

Ah, this is a neat idea! 

Unfortunately, as Richard Proctor pointed out, ?m is taken. Perhaps
(?[list|of|openers)  and  (?]list|of|closers)   ?

Does that look too bizarre, with the lone square bracket in each?
Or does that serve to make it mnemonic (which is my intention)?

And --- can-of-worms time --- we're only intending the list elements
to be constant characters, but that syntax *looks* like it can take a
regular expression for any of the list elements, so people are going
to try to do that someday. I cannot imagine what someone would want
do use a regexp in such a construct, but abuses of the language are
not limited to *my* imagination :-)  

(?[list|of|openers) would match any expression in the alternation
list. Subsequently, (?]list|of|closers) would match the *corresponding*
expression, but would keep track of the nesting level of the originally-
matching open-bracket expression. 

Sound about right?
 --
 Eric J. Roode,  [EMAIL PROTECTED]   print  scalar  reverse  sort
 Senior Software Engineer'tona ', 'reh', 'ekca', 'lre',
 Myxa Corporation'.r', 'h ', 'uj', 'p ', 'ts';

Re: RFC 145 (alternate approach)

2000-09-05 Thread Richard Proctor

On Tue 05 Sep, David Corbin wrote:
> Nathan Wiger wrote:
> > 
> > But, how about a new ?m operator?
> > 
> >/(?m<<|[).*?(?M>>|])/;
> > 
> 
> Let's combine yor operator with my example from above where everything
> inside the (?m) or the ?(M)
> fits the syntax of a RE.  
> 
>   /(?m(<<)|\[).*?(?M(>>)|(\]))
> 
> > Then the ?M matches pairs with the previous ?m, if there was one that
> > was matched. The | character separates or'ed sets consistent with other
> > regex patterns.

There already is a (?m

The whole (?x set of thingies is getting complicated...  The list of what is
used at present (and in current suggestions is:

Current Use in perl5

(?# comment
(?imsx  flags
(?-imsx flags
(?: subexpression without bracket capture
(?= zero-width positive look ahead
(?! zero width negative look ahead
(?<=zero-width positve look behind
(? Independant subexpression
(?(condition)yes-pattern
(?(condition)yes-pattern|no-pattern

Suggested in RFCs either current or in development

(?$foo= suggested for assignment (RFC 112)
(?%foo= suggested for hash assignment (RFC 150?)

(?@foo  suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166)
(?Q@foo) Quote each item of lists (RFC 166)
(?^pattern) matches anything that does not match pattern 
(RFC 166 but will be somewhere else on next rewrite [1])
(?F Failure tokens (RFC in development by me [1])
(?r),(?f)   Suggested in Direction Control RFC 1
(?& Boolean regexes (RFC in development [1])
(?*{code})  Execute code with pass/fail result (RFC in development [1])

[1] these will all be in an RFC which will probably be out in a day or so.

Unused (? sequences

a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z
A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z
0,1,2,3,4,5,6,7,8,9
`_,."+[];'~)

(if I have forgotten any do tell and I will try and keep this list up to
date.

Richard

-- 

[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-05 Thread Richard Proctor

On Tue 05 Sep, David Corbin wrote:
> Nathan Wiger wrote:
> > 
> > But, how about a new ?m operator?
> > 
> >/(?m<<|[).*?(?M>>|])/;
> > 
> 
> Let's combine yor operator with my example from above where everything
> inside the (?m) or the ?(M)
> fits the syntax of a RE.  
> 
>   /(?m(<<)|\[).*?(?M(>>)|(\]))
> 
> > Then the ?M matches pairs with the previous ?m, if there was one that
> > was matched. The | character separates or'ed sets consistent with other
> > regex patterns.

There already is a (?m

The whole (?x set of thingies is getting complicated...  The list of what is
used at present (and in current suggestions is:

Current Use in perl5

(?# comment
(?imsx  flags
(?-imsx flags
(?: subexpression without bracket capture
(?= zero-width positive look ahead
(?! zero width negative look ahead
(?<=zero-width positve look behind
(? Independant subexpression
(?(condition)yes-pattern
(?(condition)yes-pattern|no-pattern

Suggested in RFCs either current or in development

(?$foo= suggested for assignment (RFC 112)
(?%foo= suggested for hash assignment (RFC 150?)

(?@foo  suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166)
(?Q@foo) Quote each item of lists (RFC 166)
(?^pattern) matches anything that does not match pattern 
(RFC 166 but will be somewhere else on next rewrite [1])
(?F Failure tokens (RFC in development by me [1])
(?r),(?f)   Suggested in Direction Control RFC 1
(?& Boolean regexes (RFC in development [1])
(?*{code})  Execute code with pass/fail result (RFC in development [1])

[1] these will all be in an RFC which will probably be out in a day or so.

Unused (? sequences

a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z
A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z
0,1,2,3,4,5,6,7,8,9
`_,."+[];'~)

(if I have forgotten any do tell and I will try and keep this list up to
date.

Richard

-- 

[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-05 Thread David Corbin


Nathan Wiger wrote:
> 
> I think it's cool too, I don't like the @^g and ^@G either. But I worry
> about the double-meaning of the []'s in your solution, and the fact that
> these:
> 
>/\m[...]...\M/;
>/\d[...]...\D/;

Well, it's not really a double meaning.  It's a set of characters, just
like '[]' always means.
Granted, the meaning between upper & lower case characters is not the
same here, but I don't think
it always is the same currently (positive/negative).

> 
> Will work so differently. Maybe another character like ()'s that takes a
> list:
> 
>/\m(<<,[).*?\M(>>,])/;
> 
If you don't want to use [] (which limits it to single character
"para-brace-ets"),
then I"d suggest using {} as that is already established for use in with
\? type 
escapes.  

Maybe:  m/\m{(<<)|(\[)}.*?\M{(>>)|(])}/;

Essentially everything inside the {} is in-fact another pattern, and the
back-references within
match "1-for-1".  Of course, with this syntax you'd have to escape
actual braces m{\{} which I don't 
much care for...

> That solves the multiple characters problem at least. However, we still
> have a \M and \m, which isn't consistent if they're going to take
> arguments.

I'm not sure I understand your point here.


> 
> But, how about a new ?m operator?
> 
>/(?m<<|[).*?(?M>>|])/;
> 

Let's combine yor operator with my example from above where everything
inside the (?m) or the ?(M)
fits the syntax of a RE.  

/(?m(<<)|\[).*?(?M(>>)|(\]))

> Then the ?M matches pairs with the previous ?m, if there was one that
> was matched. The | character separates or'ed sets consistent with other
> regex patterns.

You can do that, or you can say it's done with backreferences (as noted
above)
> 
> -Nate
> 
> David Corbin wrote:
> >
> > I never saw one comment on this, and the more I think about it, the more
> > I like it. So,
> > I thought I'd throw it back out one more time...(If I get no comments
> > this time, I'll
> > be quiet :)
> >
> > David Corbin wrote:
> > >
> > > I haven't given this a WHOLE lot of thought, so please, shoot it full
> > > of holes.
> > >
> > > I certainly like the goal of this RFC, but I dislike the idea that the
> > > specification for
> > > what chacters are going to match are specified outside of the RE.

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]

Re: RFC 145 (alternate approach)

2000-09-05 Thread Nathan Wiger

I think it's cool too, I don't like the @^g and ^@G either. But I worry
about the double-meaning of the []'s in your solution, and the fact that
these:

   /\m[...]...\M/;
   /\d[...]...\D/;

Will work so differently. Maybe another character like ()'s that takes a
list:

   /\m(<<,[).*?\M(>>,])/;

That solves the multiple characters problem at least. However, we still
have a \M and \m, which isn't consistent if they're going to take
arguments.

But, how about a new ?m operator?

   /(?m<<|[).*?(?M>>|])/;

Then the ?M matches pairs with the previous ?m, if there was one that
was matched. The | character separates or'ed sets consistent with other
regex patterns.

-Nate

David Corbin wrote:
> 
> I never saw one comment on this, and the more I think about it, the more
> I like it. So,
> I thought I'd throw it back out one more time...(If I get no comments
> this time, I'll
> be quiet :)
> 
> David Corbin wrote:
> >
> > I haven't given this a WHOLE lot of thought, so please, shoot it full
> > of holes.
> >
> > I certainly like the goal of this RFC, but I dislike the idea that the
> > specification for
> > what chacters are going to match are specified outside of the RE.

Re: RFC 145 (alternate approach)

2000-09-05 Thread David Corbin


I never saw one comment on this, and the more I think about it, the more
I like it. So,
I thought I'd throw it back out one more time...(If I get no comments
this time, I'll
be quiet :)

David Corbin wrote:
> 
> I haven't given this a WHOLE lot of thought, so please, shoot it full
> of holes.
> 
> I certainly like the goal of this RFC, but I dislike the idea that the
> specification for
> what chacters are going to match are specified outside of the RE.
> 
> I want to be able specify a character, set of characters or maybe even
> another RE, in the primary RE that specifies what an open
> "brace-athen-acket" looks like, and then a common symbol that is used to
> say "the matching brace-athen-acket".
> 
> This is a quick guess at a syntax that I have no great attachment to
> (though I think it works).  Consider this example:
> 
> m/\m[{(].*\M/;
> 
> the \m[{(] says I want to match on either open paren or open-brace.
> the \M indicates the matching close for whatever was found in the
> appropriate \m.
> 
> Possible problems here are:
> - matching multiple character "opens" like "<<" or "/*".
> - knowing what the closing match should be (when it's not obvious) as
> in the above cases.
> - (possibly) a problem when you've got many /m-/M pairs in a single RE
> 
> I've got some vague ideas on solving all of these, I'll go into if
> people like the basic concept enough.
> --
> David Corbin
> Mach Turtle Technologies, Inc.
> http://www.machturtle.com
> [EMAIL PROTECTED]

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]

RFC 145 (alternate approach)

2000-08-25 Thread David Corbin


I haven't given this a WHOLE lot of thought, so please, shoot it full 
of holes.

I certainly like the goal of this RFC, but I dislike the idea that the
specification for
what chacters are going to match are specified outside of the RE.

I want to be able specify a character, set of characters or maybe even
another RE, in the primary RE that specifies what an open
"brace-athen-acket" looks like, and then a common symbol that is used to
say "the matching brace-athen-acket".  

This is a quick guess at a syntax that I have no great attachment to
(though I think it works).  Consider this example:

m/\m[{(].*\M/;

the \m[{(] says I want to match on either open paren or open-brace.
the \M indicates the matching close for whatever was found in the
appropriate \m.

Possible problems here are:
- matching multiple character "opens" like "<<" or "/*".
- knowing what the closing match should be (when it's not obvious) as
in the above cases.
- (possibly) a problem when you've got many /m-/M pairs in a single RE

I've got some vague ideas on solving all of these, I'll go into if
people like the basic concept enough.
-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]

42 matches

Mail list logo