Re: RFC 145 (alternate approach)

2000-09-06 Thread David Corbin

I'd suggest also, that (?[) (with no specified brackets) have the
default meaning
of the "four standard brackets" :

(?['('=')','{'='}','['=']',''='')

Note also the subtle syntax change.  We are either dealing with strings
or with patterns.  The consensus seems to be against patterns (I can
understand that).  Given that, we need  to quote the right hand side of
the = operator I think.  The quotes on the left side would be optional,
I think.

Richard Proctor wrote:
 
 On Tue 05 Sep, Nathan Wiger wrote:
  Eric Roode wrote:
  Now *that* sounds cool, I like it!
 
  What if the RFC only suggested the addition of two new constructs, (?[)
  and (?]), which did nested matches. The rest would be bound by standard
  regex constructs and your imagination!
 
  That is, the ?] simply takes whatever the closest ?[ matched and
  reverses it, verbatim, including ordering, case, and number of
  characters. The only trick would be a way to get what "reverses it"
  means correct.
 
 
 No ?] should match the closest ?[ it should nest the ?[s bound by any
 brackets in the regex and act accordingly.
 
 Also this does not work as a definition of simple bracket matching as you
 need ( to match ) not ( to match (.  A ?[ list should specify for each
 element what the matching element is perhaps
 
   (?[( = ),{ = }, 01 = 10)
 
 sort of hashish in style.
 
 Perhaps the brackets could be defined as a hash allowing (?[%Hash)
 
 Richard
 
 --
 
 [EMAIL PROTECTED]

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread David Corbin

Nathan Wiger wrote:
 
  It would be useful (and increasingly more common) to be able to match
  qr|\s*(\w+)([^]*)| to qr|\s*/\1\s*|, and handle the case where those
  can nest as well.  Something like
 
  listmatch this with
 list
 /list   not this but
  /list   this.
 
 I suspect this is going to need a ?[ and ?] of its own. I've been
 thinking about this since your email on the subject yesterday, and I
 don't see how either RFC 145 or this alternative method could support
 it, since there are two tags -  and / - which are paired
 asymmetrically, and neither approach gives any credence to what's
 contained inside the tag. So tag would be matched itself as " matches
 ".

Actually, in one of my responses I did outline a syntax which would
handle this with
reasonably ease, I think.  If the contents of (?[) is considered a
pattern, then you can
define a matching pattern.

Consider either of these.

m:(?[list]).*?(?]/list): 

or

m:(?['list' = '/list').*(?]):# really ought to include (?i:) in
there, but left out for readablity

or more generically

m:(?['\w+' = '/\1').*(?]):


I'll grant you it's not the simplest syntax, but it's a lot simpler than
using the 5.6 method... :)
 
 What if we added special XML/HTML-parsing ? and ? operators?
 Unfortunately, as Richard notes, ? is already taken, but I will use it
 for the examples to make things symmetrical.
 
?  =  opening tag (with name specified)
?  =  closing tag (matches based on nesting)
 
 Your example would simply be:
 
/(?list)[\s\w]*(?list)[\s\w]*(?)[\s\w]*(?)/;
 
 What makes me nervous about this is that ? and ? seem special-case.
 They are, but then again XML and HTML are also pervasive. So a
 special-case for something like this might not be any stranger than
 having a special-case for sin() and cos() - they're extremely important
 operations.
 
 The other thing that this doesn't handle is tags with no closing
 counterpart, like:
 
br
 
 Perhaps for these the easiest thing is to tell people not to use ? and
 ?:
 
/(?p)[\s*\w](?:br)(?)/;
 
 Would match
 
p
   Some stuffbr
/p
 
 Finally, tags which take arguments:
 
div align="center"Stuff/div
 
 Would require some type of "this is optional" syntax:
 
/(?div\s*\w*)Stuff(?)/
 
 Perhaps only the first word specified is taken as the tag name? This is
 the XML/HTML spec anyways.
 
 -Nate

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread David Corbin

Jonathan Scott Duff wrote:
 
 On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote:
  What if we added special XML/HTML-parsing ? and ? operators?
 
 What if we just provided deep enough hooks into the RE engine that
 specialized parsing constructs like these could easily be added by
 those who need them?
 

In principle, that's a very Perlish thing to do...

 -Scott
 --
 Jonathan Scott Duff
 [EMAIL PROTECTED]

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]



Re: RFC 145 (alternate approach)

2000-09-05 Thread David Corbin

Nathan Wiger wrote:
 
 I think it's cool too, I don't like the @^g and ^@G either. But I worry
 about the double-meaning of the []'s in your solution, and the fact that
 these:
 
/\m[...]...\M/;
/\d[...]...\D/;

Well, it's not really a double meaning.  It's a set of characters, just
like '[]' always means.
Granted, the meaning between upper  lower case characters is not the
same here, but I don't think
it always is the same currently (positive/negative).

 
 Will work so differently. Maybe another character like ()'s that takes a
 list:
 
/\m(,[).*?\M(,])/;
 
If you don't want to use [] (which limits it to single character
"para-brace-ets"),
then I"d suggest using {} as that is already established for use in with
\? type 
escapes.  

Maybe:  m/\m{()|(\[)}.*?\M{()|(])}/;

Essentially everything inside the {} is in-fact another pattern, and the
back-references within
match "1-for-1".  Of course, with this syntax you'd have to escape
actual braces m{\{} which I don't 
much care for...

 That solves the multiple characters problem at least. However, we still
 have a \M and \m, which isn't consistent if they're going to take
 arguments.

I'm not sure I understand your point here.


 
 But, how about a new ?m operator?
 
/(?m|[).*?(?M|])/;
 

Let's combine yor operator with my example from above where everything
inside the (?m) or the ?(M)
fits the syntax of a RE.  

/(?m()|\[).*?(?M()|(\]))

 Then the ?M matches pairs with the previous ?m, if there was one that
 was matched. The | character separates or'ed sets consistent with other
 regex patterns.

You can do that, or you can say it's done with backreferences (as noted
above)
 
 -Nate
 
 David Corbin wrote:
 
  I never saw one comment on this, and the more I think about it, the more
  I like it. So,
  I thought I'd throw it back out one more time...(If I get no comments
  this time, I'll
  be quiet :)
 
  David Corbin wrote:
  
   I haven't given this a WHOLE lot of thought, so please, shoot it full
   of holes.
  
   I certainly like the goal of this RFC, but I dislike the idea that the
   specification for
   what chacters are going to match are specified outside of the RE.

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]