Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Bart Lateur

On 06 Sep 2000 18:04:18 -0700, Randal L. Schwartz wrote:

I think the -1 indexing for "end of array" came from there.  Or at
least, it was in Perl long before it was in Python, and it was in Icon
before it was in Perl, so I had always presumed Larry had seen Icon.
Larry?

Do not assume that these are the only languages that exist. There must
be hundreds of languages; see the famous "Free Compilers" list
(http://www.idiom.com/free-compilers/). At least a few of these do
support -1 for last array index.

p.s. Shall I bring up the "@array[2 .. -1] should do the proper thing"
requested feature again? Oops, I just did. I think implementing this
basically requires lazy evaluation of the (2 .. -1) thing, so when it
eventually needs to be turned into a list of numbers, [a] it is aware of
the fact that it's in an "list indexing context", and [b] it knows the
number of list items.

And yes, some of the other languages do properly support this feature.

-- 
Bart.



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Richard Proctor

On Wed 06 Sep, Mark-Jason Dominus wrote:
 
 I've been thinking the same thing.  It seems to me that the attempts to
 shoehorn parsers into regex syntax have either been unsuccessful
 (yielding an underpowered extension) or illegible or both.
 
SNOBOL: 
 parenstring = '(' *parenstring ')'  
 | *parenstring *parenstring
 | span('()')
 
 
 This is not exactly the same, but I tried a direct translation:
 
  $re = qr{ \( (??{$re}) \)
  | (??{$re}) (??{$re})
  | (? [^()]+)
  }x;
 

I think what is needed is something along the line of :

   $re = qz{ '(' \$re ')'
| \$re \$re
| [^()]+
   };
   
Where qz is some hypothetical new quoting syntax

Richard

-- 

[EMAIL PROTECTED]




Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Jarkko Hietaniemi

On Thu, Sep 07, 2000 at 03:42:01PM -0400, Eric Roode wrote:
 Richard Proctor wrote:
 
 I think what is needed is something along the line of :
 
$re = qz{ '(' \$re ')'
 | \$re \$re
 | [^()]+
};

 Where qz is some hypothetical new quoting syntax
 
 Well, we currently have qr{}, and ??{} does something like your \$re.
 
 Warning: radical ideas ahead.
 
 What would be useful, would be to leave REs the hell alone; they're 
 great as-is, and are only getting hairier and hairier. What would be
 useful, would be to create a new non-regular pattern-matching/parsing
 "language" within Perl, that combines the best of Perl REs, lex, 
 SNOBOL, Icon, state machines, and what have you. 

Agreed.  "Yet another quoting construct", "yet another \construct",
"yet another (? construct".  Argh, please, no.  Make all the above and
all we've learned from Parse::RecDescent et alia to collide at light
speed and see what new cool particles will spring forth.


-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Michael Maraist


- Original Message -
From: "Jonathan Scott Duff" [EMAIL PROTECTED]
Subject: Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145
(alternate approach))


 How about qy() for Quote Yacc  :-)  This stuff is starting to look
 more and more like we're trying to fold lex and yacc into perl.  We
 already have lex through (?{code}) in REs, but we have to hand-write
 our own yacc-a-likes.

Though you can do cool stuff in (?{code}), I wouldn't quite call it lex.
First off we're dealing with NFA instead of DFA, and at the very least, that
gives you back-tracking.  True, local's allow you to preserve state to some
degree.  But the following is as close as I can consider (?{code}) a lexer:

sub lex_init {
my $str = shift;
our @tokens;
$str =~ / \G (?{ local @tokens; })
   (?: TokenDelim(\d+) (?{ push @tokens, [ 'digit', $1 ] })
   | TokenDelim(\w+) (?{ push @tokens, [ 'word', $1 ] })
   )
/gx;
}

sub getNextToken {  shift @tokens; }

I'm not even suggesting this is a good design.  Just showing how akward it
is.

Other problems with the lexing in perl is that you pretty much need the
entire string before you begin processing, while a good lexer only needs the
next character.  Ideally, this is a character stream.  Already we're talking
about a lot of alteration and work here..  Not something I'd be crazy about
putting into the core.

-Michael






Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread David L. Nicol

Bart Lateur wrote:
 
 On 06 Sep 2000 18:04:18 -0700, Randal L. Schwartz wrote:
 
 I think the -1 indexing for "end of array" came from there.  Or at
 least, it was in Perl long before it was in Python, and it was in Icon
 before it was in Perl, so I had always presumed Larry had seen Icon.
 Larry?


I thought he got it from the substr function in CDC mainframe BASIC, which
in which negative positions mean "from the end of the string"



XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Nathan Wiger

 It would be useful (and increasingly more common) to be able to match
 qr|\s*(\w+)([^]*)| to qr|\s*/\1\s*|, and handle the case where those
 can nest as well.  Something like
 
 listmatch this with
list
/list   not this but
 /list   this.

I suspect this is going to need a ?[ and ?] of its own. I've been
thinking about this since your email on the subject yesterday, and I
don't see how either RFC 145 or this alternative method could support
it, since there are two tags -  and / - which are paired
asymmetrically, and neither approach gives any credence to what's
contained inside the tag. So tag would be matched itself as " matches
".

What if we added special XML/HTML-parsing ? and ? operators?
Unfortunately, as Richard notes, ? is already taken, but I will use it
for the examples to make things symmetrical.

   ?  =  opening tag (with name specified)
   ?  =  closing tag (matches based on nesting)

Your example would simply be:

   /(?list)[\s\w]*(?list)[\s\w]*(?)[\s\w]*(?)/;

What makes me nervous about this is that ? and ? seem special-case.
They are, but then again XML and HTML are also pervasive. So a
special-case for something like this might not be any stranger than
having a special-case for sin() and cos() - they're extremely important
operations.

The other thing that this doesn't handle is tags with no closing
counterpart, like:

   br

Perhaps for these the easiest thing is to tell people not to use ? and
?:

   /(?p)[\s*\w](?:br)(?)/;

Would match

   p
  Some stuffbr
   /p

Finally, tags which take arguments:

   div align="center"Stuff/div

Would require some type of "this is optional" syntax:

   /(?div\s*\w*)Stuff(?)/

Perhaps only the first word specified is taken as the tag name? This is
the XML/HTML spec anyways.

-Nate



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread David Corbin

Nathan Wiger wrote:
 
  It would be useful (and increasingly more common) to be able to match
  qr|\s*(\w+)([^]*)| to qr|\s*/\1\s*|, and handle the case where those
  can nest as well.  Something like
 
  listmatch this with
 list
 /list   not this but
  /list   this.
 
 I suspect this is going to need a ?[ and ?] of its own. I've been
 thinking about this since your email on the subject yesterday, and I
 don't see how either RFC 145 or this alternative method could support
 it, since there are two tags -  and / - which are paired
 asymmetrically, and neither approach gives any credence to what's
 contained inside the tag. So tag would be matched itself as " matches
 ".

Actually, in one of my responses I did outline a syntax which would
handle this with
reasonably ease, I think.  If the contents of (?[) is considered a
pattern, then you can
define a matching pattern.

Consider either of these.

m:(?[list]).*?(?]/list): 

or

m:(?['list' = '/list').*(?]):# really ought to include (?i:) in
there, but left out for readablity

or more generically

m:(?['\w+' = '/\1').*(?]):


I'll grant you it's not the simplest syntax, but it's a lot simpler than
using the 5.6 method... :)
 
 What if we added special XML/HTML-parsing ? and ? operators?
 Unfortunately, as Richard notes, ? is already taken, but I will use it
 for the examples to make things symmetrical.
 
?  =  opening tag (with name specified)
?  =  closing tag (matches based on nesting)
 
 Your example would simply be:
 
/(?list)[\s\w]*(?list)[\s\w]*(?)[\s\w]*(?)/;
 
 What makes me nervous about this is that ? and ? seem special-case.
 They are, but then again XML and HTML are also pervasive. So a
 special-case for something like this might not be any stranger than
 having a special-case for sin() and cos() - they're extremely important
 operations.
 
 The other thing that this doesn't handle is tags with no closing
 counterpart, like:
 
br
 
 Perhaps for these the easiest thing is to tell people not to use ? and
 ?:
 
/(?p)[\s*\w](?:br)(?)/;
 
 Would match
 
p
   Some stuffbr
/p
 
 Finally, tags which take arguments:
 
div align="center"Stuff/div
 
 Would require some type of "this is optional" syntax:
 
/(?div\s*\w*)Stuff(?)/
 
 Perhaps only the first word specified is taken as the tag name? This is
 the XML/HTML spec anyways.
 
 -Nate

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Michael Maraist


- Original Message -
From: "Jonathan Scott Duff" [EMAIL PROTECTED]
Subject: Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145
(alternate approach))


 On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote:
  What if we added special XML/HTML-parsing ? and ? operators?

 What if we just provided deep enough hooks into the RE engine that
 specialized parsing constructs like these could easily be added by
 those who need them?

 -Scott

Ok, I've avoided this thread for a while, but I'll make my comment now.
I've played with several ideas of reg-ex extensions that would allow
arbitrary "parsing".  My first goal was to be able to parse perl-like text,
then later a simple nested parentheses, then later nested xml as with this
thread.

I have been able to solve these problems using perl5.6's recursive reg-ex's,
and inserted procedure code.  Unfortunately this isn't very safe, nor is it
'pretty' to figure out by a non-perl-guru.  What's more, what I'm attempting
to do with these nested parens and xml is to _parse_ the data.. Well, guess
what guys, we've had decades of research into the area of parsing, and we
came out with yacc and lex.  My point is that I think we're approaching this
the wrong way.  We're trying to apply more and more parser power into what
classically has been the lexer / tokenizer, namely our beloved
regular-expression engine.

A great deal of string processing is possible with perls enhanced NFA
engine, but at some point we're looking at perl code that is inside out: all
code embedded within a reg-ex.  That, boys and girls, is a parser, and I'm
not convinced it's the right approach for rapid design, and definately not
for large-scale robust design.

As for XML, we already have lovely c-modules that take of that.. You even
get your choice.  Call per tag, or generate a tree (where you can search for
sub-trees).  What else could you want?  (Ok, stupid question, but you could
still accomplish it via a customized parser).

My suggestion, therefore would be to discuss a method of encorportating more
powerful and convinient parsing within _perl_; not necessarily directly
within the reg-ex engine, and most likely not within a reg-ex statement.  I
know we have Yacc and Parser modules.  But try this out for size: Perl's
very name is about extraction and reporting.  Reg-ex's are fundamental to
this, but for complex jobs, so is parsing.  After I think about this some
more, I'm going to make an RFC for it.  If anyone has any hardened opinions
on the matter, I'd like to hear from you while my brain churns.

-Michael





Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Tom Christiansen

I am working on an RFC
to allow boolean logic (  and || and !) to apply a number of patterns to
the same substring to allow easier mining of information out of such
constructs. 

What, you don't like: :-)

$pattern = $conjunction eq "AND"
? join(''  = map { "(?=.*$_)" } @patterns)
| join("|" =@patterns);

--tom



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread David Corbin

Jonathan Scott Duff wrote:
 
 On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote:
  What if we added special XML/HTML-parsing ? and ? operators?
 
 What if we just provided deep enough hooks into the RE engine that
 specialized parsing constructs like these could easily be added by
 those who need them?
 

In principle, that's a very Perlish thing to do...

 -Scott
 --
 Jonathan Scott Duff
 [EMAIL PROTECTED]

-- 
David Corbin
Mach Turtle Technologies, Inc.
http://www.machturtle.com
[EMAIL PROTECTED]



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Mark-Jason Dominus


 ...My point is that I think we're approaching this
 the wrong way.  We're trying to apply more and more parser power into what
 classically has been the lexer / tokenizer, namely our beloved
 regular-expression engine.

I've been thinking the same thing.  It seems to me that the attempts to
shoehorn parsers into regex syntax have either been unsuccessful
(yielding an underpowered extension) or illegible or both.

An approach that appears to have been more successful is to find ways
to integrate regexes *into* parser code more effectively.  Damian
Conway's Parse::RecDescent module does this, and so does SNOBOL.

In SNOBOL, if you want to write a pattern that matches balanced
parenteses, it's easy and straightforward and legible:

parenstring = '(' *parenstring ')'  
| *parenstring *parenstring
| span('()')


(span('()') is like [^()]* in Perl.)

The solution in Parse::RecDescent is similar.

Compare this with the solutions that work now:

 # man page solution
 $re = qr{
  \(
(?:
   (? [^()]+ )# Non-parens without backtracking
 |
   (??{ $re }) # Group with matching parens
 )*
  \)
}x;

This is not exactly the same, but I tried a direct translation:

 $re = qr{ \( (??{$re}) \)
 | (??{$re}) (??{$re})
 | (? [^()]+)
 }x;

and it looks worse and dumps core.  

This works:

qr{
  ^
  (?{ local $d=0 })
  (?:   
  \(
  (?{$d++}) 
   |  
  \)
  (?{$d--})
  (?
(?{$d0})
(?!) 
  )  
   |  
  (? [^()]* )
  
  )* 


  (?
(?{$d!=0})  
(?!)
  )
 $
}x;

but it's rather difficult to take seriously.

The solution proposed in the recent RFC 145:

/([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g

is not a lot better.  David Corbin's alternative looks about the same.

On a different topic from the same barrel, we just got a proposal that
([23,39]) should match only numbers between 23 and 39.  It seems to me
that rather than trying to shoehorn one special-purpose syntax after
another into the regex language, which is already overloaded, that it
would be better to try to integrate regex matching better with Perl
itself.  Then you could use regular Perl code to control things like
numeric ranges.  

Note that at present, you can get the effect of [(23,39)] by writing
this:

(\d+)(?(?{$1  23 || $1  39})(?!))

which isn't pleasant to look at, but I think it points in the right
direction, because it is a lot more flexible than [(23,39)].  If you
need to fix it to match 23.2 but not 39.5, it is straightforward to do
that:  
  
(\d+(\.\d*)?)(?(?{$1  23 || $1  39})(?!))

The [(23,39)] notation, however, is doomed.All you can do is
propose Yet Another Extension for Perl 7.

The big problem with 

(\d+)(?(?{$1  23 || $1  39})(?!))

is that it is hard to read and understand.

The real problem here is that regexes are single strings.  When you
try to compress a programming language into a single string this way,
you end up with something that looks like Befunge or TECO.  We are
going in the same direction here.

Suppose there were an alternative syntax for regexes that did *not*
require that everything be compressed into a single string?  Rather
than trying to pack all of Perl into the regex syntax, bit by bit,
using ever longer and more bizarre punctuation sequences, I think a
better solution would be to try to expose the parts of the regex
engine that we are trying to control.

I have some ideas about how to do this, and I will try to write up an
RFC this week.



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Jarkko Hietaniemi

On Wed, Sep 06, 2000 at 03:47:57PM -0700, Randal L. Schwartz wrote:
  "Mark-Jason" == Mark-Jason Dominus [EMAIL PROTECTED] writes:
 
 Mark-Jason I have some ideas about how to do this, and I will try to
 Mark-Jason write up an RFC this week.
 
 "You want Icon, you know where to find it..." :)

Hey, it's one of the few languages we haven't yet stolen a neat
feature or few from...  (I don't really count the few regex thingies
as full-fledged stealing, more like an experimental sleight-of-hand.)

 But yes, a way that allows programmatic backtracking sort of "inside out"
 from a regex would be nice.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen