Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
> "Jarkko" == Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: >> "You want Icon, you know where to find it..." :) Jarkko> Hey, it's one of the few languages we haven't yet stolen a Jarkko> neat feature or few from... (I don't really count the few Jarkko> regex thingies as full-fledged stealing, more like an Jarkko> experimental sleight-of-hand.) I think the -1 indexing for "end of array" came from there. Or at least, it was in Perl long before it was in Python, and it was in Icon before it was in Perl, so I had always presumed Larry had seen Icon. Larry? -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
On Wed, Sep 06, 2000 at 03:47:57PM -0700, Randal L. Schwartz wrote: > > "Mark-Jason" == Mark-Jason Dominus <[EMAIL PROTECTED]> writes: > > Mark-Jason> I have some ideas about how to do this, and I will try to > Mark-Jason> write up an RFC this week. > > "You want Icon, you know where to find it..." :) Hey, it's one of the few languages we haven't yet stolen a neat feature or few from... (I don't really count the few regex thingies as full-fledged stealing, more like an experimental sleight-of-hand.) > But yes, a way that allows programmatic backtracking sort of "inside out" > from a regex would be nice. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
> > "Mark-Jason" == Mark-Jason Dominus <[EMAIL PROTECTED]> writes: > > Mark-Jason> I have some ideas about how to do this, and I will try to > Mark-Jason> write up an RFC this week. > > "You want Icon, you know where to find it..." :) That's exactly my motivation. It seems to me that trying to cram Icon into regexes isn't working well, but that a small transplant of Icon into the core language might suffice instead of a lot of cramming.
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
> "Mark-Jason" == Mark-Jason Dominus <[EMAIL PROTECTED]> writes: Mark-Jason> I have some ideas about how to do this, and I will try to Mark-Jason> write up an RFC this week. "You want Icon, you know where to find it..." :) But yes, a way that allows programmatic backtracking sort of "inside out" from a regex would be nice. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
RFC 198 (v1) Boolean Regexes
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Boolean Regexes =head1 VERSION Maintainer: Richard Proctor <[EMAIL PROTECTED]> Date: 6 Sep 2000 Mailing List: [EMAIL PROTECTED] Version: 1 Number: 198 Status: Developing =head1 ABSTRACT This is a development of the proposal for the "not a pattern" concept in RFC 166 V1. Looking deeper into the handling of advanced regexs, there are potential needs for many other concepts, to allow a regex to extract information directly from a complex file in one go, rather than a mixture of splits and nested regexes as is typically needed today. With these parsing data should become easier (in some cases). =head1 DESCRIPTION It would be nice (in my opinion) to be able to build more elaborate regexes allowing data to be mined out of a sting in one go. These ideas allow you to apply several patterns to one substring (each must match), to fail a match from within, to look for patterns that do not contain other patterns, and to handle looking for cases such as (foo.*bar)|(bar.*foo) in a more general way of saying "A substring that contains both foo and bar". These are ideas, at present with some proposed syntax. The ideas are more important than the exact syntax at this stage. This is very much work in progress. I have called these boolean regexs as they bring the concepts of and (&&) or (||) and not(!) into the realm of regexes. Within a boolean regex (or the boolean part of a regex), several new symbols have meanings, and some have enhanced meanings. =head2 The Ideas Are these part of a boolean (?&...) construct within an existing regex, or is the advanced syntax (and meaning of &!^$) invoked by a new flag such as /B? These can look like line noise so the use of white space with /x is used throughout, and it might be appropriate to enforce (or assume) /x within these. They are all shown here using (?&...) with /x assumed. =head3 Boolean construct (?&...) grabs a substring, and applies one or more tests to the substring. =head3 Substring matching multiple patterns (&&) (?& pattern1 && pattern2 && pattern3 ) A substring is definied that matches each pattern. For example, the first pattern may say specify a substring of at least 30 chars, the next two have a foo and a bar. =head3 Substring matching alternative patterns (||) (?& pattern1 || pattern2 || pattern3) This is similar to the existing alternative syntax "|" but the alternatives to "|" behave as /^pattern$/ rather than /pattern/ (^ and $ taken as refereing to the substring in this case - see below). (pattern1 || pattern2 || pattern3) can be mixed in with the && case above to build up more advanced cases. && and || operators can be nested with brackets in normal ways. =head3 Brackets within boolean regexes Within a complex boolean regex there are likely to be lots and lots of brackets to nest and control the behaviour of the regex. Rather than having to sprinkle the regex with (?:) line noise, it would be nicer to just use ordinary brackets () and only support capturing of elements by using one of the (?$=) or (?%=) constructs that have been proposed elsewhere. =head3 Substring not matching a pattern In RFC 166 I originally proposed (?^ pattern ). This proposal replaces that. !pattern matches anything that does not match the pattern. On its own it is not terribly useful, but in conjuction with && and || one can do things such as (?& /ix; It might be possible to have a regex that simply matches valid perl6 out of this. (though it would be large...)! =head1 IMPLENTATION Implementation detail is not appropriate for this stage in the devlopment of this RFC. If the concepts gain approval then detailed implementation issues become relevant. There are two aspects to regexs - compiling and executing: Compiling of these extended forms should be relativly straight forward, but would need some extensions to recognise the regex as being within (?&) state to handle the extended syntax. Executing - No thoughts at all at present. =head1 REFERENCES RFC 166: Additions to regexs RFC 112: Assignment within a regex RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns RFC 145: Brace-matching for Perl Regular Expressions (or at least the discussion that followed it)
What's in a Regex (was RFC 145)
I've been tossing an idea around in my head, and I've not yet decided if this is the most brilliant idea I've ever come up with:), or perhaps the lamest. I'm sure it would be cool, but that doesn't mean it should be pursued. I'm going to throw this one out in the open, and if it's not shot full of holes, I'll see if can RFC it. A string is conceptually a list of characters. The perl pattern engine is designed to recognize patterns in strings (i.e. a list of characters) Question: Is there value in extending the regex/pattern engine to support matching patterns in a list of foobars? I can see this taking two forms (beyond the strings we have today). One is matching number patterns (fibonaci, etc), and the other is matching lists of objects. The goal here would be to leverage all the GENERIC knowledge/power such as +, *, ? {x,y} [] | (?=) etc., and use this power where the fundamental thingy being matched is not a character. Now, if you want the syntax for this, that's an entirely separate issue (well, not really). But let's start with the question of weather this is Einsteinian or ElmerFuddian. -- David Corbin Mach Turtle Technologies, Inc. http://www.machturtle.com [EMAIL PROTECTED]
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
> >...My point is that I think we're approaching this > >the wrong way. We're trying to apply more and more parser power into what > >classically has been the lexer / tokenizer, namely our beloved > >regular-expression engine. I've been thinking the same thing. It seems to me that the attempts to shoehorn parsers into regex syntax have either been unsuccessful (yielding an underpowered extension) or illegible or both. An approach that appears to have been more successful is to find ways to integrate regexes *into* parser code more effectively. Damian Conway's Parse::RecDescent module does this, and so does SNOBOL. In SNOBOL, if you want to write a pattern that matches balanced parenteses, it's easy and straightforward and legible: parenstring = '(' *parenstring ')' | *parenstring *parenstring | span('()') (span('()') is like [^()]* in Perl.) The solution in Parse::RecDescent is similar. Compare this with the solutions that work now: # man page solution $re = qr{ \( (?: (?> [^()]+ )# Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; This is not exactly the same, but I tried a direct translation: $re = qr{ \( (??{$re}) \) | (??{$re}) (??{$re}) | (?> [^()]+) }x; and it looks worse and dumps core. This works: qr{ ^ (?{ local $d=0 }) (?: \( (?{$d++}) | \) (?{$d--}) (? (?{$d<0}) (?!) ) | (?> [^()]* ) )* (? (?{$d!=0}) (?!) ) $ }x; but it's rather difficult to take seriously. The solution proposed in the recent RFC 145: /([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g is not a lot better. David Corbin's alternative looks about the same. On a different topic from the same barrel, we just got a proposal that ([23,39]) should match only numbers between 23 and 39. It seems to me that rather than trying to shoehorn one special-purpose syntax after another into the regex language, which is already overloaded, that it would be better to try to integrate regex matching better with Perl itself. Then you could use regular Perl code to control things like numeric ranges. Note that at present, you can get the effect of [(23,39)] by writing this: (\d+)(?(?{$1 < 23 || $1 > 39})(?!)) which isn't pleasant to look at, but I think it points in the right direction, because it is a lot more flexible than [(23,39)]. If you need to fix it to match 23.2 but not 39.5, it is straightforward to do that: (\d+(\.\d*)?)(?(?{$1 < 23 || $1 > 39})(?!)) The [(23,39)] notation, however, is doomed.All you can do is propose Yet Another Extension for Perl 7. The big problem with (\d+)(?(?{$1 < 23 || $1 > 39})(?!)) is that it is hard to read and understand. The real problem here is that regexes are single strings. When you try to compress a programming language into a single string this way, you end up with something that looks like Befunge or TECO. We are going in the same direction here. Suppose there were an alternative syntax for regexes that did *not* require that everything be compressed into a single string? Rather than trying to pack all of Perl into the regex syntax, bit by bit, using ever longer and more bizarre punctuation sequences, I think a better solution would be to try to expose the parts of the regex engine that we are trying to control. I have some ideas about how to do this, and I will try to write up an RFC this week.
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
Jonathan Scott Duff wrote: > > On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote: > > What if we added special XML/HTML-parsing ?< and ?> operators? > > What if we just provided deep enough hooks into the RE engine that > specialized parsing constructs like these could easily be added by > those who need them? > In principle, that's a very Perlish thing to do... > -Scott > -- > Jonathan Scott Duff > [EMAIL PROTECTED] -- David Corbin Mach Turtle Technologies, Inc. http://www.machturtle.com [EMAIL PROTECTED]
Re: RFC 145 (alternate approach)
- Original Message - From: "Richard Proctor" <[EMAIL PROTECTED]> Sent: Tuesday, September 05, 2000 1:49 PM Subject: Re: RFC 145 (alternate approach) > On Tue 05 Sep, David Corbin wrote: > > Nathan Wiger wrote: > > > But, how about a new ?m operator? > > >/(?m<<|[).*?(?M>>|])/; > There already is a (?m > Current Use in perl5 > (?# comment > (?imsx flags > (?-imsx flags > (?: subexpression without bracket capture > (?= zero-width positive look ahead > (?! zero width negative look ahead > (?<= zero-width positve look behind > (? (?{code} Execute code > (??{code} Execute code and use result as pattern > (?> Independant subexpression > (?(condition)yes-pattern > (?(condition)yes-pattern|no-pattern > > Suggested in RFCs either current or in development > > (?$foo= suggested for assignment (RFC 112) > (?%foo= suggested for hash assignment (RFC 150?) > > (?@foo suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166) > (?Q@foo) Quote each item of lists (RFC 166) > (?^pattern) matches anything that does not match pattern > (RFC 166 but will be somewhere else on next rewrite [1]) > (?F Failure tokens (RFC in development by me [1]) > (?r),(?f) Suggested in Direction Control RFC 1 > (?& Boolean regexes (RFC in development [1]) > (?*{code}) Execute code with pass/fail result (RFC in development [1]) > > a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z > A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z > 0,1,2,3,4,5,6,7,8,9 > `_,."+[];'~) Ok, I've read through some of the archives, and thought this was a good starting point. I haven't seen any discussion on an obvious solution (though in another email, I suggested that this approach should be foregone in favor of a parsing approach.. But one thing at a time). There are two general problems as I see it. First, you have to be able to specify exactly what you're matching. Obviously generically matching "[<(`" etc is going to be upset if your nesting has simple things like " a < 5 " or "I'm going home, it's hot". A design goal, therefore should be to explicitly state the matching characters. Second, you need to be able to apply additional expression-syntax to match inside the nesting. An additional problem occurs when you suggest using pragmas to specify delimeters. It could be a performance hit, if not a developer's nightmare. When I run eval, must I always set the pragma, just in case there is some wierd scoping problem? Same problem as when using all global variables (and the 'local' keyword. God I hate that thing). Therefore, I suggest a commonly used form: /(?N [ { ] . )/x Note that I use N which stands for nesting instead of the redunant 'M'atch. I don't know how well character-based op-codes will be accepted. As pointed out above, the symbol-space is shrinking fast. The dots describe further matching / capturing within the delimeters. Thus /A (?N [ { ] ) B/x will match 'A' followed by a bracket grouping (anything therein is fine), then followed by 'B'. /A (?N [ { ] ( .* ) ) B/x does the same as above, but captures the internal contents (excluding the delimeters). /A ( (?N [ { ] ) ) B/x Will capture all the conents, including the delimeters. /A (?N [ [ ( ] ( .* ) ) B/x Same as before, but with squares and parentheses. Note delim specifiers can obey the same rules as normal character classes, thus [ [ ( { < ] means collect the entire group. POSIX classes can be used for all of them, as in [=open_braces=] (don't care what the phrase actually is). The reason I chose this is becuase we are essentially doing a character class, so we might as well explicitly use one; It makes more logical sence. Note that to make emacs happy, you should be able to escape all the one-way delimeters. as in [ \[ \( \{ \< ]. That might also make it easier to read, explicitly showing that these are being treated as characters, and not as actual operators. As for special operations such as (/* ... */ ), then I would recommend the usage of named-character classes. [=c_comment=], for example. I'm not sure how those classes are defined, but this obviously requires the system to be extensible (RFC anyone?). Course this violates my issue of using pragmas to alter the operation of reg-ex's. Most likely only built-in types should work. Another feature could be to treat the end of matching-brace as an end-of-line. Thus the above .* will properly exit. If this turns out to not work, then .* can necessarily be replaced by .*?. The advantage of this is in nested expressions, as in: $r_kw = qr/Keyword \s* .* /x; $r_lisp_expr = qr/ (?N [ ( ] $r_kw ) /x; $line = <>; $line =~ $r_lisp_expr; But this would also have worked with: $r_kw = qr/Keyword \s* .* $/x; Since '$' would treat ')' as '\n'. The main advantages of this approach are: That you can still pre-compile an expression and garuntee that it won't need recompiling, and that it'll always act the same. That you can nest the puppies with complete lack of ambiguity, and littl
RFC 197 (v1) Numberic Value Ranges In Regular Expressions
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Numberic Value Ranges In Regular Expressions =head1 VERSION Maintainer: David Nicol <[EMAIL PROTECTED]> Date: 5 september 2000 Mailing List: [EMAIL PROTECTED] Version: 1 Number: 197 Status: Developing =head1 ABSTRACT round and square bratches mated around two optional comma separated numbers match iff a gobbled number is within the described range. =head1 DESCRIPTION =head2 the syntax of the numeric range regex element Given a passage of regex text matching ($B1,$N1,$N2,$B2) = /(\[|\()(\-?\d*\.?\d*),(\-?\d*\.?\d*)(\]|\))/ and ($N1 <= $N2 or $N1 eq '' or $N2 eq '') we've got something we hereinafter call a "range." =head2 what the range matches A range matches, in the target string, a passage C<(\-?\d*\.?\d*)> also known as a "number" if and only if the number is within the range. In the normal agebraic sense. =head2 "within the range" Square bracket means, that end of the range may include the range specifying number, and round parenthesis means, that end of the range includes numbers ov value up to (or down to) the number but not equal to it. =head2 infinity in the event that one or the other of the range specifying numbers is the empty string, that end of the range is unbounded. In the further event that we have defined infinity and negative infinity on our numbers, the square/round distinction will come into play. =head1 COMPATIBILITY To disambiguate ranges from character sets indluding digits, commas, and parentheses, either put a backslash on the right parentheses, or the comma, or arrange things so the left hand side of the comma is greater than the right hand side, that way this special case will not apply: /(37.3,200)/; # matches any number x, 37.3 < x < 200 /([37,))/; # matches and saves any number >= 37. /(37\,200)/;# matches and saves the literal text '37,200' /[-35,9)]/; # matches any number x, -35 <= x < 9; followed by a ] /[3-5,9)]/; # matches a string containing any of 3,4,5,,,9 or ) =head1 IMPLEMENTATION When applying regular expressions to numeric data, ranges may optimize away all of the digit lookahead we must currently indulge in to implement them in perl5. If we have infinity defined, we'll have to recognize it in strings. =head1 BUT WAIT THERE'S MORE It is possible that the syntax described in this document may help slice multidimensional containers. (RFC 191) =head1 REFERENCES high school algebra
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
>I am working on an RFC >to allow boolean logic ( && and || and !) to apply a number of patterns to >the same substring to allow easier mining of information out of such >constructs. What, you don't like: :-) $pattern = $conjunction eq "AND" ? join('' => map { "(?=.*$_)" } @patterns) | join("|" =>@patterns); --tom
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
On Wed 06 Sep, David Corbin wrote: > Nathan Wiger wrote: > > > > > It would be useful (and increasingly more common) to be able to match > > > qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where > > > those can nest as well. Something like > > > > > > match this with > > > > > > not this but > > >this. > > > > I suspect this is going to need a ?[ and ?] of its own. I've been > > thinking about this since your email on the subject yesterday, and I > > don't see how either RFC 145 or this alternative method could support > > it, since there are two tags - > and > asymmetrically, and neither approach gives any credence to what's > > contained inside the tag. So would be matched itself as "< matches > > >". > > Actually, in one of my responses I did outline a syntax which would handle > this with reasonably ease, I think. If the contents of (?[) is considered > a pattern, then you can define a matching pattern. I think it should be a list of patterns rather than a single pattern. Each pattern in the list is attempted left to right until one matches. I now dont think it should be a hash as it needs to be ordered. But using the => as the l/r separateor does make it clear. > > m:(?['<\w+>' => '').*(?]): > > > I'll grant you it's not the simplest syntax, but it's a lot simpler than > using the 5.6 method... :) Actually that simple case is handled as m:<(\w+)>.*: but I think this is getting somewhere. This is a rich syntax that has lots of potential uses, not just for html. > > > > What if we added special XML/HTML-parsing ?< and ?> operators? > > Unfortunately, as Richard notes, ?> is already taken, but I will use it > > for the examples to make things symmetrical. > > > >?< = opening tag (with name specified) > >?> = closing tag (matches based on nesting) We are running out of (? syntax, we might want to find some other construct before long. But anyway, XML/HTML is important, but I am not convinced that what is being covered here really helps. I am working on an RFC to allow boolean logic ( && and || and !) to apply a number of patterns to the same substring to allow easier mining of information out of such constructs. Richard -- [EMAIL PROTECTED]
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
>...My point is that I think we're approaching this >the wrong way. We're trying to apply more and more parser power into what >classically has been the lexer / tokenizer, namely our beloved >regular-expression engine. >A great deal of string processing is possible with perls enhanced NFA >engine, but at some point we're looking at perl code that is inside out: all >code embedded within a reg-ex. That, boys and girls, is a parser, and I'm >not convinced it's the right approach for rapid design, and definately not >for large-scale robust design. What you say has, I think, a great deal of sense. While Jon and I--with Nathan, actually (see inside page credits)--were trying to figure out how to go about presenting all this wacky stuff for the final section of the new regex chapter in the Camel: Fancy Patterns Lookaround Assertions Non-Backtracking Subpatterns Programmatic Patterns Generated patterns Substitution evaluations Match-time code evaluation Match-time pattern interpolation Conditional interpolation Defining Your Own Assertions We kept coming back to sentiments remarkably similar to those you yourself have just expressed: although I think we managed to put a decently positive shine on the matter for the print version, it still really seems that that the inside-outness of this is very hard on your brain, and of remarkably abstruse appeal to the incredibly few. (Names of the usual suspects omitted to avoid using four-letter words in public forums. :-) I would welcome a less inside-out approach, as well as one that were more procedural--or at least more symbolic and less punctuational. --tom
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
David Corbin wrote: > > m:(?['' => '').*(?]): > > or more generically > > m:(?['<\w+>' => '').*(?]): I think these are good; but I do also like the idea of "automatic reversing" by default, since that's a common operation. Let's combine the ideas, as Richard suggests. How about: 1. When a scalar value is provided as the argument to ?[, then that value is automatically reversed character-wise and bracket-wise. 2. When a list is provided, each pair in the list is what to match. So here are some examples: m/(?[<<<)Some stuff(?])/;# <<>> m/\@(?[{[)weird perl(?])/; # @{[weird perl]} m/(?['<\w+>' => '').*(?])/; # Text # less verbose, more robust my @tag = qw('<(\w+)\s*.*?>' => ''); m/(?[@tag)Some title(?])(?[@tag)Open(?[@tag)Embedded(?])(?]); That last one would match Some title Open Embedded So really, all RFC 145 needs to do is introduce ?[ and ?], which do a couple things by default (like brace-matching and character reversing), but are actually general-purpose nesting operators when provided with a list of things to match. -Nate
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
- Original Message - From: "Jonathan Scott Duff" <[EMAIL PROTECTED]> Subject: Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach)) > On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote: > > What if we added special XML/HTML-parsing ?< and ?> operators? > > What if we just provided deep enough hooks into the RE engine that > specialized parsing constructs like these could easily be added by > those who need them? > > -Scott Ok, I've avoided this thread for a while, but I'll make my comment now. I've played with several ideas of reg-ex extensions that would allow arbitrary "parsing". My first goal was to be able to parse perl-like text, then later a simple nested parentheses, then later nested xml as with this thread. I have been able to solve these problems using perl5.6's recursive reg-ex's, and inserted procedure code. Unfortunately this isn't very safe, nor is it 'pretty' to figure out by a non-perl-guru. What's more, what I'm attempting to do with these nested parens and xml is to _parse_ the data.. Well, guess what guys, we've had decades of research into the area of parsing, and we came out with yacc and lex. My point is that I think we're approaching this the wrong way. We're trying to apply more and more parser power into what classically has been the lexer / tokenizer, namely our beloved regular-expression engine. A great deal of string processing is possible with perls enhanced NFA engine, but at some point we're looking at perl code that is inside out: all code embedded within a reg-ex. That, boys and girls, is a parser, and I'm not convinced it's the right approach for rapid design, and definately not for large-scale robust design. As for XML, we already have lovely c-modules that take of that.. You even get your choice. Call per tag, or generate a tree (where you can search for sub-trees). What else could you want? (Ok, stupid question, but you could still accomplish it via a customized parser). My suggestion, therefore would be to discuss a method of encorportating more powerful and convinient parsing within _perl_; not necessarily directly within the reg-ex engine, and most likely not within a reg-ex statement. I know we have Yacc and Parser modules. But try this out for size: Perl's very name is about extraction and reporting. Reg-ex's are fundamental to this, but for complex jobs, so is parsing. After I think about this some more, I'm going to make an RFC for it. If anyone has any hardened opinions on the matter, I'd like to hear from you while my brain churns. -Michael
Re: RFC 145 (alternate approach)
On Tue 05 Sep, Nathan Wiger wrote: >"normal" "reversed" >-- --- >103301 >99aa99 >(( )) ><+ +> >{{[!<_ _>!]}} >{__A1( )A1__} > > That is, when a bracket is encountered, the "reverse" of that is > automatically interpreted as its closing counterpart. This is the same > reason why qq// and qq() and qq{} all work without special notation. > > So we can replace @^g and @^G with simple precendence rules, the same > that are actually invoked automatically throughout Perl already. > > > (?[( => ),{ => }, 01 => 10) > > > > sort of hashish in style. > > I actually think this is redundant, for the reasons I mentioned above. > I'm not striking it down outright, but it seems simple rules could make > all this unnecessary. I dont think you will ever come up with a set of rules that will satisfy everybody all the time. what about html comments are they brackets? What about people doing 66/99 pairs? The best you could achieve is a set of default rules as you have suggested AND a way of overriding them with an explicit hash of what is the closing bracket for each opening bracket. The two methods depend on what follows the (?[ is it a hash or not. For the "Default" method the list of brackets could be as has been suggested a regex, or perhaps a simple comma separated list. For this you should define what is the "reverse" of each character, at least for latin-1, what do you do about the full utf-8...? An \X type construct that covers all the common brackets might be a usefull addition ({
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote: > What if we added special XML/HTML-parsing ?< and ?> operators? What if we just provided deep enough hooks into the RE engine that specialized parsing constructs like these could easily be added by those who need them? -Scott -- Jonathan Scott Duff [EMAIL PROTECTED]
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
Nathan Wiger wrote: > > > It would be useful (and increasingly more common) to be able to match > > qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those > > can nest as well. Something like > > > > match this with > > > > not this but > >this. > > I suspect this is going to need a ?[ and ?] of its own. I've been > thinking about this since your email on the subject yesterday, and I > don't see how either RFC 145 or this alternative method could support > it, since there are two tags - > and asymmetrically, and neither approach gives any credence to what's > contained inside the tag. So would be matched itself as "< matches > >". Actually, in one of my responses I did outline a syntax which would handle this with reasonably ease, I think. If the contents of (?[) is considered a pattern, then you can define a matching pattern. Consider either of these. m:(?[]).*?(?]): or m:(?['' => '').*(?]):# really ought to include (?i:) in there, but left out for readablity or more generically m:(?['<\w+>' => '').*(?]): I'll grant you it's not the simplest syntax, but it's a lot simpler than using the 5.6 method... :) > > What if we added special XML/HTML-parsing ?< and ?> operators? > Unfortunately, as Richard notes, ?> is already taken, but I will use it > for the examples to make things symmetrical. > >?< = opening tag (with name specified) >?> = closing tag (matches based on nesting) > > Your example would simply be: > >/(?)[\s\w]*(?>)/; > > What makes me nervous about this is that ?< and ?> seem special-case. > They are, but then again XML and HTML are also pervasive. So a > special-case for something like this might not be any stranger than > having a special-case for sin() and cos() - they're extremely important > operations. > > The other thing that this doesn't handle is tags with no closing > counterpart, like: > > > > Perhaps for these the easiest thing is to tell people not to use ?< and > ?>: > >/(?)(?>)/; > > Would match > > > Some stuff > > > Finally, tags which take arguments: > >Stuff > > Would require some type of "this is optional" syntax: > >/(?)/ > > Perhaps only the first word specified is taken as the tag name? This is > the XML/HTML spec anyways. > > -Nate -- David Corbin Mach Turtle Technologies, Inc. http://www.machturtle.com [EMAIL PROTECTED]
XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
> It would be useful (and increasingly more common) to be able to match > qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those > can nest as well. Something like > > match this with > > not this but >this. I suspect this is going to need a ?[ and ?] of its own. I've been thinking about this since your email on the subject yesterday, and I don't see how either RFC 145 or this alternative method could support it, since there are two tags - > and would be matched itself as "< matches >". What if we added special XML/HTML-parsing ?< and ?> operators? Unfortunately, as Richard notes, ?> is already taken, but I will use it for the examples to make things symmetrical. ?< = opening tag (with name specified) ?> = closing tag (matches based on nesting) Your example would simply be: /(?)[\s\w]*(?>)/; What makes me nervous about this is that ?< and ?> seem special-case. They are, but then again XML and HTML are also pervasive. So a special-case for something like this might not be any stranger than having a special-case for sin() and cos() - they're extremely important operations. The other thing that this doesn't handle is tags with no closing counterpart, like: Perhaps for these the easiest thing is to tell people not to use ?< and ?>: /(?)(?>)/; Would match Some stuff Finally, tags which take arguments: Stuff Would require some type of "this is optional" syntax: /(?)/ Perhaps only the first word specified is taken as the tag name? This is the XML/HTML spec anyways. -Nate
Re: RFC 145 (alternate approach)
At 09:05 AM 9/6/00 -0400, David Corbin wrote: >I'd suggest also, that (?[) (with no specified brackets) have the >default meaning >of the "four standard brackets" : > >(?['('=>')','{'=>'}','['=>']','<'=>'>') > >Note also the subtle syntax change. We are either dealing with strings >or with patterns. The consensus seems to be against patterns (I can >understand that). Given that, we need to quote the right hand side of >the => operator I think. The quotes on the left side would be optional, >I think. It would be useful (and increasingly more common) to be able to match qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where those can nest as well. Something like match this with not this but this. >Richard Proctor wrote: > > > > On Tue 05 Sep, Nathan Wiger wrote: > > > Eric Roode wrote: > > > Now *that* sounds cool, I like it! > > > > > > What if the RFC only suggested the addition of two new constructs, (?[) > > > and (?]), which did nested matches. The rest would be bound by standard > > > regex constructs and your imagination! > > > > > > That is, the ?] simply takes whatever the closest ?[ matched and > > > reverses it, verbatim, including ordering, case, and number of > > > characters. The only trick would be a way to get what "reverses it" > > > means correct. > > > > > > > No ?] should match the closest ?[ it should nest the ?[s bound by any > > brackets in the regex and act accordingly. > > > > Also this does not work as a definition of simple bracket matching as you > > need ( to match ) not ( to match (. A ?[ list should specify for each > > element what the matching element is perhaps > > > > (?[( => ),{ => }, 01 => 10) > > > > sort of hashish in style. > > > > Perhaps the brackets could be defined as a hash allowing (?[%Hash) > > > > Richard > > > > -- > > > > [EMAIL PROTECTED] > >-- >David Corbin >Mach Turtle Technologies, Inc. >http://www.machturtle.com >[EMAIL PROTECTED]
Re: RFC 145 (alternate approach)
I'd suggest also, that (?[) (with no specified brackets) have the default meaning of the "four standard brackets" : (?['('=>')','{'=>'}','['=>']','<'=>'>') Note also the subtle syntax change. We are either dealing with strings or with patterns. The consensus seems to be against patterns (I can understand that). Given that, we need to quote the right hand side of the => operator I think. The quotes on the left side would be optional, I think. Richard Proctor wrote: > > On Tue 05 Sep, Nathan Wiger wrote: > > Eric Roode wrote: > > Now *that* sounds cool, I like it! > > > > What if the RFC only suggested the addition of two new constructs, (?[) > > and (?]), which did nested matches. The rest would be bound by standard > > regex constructs and your imagination! > > > > That is, the ?] simply takes whatever the closest ?[ matched and > > reverses it, verbatim, including ordering, case, and number of > > characters. The only trick would be a way to get what "reverses it" > > means correct. > > > > No ?] should match the closest ?[ it should nest the ?[s bound by any > brackets in the regex and act accordingly. > > Also this does not work as a definition of simple bracket matching as you > need ( to match ) not ( to match (. A ?[ list should specify for each > element what the matching element is perhaps > > (?[( => ),{ => }, 01 => 10) > > sort of hashish in style. > > Perhaps the brackets could be defined as a hash allowing (?[%Hash) > > Richard > > -- > > [EMAIL PROTECTED] -- David Corbin Mach Turtle Technologies, Inc. http://www.machturtle.com [EMAIL PROTECTED]