Re: Store captures and non-captures in source-string order

2008-10-13 Thread Larry Wall
Or maybe we're not thinking big enough here.  Maybe we're looking at
a generalized tree query language that, as limiting cases, defines the
.splits and .allsplits as (re)linearized query results, where .splits
linearizes the top level nodes, and .allsplits linearizes the leaves,
but may intermediate linearizations are possible.  Don't want to
get stuck into binary thinking here...

Larry


Re: Store captures and non-captures in source-string order

2008-10-13 Thread Larry Wall
On Sun, Oct 12, 2008 at 05:34:49PM +0200, Moritz Lenz wrote:
: Patrick R. Michaud wrote:
:  On Sun, Oct 12, 2008 at 11:44:05AM +0200, Moritz Lenz wrote:
:  When we write regexes, we generally capture stuff in a way that makes
:  the following semantic analysis easier. For example we could have a
:  regex m/ this+ that? this*/ if we're only interested in the match
:  trees of what this and that matches, not their respective order.
:  [...]
:  But if you want to re-used the match tree for something different (say,
:  instead of doing a semantic analysis we want to do syntax hilighting)
:  it's rather hard to reconstruct the original text, and what part of it
:  was matched by which subrule. 
:  
:  Perhaps aliases...?
:  
:  m/ this+ that? andthen=this* /
:  
:  This is probably not exactly what you're looking for, but
:  that would be what I would look at for this specific example.
: 
: I'm looking more for a general solution for which you don't have to
: manipulate the rule itself, and which should ideally work with as little
: knowledge of the rule as possible.
: 
: Just see through which loops STD5_dump_match (in the same dir as STD.pm)
: has to jump to get a grab of the parse tree in the right order.
: 
: Moritz

Yes, funny thing is I was just thinking about the same thing this
morning after Mitchell Charity noticed that elsifs were missing
from the tree.  It will be relatively trivial to do this with STD,
since it already produces a general mapping from position to hash,
which it uses to cache whitespace matches and line numbers, but could
easily record what matched where.  (See the ._ hash for that.)
In my case, I was wanting to find the set of non-whitespace things
that are parsed but don't end up in the parse tree.  Maybe the :keepall
modifier needs access to something like this as well.

It may also let me remove the kludge whereby ~ remembers the delimiters
on either side.

It could also revolutionize the implementation of split. :)

My big question is how best to make this ordered info available within
a Match, given that we currently use the Positional role for something
else.  An argument could be made that this info is more important than
revealing $0,$1 etc at the top level of the Match, that is, that split
semantics are more natural than comb semantics for @($/).  One data
point is that the STD grammar uses very little $0 and then only as
a named parameter that happens to have a numeric name.  So we could
easily demote $0 etc to meaning $/.numbered[0] or some such.  Of course,
it goes the other way too, and we can reveal the splits via a .split
method or some such.  Plus we can have multiple levels of splitting
semantics, so then *they'd* be fighting over Positional if we made
one of them default.

So I'm thinking @($/) stays the way it is, but .splits might return
the top-level splits for a given rule, where strings are intermixed
with child tree nodes, whereas something like .allsplits might return
all the ordered strings along with mappings to what parsed them.

If we did that, then there's the question of whether .splits needs to
run the pattern lazily so that we can do a limited /':'/.splits(4)
and such.  That may turn out to be abuse of the lazy system though.
And technically, that regex *isn't* binding the colons to a child
node, so there's a little semantic mismatch there as well, since a
split implemented in terms of .splits would look more like /.*?(':')/.
So maybe .splits is the wrong name.  Suggestions welcome.

The cool thing about .allsplits is that if you doing, say, syntax
highlighting on the fly in an editor, it might be relatively easy to
run down the list and determine top-level nodes that limit how much
needs to be reparsed.  Contrariwise, with the fate system of STD it
might even be relatively easy to put the parser back into a state
that was deeply recursive and restart the parse at any point.

'Course, relatively easy is one o' them relative concepts...  :)

Larry


Re: Store captures and non-captures in source-string order

2008-10-13 Thread Aristotle Pagaltzis
* Larry Wall [EMAIL PROTECTED] [2008-10-13 19:00]:
 Maybe we're looking at a generalized tree query language

That’s an intriguing observation. Another case for having some
XPath-ish facility in the language?

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Store captures and non-captures in source-string order

2008-10-12 Thread Patrick R. Michaud
On Sun, Oct 12, 2008 at 11:44:05AM +0200, Moritz Lenz wrote:
 When we write regexes, we generally capture stuff in a way that makes
 the following semantic analysis easier. For example we could have a
 regex m/ this+ that? this*/ if we're only interested in the match
 trees of what this and that matches, not their respective order.
 [...]
 But if you want to re-used the match tree for something different (say,
 instead of doing a semantic analysis we want to do syntax hilighting)
 it's rather hard to reconstruct the original text, and what part of it
 was matched by which subrule. 

Perhaps aliases...?

m/ this+ that? andthen=this* /

This is probably not exactly what you're looking for, but
that would be what I would look at for this specific example.

Pm