subject:"Re\: Parsing indent\-sensitive languages"

Re: Parsing indent-sensitive languages

2005-09-09 Thread Peri Hankey


Many apologies for triple posting - short-circuit between ears.

Peri Hankey

--
http://languagemachine.sourceforge.net - The language machine

Re: Parsing indent-sensitive languages

2005-09-09 Thread Peri Hankey


Dave Whipp wrote:
If I want to parse a language that is sensitive to whitespace 
indentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars?


The way I'd usually handle it is to have a lexer that examines leading 
whitespace and converts it into "indent" and "unindent" tokens. The 
grammer can then use these tokens in the same way that it would any 
other block-delimiter.


This requires a stateful lexer, because to work out the number of 
"unindent" tokens on a line, it needs to know what the indentation 
positions are. How would I write a P6 rule that defines  and 
 tokens? Alternatively (if a different approach is needed) how 
would I use P6 to parse such a language?


In this context, I thought readers of this list might be interested in 
the following short extract from mediawiki.lmn (rules in the 
metalanguage of my language machine) which translate a subset of the 
mediawiki markup notation to HTML. The extract deals with bulleted and 
numbered lists, where consecutive prefix characters '*' and '#' are used 
to indicate the level of nesting of eacn entry:


- start of extract from mediawiki.lmn 
== bulleted and numbered lists ==
Unordered and ordered lists are a bit tricky - essentially they are like 
indented blocks in Python, but a little more complex because of the way 
ordered and unordered lists can be combined with each other. The 
solution is that at each level, the prefix pattern of '#' and '*' 
characters is known, and the level continues while that pattern is 
recognised. This can be done by matching the value of a variable which 
holds the pattern for the current level.


'*'  <- unit - ulist :'*';
'#'  <- unit - olist :'#';
ulist :A item :X repeat more item :Y <- unit ul :{X each Y} eom;
olist :A item :X repeat more item :Y <- unit ol :{X each Y} eom;

'*'  <- item - ulist :{A'*'};
'#'  <- item - olist :{A'#'};
ulist :A item :X repeat more item :Y <- item :{ ul :{X each Y}};
olist :A item :X repeat more item :Y <- item :{ ol :{X each Y}};
- wikitext :X<- item :{ li :X };

The following rule permits a level to continue as long as the input 
matches the current prefix. We recurse for each level before getting 
here, so we will always try to match the innermost levels first - they 
have the longest prefix strings, and so there is no danger of a 
premature match


- A  <- more ;
- end of extract from mediawiki.lmn --

The complete ruleset is visible at:
http://languagemachine.sourceforge.net/website.html   - summary
http://languagemachine.sourceforge.net/mediawiki.html - markup
http://languagemachine.sourceforge.net/sitehtml.html  - wrappings

I have fairly recently published the the language machine under Gnu GPL 
at sourceforge. It is implemented as a shared library written in the D 
language using the gdc frontend to gnu gcc. There are several flavours 
of the lmn metalanguage compiler: these are all written in lmn and share 
a common frontend. These and a number of examples are on the website as 
pages that have been generated directly from the source text.


My intention in creating the language machine has been to create 
something that can be combined with other free software languages and 
toolchains. I have recently asked the grants-secretary of the Perl 
Foundation for feedback on a proposal for implementing a language 
machine extension module for perl.


The language machine is not much like any other language toolkit that I 
know of. There is a page which tries to explain how it relates ot the 
received wisdom about language and language implementations at:


http://languagemachine.sourceforge.net/grammar.html

The language machine can produce a good deal of diagnostic information, 
including a very useful diagram which shows exactly what happens when 
unrestricted grammatical substitution rules are applied to an input stream:


http://languagemachine.sourceforge.net/lm-diagram.html

I would be interested to hear what you think.

Regards
Peri Hankey

--
http://languagemachine.sourceforge.net - The language machine

Re: Parsing indent-sensitive languages

2005-09-09 Thread Peri Hankey


Dave Whipp wrote:

If I want to parse a language that is sensitive to whitespace 
indentation (e.g. Python, Haskell), how do I do it using P6 
rules/grammars?


The way I'd usually handle it is to have a lexer that examines leading 
whitespace and converts it into "indent" and "unindent" tokens. The 
grammer can then use these tokens in the same way that it would any 
other block-delimiter.


This requires a stateful lexer, because to work out the number of 
"unindent" tokens on a line, it needs to know what the indentation 
positions are. How would I write a P6 rule that defines  and 
 tokens? Alternatively (if a different approach is needed) 
how would I use P6 to parse such a language?



In this context, I thought readers of this list might be interested in 
the following extract from mediawiki.lmn, a ruleset for generating html 
pages from a subset of mediawiki markup. These rules are written in lmn, 
the metalanguage of the language machine, and the extract deals with 
unordered and ordered lists, where entries are prefixed by '*' and '#' 
characters, and repeated prefix characters indicate nesting.


Not that the source text of lmn rules is written using a subset of the 
mediawiki markup, with preformatted text (lines that start with at least 
one space) treated as actual source with no markup and everything else 
treated as annotation:


- start of extract from mediawiki.lmn --
== bulleted and numbered lists ==
Unordered and ordered lists are a bit tricky - essentially they are like 
indented blocks in Python, but a little more complex because of the way 
ordered and unordered lists can be combined with each other. The 
solution is that at each level, the prefix pattern of '#' and '*' 
characters is known, and the level continues while that pattern is 
recognised. This can be done by matching the value of a variable which 
holds the pattern for the current level.


   '*'  <- unit - ulist :'*';
   '#'  <- unit - olist :'#';
   ulist :A item :X repeat more item :Y <- unit ul :{X each Y} eom;
   olist :A item :X repeat more item :Y <- unit ol :{X each Y} eom;

   '*'  <- item - ulist :{A'*'};
   '#'  <- item - olist :{A'#'};
   ulist :A item :X repeat more item :Y <- item :{ ul :{X each Y}};
   olist :A item :X repeat more item :Y <- item :{ ol :{X each Y}};
   - wikitext :X<- item :{ li :X };

The following rule permits a level to continue as long as the input 
matches the current prefix. We recurse for each level before getting 
here, so we will always try to match the innermost levels first - they 
have the longest prefix strings, and so there is no danger of a 
premature match


   - A  <- more ;
-  end of extract from mediawiki.lmn  --

The complete ruleset can be seen at:
http://languagemachine.sourceforge.net/website.html- summary
http://languagemachine.sourceforge.net/mediawiki.html  - markup
http://languagemachine.sourceforge.net/sitehtml.html   - wrappings

I have fairly recently published the language machine under Gnu GPL at 
sourceforge. It consists of a minimal main program, a shared library 
written in D using the gdc frontend to gnu gcc, and several flavours of 
an lmn metalanguage compiler - these are all written in lmn and share a 
common frontend.


The metalanguage compiler sources are on the website (with many other 
examples) as web pages that have been generated directly from lmn source 
text by applying  the markup-to-html translation rules.


The language machine in previous incarnations has a long history, but it 
is not much like any other language toolkit that I know of. This is a 
page that relates it to received wisdom about language and language 
implementations:


http://languagemachine.sourceforge.net/grammar.html

There is an extremely useful diagram which shows what happens when 
unrestricted grammatical substitution rules are applied to an input 
stream - this is explained here in relation to a couple of trivially 
simple examples:


http://languagemachine.sourceforge.net/lm-diagram.html

My intention in creating this implementation has been to make something 
that can be combined with other free languages and toolchains, and I 
have recently asked the grants-secretary at the Perl Foundation for 
feedback on a draft proposal to create a language machine extension for 
perl.


I would be very interested to hear what you think.

Regards
Peri Hankey

--
http://languagemachine.sourceforge.net - The language machine

Re: Parsing indent-sensitive languages

2005-09-09 Thread Peri Hankey


Dave Whipp wrote:
If I want to parse a language that is sensitive to whitespace 
indentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars?


The way I'd usually handle it is to have a lexer that examines leading 
whitespace and converts it into "indent" and "unindent" tokens. The 
grammer can then use these tokens in the same way that it would any 
other block-delimiter.


This requires a stateful lexer, because to work out the number of 
"unindent" tokens on a line, it needs to know what the indentation 
positions are. How would I write a P6 rule that defines  and 
 tokens? Alternatively (if a different approach is needed) how 
would I use P6 to parse such a language?


In this context, I thought readers of this list might be interested in 
the following extract from mediawiki.lmn, a ruleset for generating html 
pages from a subset of mediawiki markup. These rules are written in lmn, 
the metalanguage of the language machine, and the extract deals with 
unordered and ordered lists, where entries are prefixed by '*' and '#' 
characters, and repeated prefix characters indicate nesting.


NB the source text of lmn rules is written using a subset of the 
mediawiki markup, with preformatted text (lines that start with at least 
one space) treated as actual source with no markup and everything else 
treated as annotation:


- start of extract from mediawiki.lmn --
== bulleted and numbered lists ==
Unordered and ordered lists are a bit tricky - essentially they are like 
indented blocks in Python, but a little more complex because of the way 
ordered and unordered lists can be combined with each other. The 
solution is that at each level, the prefix pattern of '#' and '*' 
characters is known, and the level continues while that pattern is 
recognised. This can be done by matching the value of a variable which 
holds the pattern for the current level.


'*'  <- unit - ulist :'*';
'#'  <- unit - olist :'#';
ulist :A item :X repeat more item :Y <- unit ul :{X each Y} eom;
olist :A item :X repeat more item :Y <- unit ol :{X each Y} eom;

'*'  <- item - ulist :{A'*'};
'#'  <- item - olist :{A'#'};
ulist :A item :X repeat more item :Y <- item :{ ul :{X each Y}};
olist :A item :X repeat more item :Y <- item :{ ol :{X each Y}};
- wikitext :X<- item :{ li :X };

The following rule permits a level to continue as long as the input 
matches the current prefix. We recurse for each level before getting 
here, so we will always try to match the innermost levels first - they 
have the longest prefix strings, and so there is no danger of a 
premature match


- A  <- more ;
-  end of extract from mediawiki.lmn  --

The complete ruleset can be seen at:
http://languagemachine.sourceforge.net/website.html- summary
http://languagemachine.sourceforge.net/mediawiki.html  - markup
http://languagemachine.sourceforge.net/sitehtml.html   - wrappings

I have fairly recently published the language machine under Gnu GPL at 
sourceforge. It consists of a minimal main program, a shared library 
written in D using the gdc frontend to gnu gcc, and several flavours of 
an lmn metalanguage compiler - these are all written in lmn and share a 
common frontend.


The metalanguage compiler sources are on the website (with many other 
examples) as web pages that have been generated directly from lmn source 
text by applying  the markup-to-html translation rules.


The language machine in previous incarnations has a long history, but it 
is not much like any other language toolkit that I know of. This is a 
page that relates it to received wisdom about language and language 
implementations:


http://languagemachine.sourceforge.net/grammar.html

There is an extremely useful diagram which shows what happens when 
unrestricted grammatical substitution rules are applied to an input 
stream - this is explained here in relation to a couple of trivially 
simple examples:


http://languagemachine.sourceforge.net/lm-diagram.html

My intention in creating this implementation has been to make something 
that can be combined with other free languages and toolchains, and I 
have recently asked the grants-secretary at the Perl Foundation for 
feedback on a draft proposal to create a language machine extension for 
perl.


I would be very interested to hear what you think.

Regards
Peri Hankey

--
http://languagemachine.sourceforge.net - The language machine

Re: Parsing indent-sensitive languages

2005-09-08 Thread Dave Whipp


Damian Conway wrote:


Alternatively, you could define separate rules for the three cases:

{
state @indents = 0;

rule indent {
^^ $:=(\h*)
{ $ = expand_tabs($).chars }
<( $ > @indents[-1] )>
{ let @indents = (@indents, $) }
}

rule outdent {
^^ $:=(\h*)
{ $ = expand_tabs($).chars }
<( $ < @indents[-1] )>
{ pop @indents while @indents && $ < @indents[-1];
  let @indents = (@indents, $);
}
}

rule samedent { ... }
}


I have a couple of questions about this:

1.  It's quite possible that a code-block in a parser could call a 
function that reads a different file (e.g. for an "include " 
statement). How does the state, @indents, get associated with a 
particular match? (Sure, I could do an explicit save/restore; but things 
might get harder if I was using coroutines to get concurrent matches to 
implement, say, a smart-diff script)


2. How does the outdent rule work in the case where a line does 2 
outdents? It looks to me as if I'd only get one match of : the 
/\h*/ match will advance the match pos, so /^^/ won't match for the 
second  on the same line, which would cause problems if I'm 
trying to match up nested blocks.



Dave.

Re: Parsing indent-sensitive languages

2005-09-08 Thread Collin Winter

On 9/8/05, Larry Wall <[EMAIL PROTECTED]> wrote:
> Okay, how do you tell the difference between
> 
> if foo1
> bar1
> if foo2
> bar2
> if foo3
> bar3
> else
> baz2
> 
> and
> 
> if foo1
> bar1
> if foo2
> bar2
> if foo3
> bar3
>  else
> baz2
> 
> unless you remember the exact columns that "if foo1" and "if foo2"
> were at?

I had misunderstood what you were intending to do with the stack.
Unfortunately it dawned on me a little too late to unclick "Send" : )

Collin

Re: Parsing indent-sensitive languages

2005-09-08 Thread Larry Wall

On Thu, Sep 08, 2005 at 07:57:43PM -0400, Collin Winter wrote:
: On 9/8/05, Larry Wall <[EMAIL PROTECTED]> wrote:
: > It seems to me you need a stack of levels so you know how many
: > indentation levels to pop off.  Otherwise you can't parse this:
: > 
: > if foo1
: > bar1
: > if foo2
: > bar2
: > if foo3
: > bar3
: > else
: > baz2
: 
: Sure you can. Since each item on the stack would just mean "add
: another indentation level", you could use the same pop-triggering
: logic to decrement a simple counter.

Okay, how do you tell the difference between

if foo1
bar1
if foo2
bar2
if foo3
bar3
else
baz2

and

if foo1
bar1
if foo2
bar2
if foo3
bar3
 else
baz2

unless you remember the exact columns that "if foo1" and "if foo2"
were at?  We're counting spaces here, not tabs.  Sure, you could
throw in extra indent/outdent tokens for every column, but that's
just silly--you'd have to do massive tree rewriting when you find an
unattached "else", and you'd get zillions of useless "blocks".

(And don't even begin to think about allowing Unicode space characters.
It's bad enough when people remap their tabs to something other than
multiples of 8 and don't tell you...)

Larry

Re: Parsing indent-sensitive languages

2005-09-08 Thread Collin Winter

On 9/8/05, Larry Wall <[EMAIL PROTECTED]> wrote:
> It seems to me you need a stack of levels so you know how many
> indentation levels to pop off.  Otherwise you can't parse this:
> 
> if foo1
> bar1
> if foo2
> bar2
> if foo3
> bar3
> else
> baz2

Sure you can. Since each item on the stack would just mean "add
another indentation level", you could use the same pop-triggering
logic to decrement a simple counter.

Collin

Re: Parsing indent-sensitive languages

2005-09-08 Thread Damian Conway


To solve Dave's particular problem, you don't need any new features. Just:

rule indentation {
^^ $:=(\h*)
  { state @indents = 0;
my $new_indent = expand_tabs($).chars;
let @indents = @indents;
pop @indents while @indents && $new_indent <= @indents[-1];
push @indents, $new_indent;
$/ = [EMAIL PROTECTED];
  }
}

where every line has an  and the subsequent 
$ value tells you how deep it is.


Alternatively, you could define separate rules for the three cases:

{
state @indents = 0;

rule indent {
^^ $:=(\h*)
{ $ = expand_tabs($).chars }
<( $ > @indents[-1] )>
{ let @indents = (@indents, $) }
}

rule outdent {
^^ $:=(\h*)
{ $ = expand_tabs($).chars }
<( $ < @indents[-1] )>
{ pop @indents while @indents && $ < @indents[-1];
  let @indents = (@indents, $);
}
}

rule samedent {
^^ $:=(\h*)
{ $ = expand_tabs($).chars }
<( $ == @indents[-1] )>
}
}


Damian

Re: Parsing indent-sensitive languages

2005-09-08 Thread chromatic

On Thu, 2005-09-08 at 14:59 -0700, Greg Woodhouse wrote:

> I agree that simply using terms like this means indentation grammars
> are problematic -- or does it? One thing that bothers me is that
> *people* don't seem to have a great deal of difficulty with them. Why
> not?

People can parse multi-dimensionally.  Computers cannot... yet.

-- c

Re: Parsing indent-sensitive languages

2005-09-08 Thread Greg Woodhouse

Come to think of it...I had in mind a sequence of "skip" statements,
that would back out of a level one at a time, until you finally reached
the desired level. But, I think maybe these "skip" statements
essentially play the role of what you called "positive unindent tokens"
(I like that term).

I agree that simply using terms like this means indentation grammars
are problematic -- or does it? One thing that bothers me is that
*people* don't seem to have a great deal of difficulty with them. Why
not?




===
Gregory Woodhouse  <[EMAIL PROTECTED]>



"Without the requirement of mathematical aesthetics a great many discoveries 
would not have been made."

-- Albert Einstein

Re: Parsing indent-sensitive languages

2005-09-08 Thread Greg Woodhouse

What I had in mind is really no different from the stateful lexer
previously proposed. Unless I'm mistaken, an abstract model might be a
language over {0, 1, 2} where each 1 or 2 must be prececed by a run of
1 or more 0's, but each run differ in length from the preceding one by
0, 1 or -1. But that's only a local constraint. You also also want to
eventually get back to a run of 1 as the string ends (You don't want
the program to end in the middle of a nested block.) So, maybe a single
register would work locally (so long as transitions can be conditioned
on its value), but you still need a stack for global correctness. 

--- Larry Wall <[EMAIL PROTECTED]> wrote:

> On Thu, Sep 08, 2005 at 02:16:33PM -0700, Greg Woodhouse wrote:
> : In the case of the
> : "indentation grammar", then the (one) stack in a push-down
> automaton is
> : basically used up keeping track of the indentation level. But you
> don't
> : need a whole stack to keep track of indntation level, just a
> register
> : that can be used to track the current level.
> 
> It seems to me you need a stack of levels so you know how many
> indentation levels to pop off.  Otherwise you can't parse this:
> 
>   if foo1
>   bar1
>   if foo2
>   bar2
>   if foo3
>   bar3
>   else
>   baz2
>  
> Larry
> 

===
Gregory Woodhouse  <[EMAIL PROTECTED]>

"Without the requirement of mathematical aesthetics a great many discoveries 
would not have been made."

-- Albert Einstein

Re: Parsing indent-sensitive languages

2005-09-08 Thread Larry Wall

On Thu, Sep 08, 2005 at 02:16:33PM -0700, Greg Woodhouse wrote:
: In the case of the
: "indentation grammar", then the (one) stack in a push-down automaton is
: basically used up keeping track of the indentation level. But you don't
: need a whole stack to keep track of indntation level, just a register
: that can be used to track the current level.

It seems to me you need a stack of levels so you know how many
indentation levels to pop off.  Otherwise you can't parse this:

if foo1
bar1
if foo2
bar2
if foo3
bar3
else
baz2
 
Larry

Re: Parsing indent-sensitive languages

2005-09-08 Thread Greg Woodhouse

That's something I've been thinking about, too. There are a lot of
"interesting" languages that cannot be described by context free
grammars (such as {empty, 012, 001122, 000111222, ...} but very simple
enhancements do make them easy to recognize. In the case of the
"indentation grammar", then the (one) stack in a push-down automaton is
basically used up keeping track of the indentation level. But you don't
need a whole stack to keep track of indntation level, just a register
that can be used to track the current level.

BTW, I'm new to this list. I haven't done that much with Perl recently,
but of late, I've become a lot more interested in the language. I
recently picked ot the Perl 6/Parrot book from O'Reilly but had really
meant to finish reading the book before jumping in. It's just that this
topic is too interesting!

--- Larry Wall <[EMAIL PROTECTED]> wrote:

> On Thu, Sep 08, 2005 at 08:37:21AM -0700, Dave Whipp wrote:
> : If I want to parse a language that is sensitive to whitespace 
> : indentation (e.g. Python, Haskell), how do I do it using P6
> rules/grammars?
> : 
> : The way I'd usually handle it is to have a lexer that examines
> leading 
> : whitespace and converts it into "indent" and "unindent" tokens. The
> 
> : grammer can then use these tokens in the same way that it would any
> 
> : other block-delimiter.
> 
> This is the multi-pass approach, which even in Perl 6 is still
> certainly one way to do it.  Or actually, two ways, one of which is
> to use source filters to mung the input text, and the other way of
> which is to make one lexer pass to transform into a list or tree of
> tokens, and then do a list/tree transformation on that.  Given that
> tree transformation is something that a number of us are currently
> thinking about for various other reasons, I suspect that can be made
> to work pretty well.  But we will have to describe how rule matching
> can be extended to lists and trees, and what happens when you
> intermix
> text elements with non-textual objects.  But essentially, just think
> what it would take to match against the tree of match objects
> returned
> by a prior match instead of against a string.
> 
> : This requires a stateful lexer, because to work out the number of 
> : "unindent" tokens on a line, it needs to know what the indentation 
> : positions are. How would I write a P6 rule that defines 
> and 
> :  tokens? Alternatively (if a different approach is
> needed) how 
> : would I use P6 to parse such a language?
> 
> I can think of two other approaches as well.  One would be to allow
> "pushback" on the queued input so that when we hit a line transition,
> we can immediately analyze the leading whitespace and replace it
> conceptually with the appropriate number of indent/unindent objects.
> It's not yet clear whether that is a good approach.  It might be
> rather inefficient to splice faked up objects into the middle of the
> input stream.  On the other hand, we don't actually have to splice
> the input, only pretend that we did.  And certainly, the Perl 5 lexer
> makes heavy use of this kind of we'll-fake-the-next-N-tokens queueing
> (though I actually botched the Perl 5 implementation of it by making
> it a stack instead of a queue).
> 
> My final idea is that you can treat it as a fancy kind of lookbehind
> assertion, provided you have some landmark that will stop the
> normal analysis from running over the newline boundary.  With the
> other approaches, you have a precomputed positive unindent token
> to function as a "stopper", but with this approach, the only thing
> you can depend on is the newline itself.  So when you first hit a
> newline, you don't progress beyond it, but instead look ahead at the
> indentation of the following line and queue up the right number of
> indent/unindent objects in your own data structure (as well as
> pushing
> or popping your stack of current indents appropriately so that the
> new indent is on top).  Then as long as there are queued up objects,
> the newline doesn't report itself as a newline, but as the next
> object,
> until you run out, and then the newline allows itself to be skipped,
> along with the subsequent whitespace.  (The difference between this
> and the previous approach is that you're not relying on the rules
> engine to keep track of the queueing for you.)
> 
> The very fact that I had to use a phrase like "positive unindent
> token"
> says a lot about why indentation as syntax is problematic in general.
> 
> Larry
> 



===
Gregory Woodhouse  <[EMAIL PROTECTED]>



"Without the requirement of mathematical aesthetics a great many discoveries 
would not have been made."

-- Albert Einstein

Re: Parsing indent-sensitive languages

2005-09-08 Thread Larry Wall

On Thu, Sep 08, 2005 at 08:37:21AM -0700, Dave Whipp wrote:
: If I want to parse a language that is sensitive to whitespace 
: indentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars?
: 
: The way I'd usually handle it is to have a lexer that examines leading 
: whitespace and converts it into "indent" and "unindent" tokens. The 
: grammer can then use these tokens in the same way that it would any 
: other block-delimiter.

This is the multi-pass approach, which even in Perl 6 is still
certainly one way to do it.  Or actually, two ways, one of which is
to use source filters to mung the input text, and the other way of
which is to make one lexer pass to transform into a list or tree of
tokens, and then do a list/tree transformation on that.  Given that
tree transformation is something that a number of us are currently
thinking about for various other reasons, I suspect that can be made
to work pretty well.  But we will have to describe how rule matching
can be extended to lists and trees, and what happens when you intermix
text elements with non-textual objects.  But essentially, just think
what it would take to match against the tree of match objects returned
by a prior match instead of against a string.

: This requires a stateful lexer, because to work out the number of 
: "unindent" tokens on a line, it needs to know what the indentation 
: positions are. How would I write a P6 rule that defines  and 
:  tokens? Alternatively (if a different approach is needed) how 
: would I use P6 to parse such a language?

I can think of two other approaches as well.  One would be to allow
"pushback" on the queued input so that when we hit a line transition,
we can immediately analyze the leading whitespace and replace it
conceptually with the appropriate number of indent/unindent objects.
It's not yet clear whether that is a good approach.  It might be
rather inefficient to splice faked up objects into the middle of the
input stream.  On the other hand, we don't actually have to splice
the input, only pretend that we did.  And certainly, the Perl 5 lexer
makes heavy use of this kind of we'll-fake-the-next-N-tokens queueing
(though I actually botched the Perl 5 implementation of it by making
it a stack instead of a queue).

My final idea is that you can treat it as a fancy kind of lookbehind
assertion, provided you have some landmark that will stop the
normal analysis from running over the newline boundary.  With the
other approaches, you have a precomputed positive unindent token
to function as a "stopper", but with this approach, the only thing
you can depend on is the newline itself.  So when you first hit a
newline, you don't progress beyond it, but instead look ahead at the
indentation of the following line and queue up the right number of
indent/unindent objects in your own data structure (as well as pushing
or popping your stack of current indents appropriately so that the
new indent is on top).  Then as long as there are queued up objects,
the newline doesn't report itself as a newline, but as the next object,
until you run out, and then the newline allows itself to be skipped,
along with the subsequent whitespace.  (The difference between this
and the previous approach is that you're not relying on the rules
engine to keep track of the queueing for you.)

The very fact that I had to use a phrase like "positive unindent token"
says a lot about why indentation as syntax is problematic in general.

Larry

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

Re: Parsing indent-sensitive languages

15 matches

Site Navigation

Mail list logo

Footer information