Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Randal L. Schwartz
 Luke == Luke Palmer [EMAIL PROTECTED] writes:

Luke But you don't really need to parse to syntax highlight, either.  You
Luke just need to tokenize.

Unfortunately, to tokenize, you also have to know the state of the parse.
As long as / is both divide and begin regex, you're toasted.

Please see my long post at on parsing perl in perlmonks at
http://www.perlmonks.org/index.pl?node_id=44722 for examples of
*why* you need to notice whether you have a divide or a regex match.

Perl is fundamentally resistant to lexing.  As in the beginning of
this thread, one of the RFCs suggested the possibility of making Perl
lexable, but apparently the designers said no, we think the / duality
is worth keeping.  And that seals the fate for Perl6 just like all
Perl before it.

To properly lex a Perl program (Perl6 included), you *must* execute
BEGIN blocks.  That's the end of that tune.  Anything else is just an
approximation.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Matthew Walton
Randal L. Schwartz wrote:
Luke == Luke Palmer [EMAIL PROTECTED] writes:

Luke But you don't really need to parse to syntax highlight, either.  You
Luke just need to tokenize.
Unfortunately, to tokenize, you also have to know the state of the parse.
As long as / is both divide and begin regex, you're toasted.
So you're saying that in Perl 6 it will be entirely impossible to 
determine if / appears as the division operator or as the beginning of a 
regex from a purely syntactic examination of the source code?

I'm finding that very, very hard to believe. Regexps aren't valid where 
/-the-operator is, after all.

Please correct me if I'm wrong, but I've got the impression that Perl 6 
is tokenisable without requiring BEGIN blocks to be run - provided no 
grammars which the tokeniser doesn't already know about are used, of 
course, that one will never be avoidable.



Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Randal L. Schwartz
 Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew So you're saying that in Perl 6 it will be entirely impossible to
Matthew determine if / appears as the division operator or as the beginning of
Matthew a regex from a purely syntactic examination of the source code?

Yes.

Matthew I'm finding that very, very hard to believe. Regexps aren't valid
Matthew where /-the-operator is, after all.

And that's precisely why Perl can work as it does.  If an operator is
expected, / is divide.  If a term is expected, / is the beginning of a
regex.  This has been true since Perl1 (maybe 0).  There are a few
other characters that also work similarly, but / is the most frequent
and most troublesome.  And it got worse for Perl5, because of
user-defined prototypes, which as far as I can tell, are still present
in Perl6.

Matthew Please correct me if I'm wrong, but I've got the impression that Perl
Matthew 6 is tokenisable without requiring BEGIN blocks to be run - provided
Matthew no grammars which the tokeniser doesn't already know about are used,
Matthew of course, that one will never be avoidable.

Your impression is wrong.  In the presence of user-defined prototypes,
you *must* execute the code that might alter a prototype in order to
determine whether / is a divide (and therefore standalone token) or
the beginning of a regex (and therefore must locate the end of the
regex to properly be a token).

Please see the referenced perlmonks article.

All the handwaving in the world won't fix this.  As long as we have
dual-natured characters like /, and user-defined prototypes, Perl
cannot be lexed without also parsing, and therefore without also
running BEGIN blocks.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Matthew Walton
Randal L. Schwartz wrote:
Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew So you're saying that in Perl 6 it will be entirely impossible to
Matthew determine if / appears as the division operator or as the beginning of
Matthew a regex from a purely syntactic examination of the source code?
Yes.
Matthew I'm finding that very, very hard to believe. Regexps aren't valid
Matthew where /-the-operator is, after all.
And that's precisely why Perl can work as it does.  If an operator is
expected, / is divide.  If a term is expected, / is the beginning of a
regex.  This has been true since Perl1 (maybe 0).  There are a few
other characters that also work similarly, but / is the most frequent
and most troublesome.  And it got worse for Perl5, because of
user-defined prototypes, which as far as I can tell, are still present
in Perl6.
Perl 6 has formal parameters for subs, methods etc. I don't see any 
mention of Perl 5-style prototypes in S6, and I honestly can't see how 
they could possibly fit with formal parameters. Hopefully Larry or 
someone can clarify whether they still exist or not.

If they don't still exist, this eases the problem somewhat, but not 
entirely I understand. Being able to call subs and methods without 
parentheses around the argument lists causes problems; a quick scan of 
the updated Synopses failed to reveal the rules for that in Perl 6.

Your impression is wrong.  In the presence of user-defined prototypes,
you *must* execute the code that might alter a prototype in order to
determine whether / is a divide (and therefore standalone token) or
the beginning of a regex (and therefore must locate the end of the
regex to properly be a token).
Since Perl 5 style prototypes don't appear to exist anymore, this may be 
easier. I don't believe that the addition of the // operator compounds 
the problem anymore, because hopefully by that point it was possible to 
determine that you've seen an operator.

The Perlmonks article throws up a lot of very nasty cases. Not knowing 
the entire current language definition by heart, I can't say this with 
absolutely certainty, but I retain the belief that Perl 6 is at least 
*easier* to deal with than Perl 5.

It is also possible that telling the difference between /-as-divide and 
/-as-regex becomes much easier if lookahead is employed in the 
tokeniser. Unfortunately, that makes the tokeniser much more 
complicated, and it's just a vague and random idea.




Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Randal L. Schwartz
 Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew Perl 6 has formal parameters for subs, methods etc. I don't see any
Matthew mention of Perl 5-style prototypes in S6, and I honestly can't see how
Matthew they could possibly fit with formal parameters. Hopefully Larry or
Matthew someone can clarify whether they still exist or not.

As long as you can have a user-defined null-prototyped subroutine (one
that doesn't need parens following), you have the problem.  See the
sin/time examples in the monk article, and then consider user-defined
functions that have no args (like time) and those that do (like sin).

Matthew The Perlmonks article throws up a lot of very nasty cases. Not knowing
Matthew the entire current language definition by heart, I can't say this with
Matthew absolutely certainty, but I retain the belief that Perl 6 is at least
Matthew *easier* to deal with than Perl 5.

I believe you have a false belief.  I don't know anything in the new
prototypes-which-became-full-formal-arguments that made it any
*easier* to recognize the ending of a subroutine argument list without
knowing its precise definition.

In Perl6:

sub no_args () { ... }
sub list_args ([EMAIL PROTECTED]) { ... }

no_args / # this is a divide
list_args / # this is the start of a regex

See, it's still there. :)

Matthew It is also possible that telling the difference between /-as-divide
Matthew and /-as-regex becomes much easier if lookahead is employed in the
Matthew tokeniser.

No, not possible at all.  The entire rest of the program may be valid
either way.  You *must* know by the time you're done with /, or
/-and-more.  The rest of the code cannot be a hint.  Again, see my
article.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Matthew Walton
Randal L. Schwartz wrote:
Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew Perl 6 has formal parameters for subs, methods etc. I don't see any
Matthew mention of Perl 5-style prototypes in S6, and I honestly can't see how
Matthew they could possibly fit with formal parameters. Hopefully Larry or
Matthew someone can clarify whether they still exist or not.
As long as you can have a user-defined null-prototyped subroutine (one
that doesn't need parens following), you have the problem.  See the
sin/time examples in the monk article, and then consider user-defined
functions that have no args (like time) and those that do (like sin).
Matthew The Perlmonks article throws up a lot of very nasty cases. Not knowing
Matthew the entire current language definition by heart, I can't say this with
Matthew absolutely certainty, but I retain the belief that Perl 6 is at least
Matthew *easier* to deal with than Perl 5.
I believe you have a false belief.  I don't know anything in the new
prototypes-which-became-full-formal-arguments that made it any
*easier* to recognize the ending of a subroutine argument list without
knowing its precise definition.
In Perl6:
sub no_args () { ... }
sub list_args ([EMAIL PROTECTED]) { ... }
no_args / # this is a divide
list_args / # this is the start of a regex
See, it's still there. :)
I believe I did mention that being able to call functions without parens 
is a problem.

Matthew It is also possible that telling the difference between /-as-divide
Matthew and /-as-regex becomes much easier if lookahead is employed in the
Matthew tokeniser.
No, not possible at all.  The entire rest of the program may be valid
either way.  You *must* know by the time you're done with /, or
/-and-more.  The rest of the code cannot be a hint.  Again, see my
article.
I read the article. I believe I mentioned that as well.
But I will have to concede that it is impossible to correctly determine 
the structure of an arbitrary Perl 6 program without having to hand the 
definitions of all functions used and also any grammars and macros used. 
Sometimes you will be able to do it, sometimes you won't, but you can't 
operate on the assumption that you can.

It's quite a disappointment in some ways, but we've lived with it in 
Perl 5, and I'm sure we can live with it in Perl 6.

And I still think Perl 6 will have fewer cases in which it's completely 
impossible for not-Perl to parse it. Unfortunately, fewer still implies 
some, and some is still a problem.



Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread James Mastros
Randal L. Schwartz wrote:
All the handwaving in the world won't fix this.  As long as we have
dual-natured characters like /, and user-defined prototypes, Perl
cannot be lexed without also parsing, and therefore without also
running BEGIN blocks.
And user-defined prototypes that change when the argument list of a 
function ends, that is.  If we forced the argument list for all 
functions to have parens (including empty parens for argument less 
functions), then we'd be OK, I'm fairly certain.

For that matter, if we stick to declaration syntax for declarations, and 
not BEGIN blocks and reflection, then we're OK -- you have to do some 
execution, but of a minilanguage that can't express concepts that you 
wouldn't be OK running... though you do still have to descend through 
require/use, and thus have to have the files being required or used (or 
at least a description of their declarations).

-=- James Mastros,
theorbtwo


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Juerd
James Mastros skribis 2004-11-26 14:36 (+0100):
 And user-defined prototypes that change when the argument list of a 
 function ends, that is.  If we forced the argument list for all 
 functions to have parens (including empty parens for argument less 
 functions), then we'd be OK, I'm fairly certain.

While that is true, please realise that many people like that in Perl,
parens are optional. I am one of those people who dislike typing and
counting too many balanced symbol sets.

If only method and function syntax could be the same, and methods would
also not require parens... Ah well, that's what we have mutable grammar
for.

 For that matter, if we stick to declaration syntax for declarations, and 
 not BEGIN blocks and reflection

Macros are somewhat like BEGIN blocks and may be needed to turn invalid
syntax into something that is valid.


Juerd


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Michele Dondi
On Thu, 25 Nov 2004, Adam Kennedy wrote:
I thought it was about time I brought some concerns I've been having lately 
to the list. Not so much on any particular problem with perl6, but on 
problems with perl5 we would seem to have the opportunity to fix but aren't. 
(So far as I can tell).
So why not discussing this somewhere else? (e.g. clpmisc)
One of the biggest problems I have had with perl5 is that nothing, not even 
perl itself, can truly actually parse Perl source. By this, I mean parse
False:
[Nothing but] perl can parse Perl. (Tom Christiansen)
Michele
--
# This prints: Just another Perl hacker,
seek DATA,15,0 and  print   q... DATA;
__END__


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Adam Kennedy
Let's say you want to write a yacc grammar to parse Perl 6, or
Parse::RecDescent, or whatever you're going to use.  Yes, that will be
hard in Perl 6.  Certainly harder than it was in Perl 5.
In the end, I concluded there was _no_ way to write even a Perl 5 parser 
using any sort of pre-rolled grammar system, as the language does not 
have that sort of structure.

PPI was done the hard way. Manually stepping through line by line and 
using a variety of cruft (some stolen from the perl source, some my own) 
to make it just work.

I would envisage that the same would be true of writing a PPI6, except 
with a hell of a lot more operators :)

However, Perl 6 comes packaged with its own grammar, in Perl's own rule
format.  So now the quote only perl can parse Perl may become only
Perl can parse Perl  (And even only Perl can parse perl, since it's
written in itself :-).
Perl's contextual sensitivity is part of the language.  So the best you
can do is to track everything like you mentioned.  It's going to be
impossible to parse Perl without having perl around to do it for you.

But using the built-in grammar, you can read in a program, macros and
all, and get an annotated source tree back, that you could rebuild the
source out of.
Again, this is of very little use, effectively destroying the source 
code and replacing it with different source that is a serialised version 
of the tree.

For a current notional example, it would be like loading a simple...
try {
  $object-$do_something;
} catch (Exception $problem) {
  handle($problem);
}
... changing -$do_something to -$do_something() to make it 
back-portable, and then ending up with...

Module::Exceptions::initialize('line 98');
my $exceptionhandler = Module::Exceptions::prepare();
eval {
  $exceptionhandler-update_status('in try');
  $object-do_something();
};
if ( $@ ) {
  if ( ref $exceptionhandler ) {
require Scalar::Util ();
if ( Scalar::Util::blessed $exceptionhandler eq 'Exception' ) {
  do {
my $problem = $exceptionhandler-fetch_exception_as('$problem');
# handler starts here
handler($problem);
$problem-clean_up;
  };
}
  } else {
# Just die as normal
die $@;
  }
}
While technically they may be identical once they get through the parser 
and into tree form, trying to changing -$do_something to 
-$do_something() and getting back some huge monster chunk of code you 
didn't expect is definitely not what the intent of parsing it in the 
first place was.

This is what I am talking about when I refer to the Frontpage effect, 
the habit Micrsoft's HTML editor (especially the early versions) had of 
reuilding you HTML document from scratch, deleting all your template 
variables and PHP code and generally making it impossible to write HTML 
by hand. For HTML where you arn't MEANT to be writing stuff by hand 
under normal circumstances that wasn't always a problem, but perl _isi_ 
meant to be written by hand.

 You could even grab the comments and do something sick
with them (see Damian :-).  Or better yet, do something that PPI
doesn't, and add some sub call around all statements, or determine the
meaning of brackets in a particular context.
The question of whether to execute BEGIN blocks is a tricky one.
Sometimes they change the parse of the program. Sometimes they do other
stuff.  All you can hope for is that people understand the difference
between BEGIN (change parsing) and INIT (do before the program starts).
Frankly that is a gaping security hole... not only do I have to still 
deal with the problem of loading every single dependency or having no 
parsing ability otherwise, but I am required to trust every perl 
programmer on the planet :(

I love PPI, by the way :-)
Thank you, I do to :)
But I'd like to still have something like it in perl6 :(
Adam


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Adam Kennedy
Michele Dondi wrote:
On Thu, 25 Nov 2004, Adam Kennedy wrote:
I thought it was about time I brought some concerns I've been having 
lately to the list. Not so much on any particular problem with perl6, 
but on problems with perl5 we would seem to have the opportunity to 
fix but aren't. (So far as I can tell).

So why not discussing this somewhere else? (e.g. clpmisc)
One of the biggest problems I have had with perl5 is that nothing, not 
even perl itself, can truly actually parse Perl source. By this, I 
mean parse

False:
[Nothing but] perl can parse Perl. (Tom Christiansen)
Please see Acme::BadExample. perl itself cannot parse this at all, and 
yet it follows the absolutely most basic syntax for the language.

And just after the snip you will see I qualify parse in this context 
as loading the perl in some form of DOM-type tree.

Adam


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Adam Kennedy
Smylers wrote:
Adam Kennedy writes:

perl itself would also appear unable to understand perl source,
instead doing what I would call RIBRIB parsing, Read a bit, run a
bit.

RIBRIB?  RABRAB, surely!
Smylers
Yes, you are right, typo.


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Herbert Snorrason
On Thu, 25 Nov 2004 22:00:03 +1100, Adam Kennedy [EMAIL PROTECTED] wrote:
 And just after the snip you will see I qualify parse in this context
 as loading the perl in some form of DOM-type tree.
And yet you disqualify the Perl6 rule system, with its tree of match
objects? What, exactly, is it that you want?

-- 
Schwäche zeigen heißt verlieren;
härte heißt regieren.
  - Glas und Tränen, Megaherz


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Larry Wall
On Thu, Nov 25, 2004 at 02:31:46PM +1100, Adam Kennedy wrote:
: Let's say you want to write a yacc grammar to parse Perl 6, or
: Parse::RecDescent, or whatever you're going to use.  Yes, that will be
: hard in Perl 6.  Certainly harder than it was in Perl 5.
: 
: In the end, I concluded there was _no_ way to write even a Perl 5 parser 
: using any sort of pre-rolled grammar system, as the language does not 
: have that sort of structure.

On that level you have to think of Perl as multiple languages, not a
single language.  That in itself should not be a problem, though.

: PPI was done the hard way. Manually stepping through line by line and 
: using a variety of cruft (some stolen from the perl source, some my own) 
: to make it just work.
: 
: I would envisage that the same would be true of writing a PPI6, except 
: with a hell of a lot more operators :)

The number of operators is a bit of a red herring.  What you really
don't like is that there aren't a fixed number of them.  :-)

: However, Perl 6 comes packaged with its own grammar, in Perl's own rule
: format.  So now the quote only perl can parse Perl may become only
: Perl can parse Perl  (And even only Perl can parse perl, since it's
: written in itself :-).
: 
: Perl's contextual sensitivity is part of the language.  So the best you
: can do is to track everything like you mentioned.  It's going to be
: impossible to parse Perl without having perl around to do it for you.
: 
: But using the built-in grammar, you can read in a program, macros and
: all, and get an annotated source tree back, that you could rebuild the
: source out of.
: 
: Again, this is of very little use, effectively destroying the source 
: code and replacing it with different source that is a serialised version 
: of the tree.

And there you put your finger onto the real problem, which is not that
Perl is a mutating language or that it has a lot of operators, but that
in the process of getting from here to there, it *forgets* how it got
there, so there's no way of getting back to here.

: This is what I am talking about when I refer to the Frontpage effect, 
: the habit Micrsoft's HTML editor (especially the early versions) had of 
: reuilding you HTML document from scratch, deleting all your template 
: variables and PHP code and generally making it impossible to write HTML 
: by hand. For HTML where you arn't MEANT to be writing stuff by hand 
: under normal circumstances that wasn't always a problem, but perl _isi_ 
: meant to be written by hand.

But under another view, explosions of opcodes are just part of the
compilation process.  Again, the real problem is the forgetting of
both the original structure and what it means in the context of the
language that was being parsed at the time.

There is no doubt that source filters are much too crude, and forget
way too much.  That's why we're trying to kill them dead in Perl 6.
I think the real question is how far we can push Perl 6's macro system
without forgetting anything you want to know about the structure of
the program.  Obviously AST macros will have an easier time of it
than textual macros.  An AST macro can just automatically attach the
original parse and context as properties on the top of the new AST.

To keep this info around for textual macros will require a bit more
trickery, but we have to do it anyway for activities like debugging.
So if we can see that in the larger context of preserving the entire
compilation audit trail, all the better.

:  You could even grab the comments and do something sick
: with them (see Damian :-).  Or better yet, do something that PPI
: doesn't, and add some sub call around all statements, or determine the
: meaning of brackets in a particular context.
: 
: The question of whether to execute BEGIN blocks is a tricky one.
: Sometimes they change the parse of the program. Sometimes they do other
: stuff.  All you can hope for is that people understand the difference
: between BEGIN (change parsing) and INIT (do before the program starts).
: 
: Frankly that is a gaping security hole... not only do I have to still 
: deal with the problem of loading every single dependency or having no 
: parsing ability otherwise, but I am required to trust every perl 
: programmer on the planet :(

Another red herring--we've always had fairly strict accountability on
the language warping dependencies at the use level.  We're improving
that in Perl 6 by requiring a decision on version at use time, and
making that version a part of the metadata.

But it's no accident that one of the places that Perl 5's B::Deparse
has troubles is right at the BEGIN boundaries.  Wherever Deparse has
troubles, you can read that to mean I didn't understand that I should
put something into Perl 5 to remember something important.  The final
metadata for the compiled program has to be able to tell you which
chunks of program were compiled under which language.  That's just
as important as being able to track back to the 

Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Adam Kennedy
Herbert Snorrason wrote:
On Thu, 25 Nov 2004 22:00:03 +1100, Adam Kennedy [EMAIL PROTECTED] wrote:
And just after the snip you will see I qualify parse in this context
as loading the perl in some form of DOM-type tree.
And yet you disqualify the Perl6 rule system, with its tree of match
objects? What, exactly, is it that you want?
What I'm after are 3 critical features.
1. You always get back out what you put in.
$source eq serialize(parse($source)).
2. No side effects. Autrijus Tang suggests this may be workable
3. You can parse a document with broken dependencies.
There are a myriad of these situations, such as
- Dependencies you don't have
- Editing on different platform to execution platform (think Win32:: or 
S390/mainframe/GridComputing)

- Unfinished code
- Things you can't get installed (ImageMagick etc)
- Example code that will never be executed
(Imagine if you will a mod_perl syntax highlighting module for 
search.cpan.org. Should the search.cpan.org host have to _install_ every 
single one of the modules in CPAN?)

PPI can do all of these 3 things. Not 100% reliably, but for normal 
code (where normal is actually defined fairly broadly).

In any case, I would like to suspend this debate for a week, as I'll be 
talking with Damian (hopefully) at YAPC.AU. I'll report back afterwards, 
having hopefully imparted the full extent of my problem.

Perl 6 rules or some variation therein may indeed be what I'm after, 
although I need to find out more about the internals.

Do we have a working version yet I can create some demonstrations with?
Adam


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Damian Conway
Adam Kennedy wrote:
What I'm after are 3 critical features.
1. You always get back out what you put in.
$source eq serialize(parse($source)).
As Larry pointed out, this will depend on how much metadata your parser 
augments your parse-tree with. I think it will be doable (probably by 
subclassing the standard Perl parsing grammar).


2. No side effects. Autrijus Tang suggests this may be workable
No side effects of what? Of parsing? I don't think that's possible. Perl is 
defined such that compile-time side-effects can alter the syntax (and hence 
the parsing) of a program.


3. You can parse a document with broken dependencies.
This is clearly not possible in Perl (5 or 6) in the general case. Perl isn't 
that kind of language.


PPI can do all of these 3 things. Not 100% reliably, but for normal 
code (where normal is actually defined fairly broadly).
And Perl 6's Cgrammar Perl will be able to give you the same.

In any case, I would like to suspend this debate for a week, as I'll be 
talking with Damian (hopefully) at YAPC.AU. I'll report back afterwards, 
having hopefully imparted the full extent of my problem.
I believe I understand your problem pretty well already.
But I'll be more than happy to discuss this whole issue with you.

Perl 6 rules or some variation therein may indeed be what I'm after, 
although I need to find out more about the internals.
So do we. That's why we're building it at the moment. ;-)
See the mailto:[EMAIL PROTECTED] mailing list.

Do we have a working version yet I can create some demonstrations with?
No. We're working on that. There's a partial prototype that runs under 5.8.3 
(but is broken under earlier and later releases) on CPAN as Perl6::Rules.
You should also (re)read Apocalypse 6 and especially Synopsis 6 on 
http://dev.perl.org

Damian


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-25 Thread Luke Palmer
Adam Kennedy writes:
 Herbert Snorrason wrote:
 On Thu, 25 Nov 2004 22:00:03 +1100, Adam Kennedy [EMAIL PROTECTED] wrote:
 
 And just after the snip you will see I qualify parse in this context
 as loading the perl in some form of DOM-type tree.
 
 And yet you disqualify the Perl6 rule system, with its tree of match
 objects? What, exactly, is it that you want?
 
 What I'm after are 3 critical features.
 
 1. You always get back out what you put in.
 $source eq serialize(parse($source)).
 
 2. No side effects. Autrijus Tang suggests this may be workable
 
 3. You can parse a document with broken dependencies.
 There are a myriad of these situations, such as

I'm afraid that's just not possible.  And by that I don't mean very
hard to implement.  I mean impossible, in a halting problem sort of
way.  Take this module:

module OrBar;

macro *infix:{''} (AST $left, AST $right) {
return {
my ($le, $re) = ($left.run, $right.run);
$le * $re - ($le + $re);
}
}

And this program:

use OrBar;

print 3  4;

There's no way you can not have that dependency and still parse the
program.  What if the macro declaration declared it to be lower
precedence that listops?

And as far as not executin BEGIN blocks, well, you can't do that either.
A BEGIN block might execute and do the same thing as that declaration
(all declarations are just shorthands for the appropriate BEGIN blocks).
And then you can't parse without executing it.

You might be able to use whatever Parrot decides the alternative to
Safe.pm is in order to make sure nobody writes:

BEGIN { system 'cat /dev/urandom  /dev/mem' }

But, AFAICT, there's nothing else you can do.  And the language isn't
going to change to solve this problem.  Munging the grammar and the
operators at compile-time is simply too cool.  The single feature I like
most about perl (it's very hard to decide) is its ability to execute
code at compile time... on top of everything you can do at compile time,
of course.

Or you could tell the parser not to execute BEGIN blocks, and hope that
works.

 (Imagine if you will a mod_perl syntax highlighting module for 
 search.cpan.org. Should the search.cpan.org host have to _install_ every 
 single one of the modules in CPAN?)

No, it'll just have to guess, as all syntax highlighters do.  Chances
are, most modules aren't going to change things drastically enough to
make your syntax highlighting all wrong.

But you don't really need to parse to syntax highlight, either.  You
just need to tokenize.  The way the parser will be designed includes an
explicit tokenizer, and don't worry, we'll let you hook into it.  In
fact, it seems PPI is more of a tokenizer than a parser, and that's much
more easily done.  It's also quite easy to recover if it messes up.

Oh, I wrote this after missing your I'd like to suspend this debate for
a week.  So DELETED!

Luke


Will _anything_ be able to truly parse and understand perl?

2004-11-24 Thread Adam Kennedy
Hi folks
I thought it was about time I brought some concerns I've been having 
lately to the list. Not so much on any particular problem with perl6, 
but on problems with perl5 we would seem to have the opportunity to fix 
but aren't. (So far as I can tell).

One of the biggest problems I have had with perl5 is that nothing, not 
even perl itself, can truly actually parse Perl source. By this, I 
mean parse in the sense of reading a chunk of bytes of Perl source and 
understanding what they mean. Call it document parsing if you wish.

perl itself would also appear unable to understand perl source, instead 
doing what I would call RIBRIB parsing, Read a bit, run a bit. The 
parsing of perl source is itself merely part of the first execution 
phase (BEGIN).

If we leave source filters out of it for now, the main problems in 
regards to document parsing in the current Perl is caused by the 
interaction of prototypes and operator/operand context.

As the most common example, in order to know what the slash / 
character is (division or regex), you have to know whether you are in 
operator or operand context, which requires know what things are 
parameters for subroutines and which aren't, and you can't keep track of 
that without tracking all of the prototypes for every function both CORE 
and in the symbol table as you go, and you can't do THAT without loading 
every single module dependency and running a parse/BEGIN-phase-execution 
on all of the files, and you can't do THAT without having a perl 
interpreter to execute it all in.

Any attempt to use the perl interpreter to parse code to understand it 
in any way is both unpredictable and dangerous, due to the common 
situation of not having a platform that can fully run the code (and all 
it's dependencies) and all the potentially dangerous side-effects.

If you can't load and BEGIN-phase-execute every single one of the 
dependencies, you can't parse. At all... ever!

use Win32::Something;
1;
Unparsable on Unix...
use Win32::Something;
use Proc::ProcessTable;
1;
Unparsable on anything... even if I just want 1 syntax highlighted as 
a number.

BEGIN { system 'rm -rf /'; }
1;
eeep!
If anyone checks in a broken version of a module into CVS that is part 
of some large project you are working on, sorry can't parse anything any 
more even to try to hunt down the problem.

For an more comprehensive example, take a look at Acme::BadExample, 
which uses absolutely plain and simple syntax, yet is completely 
unparsable. (The reward is as-yet unclaimed)

Go have a look now, I'll wait...
All of this creates HUGE headaches if we want to start adding some 
intelligence to code analysis and manipulation. We've all seen the 
lengths Komodo has had to go to by continuously running the code.

What few attempts there have been to modify code are fairly impotent. 
While you _can_ get source back from B:: it isn't particularly useful 
except as a way to serialise anonymous code for storage or transport.

Given $source, B:: throws away all syntax and commenting and POD and 
__DATA__ and whatever is after __END__ and then dumps back out something 
quite different from what went in. A sort of Frontpage effect. It's what 
the program thinks is close enough to be the same, but definitely not 
what you wanted.

Now, despite _all_ of these problems, the continuous insistence from the 
entire Perl community that writing a perl parser is impossible (only 
perl can parse Perl), and despite the fact I had seen several other 
people try and fail, or give up, or not really get started, I decided to 
redefine the problem slightly and have managed to get a working Perl 
parser up and running.

Or at least, well enough to handle selfgol and 90% of CPAN (I haven't 
really started working on corner cases yet, after which I expect to 
reach about 99%), and to do so needing ONLY what is contained in the .pm 
 or .pl file and nothing else. The parser can read all of 
Acme::BadExample safely and write it back out again unchanged.

In any case, it works and works well enough to start building a number 
of cool toys on, such as normalisation and comparison of code, code 
metrics (Leon had a play with this), syntax highlighting, style 
analysis, the CPAN Cross Reference, and various other stuff that is 
staying on the whiteboard until API-freeze is finished. 
(back/forwardporting, refactoring, auto-documentation, smart perl diffs, 
safe testing of code, better dependency and version extraction, 
checkstyle, a refactoring perl editor ala IntelliJ IDEA, etc etc etc).

And most importantly, because it treats the Perl source as a document 
(data structure) and not as code (procedural execution) it can serialise 
back to the source code which will be identical to what it read in. That 
is, it is totally round-trip safe (100% in testing of a CPAN subset of 
5,500 perl files)

So $source - $DocumentObject - $source is safe.
Now of course, it is completely unable to deal with source filters. 
There is some 

Re: Will _anything_ be able to truly parse and understand perl?

2004-11-24 Thread Luke Palmer
Adam Kennedy writes:
 Getting (finally) to perl6, I could have sworn I saw an RFC early on 
 which said Make perl6 easier to parse.
 
 But it would appear the opposite is occurring. Source filters have
 become grammars and will now be officially approved and acceptable
 (yes?) while so far as I can tell the problem of prototype vs
 operator/operand interaction is not being addressed. (I'm a little in
 the dark here, perhaps it is and nobody has noticed enough)

Let's say you want to write a yacc grammar to parse Perl 6, or
Parse::RecDescent, or whatever you're going to use.  Yes, that will be
hard in Perl 6.  Certainly harder than it was in Perl 5.

However, Perl 6 comes packaged with its own grammar, in Perl's own rule
format.  So now the quote only perl can parse Perl may become only
Perl can parse Perl  (And even only Perl can parse perl, since it's
written in itself :-).

Perl's contextual sensitivity is part of the language.  So the best you
can do is to track everything like you mentioned.  It's going to be
impossible to parse Perl without having perl around to do it for you.

But using the built-in grammar, you can read in a program, macros and
all, and get an annotated source tree back, that you could rebuild the
source out of.  You could even grab the comments and do something sick
with them (see Damian :-).  Or better yet, do something that PPI
doesn't, and add some sub call around all statements, or determine the
meaning of brackets in a particular context.

The question of whether to execute BEGIN blocks is a tricky one.
Sometimes they change the parse of the program. Sometimes they do other
stuff.  All you can hope for is that people understand the difference
between BEGIN (change parsing) and INIT (do before the program starts).

I love PPI, by the way :-)

Luke



Re: Will _anything_ be able to truly parse and understand perl?

2004-11-24 Thread Damian Conway
Luke has answered this better than I would have. In particular, he wrote:
 Perl's contextual sensitivity is part of the language.  So the best you
 can do is to track everything like you mentioned.  It's going to be
 impossible to parse Perl without having perl around to do it for you.
That first sentence is the critical point to remember. Without Cuse and 
CBEGIN and the new macro facilities, Perl just wouldn't be Perl.

I would just add that we have indeed Ma[d]e Perl 6 easier to parse. In 
concrete terms, to parse Perl 6 (in Perl 6) you'll simply write the following 
one line:

$parse_tree = ( $source_text ~~ m:keepall/ Perl.prog / );
which will completely parse your source. Because *during* its parse it will 
run any CBEGIN blocks or macros and Cuse any modules, thereby *lexically* 
adapting its own grammar as it goes. What you'll get back is a tree of 
submatches (including whitespace and comments, if you want them), which you 
can then reprocess as you wish.

Damian
PS: Yes, I'll be at OSDC www.osdc.com.au next week (giving the opening and
closing keynotes, in fact). And, yes, you're most welcome have a chat with
me about Perl 6 sometime during the conference.


Re: Will _anything_ be able to truly parse and understand perl?

2004-11-24 Thread Smylers
Adam Kennedy writes:

 perl itself would also appear unable to understand perl source,
 instead doing what I would call RIBRIB parsing, Read a bit, run a
 bit.

RIBRIB?  RABRAB, surely!

Smylers