Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
Luke == Luke Palmer [EMAIL PROTECTED] writes: Luke But you don't really need to parse to syntax highlight, either. You Luke just need to tokenize. Unfortunately, to tokenize, you also have to know the state of the parse. As long as / is both divide and begin regex, you're toasted. Please see my long post at on parsing perl in perlmonks at http://www.perlmonks.org/index.pl?node_id=44722 for examples of *why* you need to notice whether you have a divide or a regex match. Perl is fundamentally resistant to lexing. As in the beginning of this thread, one of the RFCs suggested the possibility of making Perl lexable, but apparently the designers said no, we think the / duality is worth keeping. And that seals the fate for Perl6 just like all Perl before it. To properly lex a Perl program (Perl6 included), you *must* execute BEGIN blocks. That's the end of that tune. Anything else is just an approximation. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 [EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/ Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
Randal L. Schwartz wrote: Luke == Luke Palmer [EMAIL PROTECTED] writes: Luke But you don't really need to parse to syntax highlight, either. You Luke just need to tokenize. Unfortunately, to tokenize, you also have to know the state of the parse. As long as / is both divide and begin regex, you're toasted. So you're saying that in Perl 6 it will be entirely impossible to determine if / appears as the division operator or as the beginning of a regex from a purely syntactic examination of the source code? I'm finding that very, very hard to believe. Regexps aren't valid where /-the-operator is, after all. Please correct me if I'm wrong, but I've got the impression that Perl 6 is tokenisable without requiring BEGIN blocks to be run - provided no grammars which the tokeniser doesn't already know about are used, of course, that one will never be avoidable.
Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
Matthew == Matthew Walton [EMAIL PROTECTED] writes: Matthew So you're saying that in Perl 6 it will be entirely impossible to Matthew determine if / appears as the division operator or as the beginning of Matthew a regex from a purely syntactic examination of the source code? Yes. Matthew I'm finding that very, very hard to believe. Regexps aren't valid Matthew where /-the-operator is, after all. And that's precisely why Perl can work as it does. If an operator is expected, / is divide. If a term is expected, / is the beginning of a regex. This has been true since Perl1 (maybe 0). There are a few other characters that also work similarly, but / is the most frequent and most troublesome. And it got worse for Perl5, because of user-defined prototypes, which as far as I can tell, are still present in Perl6. Matthew Please correct me if I'm wrong, but I've got the impression that Perl Matthew 6 is tokenisable without requiring BEGIN blocks to be run - provided Matthew no grammars which the tokeniser doesn't already know about are used, Matthew of course, that one will never be avoidable. Your impression is wrong. In the presence of user-defined prototypes, you *must* execute the code that might alter a prototype in order to determine whether / is a divide (and therefore standalone token) or the beginning of a regex (and therefore must locate the end of the regex to properly be a token). Please see the referenced perlmonks article. All the handwaving in the world won't fix this. As long as we have dual-natured characters like /, and user-defined prototypes, Perl cannot be lexed without also parsing, and therefore without also running BEGIN blocks. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 [EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/ Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
Randal L. Schwartz wrote: Matthew == Matthew Walton [EMAIL PROTECTED] writes: Matthew So you're saying that in Perl 6 it will be entirely impossible to Matthew determine if / appears as the division operator or as the beginning of Matthew a regex from a purely syntactic examination of the source code? Yes. Matthew I'm finding that very, very hard to believe. Regexps aren't valid Matthew where /-the-operator is, after all. And that's precisely why Perl can work as it does. If an operator is expected, / is divide. If a term is expected, / is the beginning of a regex. This has been true since Perl1 (maybe 0). There are a few other characters that also work similarly, but / is the most frequent and most troublesome. And it got worse for Perl5, because of user-defined prototypes, which as far as I can tell, are still present in Perl6. Perl 6 has formal parameters for subs, methods etc. I don't see any mention of Perl 5-style prototypes in S6, and I honestly can't see how they could possibly fit with formal parameters. Hopefully Larry or someone can clarify whether they still exist or not. If they don't still exist, this eases the problem somewhat, but not entirely I understand. Being able to call subs and methods without parentheses around the argument lists causes problems; a quick scan of the updated Synopses failed to reveal the rules for that in Perl 6. Your impression is wrong. In the presence of user-defined prototypes, you *must* execute the code that might alter a prototype in order to determine whether / is a divide (and therefore standalone token) or the beginning of a regex (and therefore must locate the end of the regex to properly be a token). Since Perl 5 style prototypes don't appear to exist anymore, this may be easier. I don't believe that the addition of the // operator compounds the problem anymore, because hopefully by that point it was possible to determine that you've seen an operator. The Perlmonks article throws up a lot of very nasty cases. Not knowing the entire current language definition by heart, I can't say this with absolutely certainty, but I retain the belief that Perl 6 is at least *easier* to deal with than Perl 5. It is also possible that telling the difference between /-as-divide and /-as-regex becomes much easier if lookahead is employed in the tokeniser. Unfortunately, that makes the tokeniser much more complicated, and it's just a vague and random idea.
Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
Matthew == Matthew Walton [EMAIL PROTECTED] writes: Matthew Perl 6 has formal parameters for subs, methods etc. I don't see any Matthew mention of Perl 5-style prototypes in S6, and I honestly can't see how Matthew they could possibly fit with formal parameters. Hopefully Larry or Matthew someone can clarify whether they still exist or not. As long as you can have a user-defined null-prototyped subroutine (one that doesn't need parens following), you have the problem. See the sin/time examples in the monk article, and then consider user-defined functions that have no args (like time) and those that do (like sin). Matthew The Perlmonks article throws up a lot of very nasty cases. Not knowing Matthew the entire current language definition by heart, I can't say this with Matthew absolutely certainty, but I retain the belief that Perl 6 is at least Matthew *easier* to deal with than Perl 5. I believe you have a false belief. I don't know anything in the new prototypes-which-became-full-formal-arguments that made it any *easier* to recognize the ending of a subroutine argument list without knowing its precise definition. In Perl6: sub no_args () { ... } sub list_args ([EMAIL PROTECTED]) { ... } no_args / # this is a divide list_args / # this is the start of a regex See, it's still there. :) Matthew It is also possible that telling the difference between /-as-divide Matthew and /-as-regex becomes much easier if lookahead is employed in the Matthew tokeniser. No, not possible at all. The entire rest of the program may be valid either way. You *must* know by the time you're done with /, or /-and-more. The rest of the code cannot be a hint. Again, see my article. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 [EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/ Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
Randal L. Schwartz wrote: Matthew == Matthew Walton [EMAIL PROTECTED] writes: Matthew Perl 6 has formal parameters for subs, methods etc. I don't see any Matthew mention of Perl 5-style prototypes in S6, and I honestly can't see how Matthew they could possibly fit with formal parameters. Hopefully Larry or Matthew someone can clarify whether they still exist or not. As long as you can have a user-defined null-prototyped subroutine (one that doesn't need parens following), you have the problem. See the sin/time examples in the monk article, and then consider user-defined functions that have no args (like time) and those that do (like sin). Matthew The Perlmonks article throws up a lot of very nasty cases. Not knowing Matthew the entire current language definition by heart, I can't say this with Matthew absolutely certainty, but I retain the belief that Perl 6 is at least Matthew *easier* to deal with than Perl 5. I believe you have a false belief. I don't know anything in the new prototypes-which-became-full-formal-arguments that made it any *easier* to recognize the ending of a subroutine argument list without knowing its precise definition. In Perl6: sub no_args () { ... } sub list_args ([EMAIL PROTECTED]) { ... } no_args / # this is a divide list_args / # this is the start of a regex See, it's still there. :) I believe I did mention that being able to call functions without parens is a problem. Matthew It is also possible that telling the difference between /-as-divide Matthew and /-as-regex becomes much easier if lookahead is employed in the Matthew tokeniser. No, not possible at all. The entire rest of the program may be valid either way. You *must* know by the time you're done with /, or /-and-more. The rest of the code cannot be a hint. Again, see my article. I read the article. I believe I mentioned that as well. But I will have to concede that it is impossible to correctly determine the structure of an arbitrary Perl 6 program without having to hand the definitions of all functions used and also any grammars and macros used. Sometimes you will be able to do it, sometimes you won't, but you can't operate on the assumption that you can. It's quite a disappointment in some ways, but we've lived with it in Perl 5, and I'm sure we can live with it in Perl 6. And I still think Perl 6 will have fewer cases in which it's completely impossible for not-Perl to parse it. Unfortunately, fewer still implies some, and some is still a problem.
Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
Randal L. Schwartz wrote: All the handwaving in the world won't fix this. As long as we have dual-natured characters like /, and user-defined prototypes, Perl cannot be lexed without also parsing, and therefore without also running BEGIN blocks. And user-defined prototypes that change when the argument list of a function ends, that is. If we forced the argument list for all functions to have parens (including empty parens for argument less functions), then we'd be OK, I'm fairly certain. For that matter, if we stick to declaration syntax for declarations, and not BEGIN blocks and reflection, then we're OK -- you have to do some execution, but of a minilanguage that can't express concepts that you wouldn't be OK running... though you do still have to descend through require/use, and thus have to have the files being required or used (or at least a description of their declarations). -=- James Mastros, theorbtwo
Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)
James Mastros skribis 2004-11-26 14:36 (+0100): And user-defined prototypes that change when the argument list of a function ends, that is. If we forced the argument list for all functions to have parens (including empty parens for argument less functions), then we'd be OK, I'm fairly certain. While that is true, please realise that many people like that in Perl, parens are optional. I am one of those people who dislike typing and counting too many balanced symbol sets. If only method and function syntax could be the same, and methods would also not require parens... Ah well, that's what we have mutable grammar for. For that matter, if we stick to declaration syntax for declarations, and not BEGIN blocks and reflection Macros are somewhat like BEGIN blocks and may be needed to turn invalid syntax into something that is valid. Juerd
Re: Will _anything_ be able to truly parse and understand perl?
On Thu, 25 Nov 2004, Adam Kennedy wrote: I thought it was about time I brought some concerns I've been having lately to the list. Not so much on any particular problem with perl6, but on problems with perl5 we would seem to have the opportunity to fix but aren't. (So far as I can tell). So why not discussing this somewhere else? (e.g. clpmisc) One of the biggest problems I have had with perl5 is that nothing, not even perl itself, can truly actually parse Perl source. By this, I mean parse False: [Nothing but] perl can parse Perl. (Tom Christiansen) Michele -- # This prints: Just another Perl hacker, seek DATA,15,0 and print q... DATA; __END__
Re: Will _anything_ be able to truly parse and understand perl?
Let's say you want to write a yacc grammar to parse Perl 6, or Parse::RecDescent, or whatever you're going to use. Yes, that will be hard in Perl 6. Certainly harder than it was in Perl 5. In the end, I concluded there was _no_ way to write even a Perl 5 parser using any sort of pre-rolled grammar system, as the language does not have that sort of structure. PPI was done the hard way. Manually stepping through line by line and using a variety of cruft (some stolen from the perl source, some my own) to make it just work. I would envisage that the same would be true of writing a PPI6, except with a hell of a lot more operators :) However, Perl 6 comes packaged with its own grammar, in Perl's own rule format. So now the quote only perl can parse Perl may become only Perl can parse Perl (And even only Perl can parse perl, since it's written in itself :-). Perl's contextual sensitivity is part of the language. So the best you can do is to track everything like you mentioned. It's going to be impossible to parse Perl without having perl around to do it for you. But using the built-in grammar, you can read in a program, macros and all, and get an annotated source tree back, that you could rebuild the source out of. Again, this is of very little use, effectively destroying the source code and replacing it with different source that is a serialised version of the tree. For a current notional example, it would be like loading a simple... try { $object-$do_something; } catch (Exception $problem) { handle($problem); } ... changing -$do_something to -$do_something() to make it back-portable, and then ending up with... Module::Exceptions::initialize('line 98'); my $exceptionhandler = Module::Exceptions::prepare(); eval { $exceptionhandler-update_status('in try'); $object-do_something(); }; if ( $@ ) { if ( ref $exceptionhandler ) { require Scalar::Util (); if ( Scalar::Util::blessed $exceptionhandler eq 'Exception' ) { do { my $problem = $exceptionhandler-fetch_exception_as('$problem'); # handler starts here handler($problem); $problem-clean_up; }; } } else { # Just die as normal die $@; } } While technically they may be identical once they get through the parser and into tree form, trying to changing -$do_something to -$do_something() and getting back some huge monster chunk of code you didn't expect is definitely not what the intent of parsing it in the first place was. This is what I am talking about when I refer to the Frontpage effect, the habit Micrsoft's HTML editor (especially the early versions) had of reuilding you HTML document from scratch, deleting all your template variables and PHP code and generally making it impossible to write HTML by hand. For HTML where you arn't MEANT to be writing stuff by hand under normal circumstances that wasn't always a problem, but perl _isi_ meant to be written by hand. You could even grab the comments and do something sick with them (see Damian :-). Or better yet, do something that PPI doesn't, and add some sub call around all statements, or determine the meaning of brackets in a particular context. The question of whether to execute BEGIN blocks is a tricky one. Sometimes they change the parse of the program. Sometimes they do other stuff. All you can hope for is that people understand the difference between BEGIN (change parsing) and INIT (do before the program starts). Frankly that is a gaping security hole... not only do I have to still deal with the problem of loading every single dependency or having no parsing ability otherwise, but I am required to trust every perl programmer on the planet :( I love PPI, by the way :-) Thank you, I do to :) But I'd like to still have something like it in perl6 :( Adam
Re: Will _anything_ be able to truly parse and understand perl?
Michele Dondi wrote: On Thu, 25 Nov 2004, Adam Kennedy wrote: I thought it was about time I brought some concerns I've been having lately to the list. Not so much on any particular problem with perl6, but on problems with perl5 we would seem to have the opportunity to fix but aren't. (So far as I can tell). So why not discussing this somewhere else? (e.g. clpmisc) One of the biggest problems I have had with perl5 is that nothing, not even perl itself, can truly actually parse Perl source. By this, I mean parse False: [Nothing but] perl can parse Perl. (Tom Christiansen) Please see Acme::BadExample. perl itself cannot parse this at all, and yet it follows the absolutely most basic syntax for the language. And just after the snip you will see I qualify parse in this context as loading the perl in some form of DOM-type tree. Adam
Re: Will _anything_ be able to truly parse and understand perl?
Smylers wrote: Adam Kennedy writes: perl itself would also appear unable to understand perl source, instead doing what I would call RIBRIB parsing, Read a bit, run a bit. RIBRIB? RABRAB, surely! Smylers Yes, you are right, typo.
Re: Will _anything_ be able to truly parse and understand perl?
On Thu, 25 Nov 2004 22:00:03 +1100, Adam Kennedy [EMAIL PROTECTED] wrote: And just after the snip you will see I qualify parse in this context as loading the perl in some form of DOM-type tree. And yet you disqualify the Perl6 rule system, with its tree of match objects? What, exactly, is it that you want? -- Schwäche zeigen heißt verlieren; härte heißt regieren. - Glas und Tränen, Megaherz
Re: Will _anything_ be able to truly parse and understand perl?
On Thu, Nov 25, 2004 at 02:31:46PM +1100, Adam Kennedy wrote: : Let's say you want to write a yacc grammar to parse Perl 6, or : Parse::RecDescent, or whatever you're going to use. Yes, that will be : hard in Perl 6. Certainly harder than it was in Perl 5. : : In the end, I concluded there was _no_ way to write even a Perl 5 parser : using any sort of pre-rolled grammar system, as the language does not : have that sort of structure. On that level you have to think of Perl as multiple languages, not a single language. That in itself should not be a problem, though. : PPI was done the hard way. Manually stepping through line by line and : using a variety of cruft (some stolen from the perl source, some my own) : to make it just work. : : I would envisage that the same would be true of writing a PPI6, except : with a hell of a lot more operators :) The number of operators is a bit of a red herring. What you really don't like is that there aren't a fixed number of them. :-) : However, Perl 6 comes packaged with its own grammar, in Perl's own rule : format. So now the quote only perl can parse Perl may become only : Perl can parse Perl (And even only Perl can parse perl, since it's : written in itself :-). : : Perl's contextual sensitivity is part of the language. So the best you : can do is to track everything like you mentioned. It's going to be : impossible to parse Perl without having perl around to do it for you. : : But using the built-in grammar, you can read in a program, macros and : all, and get an annotated source tree back, that you could rebuild the : source out of. : : Again, this is of very little use, effectively destroying the source : code and replacing it with different source that is a serialised version : of the tree. And there you put your finger onto the real problem, which is not that Perl is a mutating language or that it has a lot of operators, but that in the process of getting from here to there, it *forgets* how it got there, so there's no way of getting back to here. : This is what I am talking about when I refer to the Frontpage effect, : the habit Micrsoft's HTML editor (especially the early versions) had of : reuilding you HTML document from scratch, deleting all your template : variables and PHP code and generally making it impossible to write HTML : by hand. For HTML where you arn't MEANT to be writing stuff by hand : under normal circumstances that wasn't always a problem, but perl _isi_ : meant to be written by hand. But under another view, explosions of opcodes are just part of the compilation process. Again, the real problem is the forgetting of both the original structure and what it means in the context of the language that was being parsed at the time. There is no doubt that source filters are much too crude, and forget way too much. That's why we're trying to kill them dead in Perl 6. I think the real question is how far we can push Perl 6's macro system without forgetting anything you want to know about the structure of the program. Obviously AST macros will have an easier time of it than textual macros. An AST macro can just automatically attach the original parse and context as properties on the top of the new AST. To keep this info around for textual macros will require a bit more trickery, but we have to do it anyway for activities like debugging. So if we can see that in the larger context of preserving the entire compilation audit trail, all the better. : You could even grab the comments and do something sick : with them (see Damian :-). Or better yet, do something that PPI : doesn't, and add some sub call around all statements, or determine the : meaning of brackets in a particular context. : : The question of whether to execute BEGIN blocks is a tricky one. : Sometimes they change the parse of the program. Sometimes they do other : stuff. All you can hope for is that people understand the difference : between BEGIN (change parsing) and INIT (do before the program starts). : : Frankly that is a gaping security hole... not only do I have to still : deal with the problem of loading every single dependency or having no : parsing ability otherwise, but I am required to trust every perl : programmer on the planet :( Another red herring--we've always had fairly strict accountability on the language warping dependencies at the use level. We're improving that in Perl 6 by requiring a decision on version at use time, and making that version a part of the metadata. But it's no accident that one of the places that Perl 5's B::Deparse has troubles is right at the BEGIN boundaries. Wherever Deparse has troubles, you can read that to mean I didn't understand that I should put something into Perl 5 to remember something important. The final metadata for the compiled program has to be able to tell you which chunks of program were compiled under which language. That's just as important as being able to track back to the
Re: Will _anything_ be able to truly parse and understand perl?
Herbert Snorrason wrote: On Thu, 25 Nov 2004 22:00:03 +1100, Adam Kennedy [EMAIL PROTECTED] wrote: And just after the snip you will see I qualify parse in this context as loading the perl in some form of DOM-type tree. And yet you disqualify the Perl6 rule system, with its tree of match objects? What, exactly, is it that you want? What I'm after are 3 critical features. 1. You always get back out what you put in. $source eq serialize(parse($source)). 2. No side effects. Autrijus Tang suggests this may be workable 3. You can parse a document with broken dependencies. There are a myriad of these situations, such as - Dependencies you don't have - Editing on different platform to execution platform (think Win32:: or S390/mainframe/GridComputing) - Unfinished code - Things you can't get installed (ImageMagick etc) - Example code that will never be executed (Imagine if you will a mod_perl syntax highlighting module for search.cpan.org. Should the search.cpan.org host have to _install_ every single one of the modules in CPAN?) PPI can do all of these 3 things. Not 100% reliably, but for normal code (where normal is actually defined fairly broadly). In any case, I would like to suspend this debate for a week, as I'll be talking with Damian (hopefully) at YAPC.AU. I'll report back afterwards, having hopefully imparted the full extent of my problem. Perl 6 rules or some variation therein may indeed be what I'm after, although I need to find out more about the internals. Do we have a working version yet I can create some demonstrations with? Adam
Re: Will _anything_ be able to truly parse and understand perl?
Adam Kennedy wrote: What I'm after are 3 critical features. 1. You always get back out what you put in. $source eq serialize(parse($source)). As Larry pointed out, this will depend on how much metadata your parser augments your parse-tree with. I think it will be doable (probably by subclassing the standard Perl parsing grammar). 2. No side effects. Autrijus Tang suggests this may be workable No side effects of what? Of parsing? I don't think that's possible. Perl is defined such that compile-time side-effects can alter the syntax (and hence the parsing) of a program. 3. You can parse a document with broken dependencies. This is clearly not possible in Perl (5 or 6) in the general case. Perl isn't that kind of language. PPI can do all of these 3 things. Not 100% reliably, but for normal code (where normal is actually defined fairly broadly). And Perl 6's Cgrammar Perl will be able to give you the same. In any case, I would like to suspend this debate for a week, as I'll be talking with Damian (hopefully) at YAPC.AU. I'll report back afterwards, having hopefully imparted the full extent of my problem. I believe I understand your problem pretty well already. But I'll be more than happy to discuss this whole issue with you. Perl 6 rules or some variation therein may indeed be what I'm after, although I need to find out more about the internals. So do we. That's why we're building it at the moment. ;-) See the mailto:[EMAIL PROTECTED] mailing list. Do we have a working version yet I can create some demonstrations with? No. We're working on that. There's a partial prototype that runs under 5.8.3 (but is broken under earlier and later releases) on CPAN as Perl6::Rules. You should also (re)read Apocalypse 6 and especially Synopsis 6 on http://dev.perl.org Damian
Re: Will _anything_ be able to truly parse and understand perl?
Adam Kennedy writes: Herbert Snorrason wrote: On Thu, 25 Nov 2004 22:00:03 +1100, Adam Kennedy [EMAIL PROTECTED] wrote: And just after the snip you will see I qualify parse in this context as loading the perl in some form of DOM-type tree. And yet you disqualify the Perl6 rule system, with its tree of match objects? What, exactly, is it that you want? What I'm after are 3 critical features. 1. You always get back out what you put in. $source eq serialize(parse($source)). 2. No side effects. Autrijus Tang suggests this may be workable 3. You can parse a document with broken dependencies. There are a myriad of these situations, such as I'm afraid that's just not possible. And by that I don't mean very hard to implement. I mean impossible, in a halting problem sort of way. Take this module: module OrBar; macro *infix:{''} (AST $left, AST $right) { return { my ($le, $re) = ($left.run, $right.run); $le * $re - ($le + $re); } } And this program: use OrBar; print 3 4; There's no way you can not have that dependency and still parse the program. What if the macro declaration declared it to be lower precedence that listops? And as far as not executin BEGIN blocks, well, you can't do that either. A BEGIN block might execute and do the same thing as that declaration (all declarations are just shorthands for the appropriate BEGIN blocks). And then you can't parse without executing it. You might be able to use whatever Parrot decides the alternative to Safe.pm is in order to make sure nobody writes: BEGIN { system 'cat /dev/urandom /dev/mem' } But, AFAICT, there's nothing else you can do. And the language isn't going to change to solve this problem. Munging the grammar and the operators at compile-time is simply too cool. The single feature I like most about perl (it's very hard to decide) is its ability to execute code at compile time... on top of everything you can do at compile time, of course. Or you could tell the parser not to execute BEGIN blocks, and hope that works. (Imagine if you will a mod_perl syntax highlighting module for search.cpan.org. Should the search.cpan.org host have to _install_ every single one of the modules in CPAN?) No, it'll just have to guess, as all syntax highlighters do. Chances are, most modules aren't going to change things drastically enough to make your syntax highlighting all wrong. But you don't really need to parse to syntax highlight, either. You just need to tokenize. The way the parser will be designed includes an explicit tokenizer, and don't worry, we'll let you hook into it. In fact, it seems PPI is more of a tokenizer than a parser, and that's much more easily done. It's also quite easy to recover if it messes up. Oh, I wrote this after missing your I'd like to suspend this debate for a week. So DELETED! Luke
Will _anything_ be able to truly parse and understand perl?
Hi folks I thought it was about time I brought some concerns I've been having lately to the list. Not so much on any particular problem with perl6, but on problems with perl5 we would seem to have the opportunity to fix but aren't. (So far as I can tell). One of the biggest problems I have had with perl5 is that nothing, not even perl itself, can truly actually parse Perl source. By this, I mean parse in the sense of reading a chunk of bytes of Perl source and understanding what they mean. Call it document parsing if you wish. perl itself would also appear unable to understand perl source, instead doing what I would call RIBRIB parsing, Read a bit, run a bit. The parsing of perl source is itself merely part of the first execution phase (BEGIN). If we leave source filters out of it for now, the main problems in regards to document parsing in the current Perl is caused by the interaction of prototypes and operator/operand context. As the most common example, in order to know what the slash / character is (division or regex), you have to know whether you are in operator or operand context, which requires know what things are parameters for subroutines and which aren't, and you can't keep track of that without tracking all of the prototypes for every function both CORE and in the symbol table as you go, and you can't do THAT without loading every single module dependency and running a parse/BEGIN-phase-execution on all of the files, and you can't do THAT without having a perl interpreter to execute it all in. Any attempt to use the perl interpreter to parse code to understand it in any way is both unpredictable and dangerous, due to the common situation of not having a platform that can fully run the code (and all it's dependencies) and all the potentially dangerous side-effects. If you can't load and BEGIN-phase-execute every single one of the dependencies, you can't parse. At all... ever! use Win32::Something; 1; Unparsable on Unix... use Win32::Something; use Proc::ProcessTable; 1; Unparsable on anything... even if I just want 1 syntax highlighted as a number. BEGIN { system 'rm -rf /'; } 1; eeep! If anyone checks in a broken version of a module into CVS that is part of some large project you are working on, sorry can't parse anything any more even to try to hunt down the problem. For an more comprehensive example, take a look at Acme::BadExample, which uses absolutely plain and simple syntax, yet is completely unparsable. (The reward is as-yet unclaimed) Go have a look now, I'll wait... All of this creates HUGE headaches if we want to start adding some intelligence to code analysis and manipulation. We've all seen the lengths Komodo has had to go to by continuously running the code. What few attempts there have been to modify code are fairly impotent. While you _can_ get source back from B:: it isn't particularly useful except as a way to serialise anonymous code for storage or transport. Given $source, B:: throws away all syntax and commenting and POD and __DATA__ and whatever is after __END__ and then dumps back out something quite different from what went in. A sort of Frontpage effect. It's what the program thinks is close enough to be the same, but definitely not what you wanted. Now, despite _all_ of these problems, the continuous insistence from the entire Perl community that writing a perl parser is impossible (only perl can parse Perl), and despite the fact I had seen several other people try and fail, or give up, or not really get started, I decided to redefine the problem slightly and have managed to get a working Perl parser up and running. Or at least, well enough to handle selfgol and 90% of CPAN (I haven't really started working on corner cases yet, after which I expect to reach about 99%), and to do so needing ONLY what is contained in the .pm or .pl file and nothing else. The parser can read all of Acme::BadExample safely and write it back out again unchanged. In any case, it works and works well enough to start building a number of cool toys on, such as normalisation and comparison of code, code metrics (Leon had a play with this), syntax highlighting, style analysis, the CPAN Cross Reference, and various other stuff that is staying on the whiteboard until API-freeze is finished. (back/forwardporting, refactoring, auto-documentation, smart perl diffs, safe testing of code, better dependency and version extraction, checkstyle, a refactoring perl editor ala IntelliJ IDEA, etc etc etc). And most importantly, because it treats the Perl source as a document (data structure) and not as code (procedural execution) it can serialise back to the source code which will be identical to what it read in. That is, it is totally round-trip safe (100% in testing of a CPAN subset of 5,500 perl files) So $source - $DocumentObject - $source is safe. Now of course, it is completely unable to deal with source filters. There is some
Re: Will _anything_ be able to truly parse and understand perl?
Adam Kennedy writes: Getting (finally) to perl6, I could have sworn I saw an RFC early on which said Make perl6 easier to parse. But it would appear the opposite is occurring. Source filters have become grammars and will now be officially approved and acceptable (yes?) while so far as I can tell the problem of prototype vs operator/operand interaction is not being addressed. (I'm a little in the dark here, perhaps it is and nobody has noticed enough) Let's say you want to write a yacc grammar to parse Perl 6, or Parse::RecDescent, or whatever you're going to use. Yes, that will be hard in Perl 6. Certainly harder than it was in Perl 5. However, Perl 6 comes packaged with its own grammar, in Perl's own rule format. So now the quote only perl can parse Perl may become only Perl can parse Perl (And even only Perl can parse perl, since it's written in itself :-). Perl's contextual sensitivity is part of the language. So the best you can do is to track everything like you mentioned. It's going to be impossible to parse Perl without having perl around to do it for you. But using the built-in grammar, you can read in a program, macros and all, and get an annotated source tree back, that you could rebuild the source out of. You could even grab the comments and do something sick with them (see Damian :-). Or better yet, do something that PPI doesn't, and add some sub call around all statements, or determine the meaning of brackets in a particular context. The question of whether to execute BEGIN blocks is a tricky one. Sometimes they change the parse of the program. Sometimes they do other stuff. All you can hope for is that people understand the difference between BEGIN (change parsing) and INIT (do before the program starts). I love PPI, by the way :-) Luke
Re: Will _anything_ be able to truly parse and understand perl?
Luke has answered this better than I would have. In particular, he wrote: Perl's contextual sensitivity is part of the language. So the best you can do is to track everything like you mentioned. It's going to be impossible to parse Perl without having perl around to do it for you. That first sentence is the critical point to remember. Without Cuse and CBEGIN and the new macro facilities, Perl just wouldn't be Perl. I would just add that we have indeed Ma[d]e Perl 6 easier to parse. In concrete terms, to parse Perl 6 (in Perl 6) you'll simply write the following one line: $parse_tree = ( $source_text ~~ m:keepall/ Perl.prog / ); which will completely parse your source. Because *during* its parse it will run any CBEGIN blocks or macros and Cuse any modules, thereby *lexically* adapting its own grammar as it goes. What you'll get back is a tree of submatches (including whitespace and comments, if you want them), which you can then reprocess as you wish. Damian PS: Yes, I'll be at OSDC www.osdc.com.au next week (giving the opening and closing keynotes, in fact). And, yes, you're most welcome have a chat with me about Perl 6 sometime during the conference.
Re: Will _anything_ be able to truly parse and understand perl?
Adam Kennedy writes: perl itself would also appear unable to understand perl source, instead doing what I would call RIBRIB parsing, Read a bit, run a bit. RIBRIB? RABRAB, surely! Smylers