Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Randal L. Schwartz
 Luke == Luke Palmer [EMAIL PROTECTED] writes:

Luke But you don't really need to parse to syntax highlight, either.  You
Luke just need to tokenize.

Unfortunately, to tokenize, you also have to know the state of the parse.
As long as / is both divide and begin regex, you're toasted.

Please see my long post at on parsing perl in perlmonks at
http://www.perlmonks.org/index.pl?node_id=44722 for examples of
*why* you need to notice whether you have a divide or a regex match.

Perl is fundamentally resistant to lexing.  As in the beginning of
this thread, one of the RFCs suggested the possibility of making Perl
lexable, but apparently the designers said no, we think the / duality
is worth keeping.  And that seals the fate for Perl6 just like all
Perl before it.

To properly lex a Perl program (Perl6 included), you *must* execute
BEGIN blocks.  That's the end of that tune.  Anything else is just an
approximation.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Matthew Walton
Randal L. Schwartz wrote:
Luke == Luke Palmer [EMAIL PROTECTED] writes:

Luke But you don't really need to parse to syntax highlight, either.  You
Luke just need to tokenize.
Unfortunately, to tokenize, you also have to know the state of the parse.
As long as / is both divide and begin regex, you're toasted.
So you're saying that in Perl 6 it will be entirely impossible to 
determine if / appears as the division operator or as the beginning of a 
regex from a purely syntactic examination of the source code?

I'm finding that very, very hard to believe. Regexps aren't valid where 
/-the-operator is, after all.

Please correct me if I'm wrong, but I've got the impression that Perl 6 
is tokenisable without requiring BEGIN blocks to be run - provided no 
grammars which the tokeniser doesn't already know about are used, of 
course, that one will never be avoidable.



Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Randal L. Schwartz
 Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew So you're saying that in Perl 6 it will be entirely impossible to
Matthew determine if / appears as the division operator or as the beginning of
Matthew a regex from a purely syntactic examination of the source code?

Yes.

Matthew I'm finding that very, very hard to believe. Regexps aren't valid
Matthew where /-the-operator is, after all.

And that's precisely why Perl can work as it does.  If an operator is
expected, / is divide.  If a term is expected, / is the beginning of a
regex.  This has been true since Perl1 (maybe 0).  There are a few
other characters that also work similarly, but / is the most frequent
and most troublesome.  And it got worse for Perl5, because of
user-defined prototypes, which as far as I can tell, are still present
in Perl6.

Matthew Please correct me if I'm wrong, but I've got the impression that Perl
Matthew 6 is tokenisable without requiring BEGIN blocks to be run - provided
Matthew no grammars which the tokeniser doesn't already know about are used,
Matthew of course, that one will never be avoidable.

Your impression is wrong.  In the presence of user-defined prototypes,
you *must* execute the code that might alter a prototype in order to
determine whether / is a divide (and therefore standalone token) or
the beginning of a regex (and therefore must locate the end of the
regex to properly be a token).

Please see the referenced perlmonks article.

All the handwaving in the world won't fix this.  As long as we have
dual-natured characters like /, and user-defined prototypes, Perl
cannot be lexed without also parsing, and therefore without also
running BEGIN blocks.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Matthew Walton
Randal L. Schwartz wrote:
Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew So you're saying that in Perl 6 it will be entirely impossible to
Matthew determine if / appears as the division operator or as the beginning of
Matthew a regex from a purely syntactic examination of the source code?
Yes.
Matthew I'm finding that very, very hard to believe. Regexps aren't valid
Matthew where /-the-operator is, after all.
And that's precisely why Perl can work as it does.  If an operator is
expected, / is divide.  If a term is expected, / is the beginning of a
regex.  This has been true since Perl1 (maybe 0).  There are a few
other characters that also work similarly, but / is the most frequent
and most troublesome.  And it got worse for Perl5, because of
user-defined prototypes, which as far as I can tell, are still present
in Perl6.
Perl 6 has formal parameters for subs, methods etc. I don't see any 
mention of Perl 5-style prototypes in S6, and I honestly can't see how 
they could possibly fit with formal parameters. Hopefully Larry or 
someone can clarify whether they still exist or not.

If they don't still exist, this eases the problem somewhat, but not 
entirely I understand. Being able to call subs and methods without 
parentheses around the argument lists causes problems; a quick scan of 
the updated Synopses failed to reveal the rules for that in Perl 6.

Your impression is wrong.  In the presence of user-defined prototypes,
you *must* execute the code that might alter a prototype in order to
determine whether / is a divide (and therefore standalone token) or
the beginning of a regex (and therefore must locate the end of the
regex to properly be a token).
Since Perl 5 style prototypes don't appear to exist anymore, this may be 
easier. I don't believe that the addition of the // operator compounds 
the problem anymore, because hopefully by that point it was possible to 
determine that you've seen an operator.

The Perlmonks article throws up a lot of very nasty cases. Not knowing 
the entire current language definition by heart, I can't say this with 
absolutely certainty, but I retain the belief that Perl 6 is at least 
*easier* to deal with than Perl 5.

It is also possible that telling the difference between /-as-divide and 
/-as-regex becomes much easier if lookahead is employed in the 
tokeniser. Unfortunately, that makes the tokeniser much more 
complicated, and it's just a vague and random idea.




Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Randal L. Schwartz
 Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew Perl 6 has formal parameters for subs, methods etc. I don't see any
Matthew mention of Perl 5-style prototypes in S6, and I honestly can't see how
Matthew they could possibly fit with formal parameters. Hopefully Larry or
Matthew someone can clarify whether they still exist or not.

As long as you can have a user-defined null-prototyped subroutine (one
that doesn't need parens following), you have the problem.  See the
sin/time examples in the monk article, and then consider user-defined
functions that have no args (like time) and those that do (like sin).

Matthew The Perlmonks article throws up a lot of very nasty cases. Not knowing
Matthew the entire current language definition by heart, I can't say this with
Matthew absolutely certainty, but I retain the belief that Perl 6 is at least
Matthew *easier* to deal with than Perl 5.

I believe you have a false belief.  I don't know anything in the new
prototypes-which-became-full-formal-arguments that made it any
*easier* to recognize the ending of a subroutine argument list without
knowing its precise definition.

In Perl6:

sub no_args () { ... }
sub list_args ([EMAIL PROTECTED]) { ... }

no_args / # this is a divide
list_args / # this is the start of a regex

See, it's still there. :)

Matthew It is also possible that telling the difference between /-as-divide
Matthew and /-as-regex becomes much easier if lookahead is employed in the
Matthew tokeniser.

No, not possible at all.  The entire rest of the program may be valid
either way.  You *must* know by the time you're done with /, or
/-and-more.  The rest of the code cannot be a hint.  Again, see my
article.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Matthew Walton
Randal L. Schwartz wrote:
Matthew == Matthew Walton [EMAIL PROTECTED] writes:

Matthew Perl 6 has formal parameters for subs, methods etc. I don't see any
Matthew mention of Perl 5-style prototypes in S6, and I honestly can't see how
Matthew they could possibly fit with formal parameters. Hopefully Larry or
Matthew someone can clarify whether they still exist or not.
As long as you can have a user-defined null-prototyped subroutine (one
that doesn't need parens following), you have the problem.  See the
sin/time examples in the monk article, and then consider user-defined
functions that have no args (like time) and those that do (like sin).
Matthew The Perlmonks article throws up a lot of very nasty cases. Not knowing
Matthew the entire current language definition by heart, I can't say this with
Matthew absolutely certainty, but I retain the belief that Perl 6 is at least
Matthew *easier* to deal with than Perl 5.
I believe you have a false belief.  I don't know anything in the new
prototypes-which-became-full-formal-arguments that made it any
*easier* to recognize the ending of a subroutine argument list without
knowing its precise definition.
In Perl6:
sub no_args () { ... }
sub list_args ([EMAIL PROTECTED]) { ... }
no_args / # this is a divide
list_args / # this is the start of a regex
See, it's still there. :)
I believe I did mention that being able to call functions without parens 
is a problem.

Matthew It is also possible that telling the difference between /-as-divide
Matthew and /-as-regex becomes much easier if lookahead is employed in the
Matthew tokeniser.
No, not possible at all.  The entire rest of the program may be valid
either way.  You *must* know by the time you're done with /, or
/-and-more.  The rest of the code cannot be a hint.  Again, see my
article.
I read the article. I believe I mentioned that as well.
But I will have to concede that it is impossible to correctly determine 
the structure of an arbitrary Perl 6 program without having to hand the 
definitions of all functions used and also any grammars and macros used. 
Sometimes you will be able to do it, sometimes you won't, but you can't 
operate on the assumption that you can.

It's quite a disappointment in some ways, but we've lived with it in 
Perl 5, and I'm sure we can live with it in Perl 6.

And I still think Perl 6 will have fewer cases in which it's completely 
impossible for not-Perl to parse it. Unfortunately, fewer still implies 
some, and some is still a problem.



Re: $ @ and %

2004-11-26 Thread Larry Wall
On Fri, Nov 26, 2004 at 10:29:52AM +0300, Alexey Trofimenko wrote:
: I'm talking about unifying namespaces of arrays, hashes and scalars. I  
: could swear i've seen some RFC about it..

Yes that's RFC 9, which was discussed and rejected long ago in A2.
I just find that I prefer to think of the sigils as part of the name.
That's doubly true now that we have various secondary sigils.

Larry


Re: Angle quotes and pointy brackets

2004-11-26 Thread Larry Wall
On Fri, Nov 26, 2004 at 07:32:58AM +0300, Alexey Trofimenko wrote:
: ah, I forget, how could I do qx'echo $VAR' in Perl6? something like  
: qx:noparse 'echo $VAR' ?

Hmm, well, with the currently defined adverbs you'd have to say

qx:s(0)'echo $VAR'

but that doesn't give you protection from other kinds of interpolation.
I think we need two more adverbs that add the special features of qx and qw,
so that you could write that:

q:x/echo $VAR/

where ordinary qx/$cmd/ is short for

qq:x/$cmd/

Likewise a qw/a b/ is short for

q:w/a b/

: (Note: I like thoose adverbs.. I could imagine that in Perl6 if you want  
: to have something done in some_other_way, you just should insert  
: :some_other_way adverb, and that is! perl will DWIM happily :)

Well, that's perhaps a bit underspecified from the computer's point of view.

: I notice that in Perl6 thoose funny « and » could be much more common 
: than  other paired brackets. And some people likes how they look, but 
: nobody  likes fact that there's no (and won't!) be a consistent way to type 
: them  in different applications, wether it's hard or easy.
: 
: But to swap «» with [] or {} could be real shock for major part of 
: people..
: We also have another ascii pair,  and  . maybe they could be better than  
: « and » ?:) i'm not that farseeing, but isn't problem of distinguishing  
: as a bracket and  as an comparison operator no harder than distinguishing  
:  as bracket and as part of heredoc?..

It would get very confusing visually, even if the computer could sort it out:

@a = @b
@a = @b

But there are some things that would be completely ambiguous:

%hashfoobar
%hashfoobaz()

: or maybe even we could see consistant to go after + + and alike, and  
: make old  and  written as + and + (and then lt and gt suddenly could  
: become ~ and ~ :)

I think people would rise up and slay us if we did that.  We're already
getting sufficiently risen up and slain over Perl 6.

: But I certain, Larry already weighted exact that solution years ago..

Well, yes, but sometimes the weights change over time, so it doesn't
hurt (much) to reevaluate occasionally.  But in this case, I think I
still prefer to attach the exotic characters to the exotic behaviors,
and leave the angles with their customary uses.

: P.S. If you have an urgent need to throw spoiled eggs at me, consider all  
: above as very late or very early fools day joke.. or you could try, but  
: i've never heard about ballistic transcontinental eggs.

If you're a White Russian I suppose the yolk is on me.

Larry


Re: Angle quotes and pointy brackets

2004-11-26 Thread Juerd
Larry Wall skribis 2004-11-26  9:33 (-0800):
 but that doesn't give you protection from other kinds of interpolation.
 I think we need two more adverbs that add the special features of qx and qw,
 so that you could write that: q:x/echo $VAR/ where ordinary qx/$cmd/
 is short for qq:x/$cmd/ Likewise a qw/a b/ is short for q:w/a b/

With x and w as adverbs to q and qq, are qx and qw still worth keeping?
It's only one character less, qx isn't used terribly often and qw will
probably be written mostly as  anyway.

And perhaps qq:x is a bit too dangerous. Suppose someone meant to type
qq:z[$foo] (where z is a defined adverb that does something useful to
the return value, but has no side effects) and mistypes it as
qq:x[$foo]. Instant hard-to-spot security danger.


Juerd


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread James Mastros
Randal L. Schwartz wrote:
All the handwaving in the world won't fix this.  As long as we have
dual-natured characters like /, and user-defined prototypes, Perl
cannot be lexed without also parsing, and therefore without also
running BEGIN blocks.
And user-defined prototypes that change when the argument list of a 
function ends, that is.  If we forced the argument list for all 
functions to have parens (including empty parens for argument less 
functions), then we'd be OK, I'm fairly certain.

For that matter, if we stick to declaration syntax for declarations, and 
not BEGIN blocks and reflection, then we're OK -- you have to do some 
execution, but of a minilanguage that can't express concepts that you 
wouldn't be OK running... though you do still have to descend through 
require/use, and thus have to have the files being required or used (or 
at least a description of their declarations).

-=- James Mastros,
theorbtwo


Re: Lexing requires execution (was Re: Will _anything_ be able to truly parse and understand perl?)

2004-11-26 Thread Juerd
James Mastros skribis 2004-11-26 14:36 (+0100):
 And user-defined prototypes that change when the argument list of a 
 function ends, that is.  If we forced the argument list for all 
 functions to have parens (including empty parens for argument less 
 functions), then we'd be OK, I'm fairly certain.

While that is true, please realise that many people like that in Perl,
parens are optional. I am one of those people who dislike typing and
counting too many balanced symbol sets.

If only method and function syntax could be the same, and methods would
also not require parens... Ah well, that's what we have mutable grammar
for.

 For that matter, if we stick to declaration syntax for declarations, and 
 not BEGIN blocks and reflection

Macros are somewhat like BEGIN blocks and may be needed to turn invalid
syntax into something that is valid.


Juerd


Re: Angle quotes and pointy brackets

2004-11-26 Thread Larry Wall
On Fri, Nov 26, 2004 at 07:31:09PM +0100, Juerd wrote:
: Larry Wall skribis 2004-11-26  9:33 (-0800):
:  but that doesn't give you protection from other kinds of interpolation.
:  I think we need two more adverbs that add the special features of qx and qw,
:  so that you could write that: q:x/echo $VAR/ where ordinary qx/$cmd/
:  is short for qq:x/$cmd/ Likewise a qw/a b/ is short for q:w/a b/
: 
: With x and w as adverbs to q and qq, are qx and qw still worth keeping?
: It's only one character less, qx isn't used terribly often and qw will
: probably be written mostly as  anyway.

I might be happy to remove them, though people will write q:x instead
of qq:x and wonder why it doesn't interpolate.  What I think is fun is
qq:x:w, which presumably runs the command and then splits the result
into words.

I know everone has their reflexes tuned to type qw currently, but
how many of you Gentle Readers would feel blighted if we turned it
into q:w instead?

: And perhaps qq:x is a bit too dangerous. Suppose someone meant to type
: qq:z[$foo] (where z is a defined adverb that does something useful to
: the return value, but has no side effects) and mistypes it as
: qq:x[$foo]. Instant hard-to-spot security danger.

Seems rather unlikely.  And presumably tainting should catch it
if it's really a security issue.

Larry