Re: "rule" declarator: Different results for 'unadorned' match vs unnamed/named captures? (in re Grammars...).

2021-03-12 Thread Ralph Mellor
On Thu, Mar 11, 2021 at 8:53 PM William Michels  wrote:
>
> I think there's something going on with the examples below, as I'm
> seeing different results when comparing a basic "rule" match vs either
> unnamed or named "rule" captures.

All your examples are correct according to my current understanding.

The key issue is that the patterns you're sub-capturing exclude the
significant space whereas the non-captured ones don't.

I'll respond only to your TL;DR in this comment. Perhaps give yourself
time to have a couple repl sessions, then time away from the computer
doing something completely different, before responding to go through
any residual confusion, or if it turns out I've gotten something wrong.



> say $/ if 'ab'  ~~ rule  {.? .?}
「a」

The first `.? ` (including the token boundary implied by the space)
matches against the `a`. It fails.

The second `.?` then matches the `a`.

The `~~` succeeds and reports the match of `a`.

> say $/ if 'ab'  ~~ rule  {(.?) (.?)}

The first `.?` (excluding the token boundary) matches against the `a`
and succeeds.

`:ratchet` is in effect so the engine will not reconsider this decision.

The space in the pattern then fails to match a token boundary.

Due to `:ratchet` being in effect, the entire rule fails.

Because you're using `~~`, the engine then moves forward one character
in the input, pretending the input begins with `b`.

The first `.?` matches against the `b` and succeeds.

The space in the pattern successfully matches a token boundary (end of
input).

Thus the rule succeeds, and so the `~~` succeeds, and reports a match of `b`.

--
love, raiph


"rule" declarator: Different results for 'unadorned' match vs unnamed/named captures? (in re Grammars...).

2021-03-11 Thread William Michels via perl6-users
Hello,

I've been chatting with raiph on SO regarding Grammar "tokens" vs
"rules". The OP is here https://stackoverflow.com/q/62051742 and our
discussion is here https://stackoverflow.com/a/62053666 .

I think there's something going on with the examples below, as I'm
seeing different results when comparing a basic "rule" match vs either
unnamed or named "rule" captures. The focus of the tests below is to
look at changes with/without whitespace, and comparing "tokens" vs
"rules".

TL;DR version:

> say $/ if 'ab'  ~~ rule  {.? .?}
「a」
> say $/ if 'ab'  ~~ rule  {(.?) (.?)}
「b」
 0 => 「b」
 1 => 「」
> say $/ if 'ab'  ~~ rule  {$=.? $=.?}
「b」
 first => 「b」
 second => 「」

So for the first example (an unadorned "rule" match), Raku returns 「a」.
But for the second and third examples ("rule" captures), Raku returns 「b」.
(Also confirmed the three lines above with Raku one-liners).

Full list of examples:

Last login: Thu Mar 11 11:50:11 on ttys042
user@mbook:~$ raku
Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2020.10.
Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
Built on MoarVM version 2020.10.

To exit type 'exit' or '^D'
> say $/ if 'a'  ~~ token  {.?.?}
「a」
> say $/ if 'ab'  ~~ token  {.?.?}
「ab」
> say $/ if 'ab'  ~~ token  {.? .?}
「ab」
> say $/ if 'a b'  ~~ token  {.? .?}
「a 」
> say $/ if 'a'  ~~ rule  {.?.?}
「a」
> say $/ if 'ab'  ~~ rule  {.?.?}
「ab」
> say $/ if 'ab'  ~~ rule  {.? .?}
「a」
> say $/ if 'a b'  ~~ rule  {.? .?}
「a b」
> 
Nil
> say $/ if 'a'  ~~ token  {(.?)(.?)}
「a」
 0 => 「a」
 1 => 「」
> say $/ if 'ab'  ~~ token  {(.?)(.?)}
「ab」
 0 => 「a」
 1 => 「b」
> say $/ if 'ab'  ~~ token  {(.?) (.?)}
「ab」
 0 => 「a」
 1 => 「b」
> say $/ if 'a b'  ~~ token  {(.?) (.?)}
「a 」
 0 => 「a」
 1 => 「 」
> say $/ if 'a'  ~~ rule  {(.?)(.?)}
「a」
 0 => 「a」
 1 => 「」
> say $/ if 'ab'  ~~ rule  {(.?)(.?)}
「ab」
 0 => 「a」
 1 => 「b」
> say $/ if 'ab'  ~~ rule  {(.?) (.?)}
「b」
 0 => 「b」
 1 => 「」
> say $/ if 'a b'  ~~ rule  {(.?) (.?)}
「a b」
 0 => 「a」
 1 => 「b」
> 
Nil
> say $/ if 'a'  ~~ token  {$=.?$=.?}
「a」
 first => 「a」
 second => 「」
> say $/ if 'ab'  ~~ token  {$=.?$=.?}
「ab」
 first => 「a」
 second => 「b」
> say $/ if 'ab'  ~~ token  {$=.? $=.?}
「ab」
 first => 「a」
 second => 「b」
> say $/ if 'a b'  ~~ token  {$=.? $=.?}
「a 」
 first => 「a」
 second => 「 」
> say $/ if 'a'  ~~ rule  {$=.?$=.?}
「a」
 first => 「a」
 second => 「」
> say $/ if 'ab'  ~~ rule  {$=.?$=.?}
「ab」
 first => 「a」
 second => 「b」
> say $/ if 'ab'  ~~ rule  {$=.? $=.?}
「b」
 first => 「b」
 second => 「」
> say $/ if 'a b'  ~~ rule  {$=.? $=.?}
「a b」
 first => 「a」
 second => 「b」
> 
Nil

Any advice appreciated, Thx, Bill.


Re: grammars and indentation of input

2016-09-13 Thread Theo van den Heuvel
As so often it turned out that the reason my program did not work was 
elsewhere (in the grammar).

My approach worked al along.
It was instructive to look at the examples you guys mentioned. Thanks

Theo


Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Bennett Todd
Well put.

The clearest description of Python's approach I've read, explained it as a 
lexer that tracked indentation level, and inserted appropriate tokens when it 
changed.


Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Aaron Sherman
Oh, a side point: there's some confusion introduced by the lack of a
scanner/lexer in modern all-in-one-parsers.

Python, for example, uses a scanner and so its grammar is nominally not
context sensitive, but its scanner very much is (maintaining a stack of
indentation exactly as OP was asking about). When you do everything in one
place, the result must have the ability to maintain and respond to global
state. There's really no other way around it. A pure BNF cannot parse
Python.




Aaron Sherman, M.:
P: 617-440-4332 Google Talk, Email and Google Plus: a...@ajs.com
Toolsmith, developer, gamer and life-long student.


On Tue, Sep 13, 2016 at 12:26 PM, Bennett Todd 
wrote:

> Hostile or not, thanks for your informative reply.
>


Re: grammars and indentation of input

2016-09-13 Thread Moritz Lenz
Hi,

On 13.09.2016 18:55, Patrick R. Michaud wrote:
> I don't have an example handy, but I can categorically say that
> Perl 6 grammars are designed to support exactly this form of parsing.
> It's almost exactly what I did in "pynie" -- a Python implementation
> on top of Perl 6.  The parsing was done using a Perl 6 grammar.

See https://github.com/arnsholt/snake for a newer implementation that
parses python (or subsets thereof). Its grammar is likely inspired by
pynie, but your chances to get it to run are much better, due to the
more recent development that has gone into it.

Cheers,
Moritz

-- 
Moritz Lenz
https://deploybook.com/ -- https://perlgeek.de/ -- https://perl6.org/


Re: grammars and indentation of input

2016-09-13 Thread Patrick R. Michaud
I don't have an example handy, but I can categorically say that
Perl 6 grammars are designed to support exactly this form of parsing.
It's almost exactly what I did in "pynie" -- a Python implementation
on top of Perl 6.  The parsing was done using a Perl 6 grammar.

If I remember correctly, Pynie had , , and 
grammar rules.  The grammar kept a stack of known indentation levels.
The  rule was a zero-width match that would succeed when it
found leading whitespace greater than the current indentation level
(and push the new level onto the stack).  The  rule
was a zero-width match that succeed when the leading whitespace
exactly matched the current indentation level.  And the 
rule would be called when  and  no longer 
matched, popping the top level off the stack.

So the grammar rule to match an indented block ended up looking
something like (I've shortened the example here):

token suite {
 
[   ]*
[  |  ]
}

A python "if statement" then looked like:

rule if_stmt {
'if'  ':' 
[ 'elif'  ':'  ]*
[ 'else' ':'  ]?
}

where the  subrules would match the statements or block
of statements indented within the "if" statement.

However, all of , , and  were written using
"normal" (non-regular expression) code.  Perl 6 makes this easy; since 
grammar rules are just methods in a class (that have a different code
syntax), you can create your own methods to emulate a grammar rule.  
The methods simply need to follow the Cursor protocol; that is, 
return Match objects indicating success/failure/length of whatever has 
been parsed at that point.

I hope this is a little useful.  If I can dig up or recreate a more 
complete Python implementation example sometime, I'll post it.

Pm


On Tue, Sep 13, 2016 at 01:13:45PM +0200, Theo van den Heuvel wrote:
> Hi all,
> 
> I am beginning to appreciate the power of grammars and the Match class. This
> is truly a major asset within Perl6.
> 
> I have a question on an edge case. I was hoping to use a grammar for an
> input that has meaningful indented blocks.
> I was trying something like this:
> 
>   token element { <.lm> [  | $=[ ' '+ ] )> ] }
>   token lm { ^^ ' '**{$cur-indent} } # skip up to current indent level
> 
> My grammar has a method called within the level rule that maintains a stack
> of indentations and sets a $cur-indent.
> I can imagine that the inner workings of the parser (i.e. optimization)
> frustrate this approach.
> Is there a way to make something like this work?
> 
> Thanks,
> Theo
> 
> -- 
> Theo van den Heuvel
> Van den Heuvel HLT Consultancy


Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Bennett Todd
Hostile or not, thanks for your informative reply.


Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Bennett Todd
Thank you, very much. Yes, I'm disappointed, but I'd rather know.


Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Aaron Sherman
>
> Having the minutia of the programmatic run-time state of the parse then
> influence the parse itself, is at the heart of the perl5 phenomenon "only
> Perl can parse perl"


I don't mean to be hostile, but you're demonstrably wrong, here. (also it's
"only perl can parse Perl" as in, only the "perl" implementation can parse
the Perl language).

There are currently about 2.5 implementations of Perl 6, and while you
could backhandedly claim that only Perl 6 can parse Perl 6 (because it's
specced as a self-hosting language whose spec is actually written in Perl
6), the reality is that the parts that aren't written in Perl 6 can be
written in just about anything (with C/MoarVM and JVM implementations
working just fine).

It's not a context sensitive grammar that was the issue with Perl 5, it was
the lack of a specification outside of the primary implementation.



Aaron Sherman, M.:
P: 617-440-4332 Google Talk, Email and Google Plus: a...@ajs.com
Toolsmith, developer, gamer and life-long student.


On Tue, Sep 13, 2016 at 10:35 AM, Bennett Todd 
wrote:

> Having the minutia of the programmatic run-time state of the parse then
> influence the parse itself, is at the heart of the perl5 phenomenon "only
> Perl can parse perl", which I rather hope isn't going to be preserved in
> perl6.
>


Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Patrick R. Michaud
On Tue, Sep 13, 2016 at 10:35:01AM -0400, Bennett Todd wrote:
> Having the minutia of the programmatic run-time state of the parse then 
> influence the parse itself, is at the heart of the perl5 phenomenon "only 
> Perl can parse perl", which I rather hope isn't going to be preserved in 
> perl6.

You may be disappointed to read this:  Not only is this feature preserved in 
Perl 6... it's something of a prerequisite.  It is what is required for a truly 
dynamic language that has many things happening in BEGIN blocks (i.e., things 
get executed even before you finish compiling the thing you're working on) and 
that allows dynamically adding new statement types and language features to the 
grammar.

When implementing Perl 6, I think many of us aimed to minimize the amount of 
"runtime" things that happened during the parse... only to discover that we 
actually had to embrace and formalize it instead.

Pm


Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Theo van den Heuvel

Hi Bennett,

There are many situations that require non-contextfree languages. Even 
though much of these could be solved in the AST-building step (called 
'transduction' in my days) instead of the parsing step, that does not 
solve all cases. I am just wondering if and to what extent we can parse 
non-CF languages with perl6. The reference to Perl5 is not appropriate, 
because a programming language is supposed to be designed to be 
relatively easy to parse. The language designer will have the need for 
an interpreter for the language high on his/her list of priorities.


It all boils down to the question: does a Grammar allow a rule that 
depends on a function instead of being constant.
I am not shocked if it is impossible, I just want to know how far perl6 
takes me.


Theo

Bennett Todd schreef op 2016-09-13 16:35:

Having the minutia of the programmatic run-time state of the parse
then influence the parse itself, is at the heart of the perl5
phenomenon "only Perl can parse perl", which I rather hope isn't going
to be preserved in perl6.




Re: Fwd: Re: grammars and indentation of input

2016-09-13 Thread Bennett Todd
Having the minutia of the programmatic run-time state of the parse then 
influence the parse itself, is at the heart of the perl5 phenomenon "only Perl 
can parse perl", which I rather hope isn't going to be preserved in perl6.


Fwd: Re: grammars and indentation of input

2016-09-13 Thread Theo van den Heuvel



Thanks Timo and Brian,

both examples are educational. However, they have a common limitation in 
that they both perform their magic after a Match object has been 
created. I was trying to influence the parsing step itself.
I am experimenting to find if I can influence the parsing process 
programmatically. Indentation is just an example here.
The stack of indentation levels is maintained fine, but I cannot seem to 
use the knowledge of the current indentation to

affect the rule for the left margin.

Theo

Timo Paulssen schreef op 2016-09-13 13:56:
I haven't read your code, but your question immediately made me think 
of

this module:

https://github.com/masak/text-indented

Would be interested to hear if this helps you!
  - Timo




Re: grammars and indentation of input

2016-09-13 Thread Aaron Sherman
I don't see why optimization would frustrate this approach. You are doing
the correct thing as far as I can tell, but with one exception. The current
implementation (last I checked) was sometimes slow in binding values. You
might need to force it between an assignment and passing a bound match as a
parameter by inserting an empty block. You can see this documented and used
here:

http://examples.perl6.org/categories/parsers/SimpleStrings.html



Aaron Sherman, M.:
P: 617-440-4332 Google Talk, Email and Google Plus: a...@ajs.com
Toolsmith, developer, gamer and life-long student.


On Tue, Sep 13, 2016 at 7:13 AM, Theo van den Heuvel 
wrote:

> Hi all,
>
> I am beginning to appreciate the power of grammars and the Match class.
> This is truly a major asset within Perl6.
>
> I have a question on an edge case. I was hoping to use a grammar for an
> input that has meaningful indented blocks.
> I was trying something like this:
>
>   token element { <.lm> [  | $=[ ' '+ ] )> ] }
>   token lm { ^^ ' '**{$cur-indent} } # skip up to current indent level
>
> My grammar has a method called within the level rule that maintains a
> stack of indentations and sets a $cur-indent.
> I can imagine that the inner workings of the parser (i.e. optimization)
> frustrate this approach.
> Is there a way to make something like this work?
>
> Thanks,
> Theo
>
> --
> Theo van den Heuvel
> Van den Heuvel HLT Consultancy
>


Re: grammars and indentation of input

2016-09-13 Thread Brian Duggan
I've also recently been experimenting with parsing an
indent-based language -- specifically, a small subset
of Slim () -- I push to a stack
when I see a tag, and pop based on the depth of the
indendation.

Here's a working example:

  https://git.io/vig93

Brian


Re: grammars and indentation of input

2016-09-13 Thread Timo Paulssen
I haven't read your code, but your question immediately made me think of
this module:

https://github.com/masak/text-indented

Would be interested to hear if this helps you!
  - Timo


Re: Grammars

2015-04-20 Thread Larry Wall
On Sun, Apr 19, 2015 at 06:31:30PM +0200, mt1957 wrote:
: L.s.,
: 
: I found a small problem when writing a piece of grammar. A
: simplified part of it is shown here;
: ...
: token tag-body   {  ~   }
: token body-start { '[' }
: token body-end  { ']' }
: token body-text  { .*?  }
: ...
: 

A couple of things:

The ~ is intended primarily for literal delimiters, so you'd typically just
see something like:

token tag-body   { '[' ~ ']'  }
token body-text  { .*?  }

In this case there would be no body-end rule at all--which means you'd
hang the action routine somewhere else.  So you could just as easily
hang your action routine on tag-body or on body-text, depending on
whether you care about whether the match object includes the delimiters.
In either case, it doesn't have to attach to the final delimiter.

:  * Is there a possibility to give the method more information in the
:form of boolean flags saying for example that there was a look ahead
:match, all in all the parser knows about the way it must seek!

One could always set a dynamic variable inside the "not really" rule:

token body-text {
:my $*NOT-REALLY = 1;
.*?

}

but it's easier to just move the reduction action.

Larry


Re: Grammars and biological data formats

2014-08-09 Thread Fields, Christopher J
On Aug 9, 2014, at 8:51 PM, "Fields, Christopher J"  
wrote:
> 
> 
>> On Aug 9, 2014, at 5:25 PM, "t...@wakelift.de"  wrote:
>> 
>> 
>>> On 08/10/2014 12:21 AM, t...@wakelift.de wrote:
>>> Something that does surprise me is that your tests seem to imply that :p
>>> for subparse doesn't work. I'll look into that, because I believe it
>>> ought to be implemented already. Perhaps not properly hooked up, though.
>> 
>> On #perl6 I got corrected quite quickly: subparse is anchored to the
>> start and end of the target string, so :pos doesn't make sense. In this
>> case, you want just .parse
> 
> I mainly tested subparse() to see if it would find the second FASTA record 
> (which works if using :p and not :pos).
> 
> Sorry, I should have updated that, but subparse() with :p works fine; the 
> spec mentions :pos though (I plan on submitting a pull request on that).
> 
>> Another thing is that if lines() does keep all data around, it should be
>> considered a bug, as we should be able to infer that we don't keep the
>> list itself around and thus won't be able to refer to its previous
>> values later on. Thus, we should free the memory for the earlier lines
>> in the target string after the loop is done with them.
>> 
>> I have not yet tested, if this is the case, though.
>> 
>> Hope that clears up a bit of potential confusion before it can arise
>> - Timo
> 
> I can try that out.
> 
> Chris

Oh, and thanks everyone for the quick replies!

Chris

Re: Grammars and biological data formats

2014-08-09 Thread Fields, Christopher J

> On Aug 9, 2014, at 5:25 PM, "t...@wakelift.de"  wrote:
> 
> 
>> On 08/10/2014 12:21 AM, t...@wakelift.de wrote:
>> Something that does surprise me is that your tests seem to imply that :p
>> for subparse doesn't work. I'll look into that, because I believe it
>> ought to be implemented already. Perhaps not properly hooked up, though.
> 
> On #perl6 I got corrected quite quickly: subparse is anchored to the
> start and end of the target string, so :pos doesn't make sense. In this
> case, you want just .parse

I mainly tested subparse() to see if it would find the second FASTA record 
(which works if using :p and not :pos).

Sorry, I should have updated that, but subparse() with :p works fine; the spec 
mentions :pos though (I plan on submitting a pull request on that).

> Another thing is that if lines() does keep all data around, it should be
> considered a bug, as we should be able to infer that we don't keep the
> list itself around and thus won't be able to refer to its previous
> values later on. Thus, we should free the memory for the earlier lines
> in the target string after the loop is done with them.
> 
> I have not yet tested, if this is the case, though.
> 
> Hope that clears up a bit of potential confusion before it can arise
>  - Timo

I can try that out.

Chris




Re: Grammars and biological data formats

2014-08-09 Thread Darren Duncan
I've already been thinking for awhile now that parsers need to be able to 
operate in a streaming fashion (when the grammars lend themselves to it, by not 
needing to lookahead, much if at all, to understand what they've already seen) 
so that strings that don't fit in memory all at once can be parsed.


Any parser that returns results piecewise to the caller rather than all at once, 
such as by supporting callbacks, already makes for a streaming interface on that 
end, so it just needs to be lazy on the input end as well, and then one can 
parse arbitrary sized inputs while using little memory.


Christopher's example is a good one.

Another example that I would deal with is database dumps; the parsers in psql or 
mysql or others can obviously handle SQL dump files that are many gigabytes and 
are obviously parsing them in a streaming manner, but SQL files are really just 
program source code files.


-- Darren Duncan

On 2014-08-09, 3:09 PM, Fields, Christopher J wrote:

(accidentally sent to perl6-lang, apologies for cross-posting but this seems 
more appropriate)

I have a fairly simple question regarding the feasibility of using grammars 
with commonly used biological data formats.

My main question: if I wanted to parse() or subparse() vary large files (not 
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) 
would a grammar be the best solution?  For instance, based on what I am reading 
the semantics appear to be greedy; for instance:

Grammar.parsefile($file)

appears to be a convenient shorthand for:

Grammar.parse($file.slurp)

since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
misunderstanding how this could be accomplished?

(just to point out, I know I can subparse() as well but that also appears to 
act on a string…)

As an example, I have a simple grammar for parsing FASTA, which a (deceptively) 
simple format for storing sequence data:

http://en.wikipedia.org/wiki/FASTA_format

I have a simple grammar here:

https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6

and tests here:

https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t

Tests pass with the latest Rakudo just fine.

chris






Re: Grammars and biological data formats

2014-08-09 Thread timo

On 08/10/2014 12:21 AM, t...@wakelift.de wrote:
> Something that does surprise me is that your tests seem to imply that :p
> for subparse doesn't work. I'll look into that, because I believe it
> ought to be implemented already. Perhaps not properly hooked up, though.

On #perl6 I got corrected quite quickly: subparse is anchored to the
start and end of the target string, so :pos doesn't make sense. In this
case, you want just .parse

Another thing is that if lines() does keep all data around, it should be
considered a bug, as we should be able to infer that we don't keep the
list itself around and thus won't be able to refer to its previous
values later on. Thus, we should free the memory for the earlier lines
in the target string after the loop is done with them.

I have not yet tested, if this is the case, though.

Hope that clears up a bit of potential confusion before it can arise
  - Timo



Re: Grammars and biological data formats

2014-08-09 Thread timo
(accidentally sent this privately only, now re-sending to the list)

Hello Christopher,

In the Perl 6 specification, there are plans for lazy and
memory-releasing ways to parse strings that are either too large to fit
into memory at once or that are generated lazily (like being streamed in
through the network or using "live" data sources). Sadly, none of those
features are implemented in either of our backends.

The simplest thing we have is the  rule, which should instruct the
grammar engine to deallocate the parts of the input data that are before
the current cursor. Sadly, this is not going to help you much at this stage.

Another thing that will be unhelpful is that our lazy lists (such as the
ones you can generate with gather/take or what lines() will give you)
will keep all items from the very first to the last you've requested
around until the whole list becomes garbage and gets collected.

It would seem like you'll want to do a line-by-line iteration through
the data using not lines() but get() and manually parse the individual
lines; the grammar seems sufficiently simple for that to work.

Something that does surprise me is that your tests seem to imply that :p
for subparse doesn't work. I'll look into that, because I believe it
ought to be implemented already. Perhaps not properly hooked up, though.

Hope to help!
- Timo