Re: Grammars and biological data formats

2014-08-16 Thread Martin D Kealey

Hmmm, what about just implementing mmap-as-string?

Then, assuming the parsing process is somewhat stream-like, the OS will take
care of swapping in chunks as you need them. You don't even need anything
special to support backtracking -- it's just a memory address, after all.

-Martin

On Thu, 14 Aug 2014, Fields, Christopher J wrote:
 Yeah, I'm thinking of a Cat-like class that would chunkify the data and check 
 for matches.

 The main reason I would like to stick with a consistent grammar-based 
 approach is I have seen many instances in BioPerl where a parser is 
 essentially rewritten based on its purpose (full parsing, lazy parsing, 
 indexing of flat files, adding to a persistent data store, etc).  Having a 
 way to both parse a full grammar but also subparse for a specific token/rule 
 is very handy, and when Cat comes around even more so.

 Chris

 Sent from my iPad

  On Aug 14, 2014, at 6:40 AM, Carl Mäsak cma...@gmail.com wrote:
 
  I was going to pipe in and say that I wouldn't wait around for Cat,
  I'd write something that reads chunks and then parses that. It'll be a
  bit more code, but it'll work today. But I see you reached that
  conclusion already. :)
 
  Lately I've found myself writing more and more grammars that parse
  just one line of some input. Provided that the same action object gets
  attached to the parse each time, that's an excellent place to store
  information that you want to persist between lines. Actually, action
  objects started to make a whole lot more sense to me after I found
  that use case, because it takes on the role of a session/lifetime
  object for the parse process itself.
 
  // Carl
 
  On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J
  cjfie...@illinois.edu wrote:
  On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu 
  wrote:
 
  On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote:
 
  On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
  cjfie...@illinois.edu wrote:
  I have a fairly simple question regarding the feasibility of using 
  grammars with commonly used biological data formats.
 
  My main question: if I wanted to parse() or subparse() vary large files 
  (not unheard of to have FASTA/FASTQ or other similar data files exceed 
  100’s of GB) would a grammar be the best solution?  For instance, based 
  on what I am reading the semantics appear to be greedy; for instance:
 
   Grammar.parsefile($file)
 
  appears to be a convenient shorthand for:
 
   Grammar.parse($file.slurp)
 
  since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
  misunderstanding how this could be accomplished?
 
  My understanding is it is intended that parsing can work on Cats
  (hypothetical lazy strings) but this hasn't been implemented yet
  anywhere.
 
  --
  Solomon Foster: colo...@gmail.com
  HarmonyWare, Inc: http://www.harmonyware.com
 
  Yeah, that’s what I recall as well.  I see very little in the specs re: 
  Cat unfortunately.
 
  chris
 
  Ah, nevermind.  I did a search of the IRC channel and found it’s 
  considered to be a ‘6.1’ feature:
 
 http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974
 
  It is mentioned a few times in the specs, I’m guessing based on where it’s 
  thought to fit in best.  For the moment the proposal is to run grammar 
  parsing on sized chunks of the input data, which might be how Cat would be 
  implemented anyway.
 
  chris
 



Re: Grammars and biological data formats

2014-08-16 Thread Fields, Christopher J
Yes, that looks like an even better option.  I see that this is implemented in 
p5 as File::Map, which is a nice portable option.

Chris

 On Aug 16, 2014, at 7:51 AM, Martin D Kealey mar...@kurahaupo.gen.nz 
 wrote:
 
 
 Hmmm, what about just implementing mmap-as-string?
 
 Then, assuming the parsing process is somewhat stream-like, the OS will take
 care of swapping in chunks as you need them. You don't even need anything
 special to support backtracking -- it's just a memory address, after all.
 
 -Martin
 
 On Thu, 14 Aug 2014, Fields, Christopher J wrote:
 Yeah, I'm thinking of a Cat-like class that would chunkify the data and 
 check for matches.
 
 The main reason I would like to stick with a consistent grammar-based 
 approach is I have seen many instances in BioPerl where a parser is 
 essentially rewritten based on its purpose (full parsing, lazy parsing, 
 indexing of flat files, adding to a persistent data store, etc).  Having a 
 way to both parse a full grammar but also subparse for a specific token/rule 
 is very handy, and when Cat comes around even more so.
 
 Chris
 
 Sent from my iPad
 
 On Aug 14, 2014, at 6:40 AM, Carl Mäsak cma...@gmail.com wrote:
 
 I was going to pipe in and say that I wouldn't wait around for Cat,
 I'd write something that reads chunks and then parses that. It'll be a
 bit more code, but it'll work today. But I see you reached that
 conclusion already. :)
 
 Lately I've found myself writing more and more grammars that parse
 just one line of some input. Provided that the same action object gets
 attached to the parse each time, that's an excellent place to store
 information that you want to persist between lines. Actually, action
 objects started to make a whole lot more sense to me after I found
 that use case, because it takes on the role of a session/lifetime
 object for the parse process itself.
 
 // Carl
 
 On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J
 cjfie...@illinois.edu wrote:
 On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu 
 wrote:
 
 On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote:
 
 On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
 cjfie...@illinois.edu wrote:
 I have a fairly simple question regarding the feasibility of using 
 grammars with commonly used biological data formats.
 
 My main question: if I wanted to parse() or subparse() vary large files 
 (not unheard of to have FASTA/FASTQ or other similar data files exceed 
 100’s of GB) would a grammar be the best solution?  For instance, based 
 on what I am reading the semantics appear to be greedy; for instance:
 
 Grammar.parsefile($file)
 
 appears to be a convenient shorthand for:
 
 Grammar.parse($file.slurp)
 
 since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
 misunderstanding how this could be accomplished?
 
 My understanding is it is intended that parsing can work on Cats
 (hypothetical lazy strings) but this hasn't been implemented yet
 anywhere.
 
 --
 Solomon Foster: colo...@gmail.com
 HarmonyWare, Inc: http://www.harmonyware.com
 
 Yeah, that’s what I recall as well.  I see very little in the specs re: 
 Cat unfortunately.
 
 chris
 
 Ah, nevermind.  I did a search of the IRC channel and found it’s 
 considered to be a ‘6.1’ feature:
 
   http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974
 
 It is mentioned a few times in the specs, I’m guessing based on where it’s 
 thought to fit in best.  For the moment the proposal is to run grammar 
 parsing on sized chunks of the input data, which might be how Cat would be 
 implemented anyway.
 
 chris
 
 


Re: Grammars and biological data formats

2014-08-14 Thread Carl Mäsak
I was going to pipe in and say that I wouldn't wait around for Cat,
I'd write something that reads chunks and then parses that. It'll be a
bit more code, but it'll work today. But I see you reached that
conclusion already. :)

Lately I've found myself writing more and more grammars that parse
just one line of some input. Provided that the same action object gets
attached to the parse each time, that's an excellent place to store
information that you want to persist between lines. Actually, action
objects started to make a whole lot more sense to me after I found
that use case, because it takes on the role of a session/lifetime
object for the parse process itself.

// Carl

On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J
cjfie...@illinois.edu wrote:
 On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu wrote:

 On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote:

 On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
 cjfie...@illinois.edu wrote:
 I have a fairly simple question regarding the feasibility of using 
 grammars with commonly used biological data formats.

 My main question: if I wanted to parse() or subparse() vary large files 
 (not unheard of to have FASTA/FASTQ or other similar data files exceed 
 100’s of GB) would a grammar be the best solution?  For instance, based on 
 what I am reading the semantics appear to be greedy; for instance:

   Grammar.parsefile($file)

 appears to be a convenient shorthand for:

   Grammar.parse($file.slurp)

 since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
 misunderstanding how this could be accomplished?

 My understanding is it is intended that parsing can work on Cats
 (hypothetical lazy strings) but this hasn't been implemented yet
 anywhere.

 --
 Solomon Foster: colo...@gmail.com
 HarmonyWare, Inc: http://www.harmonyware.com

 Yeah, that’s what I recall as well.  I see very little in the specs re: Cat 
 unfortunately.

 chris

 Ah, nevermind.  I did a search of the IRC channel and found it’s considered 
 to be a ‘6.1’ feature:

 http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974

 It is mentioned a few times in the specs, I’m guessing based on where it’s 
 thought to fit in best.  For the moment the proposal is to run grammar 
 parsing on sized chunks of the input data, which might be how Cat would be 
 implemented anyway.

 chris



Re: Grammars and biological data formats

2014-08-14 Thread Fields, Christopher J
Yeah, I'm thinking of a Cat-like class that would chunkify the data and check 
for matches.

The main reason I would like to stick with a consistent grammar-based approach 
is I have seen many instances in BioPerl where a parser is essentially 
rewritten based on its purpose (full parsing, lazy parsing, indexing of flat 
files, adding to a persistent data store, etc).  Having a way to both parse a 
full grammar but also subparse for a specific token/rule is very handy, and 
when Cat comes around even more so.  

Chris

Sent from my iPad

 On Aug 14, 2014, at 6:40 AM, Carl Mäsak cma...@gmail.com wrote:
 
 I was going to pipe in and say that I wouldn't wait around for Cat,
 I'd write something that reads chunks and then parses that. It'll be a
 bit more code, but it'll work today. But I see you reached that
 conclusion already. :)
 
 Lately I've found myself writing more and more grammars that parse
 just one line of some input. Provided that the same action object gets
 attached to the parse each time, that's an excellent place to store
 information that you want to persist between lines. Actually, action
 objects started to make a whole lot more sense to me after I found
 that use case, because it takes on the role of a session/lifetime
 object for the parse process itself.
 
 // Carl
 
 On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J
 cjfie...@illinois.edu wrote:
 On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu 
 wrote:
 
 On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote:
 
 On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
 cjfie...@illinois.edu wrote:
 I have a fairly simple question regarding the feasibility of using 
 grammars with commonly used biological data formats.
 
 My main question: if I wanted to parse() or subparse() vary large files 
 (not unheard of to have FASTA/FASTQ or other similar data files exceed 
 100’s of GB) would a grammar be the best solution?  For instance, based 
 on what I am reading the semantics appear to be greedy; for instance:
 
  Grammar.parsefile($file)
 
 appears to be a convenient shorthand for:
 
  Grammar.parse($file.slurp)
 
 since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
 misunderstanding how this could be accomplished?
 
 My understanding is it is intended that parsing can work on Cats
 (hypothetical lazy strings) but this hasn't been implemented yet
 anywhere.
 
 --
 Solomon Foster: colo...@gmail.com
 HarmonyWare, Inc: http://www.harmonyware.com
 
 Yeah, that’s what I recall as well.  I see very little in the specs re: Cat 
 unfortunately.
 
 chris
 
 Ah, nevermind.  I did a search of the IRC channel and found it’s considered 
 to be a ‘6.1’ feature:
 
http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974
 
 It is mentioned a few times in the specs, I’m guessing based on where it’s 
 thought to fit in best.  For the moment the proposal is to run grammar 
 parsing on sized chunks of the input data, which might be how Cat would be 
 implemented anyway.
 
 chris
 


Grammars and biological data formats

2014-08-13 Thread Fields, Christopher J
I have a fairly simple question regarding the feasibility of using grammars 
with commonly used biological data formats.  

My main question: if I wanted to parse() or subparse() vary large files (not 
unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) 
would a grammar be the best solution?  For instance, based on what I am reading 
the semantics appear to be greedy; for instance:

Grammar.parsefile($file)

appears to be a convenient shorthand for:

Grammar.parse($file.slurp)

since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
misunderstanding how this could be accomplished?

(just to point out, I know I can subparse() as well but that also appears to 
act on a string…)

As an example, I have a simple grammar for parsing FASTA, which a (deceptively) 
simple format for storing sequence data:

http://en.wikipedia.org/wiki/FASTA_format

I have a simple grammar here:

https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6

and tests here:

https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t

Tests pass with the latest Rakudo just fine.

chris

Re: Grammars and biological data formats

2014-08-13 Thread Solomon Foster
On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
cjfie...@illinois.edu wrote:
 I have a fairly simple question regarding the feasibility of using grammars 
 with commonly used biological data formats.

 My main question: if I wanted to parse() or subparse() vary large files (not 
 unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of 
 GB) would a grammar be the best solution?  For instance, based on what I am 
 reading the semantics appear to be greedy; for instance:

 Grammar.parsefile($file)

 appears to be a convenient shorthand for:

 Grammar.parse($file.slurp)

 since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
 misunderstanding how this could be accomplished?

My understanding is it is intended that parsing can work on Cats
(hypothetical lazy strings) but this hasn't been implemented yet
anywhere.

-- 
Solomon Foster: colo...@gmail.com
HarmonyWare, Inc: http://www.harmonyware.com


Re: Grammars and biological data formats

2014-08-13 Thread Fields, Christopher J
On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote:

 On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
 cjfie...@illinois.edu wrote:
 I have a fairly simple question regarding the feasibility of using grammars 
 with commonly used biological data formats.
 
 My main question: if I wanted to parse() or subparse() vary large files (not 
 unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of 
 GB) would a grammar be the best solution?  For instance, based on what I am 
 reading the semantics appear to be greedy; for instance:
 
Grammar.parsefile($file)
 
 appears to be a convenient shorthand for:
 
Grammar.parse($file.slurp)
 
 since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
 misunderstanding how this could be accomplished?
 
 My understanding is it is intended that parsing can work on Cats
 (hypothetical lazy strings) but this hasn't been implemented yet
 anywhere.
 
 -- 
 Solomon Foster: colo...@gmail.com
 HarmonyWare, Inc: http://www.harmonyware.com

Yeah, that’s what I recall as well.  I see very little in the specs re: Cat 
unfortunately.

chris

Re: Grammars and biological data formats

2014-08-13 Thread Fields, Christopher J
On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu wrote:

 On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote:
 
 On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J
 cjfie...@illinois.edu wrote:
 I have a fairly simple question regarding the feasibility of using grammars 
 with commonly used biological data formats.
 
 My main question: if I wanted to parse() or subparse() vary large files 
 (not unheard of to have FASTA/FASTQ or other similar data files exceed 
 100’s of GB) would a grammar be the best solution?  For instance, based on 
 what I am reading the semantics appear to be greedy; for instance:
 
   Grammar.parsefile($file)
 
 appears to be a convenient shorthand for:
 
   Grammar.parse($file.slurp)
 
 since Grammar.parse() works on a Str, not a IO::Handle or Buf.  Or am I 
 misunderstanding how this could be accomplished?
 
 My understanding is it is intended that parsing can work on Cats
 (hypothetical lazy strings) but this hasn't been implemented yet
 anywhere.
 
 -- 
 Solomon Foster: colo...@gmail.com
 HarmonyWare, Inc: http://www.harmonyware.com
 
 Yeah, that’s what I recall as well.  I see very little in the specs re: Cat 
 unfortunately.
 
 chris

Ah, nevermind.  I did a search of the IRC channel and found it’s considered to 
be a ‘6.1’ feature:

http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974

It is mentioned a few times in the specs, I’m guessing based on where it’s 
thought to fit in best.  For the moment the proposal is to run grammar parsing 
on sized chunks of the input data, which might be how Cat would be implemented 
anyway.

chris