Re: Grammars and biological data formats
Hmmm, what about just implementing mmap-as-string? Then, assuming the parsing process is somewhat stream-like, the OS will take care of swapping in chunks as you need them. You don't even need anything special to support backtracking -- it's just a memory address, after all. -Martin On Thu, 14 Aug 2014, Fields, Christopher J wrote: Yeah, I'm thinking of a Cat-like class that would chunkify the data and check for matches. The main reason I would like to stick with a consistent grammar-based approach is I have seen many instances in BioPerl where a parser is essentially rewritten based on its purpose (full parsing, lazy parsing, indexing of flat files, adding to a persistent data store, etc). Having a way to both parse a full grammar but also subparse for a specific token/rule is very handy, and when Cat comes around even more so. Chris Sent from my iPad On Aug 14, 2014, at 6:40 AM, Carl Mäsak cma...@gmail.com wrote: I was going to pipe in and say that I wouldn't wait around for Cat, I'd write something that reads chunks and then parses that. It'll be a bit more code, but it'll work today. But I see you reached that conclusion already. :) Lately I've found myself writing more and more grammars that parse just one line of some input. Provided that the same action object gets attached to the parse each time, that's an excellent place to store information that you want to persist between lines. Actually, action objects started to make a whole lot more sense to me after I found that use case, because it takes on the role of a session/lifetime object for the parse process itself. // Carl On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J cjfie...@illinois.edu wrote: On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu wrote: On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote: On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J cjfie...@illinois.edu wrote: I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? My understanding is it is intended that parsing can work on Cats (hypothetical lazy strings) but this hasn't been implemented yet anywhere. -- Solomon Foster: colo...@gmail.com HarmonyWare, Inc: http://www.harmonyware.com Yeah, that’s what I recall as well. I see very little in the specs re: Cat unfortunately. chris Ah, nevermind. I did a search of the IRC channel and found it’s considered to be a ‘6.1’ feature: http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974 It is mentioned a few times in the specs, I’m guessing based on where it’s thought to fit in best. For the moment the proposal is to run grammar parsing on sized chunks of the input data, which might be how Cat would be implemented anyway. chris
Re: Grammars and biological data formats
Yes, that looks like an even better option. I see that this is implemented in p5 as File::Map, which is a nice portable option. Chris On Aug 16, 2014, at 7:51 AM, Martin D Kealey mar...@kurahaupo.gen.nz wrote: Hmmm, what about just implementing mmap-as-string? Then, assuming the parsing process is somewhat stream-like, the OS will take care of swapping in chunks as you need them. You don't even need anything special to support backtracking -- it's just a memory address, after all. -Martin On Thu, 14 Aug 2014, Fields, Christopher J wrote: Yeah, I'm thinking of a Cat-like class that would chunkify the data and check for matches. The main reason I would like to stick with a consistent grammar-based approach is I have seen many instances in BioPerl where a parser is essentially rewritten based on its purpose (full parsing, lazy parsing, indexing of flat files, adding to a persistent data store, etc). Having a way to both parse a full grammar but also subparse for a specific token/rule is very handy, and when Cat comes around even more so. Chris Sent from my iPad On Aug 14, 2014, at 6:40 AM, Carl Mäsak cma...@gmail.com wrote: I was going to pipe in and say that I wouldn't wait around for Cat, I'd write something that reads chunks and then parses that. It'll be a bit more code, but it'll work today. But I see you reached that conclusion already. :) Lately I've found myself writing more and more grammars that parse just one line of some input. Provided that the same action object gets attached to the parse each time, that's an excellent place to store information that you want to persist between lines. Actually, action objects started to make a whole lot more sense to me after I found that use case, because it takes on the role of a session/lifetime object for the parse process itself. // Carl On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J cjfie...@illinois.edu wrote: On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu wrote: On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote: On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J cjfie...@illinois.edu wrote: I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? My understanding is it is intended that parsing can work on Cats (hypothetical lazy strings) but this hasn't been implemented yet anywhere. -- Solomon Foster: colo...@gmail.com HarmonyWare, Inc: http://www.harmonyware.com Yeah, that’s what I recall as well. I see very little in the specs re: Cat unfortunately. chris Ah, nevermind. I did a search of the IRC channel and found it’s considered to be a ‘6.1’ feature: http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974 It is mentioned a few times in the specs, I’m guessing based on where it’s thought to fit in best. For the moment the proposal is to run grammar parsing on sized chunks of the input data, which might be how Cat would be implemented anyway. chris
Re: Grammars and biological data formats
I was going to pipe in and say that I wouldn't wait around for Cat, I'd write something that reads chunks and then parses that. It'll be a bit more code, but it'll work today. But I see you reached that conclusion already. :) Lately I've found myself writing more and more grammars that parse just one line of some input. Provided that the same action object gets attached to the parse each time, that's an excellent place to store information that you want to persist between lines. Actually, action objects started to make a whole lot more sense to me after I found that use case, because it takes on the role of a session/lifetime object for the parse process itself. // Carl On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J cjfie...@illinois.edu wrote: On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu wrote: On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote: On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J cjfie...@illinois.edu wrote: I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? My understanding is it is intended that parsing can work on Cats (hypothetical lazy strings) but this hasn't been implemented yet anywhere. -- Solomon Foster: colo...@gmail.com HarmonyWare, Inc: http://www.harmonyware.com Yeah, that’s what I recall as well. I see very little in the specs re: Cat unfortunately. chris Ah, nevermind. I did a search of the IRC channel and found it’s considered to be a ‘6.1’ feature: http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974 It is mentioned a few times in the specs, I’m guessing based on where it’s thought to fit in best. For the moment the proposal is to run grammar parsing on sized chunks of the input data, which might be how Cat would be implemented anyway. chris
Re: Grammars and biological data formats
Yeah, I'm thinking of a Cat-like class that would chunkify the data and check for matches. The main reason I would like to stick with a consistent grammar-based approach is I have seen many instances in BioPerl where a parser is essentially rewritten based on its purpose (full parsing, lazy parsing, indexing of flat files, adding to a persistent data store, etc). Having a way to both parse a full grammar but also subparse for a specific token/rule is very handy, and when Cat comes around even more so. Chris Sent from my iPad On Aug 14, 2014, at 6:40 AM, Carl Mäsak cma...@gmail.com wrote: I was going to pipe in and say that I wouldn't wait around for Cat, I'd write something that reads chunks and then parses that. It'll be a bit more code, but it'll work today. But I see you reached that conclusion already. :) Lately I've found myself writing more and more grammars that parse just one line of some input. Provided that the same action object gets attached to the parse each time, that's an excellent place to store information that you want to persist between lines. Actually, action objects started to make a whole lot more sense to me after I found that use case, because it takes on the role of a session/lifetime object for the parse process itself. // Carl On Wed, Aug 13, 2014 at 3:19 PM, Fields, Christopher J cjfie...@illinois.edu wrote: On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu wrote: On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote: On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J cjfie...@illinois.edu wrote: I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? My understanding is it is intended that parsing can work on Cats (hypothetical lazy strings) but this hasn't been implemented yet anywhere. -- Solomon Foster: colo...@gmail.com HarmonyWare, Inc: http://www.harmonyware.com Yeah, that’s what I recall as well. I see very little in the specs re: Cat unfortunately. chris Ah, nevermind. I did a search of the IRC channel and found it’s considered to be a ‘6.1’ feature: http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974 It is mentioned a few times in the specs, I’m guessing based on where it’s thought to fit in best. For the moment the proposal is to run grammar parsing on sized chunks of the input data, which might be how Cat would be implemented anyway. chris
Grammars and biological data formats
I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? (just to point out, I know I can subparse() as well but that also appears to act on a string…) As an example, I have a simple grammar for parsing FASTA, which a (deceptively) simple format for storing sequence data: http://en.wikipedia.org/wiki/FASTA_format I have a simple grammar here: https://github.com/cjfields/bioperl6/blob/master/lib/Bio/Grammar/Fasta.pm6 and tests here: https://github.com/cjfields/bioperl6/blob/master/t/Grammar/fasta.t Tests pass with the latest Rakudo just fine. chris
Re: Grammars and biological data formats
On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J cjfie...@illinois.edu wrote: I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? My understanding is it is intended that parsing can work on Cats (hypothetical lazy strings) but this hasn't been implemented yet anywhere. -- Solomon Foster: colo...@gmail.com HarmonyWare, Inc: http://www.harmonyware.com
Re: Grammars and biological data formats
On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote: On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J cjfie...@illinois.edu wrote: I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? My understanding is it is intended that parsing can work on Cats (hypothetical lazy strings) but this hasn't been implemented yet anywhere. -- Solomon Foster: colo...@gmail.com HarmonyWare, Inc: http://www.harmonyware.com Yeah, that’s what I recall as well. I see very little in the specs re: Cat unfortunately. chris
Re: Grammars and biological data formats
On Aug 13, 2014, at 8:11 AM, Christopher Fields cjfie...@illinois.edu wrote: On Aug 13, 2014, at 4:50 AM, Solomon Foster colo...@gmail.com wrote: On Sat, Aug 9, 2014 at 7:26 PM, Fields, Christopher J cjfie...@illinois.edu wrote: I have a fairly simple question regarding the feasibility of using grammars with commonly used biological data formats. My main question: if I wanted to parse() or subparse() vary large files (not unheard of to have FASTA/FASTQ or other similar data files exceed 100’s of GB) would a grammar be the best solution? For instance, based on what I am reading the semantics appear to be greedy; for instance: Grammar.parsefile($file) appears to be a convenient shorthand for: Grammar.parse($file.slurp) since Grammar.parse() works on a Str, not a IO::Handle or Buf. Or am I misunderstanding how this could be accomplished? My understanding is it is intended that parsing can work on Cats (hypothetical lazy strings) but this hasn't been implemented yet anywhere. -- Solomon Foster: colo...@gmail.com HarmonyWare, Inc: http://www.harmonyware.com Yeah, that’s what I recall as well. I see very little in the specs re: Cat unfortunately. chris Ah, nevermind. I did a search of the IRC channel and found it’s considered to be a ‘6.1’ feature: http://irclog.perlgeek.de/perl6/2014-07-06#i_8978974 It is mentioned a few times in the specs, I’m guessing based on where it’s thought to fit in best. For the moment the proposal is to run grammar parsing on sized chunks of the input data, which might be how Cat would be implemented anyway. chris