Binning k-mers, open file handle question

2015-07-25 Thread Youens-Clark, Charles Kenneth - (kyclark)
I have an idea to unique and combine the k-mers of many (extremely large) FASTA files for a project. I have a proof of concept working in Perl that works fairly quickly because of the ability to cache the open file handles (at most 1024 [4^5 as I'm using 5-mers as the bins]). I can get a Has

Re: Binning k-mers, open file handle question

2015-07-25 Thread Nicholas Ingolia
I would suggest opening all files at the beginning. Generate all k-mers, open a file for each, and make a HashMap from k-mer to file handle. When you're done, run through the HashMap to close all file handles. Also, this lets you keep your sequences/k-mers/hash keys as byte strings for iterating

Re: Binning k-mers, open file handle question

2015-07-25 Thread Michał J Gajda
Could You provide some larger test cases that You mention? On Sunday, July 26, 2015, Youens-Clark, Charles Kenneth - (kyclark) < kycl...@email.arizona.edu> wrote: > I have an idea to unique and combine the k-mers of many (extremely large) > FASTA files for a project. I have a proof of concept wo

Re: Binning k-mers, open file handle question

2015-07-25 Thread Youens-Clark, Charles Kenneth - (kyclark)
On Jul 25, 2015, at 5:21 PM, Michał J Gajda wrote: > > Could You provide some larger test cases that You mention? This is a medium-sized set of the Pacific Ocean Virome: http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/imicrobe/pov/fasta/reads I could put up another se

Re: Binning k-mers, open file handle question

2015-07-25 Thread Youens-Clark, Charles Kenneth - (kyclark)
On Jul 25, 2015, at 10:43 AM, Nicholas Ingolia wrote: > > I would suggest opening all files at the beginning. Generate all k-mers, > open a file for each, and make a HashMap from k-mer to file handle. When > you're done, run through the HashMap to close all file handles. I really like this idea

Re: Binning k-mers, open file handle question

2015-07-25 Thread Michał J Gajda
Hi Ken, I'd suggest to use ByteString instead of String with Sequence at all times - it is much more efficient, and uses less memory. Just: import Data.ByteString.Char8 as BS and instead of findKMers 20 . toString, rewrite findKMersBS 20 with the following sig: findKMers :: Int -> [BS.ByteString]