I have an idea to unique and combine the k-mers of many (extremely large) FASTA
files for a project. I have a proof of concept working in Perl that works
fairly quickly because of the ability to cache the open file handles (at most
1024 [4^5 as I'm using 5-mers as the bins]).
I can get a Has
I would suggest opening all files at the beginning. Generate all k-mers,
open a file for each, and make a HashMap from k-mer to file handle. When
you're done, run through the HashMap to close all file handles.
Also, this lets you keep your sequences/k-mers/hash keys as byte strings
for iterating
Could You provide some larger test cases that You mention?
On Sunday, July 26, 2015, Youens-Clark, Charles Kenneth - (kyclark) <
kycl...@email.arizona.edu> wrote:
> I have an idea to unique and combine the k-mers of many (extremely large)
> FASTA files for a project. I have a proof of concept wo
On Jul 25, 2015, at 5:21 PM, Michał J Gajda wrote:
>
> Could You provide some larger test cases that You mention?
This is a medium-sized set of the Pacific Ocean Virome:
http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/imicrobe/pov/fasta/reads
I could put up another se
On Jul 25, 2015, at 10:43 AM, Nicholas Ingolia wrote:
>
> I would suggest opening all files at the beginning. Generate all k-mers,
> open a file for each, and make a HashMap from k-mer to file handle. When
> you're done, run through the HashMap to close all file handles.
I really like this idea
Hi Ken,
I'd suggest to use ByteString instead of String with Sequence at all times
- it is much more efficient, and uses less memory.
Just:
import Data.ByteString.Char8 as BS
and instead of findKMers 20 . toString, rewrite findKMersBS 20 with the
following sig:
findKMers :: Int -> [BS.ByteString]