Re: [OT] Re: How to read fastly files ( I/O operation)

2013-12-18 Thread Jay Norwood
On Friday, 8 February 2013 at 06:22:18 UTC, Denis Shelomovskij 
wrote:

06.02.2013 19:40, bioinfornatics пишет:
On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics 
wrote:
I agree the spec format is really bad but it is heavily used 
in biology
so i would like a fast parser to develop some D application 
instead to

use C++.


Yes, lets also create 1 GiB XML files and ask for fast 
encoding/decoding!


The situation can be improved only if:
1. We will find and kill every text format creator;
2. We will create a really good binary format for each such 
task and support it in every application we create. So after 
some time text formats will just die because of evolution as 
everything will support better formats.


(the second proposal is a real recommendation)


There is a binary resource format for emf models, which normally 
use xml files, and some timing improvements stated at this link.  
It might be worth looking at this if you are thinking about 
writing your own binary format.

http://www.slideshare.net/kenn.hussey/performance-and-extensibility-with-emf

There is also a fast binary compression library named blosc that 
is used in some python utilities, measured and presented here, 
showing that it is faster than doing a memcpy if you have 
multiple cores.

http://blosc.pytables.org/trac

On the sequential accesses ... I found that windows writes blocks 
of data all over the place, but the best way to get it to write 
something in more contiguous locations is to modify the file 
output routines to use specify write through.  The sequential 
accesses didn't improve read times on ssd.


Most of the decent ssds can read big files at 300MB/sec or more 
now, and you can raid 0 a few of them and read 800MB/sec.




Re: How to read fastly files ( I/O operation)

2013-12-18 Thread Jay Norwood
On Wednesday, 13 February 2013 at 17:39:11 UTC, monarch_dodra 
wrote:
On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra 
wrote:
On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
wrote:


Some time fastq are comressed to gz bz2 or xz as that is 
often a

huge file.
Maybe we need keep in mind this early in developement and use
std.zlib


While working on making the parser multi-threaded compatible, 
I was able to seperate the part that feeds data, and the part 
that parses data.


Long story short, the parser operates on an input range of 
ubyte[]: It is not responsible any more for acquisition of 
data.


The range can be a simple (wrapped) File, a byChunk, an 
asynchroneus file reader, or a zip decompresser, or just stdin 
I guess. Range can be transient.


However, now that you mention it, I'll make sure it is 
correctly supported.


I'll *try* to show you what I have so far tomorow (in about 
18h).


Yeah... I played around too much, and the file is dirtier than 
ever.


The good news is that I was able to test out what I was telling 
you about: accepting any range is ok:


I used your ZFile range to plug it into my parser: I can now 
parse zipped files directly.


The good news is that now, I'm not bottle necked by IO anymore! 
The bad news is that I'm now bottle necked by CPU 
decompressing. But since I'm using dmd, you may get better 
results with LDC or GDC.


In any case, I am now parsing the 6Gig packed into 1.5Gig in 
about 53 seconds (down from 61). I also tried doing a 
dual-threaded approach (1 thread to unzip, 1 thread to parse), 
but again, the actual *parse* phase is so ridiculously fast, 
that it changes *nothing* to the final result.


Long story short: 99% of the time is spent acquiring data. The 
last 1% is just copying it into local buffers.


The last good news though is that CPU bottleneck is always 
better than IO bottleneck. If you have multiple cores, you 
should be able to run multiple *instances* (not threads), and 
be able to process several files at once, multiplying your 
throughput.


I modified the library unzip to make a parallel unzip a while 
back (at the link below).  The execution time scaled very well 
for the number of cpus for the test case I was using, which was a 
2GB unzip'd distribution containing many small files and 
subdirectories.  The parallel operations were by file.   I think 
the execution time gains on ssd drives were from having multiple 
cores scheduling the writes to separate files in parallel.

https://github.com/jnorwood/file_parallel/blob/master/unzip_parallel.d




Re: How to read fastly files ( I/O operation)

2013-02-22 Thread monarch_dodra

On Friday, 22 February 2013 at 08:53:35 UTC, bioinfornatics wrote:

arf I am always in dmdfe 2.060


AFAIK, the problems are mostly the nothrows, and maybe 1 or 2 
new style alias declarations.


That said, what's stopping you from upgrading? We are at 2.062 
right now. Does upgrading break anything for you?


Re: How to read fastly files ( I/O operation)

2013-02-19 Thread monarch_dodra
On Thursday, 14 February 2013 at 18:31:35 UTC, bioinfornatics 
wrote:


Mr.Bio, what usage cases you'll be interested in, other than 
those counters?


some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
convert to fasta + sff
merge close sequence to one concenus
create a brujin graph
more idea later


OK. I posted the parser here:
http://dpaste.dzfl.pl/37b893ed

This runs on the 2.061. I'll have to make a few changes if you 
need it to run 2.060, to get around some 2.060 specific bugs.


This contains strictly only the parser. If you want, I'll post 
the async file reading stuff I wrote to interface with it.


The example sections should give you a quick idea of how to use 
it.


Tell me what you think about it.


Re: How to read fastly files ( I/O operation)

2013-02-14 Thread bioinfornatics


Mr.Bio, what usage cases you'll be interested in, other than 
those counters?


some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
convert to fasta + sff
merge close sequence to one concenus
create a brujin graph
more idea later


Re: How to read fastly files ( I/O operation)

2013-02-13 Thread monarch_dodra

On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
wrote:


Some time fastq are comressed to gz bz2 or xz as that is often 
a

huge file.
Maybe we need keep in mind this early in developement and use
std.zlib


While working on making the parser multi-threaded compatible, I 
was able to seperate the part that feeds data, and the part 
that parses data.


Long story short, the parser operates on an input range of 
ubyte[]: It is not responsible any more for acquisition of data.


The range can be a simple (wrapped) File, a byChunk, an 
asynchroneus file reader, or a zip decompresser, or just stdin 
I guess. Range can be transient.


However, now that you mention it, I'll make sure it is 
correctly supported.


I'll *try* to show you what I have so far tomorow (in about 
18h).


Yeah... I played around too much, and the file is dirtier than 
ever.


The good news is that I was able to test out what I was telling 
you about: accepting any range is ok:


I used your ZFile range to plug it into my parser: I can now 
parse zipped files directly.


The good news is that now, I'm not bottle necked by IO anymore! 
The bad news is that I'm now bottle necked by CPU decompressing. 
But since I'm using dmd, you may get better results with LDC or 
GDC.


In any case, I am now parsing the 6Gig packed into 1.5Gig in 
about 53 seconds (down from 61). I also tried doing a 
dual-threaded approach (1 thread to unzip, 1 thread to parse), 
but again, the actual *parse* phase is so ridiculously fast, that 
it changes *nothing* to the final result.


Long story short: 99% of the time is spent acquiring data. The 
last 1% is just copying it into local buffers.


The last good news though is that CPU bottleneck is always better 
than IO bottleneck. If you have multiple cores, you should be 
able to run multiple *instances* (not threads), and be able to 
process several files at once, multiplying your throughput.


Re: How to read fastly files ( I/O operation)

2013-02-13 Thread FG

On 2013-02-13 18:39, monarch_dodra wrote:

In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds
(down from 61). I also tried doing a dual-threaded approach (1 thread to unzip,
1 thread to parse), but again, the actual *parse* phase is so ridiculously fast,
that it changes *nothing* to the final result.


Great. Performance aside, we didn't talk much about how this data can be useful 
- should it only be read sequentially forward or both ways, would there be a 
need to place some markers or slice the sequence, etc. Our small test case was 
only about counting nucleotides, so reading order and possibility of further 
processing was irrelevant.


Mr.Bio, what usage cases you'll be interested in, other than those counters?



Re: How to read fastly files ( I/O operation)

2013-02-12 Thread monarch_dodra
On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
wrote:

instead to use memcpy I try with slicing ~ lines 136 :
_hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
moveSize + _bufPosition];


I get same perf


I think I figured out why I'm getting different results than you 
guys are, on my windows machine.


AFAIK, file reads in windows are done natively asynchronously.

I wrote a multi-threaded version of the parser, with a thread 
dedicated to reading the file, while the main thread parses the 
read buffers.


I'm getting EXACTLY 0% performance improvement. Not better, not 
worst, just 0%.


I'd have to try again on my SSD. Right now, I'm parsing the file 
6 Gig file in 60 seconds, which is the limit of my HDD. As a 
matter of fact, just *reading* the files takes the EXACT same 
amount of time as parsing it...


This takes 60 seconds.
//
auto input = File(args[1], rb);
ubyte[] buffer = new ubyte[](BufferSize);
do{
buffer = input.rawRead(buffer);
}while(buffer.length);
//

This takes 60 seconds too.
//
Parser parser = new Parser(args[1]);
foreach(q; parser)
foreach(char c; q.sequence)
globalNucleic.collect(c);
}
//

So at this point, I'd need to test on my Linux box, or publish 
the code so you can tell me how I'm doing.


I'm still tweaking the code to publish something readable, as 
there is a lot of sketchy code right now.


I'm also implementing a correct exception handling, so that if 
there is an erroneous entry, an exception is thrown. However, all 
the erroneous data is parsed out of the file, and placed inside 
the exception. This means that:

a) You can inspect the erroneous data
b) You can skip the erroneous data, and parse the rest of the 
file.


Once I deliver the code with the multi-threaded code activated, 
you should get some better performance on Linux.


When 1.0 is ready, I'll create a github project for it, so work 
can be done parallel on it.


Re: How to read fastly files ( I/O operation)

2013-02-12 Thread bioinfornatics

On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra wrote:
On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
wrote:

instead to use memcpy I try with slicing ~ lines 136 :
_hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
moveSize + _bufPosition];


I get same perf


I think I figured out why I'm getting different results than 
you guys are, on my windows machine.


AFAIK, file reads in windows are done natively asynchronously.

I wrote a multi-threaded version of the parser, with a thread 
dedicated to reading the file, while the main thread parses the 
read buffers.


I'm getting EXACTLY 0% performance improvement. Not better, not 
worst, just 0%.


I'd have to try again on my SSD. Right now, I'm parsing the 
file 6 Gig file in 60 seconds, which is the limit of my HDD. As 
a matter of fact, just *reading* the files takes the EXACT same 
amount of time as parsing it...


This takes 60 seconds.
//
auto input = File(args[1], rb);
ubyte[] buffer = new ubyte[](BufferSize);
do{
buffer = input.rawRead(buffer);
}while(buffer.length);
//

This takes 60 seconds too.
//
Parser parser = new Parser(args[1]);
foreach(q; parser)
foreach(char c; q.sequence)
globalNucleic.collect(c);
}
//

So at this point, I'd need to test on my Linux box, or publish 
the code so you can tell me how I'm doing.


I'm still tweaking the code to publish something readable, as 
there is a lot of sketchy code right now.


I'm also implementing a correct exception handling, so that if 
there is an erroneous entry, an exception is thrown. However, 
all the erroneous data is parsed out of the file, and placed 
inside the exception. This means that:

a) You can inspect the erroneous data
b) You can skip the erroneous data, and parse the rest of the 
file.


Once I deliver the code with the multi-threaded code activated, 
you should get some better performance on Linux.


When 1.0 is ready, I'll create a github project for it, so 
work can be done parallel on it.


about threaded version is possible to use get file size function 
to split it in several thread.
Use fseek read end of section return it to detect end of split to 
used


Re: How to read fastly files ( I/O operation)

2013-02-12 Thread monarch_dodra
On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
wrote:


Some time fastq are comressed to gz bz2 or xz as that is often a
huge file.
Maybe we need keep in mind this early in developement and use
std.zlib


While working on making the parser multi-threaded compatible, I 
was able to seperate the part that feeds data, and the part that 
parses data.


Long story short, the parser operates on an input range of 
ubyte[]: It is not responsible any more for acquisition of data.


The range can be a simple (wrapped) File, a byChunk, an 
asynchroneus file reader, or a zip decompresser, or just stdin I 
guess. Range can be transient.


However, now that you mention it, I'll make sure it is correctly 
supported.


I'll *try* to show you what I have so far tomorow (in about 18h).


Re: How to read fastly files ( I/O operation)

2013-02-09 Thread bioinfornatics

some idea such as letter counting:
rename identifier
trimming sequence from quality value to cutoff
convert to a binary format
more idea later


Re: How to read fastly files ( I/O operation)

2013-02-07 Thread FG

On 2013-02-07 08:26, monarch_dodra wrote:

You have timed the same file SRR077487_1.filt.fastq at 67s?


Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO
bound than you are?


Ah, now that you mention SSD, I moved the file onto one and it's even more
clear that I am CPU-bound here on the Intel E6600 system. Compare:

7200rpm: MS 4m30s / FG 1m55s
SSD: MS 4m14s / FG 1m44s

Almost the same, but running the utility wc -l on the file renders:

7200rpm: 1m45s
SSD: 0m33s

In my case threads would be beneficial but only when using the SSD.
Reading the file by chunk in D takes 33s on SSD and 1m44s on HDD.
Slicing the file in half and reading from both threads would also
be fine only on the SSD, because on a HDD I'd lose sequential disk
reads jumping between threads (expecting lower performance).

Therefore - threads: yes, but gotta use an SSD. :)
Also, threads: yes, if there's gonna be more processing than just
counting letters.


Re: How to read fastly files ( I/O operation)

2013-02-07 Thread monarch_dodra
On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics 
wrote:

Little feed back
i named f the f's script and monarch the monarch's script

 gdmd -O -w -release f.d
~ $ time ./f bigFastq.fastq
['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
'G':1353023772]


real2m14.966s
user0m47.168s
sys 0m15.379s
~ $ gdmd -O -w -release monarch.d
monarch.d:117: no identifier for declarator Lines
monarch.d:117: alias cannot have initializer
monarch.d:130: identifier or integer expected, not assert


i haven't take the time to look more

but in any case it seem memory mapped file is really slowly 
whereas it is said that is the faster way to read file. Create 
an index where reading the file need 12 min that is useless as 
for read and compute you need 2 min


You must be using dmd 2.060. I'm using some 2.061 features: 
Namelly new style alias.


Just change line 117:
alias Lines = typeof(File.init.byLine());
to
alias typeof(File.init.byLine()) Lines;

As for 130, it's a version(assert) eg, code that does not get 
executed in release. Just remove the version(assert), if it 
gets executed, it is not a big deal.


In any case, I think the code is mostly proof, I wouldn't use 
it as is.




BTW, I've started working on my library. How would users expect 
the quality format served? As an array of characters, or as an 
array of integrals (ubytes)?


Re: How to read fastly files ( I/O operation)

2013-02-07 Thread bioinfornatics

On Thursday, 7 February 2013 at 14:42:57 UTC, monarch_dodra wrote:
On Thursday, 7 February 2013 at 14:30:11 UTC, bioinfornatics 
wrote:

Little feed back
i named f the f's script and monarch the monarch's script

gdmd -O -w -release f.d
~ $ time ./f bigFastq.fastq
['T':999786820, 'A':1007129068, 'N':39413, 'C':1350576504, 
'G':1353023772]


real2m14.966s
user0m47.168s
sys 0m15.379s
~ $ gdmd -O -w -release monarch.d
monarch.d:117: no identifier for declarator Lines
monarch.d:117: alias cannot have initializer
monarch.d:130: identifier or integer expected, not assert


i haven't take the time to look more

but in any case it seem memory mapped file is really slowly 
whereas it is said that is the faster way to read file. Create 
an index where reading the file need 12 min that is useless as 
for read and compute you need 2 min


You must be using dmd 2.060. I'm using some 2.061 features: 
Namelly new style alias.


Just change line 117:
alias Lines = typeof(File.init.byLine());
to
alias typeof(File.init.byLine()) Lines;

As for 130, it's a version(assert) eg, code that does not get 
executed in release. Just remove the version(assert), if it 
gets executed, it is not a big deal.


In any case, I think the code is mostly proof, I wouldn't use 
it as is.




BTW, I've started working on my library. How would users expect 
the quality format served? As an array of characters, or as 
an array of integrals (ubytes)?


ubyte as is a number is maybe easier to understand an cuttoff 
some value


[OT] Re: How to read fastly files ( I/O operation)

2013-02-07 Thread Denis Shelomovskij

06.02.2013 19:40, bioinfornatics пишет:

On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics wrote:
I agree the spec format is really bad but it is heavily used in biology
so i would like a fast parser to develop some D application instead to
use C++.


Yes, lets also create 1 GiB XML files and ask for fast encoding/decoding!

The situation can be improved only if:
1. We will find and kill every text format creator;
2. We will create a really good binary format for each such task and 
support it in every application we create. So after some time text 
formats will just die because of evolution as everything will support 
better formats.


(the second proposal is a real recommendation)

--
Денис В. Шеломовский
Denis V. Shelomovskij


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread monarch_dodra
On Wednesday, 6 February 2013 at 10:43:02 UTC, bioinfornatics 
wrote:
instead to call mmFile opIndex to read ubyte by ubyte i tried 
to put into a buffer array of length PAGESIZE.


code here: http://dpaste.dzfl.pl/25ee34fc

and is not faster for 12Go to parse i need 11 minutes. I do not 
see how i could read faster the file!


To remember fastxtoolkit need 2 min!


This might be stupid, but I see a writeln in your inner loop. 
You aren't slowed down just by your console by any chance?


If I were you, I'd start benching to try and see who is slowing 
you down.


I'd reorganize the code to parse a file that is, say 512Mb. The 
rationale being you can place it entirely at once. Then, I'd 
shift the logic from fully proccess each charater before moving 
to the next character to make a full processing pass on the 
entire data structure, before moving to the next pass.


The steps I see that need to be measured are:

* Raw read of file
* Iterating on your file to extract it as a raw array of Data 
objects

* Processing the Data objects
* Outputting the data

Also,  (of course), you need to make sure you are compiling in 
release (might sound obvious, but you never know). Are you using 
dmd? I heard the other compilers are faster.


I'm going to try and see with some example files if I can't get 
something running faster.


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread monarch_dodra
On Wednesday, 6 February 2013 at 11:15:22 UTC, monarch_dodra 
wrote:
I'm going to try and see with some example files if I can't get 
something running faster.


Benchmarking and tweaking, I was able to find 3 things that 
speeds up your program:


1) Make the computeLocal a compile time constant. This will give 
you a tinsy bit of performance. Depends on if you plan to make it 
a run-time argument switch I guess.


2) Makes things about 10%-20% faster:
Your nucleic and amino hash tables map a character to an 
index. However, given the range of the characters ('A' to 'Z'), 
you are better off doing a flat array, where each index 
represents a character, eg: A is index 0, B is index 1. This way, 
lookup is a simple array indexing, as opposed to a hash table 
indexing.


You may even get a bigger bang for your buck by simply giving 
your _stats structure a simple A is index 0, B is index 1, 
and only re-order the data at the end, when you want to read 
it. (I haven't done this though).


3) Makes things about 100% faster (ran in half the time on my 
machine): I don't know how mmFile works, but a simple File + 
rawRead seems to get the job done fast. Also, instead of 
keeping track of an (several) indexes, I merely keep a single 
slice. The only thing I care about, is if my slice is empty, in 
which case I re-fill it.


The modified code is here. I'm apparently getting the same output 
you are, but that doesn't mean there might not be bugs in it. For 
example, I noticed that you don't strip leading whites, if any, 
before the first read.

http://dpaste.dzfl.pl/9b9353b8


I'd be tempted to re-write the parser using a byLine approach, 
since my quick reading about fastq seems to imply it is a line 
based format. Or just plain try to write a parser from scratch, 
putting my own logic and thought into it (all I did was modify 
your code, without caring about the actual algorithm)


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread bioinfornatics

i use both gdc / ldc with -w -O -release flags

writeln inside loop is never evaluated as computeLocal boolean is 
always false



Thanks in any case i continue to read all your answer :-)


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread bioinfornatics
On Wednesday, 6 February 2013 at 13:20:58 UTC, bioinfornatics 
wrote:

i use both gdc / ldc with -w -O -release flags

writeln inside loop is never evaluated as computeLocal boolean 
is always false



Thanks in any case i continue to read all your answer :-)


just to add more information about fastq
http://www.biomedsearch.com/nih/Sanger-FASTQ-file-format-sequences/20015970.html

And here a set of fastq where parser should success or fail 
http://www.biomedsearch.com/attachments/00/20/01/59/20015970/gkp1137_nar-02248-d-2009-File005.gz


The problem is a sequence line could be splitted in several line 
same for quality line. And if i think it should allow to have in 
this lines white space


the @ is used to identify a identifier line
the + is used to identify a description line
but this char could appear as a quality value (ubyte)

I agree the spec format is really bad but it is heavily used in 
biology so i would like a fast parser to develop some D 
application instead to use C++.


I will try later all previous recommendation thank to all.

It seem in any case is not easy to parse fastly a file in D

Note: is possible to lock a file? to able to use pure method ?


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread monarch_dodra
On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics 
wrote:

It seem in any case is not easy to parse fastly a file in D


I don't think that's true. D provides the same FILE primitive 
you'd get in C, so there is no reason for it to be slower than C.


It is the range approach that, as convenient as it is, is not 
well adapted for certain things.


As I had said, I tried to write my own program. In it, I devised 
a range that, instead of exposing things to parse character by 
character, parses an entire object (a ... genome ... maybe ? 
I called them Q in my program) at once into an object. I 
decided to use the very simple byLine primitive.


From there, you can query the object for their 
name/sequence/quality. The irony is that by parsing twice (once 
to do the io read, once to do the actual processing of the text), 
and taking into account I'm allocating each object individually, 
I'm running twice as fast as my original already improved 
implementation. Not only is it faster, it is also more 
convenient, since you can extract an entire Q object at once, and 
then operate on that as you would so please: Separation of 
algorithm and parsing.


It correctly takes into account that a sequence can be multiple 
lines. It does not strip whitespace because according to 
http://maq.sourceforge.net/fastq.shtml whitespace is not a legal 
character.


Now: Keep in mind that this approach allocates (3) new strings 
for each Q. You could *try* an approach with a pre-allocated 
re-useable buffer. This would mean you can only operate on 1 Q at 
once, but you'd probably iterate on them faster.


In any case, you can try it out:
http://dpaste.dzfl.pl/8bdd0c84



Re: How to read fastly files ( I/O operation)

2013-02-06 Thread monarch_dodra
On Wednesday, 6 February 2013 at 16:06:20 UTC, monarch_dodra 
wrote:
It correctly takes into account that a sequence can be multiple 
lines. It does not strip whitespace because according to 
http://maq.sourceforge.net/fastq.shtml whitespace is not a 
legal character.


Hum, just read your example files. I guess you can have white. In 
any case, that should pose not pose any real problem. 
http://dlang.org/phobos/std_string.html#.removechars


would come in handy here.


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread FG

On 2013-02-04 15:04, bioinfornatics wrote:

I am looking to parse efficiently huge file but i think D lacking for this 
purpose.
To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 
min.


Haven't compared to fastxtoolkit, but I have some code for you.
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far -- all compiled with gdc -O3.
I bet your computer has better specs than mine.

Program uses a buffer that should be twice the size of the largest sequence
record (counting id, comment and quality data). A chunk of file is read,
then records are scanned on the buffer until record start pointer passes
the middle of the buffer -- then memcpy is used to move all the rest to
the begining of the buffer and the remaining space at the end is filled with
another chunk read from the file.

Data contains both sequence letter and associated quality information.
Sequence ID and comment are slices of the buffer, so they have valid info
until you move to the next sequence (and the number increments).

This is the code: http://dpaste.1azy.net/8424d4ac
Tell me what timings you can get now.


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread monarch_dodra

On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:

On 2013-02-04 15:04, bioinfornatics wrote:
I am looking to parse efficiently huge file but i think D 
lacking for this purpose.
To parse 12 Go i need 11 minutes wheras fastxtoolkit (written 
in c++ ) need 2 min.


Haven't compared to fastxtoolkit, but I have some code for you.
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast 
as the

fastest D solution so far


Do you mean my solution above? I tried your solution with dmd, 
with -release -O -inline, and both gave about the same result 
(69s yours, 67s mine).


Data contains both sequence letter and associated quality 
information.
Sequence ID and comment are slices of the buffer, so they have 
valid info

until you move to the next sequence (and the number increments).


Hum. Mine allocates new slices, so they are never invalidated :)
Mine also takes into account newlines and and lowercase sequences.

Still, it seems you and I both took different approaches. I had 
mentioned using a re-useable buffer. I'm going to try to consume 
some of your code to see if I can't improve my implementation.


@bioinfornatics

I'm getting real interested on the subject. I'm going to try to 
write an actual library/framework for working with fastq files in 
a D environment.


This means I'll try to write robust and useable code, with both 
stability and performance in mind, as opposed to the proofs of 
concepts in so far.


For now, I'd like to keep it simple: Would something that only 
knows how to parse/write Sanger FASTQ files be of help to you?


If I write something, can I have you review it?


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread bioinfornatics

Thanks monarch and FG,
i will read your code to see where i failing :-)
And of course if you are interested with bio format i will really 
happy to works / review together


In any case  big thanks that is a very interesting subject


Re: How to read fastly files ( I/O operation)

2013-02-06 Thread FG

On 2013-02-06 21:43, monarch_dodra wrote:

On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:

I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far


Do you mean my solution above? I tried your solution with dmd, with -release -O
-inline, and both gave about the same result (69s yours, 67s mine).


Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.

You have timed the same file SRR077487_1.filt.fastq at 67s?



I'm getting real interested on the subject. I'm going to try to write an actual
library/framework for working with fastq files in a D environment.


Those fastq are contagious. ;)


This means I'll try to write robust and useable code, with both stability and
performance in mind, as opposed to the proofs of concepts in so far.


Yeah, but the big deal was that D is 5.5x slower than C++.

You have mentioned something about using byLine. Well, I would have gladly used
it instead of looking for line ends myself and pushing stuff with memcpy.
But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx
is fast in reading file by line, using file.readln(buf) is unpredictable. :)
I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC
can be several times slower. For example just reading in a loop:

import std.stdio;
enum uint bufferSize = 4096 - 16;
void main(string[] args) {
char[] tmp, buf = new char[bufferSize];
size_t cnt;
auto f = File(args[1], r);
switch(args[2]) {
case raw:
do tmp = f.rawRead(buf); while (tmp.length);
break;

case readln:
do cnt = f.readln(buf); while (cnt);
break;

default: writeln(Use parameters: filename raw|readln);
}
}

Tested on a much smaller SRR077487.filt.fastq:
DMD32 -release -O -inline: raw 94ms / readln 450ms
GDC64 -O3: raw 94ms / readln 6.76s

Tested on SRR077487_1.filt.fastq:
DMD32 -release -O -inline: raw 1m44s / readln  1m55s
GDC64 -O3: raw 1m48s / readln 14m16s

Why such a big difference between the DMD and GDC (on Windows)?
(or have I missed some switch in GDC?)



Re: How to read fastly files ( I/O operation)

2013-02-06 Thread Ali Çehreli

On 02/06/2013 12:43 PM, monarch_dodra wrote:

 with dmd, with -release -O -inline

Going off topic a little, in a recent experiment, I have noticed that 
adding -inline made a range solution twice slower. -O -release still 
helped but -inline was the culprit.


Ali



Re: How to read fastly files ( I/O operation)

2013-02-06 Thread Lee Braiden

On 06/02/13 22:21, bioinfornatics wrote:

Thanks monarch and FG,
i will read your code to see where i failing :-)


I wasn't going to mention this as I thought the CPU usage might be 
trivial, but if both CPU and IO are factors, then it would probably be 
beneficial to have a separate IO thread/task.


I guess you'd need a big task: the task would need to load and return n 
chunks or n lines, rather than just one line at at time, for example, 
and the processing/parsing thread (main thread or otherwise) could then 
churn through that while more IO was done.


It would also depend on the size of the file: no point firing up a 
thread just to read a tiny file that the filesystem can return in a 
millisecond.  If you're talking about 1+ minutes of loading though, a 
thread should definitely help.


Also, if you don't strictly need to parse the file in order, then you 
could divide and conquer it by breaking it into more sections/tasks. For 
example, if you're parsing records, you cold split the file in half, 
find the remaining parts of the record in the second half, move it to 
the first, and then process the two halves in two threads.  If you've a 
nice function to do that split cleanly, and n cpus, then just call it 
some more.




--
Lee



Re: How to read fastly files ( I/O operation)

2013-02-06 Thread FG

On 2013-02-07 00:41, Lee Braiden wrote:

I wasn't going to mention this as I thought the CPU usage might be trivial, but
if both CPU and IO are factors, then it would probably be beneficial to have a
separate IO thread/task.


This wasn't an issue in my version of the program. It took 1m55s to process the
file, but then again it takes 1m44s just to read it (as shown previously).


Also, if you don't strictly need to parse the file in order, then you could
divide and conquer it by breaking it into more sections/tasks. For example, if
you're parsing records, you cold split the file in half, find the remaining
parts of the record in the second half, move it to the first, and then process
the two halves in two threads.  If you've a nice function to do that split
cleanly, and n cpus, then just call it some more.


Now, this could make a big difference!
If only parsing out of order is acceptable in this case.



Re: How to read fastly files ( I/O operation)

2013-02-06 Thread monarch_dodra

On Wednesday, 6 February 2013 at 22:55:14 UTC, FG wrote:

On 2013-02-06 21:43, monarch_dodra wrote:

On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:

I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as 
fast as the

fastest D solution so far


Do you mean my solution above? I tried your solution with dmd, 
with -release -O
-inline, and both gave about the same result (69s yours, 67s 
mine).


Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.

You have timed the same file SRR077487_1.filt.fastq at 67s?


Yes, that file exactly. That said, I'm working on an SSD, so 
maybe I'm less IO bound than you are?


My attempt was mostly to try and see how fast we could go, while 
doing it only with high level stuff (eg, no fSomething calls).


Probably, going lower level, and parsing the text manually, 
waiting for magic characters could yield better result (like what 
you did).


I'm going to also try playing around with threads: Just last week 
I wrote a program that did exactly this (asynchronous file reads).


That said, I'll be making this priority n°2. I'd like to make the 
parser work perfectly first, and in a way that is easily 
upgradeable/useable. Mr. bio made it perfectly clear that he 
needed support for whites and line feeds ;)


How to read fastly files ( I/O operation)

2013-02-04 Thread bioinfornatics

Dear,

I am looking to parse efficiently huge file but i think D lacking 
for this purpose.
To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in 
c++ ) need 2 min.


My code is maybe not easy as is not easy to parse a fastq file 
and is more harder when using memory mapped file.


I do not see where i can get some perf as i do not do many copy 
and i use mmfile.
fastxtoolkit do not use mmfile and store his result into a  
struct array for each sequences but is is still faster!!!


thanks to any help i hope we can create a faster parser otherwise 
that is too slow to use D instead C++


Re: How to read fastly files ( I/O operation)

2013-02-04 Thread FG

On 2013-02-04 15:04, bioinfornatics wrote:

I am looking to parse efficiently huge file but i think D lacking for this 
purpose.
To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 
min.

My code is maybe not easy as is not easy to parse a fastq file and is more
harder when using memory mapped file.


Why are you using mmap? Don't you just go through the file sequentially?
In that case it should be faster to read in chunks:

foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }



Re: How to read fastly files ( I/O operation)

2013-02-04 Thread Dejan Lekic
FG wrote:

 On 2013-02-04 15:04, bioinfornatics wrote:
 I am looking to parse efficiently huge file but i think D lacking for this
 purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++
 ) need 2 min.

 My code is maybe not easy as is not easy to parse a fastq file and is more
 harder when using memory mapped file.
 
 Why are you using mmap? Don't you just go through the file sequentially?
 In that case it should be faster to read in chunks:
 
  foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }

I would go even further, and organise the file so N Data objects fit one page, 
and read the file page by page. The page-size can easily be obtained from the 
system. IMHO that would beat this fastxtoolkit. :)

-- 
Dejan Lekic
dejan.lekic (a) gmail.com
http://dejan.lekic.org


Re: How to read fastly files ( I/O operation)

2013-02-04 Thread monarch_dodra

On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:

FG wrote:


On 2013-02-04 15:04, bioinfornatics wrote:
I am looking to parse efficiently huge file but i think D 
lacking for this
purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit 
(written in c++

) need 2 min.

My code is maybe not easy as is not easy to parse a fastq 
file and is more

harder when using memory mapped file.


Why are you using mmap? Don't you just go through the file 
sequentially?

In that case it should be faster to read in chunks:

 foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }


I would go even further, and organise the file so N Data 
objects fit one page,
and read the file page by page. The page-size can easily be 
obtained from the

system. IMHO that would beat this fastxtoolkit. :)


AFAIK, he is reading text data that needs to be parsed line by 
line, so byChunk may not be the best approach. Or at least, not 
the easiest approach.


I'm just wondering if maybe the reason the D code is slow is not 
just because of:

- unicode.
- front + popFront.

ranges in D are notorious for being slow to iterate on text, 
due to the double decode.


If you are *certain* that the file contains nothing but ASCII 
(which should be the case for fastq, right?), you can get more 
bang for your buck if you attempt to iterate over it as an array 
of bytes, and convert the bytes to char on the fly, bypassing any 
and all unicode processing.


Re: How to read fastly files ( I/O operation)

2013-02-04 Thread Brad Roberts
On Mon, 4 Feb 2013, monarch_dodra wrote:

 AFAIK, he is reading text data that needs to be parsed line by line, so
 byChunk may not be the best approach. Or at least, not the easiest approach.
 
 I'm just wondering if maybe the reason the D code is slow is not just because
 of:
 - unicode.
 - front + popFront.

First rule of performance analysis.. don't guess, measure.


Re: How to read fastly files ( I/O operation)

2013-02-04 Thread Jacob Carlborg

On 2013-02-04 20:39, monarch_dodra wrote:


AFAIK, he is reading text data that needs to be parsed line by line, so
byChunk may not be the best approach. Or at least, not the easiest
approach.


He can still read a chunk from the file, or the whole file and then read 
that chunk line by line.



I'm just wondering if maybe the reason the D code is slow is not just
because of:
- unicode.
- front + popFront.

ranges in D are notorious for being slow to iterate on text, due to
the double decode.

If you are *certain* that the file contains nothing but ASCII (which
should be the case for fastq, right?), you can get more bang for your
buck if you attempt to iterate over it as an array of bytes, and convert
the bytes to char on the fly, bypassing any and all unicode processing.


Depending on what you're doing you can blast through the bytes even if 
it's Unicode. It will of course not validate the Unicode.


--
/Jacob Carlborg