Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Sep 15, 2015 at 08:55:43AM +, Fredrik Boulund via Digitalmars-d-learn wrote: > On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote: > >I tried implementing a crude version of this (see code below), and > >found that manually calling GC.collect() even as frequently as once >

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Andrew Brown via Digitalmars-d-learn
I had some luck building a local copy of llvm in my home directory, using a linux version about as old as yours (llvm 3.5 i used) specifying: --configure --prefix=/home/andrew/llvm so make install would install it somewhere I had permissions. Then I changed the cmake command to: cmake -L

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Rikki Cattermole via Digitalmars-d-learn
On 15/09/15 9:00 PM, Kagamin wrote: On Tuesday, 15 September 2015 at 08:51:02 UTC, Fredrik Boulund wrote: Using char[] all around might be a good idea, but it doesn't seem like the string conversions are really that taxing. What are the arguments for working on char[] arrays rather than

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread John Colvin via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 09:09:00 UTC, Kagamin wrote: On Tuesday, 15 September 2015 at 08:53:37 UTC, Fredrik Boulund wrote: my favourite for streaming a file: enum chunkSize = 4096; File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner() Is this an efficient way of reading this

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote: I tried implementing a crude version of this (see code below), and found that manually calling GC.collect() even as frequently as once every 5000 loop iterations (for a 500,000 line test input file) still gives about 15%

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 16:13:14 UTC, Edwin van Leeuwen wrote: See this link for clarification on what the columns/numbers in the profile file mean http://forum.dlang.org/post/f9gjmo$2gce$1...@digitalmars.com It is still difficult to parse though. I myself often use sysprof (only

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Kagamin via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 08:53:37 UTC, Fredrik Boulund wrote: my favourite for streaming a file: enum chunkSize = 4096; File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner() Is this an efficient way of reading this type of file? What should one keep in mind when choosing

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 15:04:12 UTC, John Colvin wrote: I've had nothing but trouble when using different versions of libc. It would be easier to do this instead: http://wiki.dlang.org/Building_LDC_from_source I'm running a build of LDC git HEAD right now on an old server with

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 18:08:31 UTC, John Colvin wrote: On Monday, 14 September 2015 at 17:51:43 UTC, CraigDillabaugh wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] I am going to go off the beaten path here. If you really want speed for a file

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote: A lot of this hasn't been covered I believe. http://dpaste.dzfl.pl/f7ab2915c3e1 1) You don't need to convert char[] to string via to. No. Too much. Cast it. 2) You don't need byKey, use foreach key, value syntax. That way

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Kagamin via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 08:51:02 UTC, Fredrik Boulund wrote: Using char[] all around might be a good idea, but it doesn't seem like the string conversions are really that taxing. What are the arguments for working on char[] arrays rather than strings? No, casting to string would be

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread John Colvin via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 08:45:00 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 15:04:12 UTC, John Colvin wrote: [...] Thanks for the offer, but don't go out of your way for my sake. Maybe I'll just build this in a clean environment instead of on my work computer to

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread John Colvin via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 13:49:04 UTC, Fredrik Boulund wrote: On Tuesday, 15 September 2015 at 10:01:30 UTC, John Colvin wrote: [...] Nope, :( [...] Oh well, worth a try I guess.

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Fredrik Boulund via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 18:42:29 UTC, Andrew Brown wrote: I had some luck building a local copy of llvm in my home directory, using a linux version about as old as yours (llvm 3.5 i used) specifying: --configure --prefix=/home/andrew/llvm so make install would install it somewhere I

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Fredrik Boulund via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 10:01:30 UTC, John Colvin wrote: try this: https://dlangscience.github.io/resources/ldc-0.16.0-a2_glibc2.11.3.tar.xz Nope, :( $ ldd ldc2 ./ldc2: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./ldc2) linux-vdso.so.1 =>

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread John Colvin via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 13:01:06 UTC, Kagamin wrote: On Tuesday, 15 September 2015 at 09:19:29 UTC, John Colvin wrote: It provides you only one char at a time instead of a whole line. It will be quite constraining for your code if not mind-bending.

Re: Speeding up text file parser (BLAST tabular format)

2015-09-15 Thread Kagamin via Digitalmars-d-learn
On Tuesday, 15 September 2015 at 09:19:29 UTC, John Colvin wrote: It provides you only one char at a time instead of a whole line. It will be quite constraining for your code if not mind-bending. http://dlang.org/phobos/std_string.html#.lineSplitter

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Example output might be useful for you to see as well: 10009.1.1:5.2e-02_13: 16 10014.1.1:2.9e-03_11: 44 10017.1.1:4.1e-02_13: 16 10026.1.1:5.8e-03_12: 27 10027.1.1:6.6e-04_13: 16 10060.1.1:2.7e-03_14: 2

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen wrote: Sounds like this program is actually IO bound. In that case I would not expect a really expect an improvement by using D. What is the CPU usage like when you run this program? Also which dmd version are you using. I think

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Andrea Fontana via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if problem probabily is i/o related, have you tried with: -O -inline -release -noboundscheck ? Anyway I think it's a good idea to test it

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Andrea Fontana via Digitalmars-d-learn
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if problem probabily is i/o related, have you tried with: -O -inline -release -noboundscheck ? Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables.

Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
Hi, This is my first post on Dlang forums and I don't have a lot of experience with D (yet). I mainly code bioinformatics-stuff in Python on my day-to-day job, but I've been toying with D for a couple of years now. I had this idea that it'd be fun to write a parser for a text-based tabular

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Edwin van Leeuwen via Digitalmars-d-learn
On Monday, 14 September 2015 at 12:50:03 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen wrote: Sounds like this program is actually IO bound. In that case I would not expect a really expect an improvement by using D. What is the CPU usage like when

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Edwin van Leeuwen via Digitalmars-d-learn
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: Hi, Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20 seconds and the Python version less than 16 seconds. I've repeated runs at

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:14:18 UTC, John Colvin wrote: what system are you on? What are the error messages you are getting? I really appreciate your will to try to help me out. This is what ldd shows on the latest binary release of LDC on my machine. I'm on a Red Hat Enterprise

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:15:25 UTC, Laeeth Isharc wrote: I picked up D to start learning maybe a couple of years ago. I found Ali's book, Andrei's book, github source code (including for Phobos), and asking here to be the best resources. The docs make perfect sense when you have

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread John Colvin via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:58:33 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 13:37:18 UTC, John Colvin wrote: On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Laeeth Isharc via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:55:50 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen wrote: Two things that you could try: First hitlists.byKey can be expensive (especially if hitlists is big). Instead use: foreach( key, value ; hitlists )

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:28:41 UTC, John Colvin wrote: Yup, glibc is too old for those binaries. What does "ldd --version" say? It says "ldd (GNU libc) 2.12". Hmm... The most recent version in RHEL's repo is "2.12-1.166.el6_7.1", which is what is installed. Can this be side-loaded

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread H. S. Teoh via Digitalmars-d-learn
On Mon, Sep 14, 2015 at 02:34:41PM +, Fredrik Boulund via Digitalmars-d-learn wrote: > On Monday, 14 September 2015 at 14:18:58 UTC, John Colvin wrote: > >Range-based code like you are using leads to *huge* numbers of > >function calls to get anything done. The advantage of inlining is >

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if problem probabily is i/o related, have you tried with: -O -inline -release -noboundscheck ? Anyway I think it's a good idea to test it

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:37:18 UTC, John Colvin wrote: On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if problem probabily is i/o related, have you tried with: -O -inline -release

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread John Colvin via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:50:22 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: [...] Thanks for the suggestions! I'm not too familiar with compiled languages like this, I've mainly written small programs in D and run them via `rdmd`

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread John Colvin via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:25:04 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 14:14:18 UTC, John Colvin wrote: what system are you on? What are the error messages you are getting? I really appreciate your will to try to help me out. This is what ldd shows on the latest

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:40:29 UTC, H. S. Teoh wrote: If performance is a problem, the first thing I'd recommend is to use a profiler to find out where the hotspots are. (More often than not, I have found that the hotspots are not where I expected them to be; sometimes a 1-line

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen wrote: Two things that you could try: First hitlists.byKey can be expensive (especially if hitlists is big). Instead use: foreach( key, value ; hitlists ) Also the filter.array.length is quite expensive. You could use count

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Fredrik Boulund via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:18:58 UTC, John Colvin wrote: Range-based code like you are using leads to *huge* numbers of function calls to get anything done. The advantage of inlining is twofold: 1) you don't have to pay the cost of the function call itself and 2) often more

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread John Colvin via Digitalmars-d-learn
On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if problem probabily is i/o related, have you tried with: -O -inline -release -noboundscheck ? -inline in particular is likely to have a strong

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread John Colvin via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:35:26 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 14:28:41 UTC, John Colvin wrote: Yup, glibc is too old for those binaries. What does "ldd --version" say? It says "ldd (GNU libc) 2.12". Hmm... The most recent version in RHEL's repo is

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Edwin van Leeuwen via Digitalmars-d-learn
On Monday, 14 September 2015 at 14:54:34 UTC, Fredrik Boulund wrote: On Monday, 14 September 2015 at 14:40:29 UTC, H. S. Teoh wrote: I agree with you on that. I used Python's cProfile module to find the performance bottleneck in the Python version I posted, and shaved off 8-10 seconds of

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread John Colvin via Digitalmars-d-learn
On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote: On 15/09/15 12:30 AM, Fredrik Boulund wrote: [...] A lot of this hasn't been covered I believe. http://dpaste.dzfl.pl/f7ab2915c3e1 1) You don't need to convert char[] to string via to. No. Too much. Cast it. Not a good

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread H. S. Teoh via Digitalmars-d-learn
On Mon, Sep 14, 2015 at 04:13:12PM +, Edwin van Leeuwen via Digitalmars-d-learn wrote: > On Monday, 14 September 2015 at 14:54:34 UTC, Fredrik Boulund wrote: > >[...] I tried using the built-in profiler in DMD on the D program but > >to no avail. I couldn't really make any sense of the output

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Rikki Cattermole via Digitalmars-d-learn
On 15/09/15 12:30 AM, Fredrik Boulund wrote: Hi, This is my first post on Dlang forums and I don't have a lot of experience with D (yet). I mainly code bioinformatics-stuff in Python on my day-to-day job, but I've been toying with D for a couple of years now. I had this idea that it'd be fun to

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread NX via Digitalmars-d-learn
On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote: A lot of this hasn't been covered I believe. http://dpaste.dzfl.pl/f7ab2915c3e1 I believe that should be: foreach (query, ref value; hitlists) Since an assignment happenin there..?

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread H. S. Teoh via Digitalmars-d-learn
On Mon, Sep 14, 2015 at 08:07:45PM +, Kapps via Digitalmars-d-learn wrote: > On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote: > >I decided to give the code a spin with `gdc -O3 -pg`. Turns out that > >the hotspot is in std.array.split, contrary to expectations. :-) > >Here are

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Rikki Cattermole via Digitalmars-d-learn
On 15/09/15 5:41 AM, NX wrote: On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote: A lot of this hasn't been covered I believe. http://dpaste.dzfl.pl/f7ab2915c3e1 I believe that should be: foreach (query, ref value; hitlists) Since an assignment happenin there..?

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread CraigDillabaugh via Digitalmars-d-learn
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: Hi, This is my first post on Dlang forums and I don't have a lot of experience with D (yet). I mainly code bioinformatics-stuff in Python on my day-to-day job, but I've been toying with D for a couple of years now. I had

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread John Colvin via Digitalmars-d-learn
On Monday, 14 September 2015 at 17:51:43 UTC, CraigDillabaugh wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] I am going to go off the beaten path here. If you really want speed for a file like this one way of getting that is to read the file in as a

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread H. S. Teoh via Digitalmars-d-learn
I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the hotspot is in std.array.split, contrary to expectations. :-) Here are the first few lines of the gprof output: -snip- Each sample counts as 0.01 seconds. % cumulative self self total

Re: Speeding up text file parser (BLAST tabular format)

2015-09-14 Thread Kapps via Digitalmars-d-learn
On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote: I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the hotspot is in std.array.split, contrary to expectations. :-) Here are the first few lines of the gprof output: [...] Perhaps using the new rangified