Re: Sort characters in string
On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote: Or you simply do writeln("longword".array.sort); This is so strange. I was dead sure I tried that but it failed for some reason. But after trying it just now it also seems to work just fine. Thanks! :)
Re: Sort characters in string
On Wednesday, 6 December 2017 at 09:25:20 UTC, Biotronic wrote: In addition, sort does in-place sorting, so the input range is changed. Since D strings are immutable(char)[], changing the elements is disallowed. So in total, you'll need to convert from a string (immutable(char)[]) to a dchar[]. std.conv.to to the rescue: import std.stdio : writeln; import std.conv : to; import std.algorithm.sorting : sort; string word = "longword"; writeln(sort(word.to!(dchar[]))); // dglnoorw Also very useful information! Thanks. I was just realizing that sort was in-place as I finished writing my first post. It got me really confused as I expected it to return a sorted array (but I do realize that is a strange assumption to make).
Re: Sort characters in string
On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote: If you have a string, and you _know_ that it's only ASCII, then either use representation or byCodeUnit to wrap it for the call to sort, but it _will_ have to be mutable, so string won't actually work. e.g. char[] str = "hello world".dup; sort(str.representation); // str is now sorted However, if the string could actually contain Unicode, then you'll have to use std.uni's grapheme support to wrap the string in a range that operates at the grapheme level; otherwise you could rip pieces of characters apart. Thanks so much for such an elaborate reply! I was really close with str.representation in my code, and the little code example you gave helped me fix my problem! Thanks a lot
Sort characters in string
Hi, I'm having some trouble sorting the individual characters in a string. Searching around, I found this thread (http://forum.dlang.org/post/mailman.612.1331659665.4860.digitalmars-d-le...@puremagic.com) about a similar issue, but it feels quite old so I wanted to check if there is a clear cut way of sorting the characters of a string nowadays? I was expecting to be able to do something like this: string word = "longword"; writeln(sort(word)); But that doesn't work because I guess a string is not the type of range required for sort? I tried converting it to a char[] array (to!(char[])(word)) but that doesn't seem work either. I get: Error: template std.algorithm.sorting.sort cannot deduce function from argument types !()(char[]) Best regards, Fredrik
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 18:31:38 UTC, H. S. Teoh wrote: I tried implementing a crude version of this (see code below), and found that manually calling GC.collect() even as frequently as once every 5000 loop iterations (for a 500,000 line test input file) still gives about 15% performance improvement over completely disabling the GC. Since most of the arrays involved here are pretty small, the frequency could be reduced to once every 50,000 iterations and you'd pretty much get the 20% performance boost for free, and still not run out of memory too quickly. Interesting, I'll have to go through your code to understand exactly what's going on. I also noticed some GC-related stuff high up in my profiling, but had no idea what could be done about that. Appreciate the suggestions!
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 16:13:14 UTC, Edwin van Leeuwen wrote: See this link for clarification on what the columns/numbers in the profile file mean http://forum.dlang.org/post/f9gjmo$2gce$1...@digitalmars.com It is still difficult to parse though. I myself often use sysprof (only available on linux), which automatically ranks by time spent. Thanks for the link. I read up on what everything means, but I think the problem isn't finding what consumes the most time, the problem is me not knowing the standard library well enough to translate the largest consumers to actual parts of my code :).
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 15:04:12 UTC, John Colvin wrote: I've had nothing but trouble when using different versions of libc. It would be easier to do this instead: http://wiki.dlang.org/Building_LDC_from_source I'm running a build of LDC git HEAD right now on an old server with 2.11, I'll upload the result somewhere once it's done if it might be useful Thanks for the offer, but don't go out of your way for my sake. Maybe I'll just build this in a clean environment instead of on my work computer to get rid of all the hassle. The Red Hat llvm-devel packages are broken, dependent on libffi-devel which is unavailable. Getting the build environment up to speed on my main machine would take me a lot more time than I have right now. Tried building LDC from scratch but it fails because of missing LLVM components, despite having LLVM 3.4.2 installed (though lacking devel components).
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 18:08:31 UTC, John Colvin wrote: On Monday, 14 September 2015 at 17:51:43 UTC, CraigDillabaugh wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] I am going to go off the beaten path here. If you really want speed for a file like this one way of getting that is to read the file in as a single large binary array of ubytes (or in blocks if its too big) and parse the lines yourself. Should be fairly easy with D's array slicing. my favourite for streaming a file: enum chunkSize = 4096; File(fileName).byChunk(chunkSize).map!"cast(char[])a".joiner() Is this an efficient way of reading this type of file? What should one keep in mind when choosing chunkSize?
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole wrote: A lot of this hasn't been covered I believe. http://dpaste.dzfl.pl/f7ab2915c3e1 1) You don't need to convert char[] to string via to. No. Too much. Cast it. 2) You don't need byKey, use foreach key, value syntax. That way you won't go around modifying things unnecessarily. Ok, I disabled GC + reserved a bunch of memory. It probably won't help much actually. In fact may make it fail so keep that in mind. Humm what else. I'm worried about that first foreach. I don't think it needs to exist as it does. I believe an input range would be far better. Use a buffer to store the Hit[]'s. Have a subset per set of them. If the first foreach is an input range, then things become slightly easier in the second. Now you can turn that into it's own input range. Also that .array usage concerns me. Many an allocation there! Hence why the input range should be the return from it. The last foreach, is lets assume dummy. Keep in mind, stdout is expensive here. DO NOT USE. If you must buffer output then do it large quantities. Based upon what I can see, you are definitely not able to use your cpu's to the max. There is no way that is the limiting factor here. Maybe your usage of a core is. But not the cpu's itself. The thing is, you cannot use multiple threads on that first foreach loop to speed things up. No. That needs to happen all on one thread. Instead after that thread you need to push the result into another. Perhaps, per thread one lock (mutex) + buffer for hits. Go round robin over all the threads. If mutex is still locked, you'll need to wait. In this situation a locked mutex means all you worker threads are working. So you can't do anything more (anyway). Of course after all this, the HDD may still be getting hit too hard. In which case I would recommend you memory mapping it. Which should allow the OS to more efficiently handle reading it into memory. But you'll need to rework .byLine for that. Wow that was a lot at 4:30am! So don't take it too seriously. I'm sure somebody else will rip that to shreds! Thanks for your suggestions! That sure is a lot of details. I'll have to go through them carefully to understand what to do with all this. Going multithreaded sounds fun but would effectively kill of all of my spare time, so I might have to skip that. :) Using char[] all around might be a good idea, but it doesn't seem like the string conversions are really that taxing. What are the arguments for working on char[] arrays rather than strings? I'm aware that printing output like that is a performance killer, but it's not supposed to write anything in the final program. It's just there for me to be able to compare the results to my Python code.
Re: Speeding up text file parser (BLAST tabular format)
On Tuesday, 15 September 2015 at 18:42:29 UTC, Andrew Brown wrote: I had some luck building a local copy of llvm in my home directory, using a linux version about as old as yours (llvm 3.5 i used) specifying: --configure --prefix=/home/andrew/llvm so make install would install it somewhere I had permissions. Then I changed the cmake command to: cmake -L -DLLVM_CONFIG="/home/andrew/llvm/bin/llvm-config" .. and I got a working install of ldc. Make yourself a cup of tea while you wait though if you try it, llvm was about an hour and a half to compile. Thanks for your suggestion. I'm amazed by the amount of effort you guys put into helping me. Unfortunately the only precompiled version of libstdc++ available for the system in question via Red Hat repo's is 4.4.7, and compiling llvm from scratch requires at least 4.7. I'll be fine using DMD for now as I'm still learning more about D :).
Re: Speeding up text file parser (BLAST tabular format)
On Tuesday, 15 September 2015 at 10:01:30 UTC, John Colvin wrote: try this: https://dlangscience.github.io/resources/ldc-0.16.0-a2_glibc2.11.3.tar.xz Nope, :( $ ldd ldc2 ./ldc2: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./ldc2) linux-vdso.so.1 => (0x7fff2ffd8000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00318a00) libdl.so.2 => /lib64/libdl.so.2 (0x00318a40) libncurses.so.5 => /lib64/libncurses.so.5 (0x00319bc0) librt.so.1 => /lib64/librt.so.1 (0x00318a80) libz.so.1 => /lib64/libz.so.1 (0x00318ac0) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00318dc0) libm.so.6 => /lib64/libm.so.6 (0x003189c0) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00318c00) libc.so.6 => /lib64/libc.so.6 (0x00318980) /lib64/ld-linux-x86-64.so.2 (0x00318940) libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00319900) Thanks for trying though!
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Example output might be useful for you to see as well: 10009.1.1:5.2e-02_13: 16 10014.1.1:2.9e-03_11: 44 10017.1.1:4.1e-02_13: 16 10026.1.1:5.8e-03_12: 27 10027.1.1:6.6e-04_13: 16 10060.1.1:2.7e-03_14: 2 10061.1.1:5.1e-07_13: 41 Worth noticing is that it is essentially impossible to predict how many "hits"/records there are for each query, it varies wildly from 0 to 1000+ in some cases.
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen wrote: Sounds like this program is actually IO bound. In that case I would not expect a really expect an improvement by using D. What is the CPU usage like when you run this program? Also which dmd version are you using. I think there were some performance improvements for file reading in the latest version (2.068) Hi Edwin, thanks for your quick reply! I'm using v2.068.1; I actually got inspired to try this out after skimming the changelog :). Regarding if it is IO-bound. I actually expected it would be, but both the Python and the D-version consume 100% CPU while running, and just copying the file around only takes a few seconds (cf 15-20 sec in runtime for the two programs). There's bound to be some aggressive file caching going on, but I figure that would rather normalize program runtimes at lower times after running them a few times, but I see nothing indicating that.
Speeding up text file parser (BLAST tabular format)
Hi, This is my first post on Dlang forums and I don't have a lot of experience with D (yet). I mainly code bioinformatics-stuff in Python on my day-to-day job, but I've been toying with D for a couple of years now. I had this idea that it'd be fun to write a parser for a text-based tabular data format I tend to read a lot of in my programs, but I was a bit stomped that the D implementation I created was slower than my Python-version. I tried running `dmd -profile` on it but didn't really understand what I can do to make it go faster. I guess there's some unnecessary dynamic array extensions being made but I can't figure out how to do without them, maybe someone can help me out? I tried making the examples as small as possible. Here's the code D code: http://dpaste.com/2HP0ZVA Here's my Python code for comparison: http://dpaste.com/0MPBK67 Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20 seconds and the Python version less than 16 seconds. I've repeated runs at least thrice when testing. This holds true even if the D version is compiled with -O. The file being parsed is the output of a DNA/protein sequence mapping algorithm called BLAT, but the tabular output format is originally known from the famous BLAST algorithm. Here's a short example of what the input files looks like: http://dpaste.com/017N58F The format is TAB-delimited: query, target, percent_identity, alignment_length, mismatches, gaps, query_start, query_end, target_start, target_end, e-value, bitscore In the example the output is sorted by query, but this cannot be assumed to hold true for all cases. The input file varies in range from several hundred megabytes to several gigabytes (10+ GiB). A brief explanation on what the code does: Parse each line, Only accept records with percent_identity >= min_identity (90.0) and alignment_length >= min_matches (10), Store all such records as tuples (in the D code this is a struct) in an array in an associative array indexed by 'query', For each query, remove any records with percent_id less than 5 percentage points less than the highest value observed for that query, Write results to stdout (in my real code the data is subject to further downstream processing) This was all just for me learning to do some basic stuff in D, e.g. file handling, streaming data from disk, etc. I'm really curious what I can do to improve the D code. My original idea was that maybe I should compile the performance critical parts of my Python codebase to D and call them with PyD or something, but not I'm not so sure any more. Help and suggestions appreciated!
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 14:14:18 UTC, John Colvin wrote: what system are you on? What are the error messages you are getting? I really appreciate your will to try to help me out. This is what ldd shows on the latest binary release of LDC on my machine. I'm on a Red Hat Enterprise Linux 6.6 system. [boulund@terra ~]$ ldd ~/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2 /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2) /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2: /lib64/libc.so.6: version `GLIBC_2.15' not found (required by /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/ldc2) linux-vdso.so.1 => (0x7fff623ff000) libconfig.so.8 => /home/boulund/apps/ldc2-0.16.0-alpha2-linux-x86_64/bin/libconfig.so.8 (0x7f7f716e1000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7f7f714a3000) libdl.so.2 => /lib64/libdl.so.2 (0x7f7f7129f000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0032cde0) libm.so.6 => /lib64/libm.so.6 (0x7f7f7101a000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0032cca0) libc.so.6 => /lib64/libc.so.6 (0x7f7f70c86000) /lib64/ld-linux-x86-64.so.2 (0x7f7f718ec000) As you can see it lacks something related to GLIBC, but I'm not sure how to fix that.
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 14:15:25 UTC, Laeeth Isharc wrote: I picked up D to start learning maybe a couple of years ago. I found Ali's book, Andrei's book, github source code (including for Phobos), and asking here to be the best resources. The docs make perfect sense when you have got to a certain level (or perhaps if you have a computer sciencey background), but can be tough before that (though they are getting better). You should definitely take a look at the dlangscience project organized by John Colvin and others. If you like ipython/jupyter also see his pydmagic - write D inline in a notebook. I saw the dlangscience project on GitHub the other day. I've yet to venture deeper. The inlining of D in jupyter notebooks sure is cool, but I'm not sure it's very useful for me, Python feels more succinct for notebook use. Still, I really appreciate the effort put into that, it's really cool! You may find this series of posts interesting too - another bioinformatics guy migrating from Python: http://forum.dlang.org/post/akzdstfiwwzfeoudh...@forum.dlang.org I'll have a look at that series of posts, thanks for the heads-up! Unfortunately I haven't time to read your code, and others will do better. But do you use .reserve() ? Also these are a nice fast container library based on Andrei Alexandrescu's allocator: https://github.com/economicmodeling/containers Not familiar with .reserve(), nor Andrei's allocator library. I'll put that in the stuff-to-read-about-queue for now. :) Thanks for your tips!
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 14:28:41 UTC, John Colvin wrote: Yup, glibc is too old for those binaries. What does "ldd --version" say? It says "ldd (GNU libc) 2.12". Hmm... The most recent version in RHEL's repo is "2.12-1.166.el6_7.1", which is what is installed. Can this be side-loaded without too much hassle and manual effort?
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if problem probabily is i/o related, have you tried with: -O -inline -release -noboundscheck ? Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables. Andrea Thanks for the suggestions! I'm not too familiar with compiled languages like this, I've mainly written small programs in D and run them via `rdmd` in a scripting language fashion. I'll read up on what the different compile flags do (I knew about -O, but I'm not sure what the others do). Unfortunately I cannot get LDC working on my system. It seems to fail finding some shared library when I download the binary released, and I can't figure out how to make it compile. I haven't really given GDC a try yet. I'll see what I can do. Running the original D code I posted before with the flags you suggested reduced the runtime by about 2 seconds on average.
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 13:37:18 UTC, John Colvin wrote: On Monday, 14 September 2015 at 13:05:32 UTC, Andrea Fontana wrote: On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund wrote: [...] Also if problem probabily is i/o related, have you tried with: -O -inline -release -noboundscheck ? -inline in particular is likely to have a strong impact here Why would -inline be particularly likely to make a big difference in this case? I'm trying to learn, but I don't see what inlining could be done in this case. Anyway I think it's a good idea to test it against gdc and ldc that are known to generate faster executables. Andrea +1 I would expect ldc or gdc to strongly outperform dmd on this code. Why is that? I would love to learn to understand why they could be expected to perform much better on this kind of code.
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 14:40:29 UTC, H. S. Teoh wrote: If performance is a problem, the first thing I'd recommend is to use a profiler to find out where the hotspots are. (More often than not, I have found that the hotspots are not where I expected them to be; sometimes a 1-line change to an unanticipated hotspot can result in a huge performance boost.) I agree with you on that. I used Python's cProfile module to find the performance bottleneck in the Python version I posted, and shaved off 8-10 seconds of runtime on an extraneous str.split() I had missed. I tried using the built-in profiler in DMD on the D program but to no avail. I couldn't really make any sense of the output other than that were enormous amounts of calls to lots of functions I couldn't find a way to remove from the code. Here's a paste of the trace output from the version I posted in the original post: http://dpaste.com/1AXPK9P The next thing I'd try is to use gdc instead of dmd. ;-) IME, code produced by `gdc -O3` is at least 20-30% faster than code produced by `dmd -O -inline`. Sometimes the difference can be up to 40-50%, depending on the kind of code you're compiling. Yes, it really seems that gdc or ldc is the way to go.
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen wrote: Two things that you could try: First hitlists.byKey can be expensive (especially if hitlists is big). Instead use: foreach( key, value ; hitlists ) Also the filter.array.length is quite expensive. You could use count instead. import std.algorithm : count; value.count!(h => h.pid >= (max_pid - max_pid_diff)); I didn't know that hitlists.byKey was that expensive, that's just the kind of feedback I was hoping for. I'm just grasping for straws in the online documentation when I want to do things. With my Python background it feels as if I can still get things that work that way. I realize the filter.array.length thing is indeed expensive. I find it especially horrendous that the code I've written needs to allocate a big dynamic array that will most likely be cut down quite drastically in this step. Unfortunately I haven't figured out a good way to do this without storing the intermediary results since I cannot know if there might be yet another hit for any encountered "query" since the input file might not be sorted. But the main reason I didn't just count the values like you suggest is actually that I need the filtered hits in later downstream analysis. The filtered hits for each query are used as input to a lowest common ancestor algorithm on the taxonomic tree (of life).
Re: Speeding up text file parser (BLAST tabular format)
On Monday, 14 September 2015 at 14:18:58 UTC, John Colvin wrote: Range-based code like you are using leads to *huge* numbers of function calls to get anything done. The advantage of inlining is twofold: 1) you don't have to pay the cost of the function call itself and 2) often more optimisation can be done once a function is inlined. Thanks for that explanation! Now that you mention it it makes perfect sense. I never considered it, but of course *huge* numbers of function calls to e.g. next() and other range-methods will be made. Because there are much better at inlining. dmd is quick to compile your code and is most up-to-date, but ldc and gdc will produce somewhat faster code in almost all cases, sometimes very dramatically much faster. Sure sounds like I could have more fun with LDC and GDC on my system in addition to DMD :).