Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-10-03 Thread brentp
I finally wrote a quick wrapper for gzfile that looks a lot like nim's File. You can use like import gzfile import strutils var vcf:GZFile doAssert vcf.open("test.vcf.gz") for line in vcf.lines: if not line.startsWith('#'): # do something

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-27 Thread cblake
It looks like `nim-faststreams` is just to bridge the API gap between mmap IO and streams interfaces which is nice and all, but won't help for `startProcess` or other `popen`-like contexts where in this discussion streams slowness was a problem. { @mratsim didn't say it would, exactly, but I

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-26 Thread mratsim
We have an alternative streams implementation that uses memfiles for speed: [nim-faststreams](https://github.com/status-im/nim-faststreams)

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-25 Thread cdunn2001
The slowness of the Nim **streams** library is frustrating, but I've learned to avoid `FileStream`. I simply readAll() into memory and then use `StringStream`, where the unbuffered implementation is fine.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-22 Thread brentp
hi @markebbert, re Zstd and htslib, see: [https://github.com/samtools/htslib/issues/530](https://github.com/samtools/htslib/issues/530) ping me if you have any questions on hts-nim.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-22 Thread markebbert
Thanks @jyapayne. It's nice to have this code I can look back on for future use. @brentp, fancy seeing you around here. I blame you for starting me on this little journey. :-P Thanks for pointing me towards the multi-threaded `.gz` decompression in `hts-nim`. I will definitely use that. Did

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-22 Thread jyapayne
@markebbert Good catch! Thanks for debugging :P Now that you mention it, the whole section if data[last] == '\l': buffer.add data[pos+1 ..< pos+bufSize] else: buffer.add data[pos ..< pos+bufSize] Run can just be replaced with buffer.add

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-22 Thread brentp
just to add my 0.02 here as I happened on this and I have avoided using gzip stuff from nim as it is too slow. Mark (hi!), I know you linked to [hts-nim]([https://github.com/brentp/hts-nim](https://github.com/brentp/hts-nim)) but that will give you multi-threaded decompression for bgzipped

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-21 Thread markebbert
Thanks @jyapayne. I think we (i.e., you) are really close. The first character of each line that exceeds the buffer was getting cut off (or maybe if the prior line exceeded buffer?). Looks like we're off by one at the same spot. I believe: buffer.add data[pos+1 ..< pos+bufSize]

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-21 Thread jyapayne
@markebbert, yes you are right! That should be if data[last] == '\l': buffer.add data[pos+1 ..< pos+bufSize] else: buffer.add data[pos ..< pos+bufSize] pos += bufSize Run Which will account for the buffer increase. So the code now for the `lines`

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-21 Thread cblake
For what it's worth, and for completeness if Windows portability even matters in this case (as @markebbert mentioned, these science things are often one time deals), this works but is 6x slower (405 sec aka 6min 45sec) than the `popen`/`mSlices` variant: import strutils, osproc,

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread jyapayne
@markebbert, No problem! No need to explain yourself. I asked for feedback :) It's all for the service of making things better! As for your comment, though, I don't think it is duplicating lines. I extracted the file and compared the output (adding an echo statement to the code) and they are

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread markebbert
@jyapayne, I don't mean to be a backseat driver, and I realize you're doing this on the fly in your 'spare' time, but I think your updated code is now duplicating lines (without a newline). For example, I'm seeing close to 400k columns on the final header column that identifies sample IDs

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread cblake
You're welcome. @jyapayne \- Well, there is this import strutils, posix proc main() = for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)): if line.startsWith('#'): continue var i = 0 for col in line.split('\t'):

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread markebbert
@cblake, Thank you for all of the detailed comparisons, and thanks for pointing out the counter bug. I did catch that yesterday, but it didn't make a meaningful difference for the timing. I plan to update it in my original post. Thanks for pointing me towards `Zstd`. That's remarkable, and

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread jyapayne
@markebbert Here is a rough version of the working code. Unfortunately, the speed is quite a bit slower. And it does indeed look like most of the time is spent in the split function, because without it, this code runs in ~39 seconds. With it, it takes about 7 minutes on my machine.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread cblake
After `-d:danger -d:release` and gcc-9.2 on an i7-6700k at 4.8GHz this runs in about 46 seconds for me against the decompressed file in a RAM filesystem: import cligen/[mfile, mslice] proc main() = for line in mSlices(mopen("big.vcf")): if line.len > 0 and line[0]

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread jyapayne
@markebbert, yes I've actually made quite a serious error in the code I posted. The way streams work is a bit different from files and I can't actually do what I did to the buffer size. What actually happens is that the code reads in 100 MB of file, parses up to one line, and then throws the

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread markebbert
I'm glad to see that my silly compiling error has proven useful in an unintended way. :-) @jyapayne, I was adding in my other logic but I noticed some lines were missing. I went back to the minimal example and echoed the line right inside the for loop. It prints the first header line

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread cblake
Because various Zstd ratios are all so large, it helps in many practical circumstances more than choice of programming language (which often plays in the 2x-5x range). Continuing with that one data file example, with just 4 cores the output rate is 7.3 GB/s. On an otherwise idle 16 core system,

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread jyapayne
@cblake damn! I had no idea zstd was so awesome. Thanks for those numbers. I will definitely use that for compression in the future. Also your mmap code using cligen is really nice. I'll probably use it for parsing files in the future ;) @rayman22201 I'll file a PR for the problem. Just gotta

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread mratsim
Note: It's not `--d:release` or `-d:danger`, it can be both. `-d:release` turns on GCC/Clang/MSVC `-O3` optimizations and removes stacktraces. `-d:danger` turns off runtime checks like asserts, checking that all array or string accesses are within bounds, that your `Natural` is >= 0 and so

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread rayman22201
Streams in the stdlib definitely need some optimization love. I've known about the char by char readline problem forever, and just never got around to fixing it. See here:

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread jyapayne
@markebbert I had a crack at making your script faster. Some std libs need to be altered, but here's the work in a standalone file: import zip/gzipfiles import strutils # This is from lib/pure/streams.nim. Optimized to take a buffer proc readStr*(s: Stream, length:

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread Stefan_Salewski
> There are so many cool languages, but learning a new language and all of its > nuances can be so time consuming. If you have some basic CS background, then the effort learning Nim is minimal, it is basically reading Tutorial 1 and 2 only, and maybe some of the other free resources listed on

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
@jyapayne \- not according to him. :-) But your two messages clearly passed in flight. As Araq mentioned, that `gzipfiles` module may need some work. You should do it! Personally, I have been avoiding gzip since at least 2007. There are just much better algos & tools now across every dimension

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread markebbert
@siloamx, thank you for pointing out that Python list comprehensions are faster than loops! I only recently started using them for simple list manipulations (e.g., vector math), but I did not know they were faster. Also looks like you can use functions within them, which makes them useful for

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
You're welcome. The usual behavior for compilers is to generate slow code about as quickly as possible, maybe also with the best debuggability, and do things like add `-O`/`-O2` to get faster code generated more slowly. I don't think Nim should vary from that. I am not entirely sure how much

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread zulu
Would not it be good to make at least d:release the default compile option ? I get nervous every time someone posts a perf issue like this as I do not know whether this would affect any of my code or not. Thank you @cblake for testing this out.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
I don't know. Thanks for the implicit compliments and all, but it's an awfully specific set of circumstances, not a general performance analysis even of VCF never mind DSV parsing. The performance will be closely related to how long various column substrings are. That will be specific to the

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
Well, I don't really have a blog. So, this is what you get. ;-) Someone else can, though. Ideally, just give me credit by linking back to here. Or if you can make any of that `cligen/mslice.*split*` faster then a PR is welcome. As a slight update, storing `big.vcf` in `/tmp` (a tmpfs aka

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread zetashift
@cblake maybe a worthy post to the nim homepage blog: [https://nim-lang.org/blog.html](https://nim-lang.org/blog.html) ?

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
FWIW, I suspect the answer to all this noise is that @markebbert was simply not using an optimized compile (as suggested by the very first line of the very first response to him). @jyapayne \- what I did was go to [https://vcftools.github.io/index.html](https://vcftools.github.io/index.html)

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread jyapayne
Mark, Do you have a sample file that we can use to try to optimize the code? I'd like to try but I'd need to test it out on a real file.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread federico3
Nim is usually among the fastests languages around. Occasionally you run into a shockingly slow proc without warnings in its documentation. Having a large set of language/stdlib benchmarks could really help but so far there's been little interest creating and maintaining such set.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread jlhouchin
I want to encourage on your journey with Nim. Nim can be a challenge especially if you are coming from a dynamic language like Python. Because Nim has a nice and reasonably friendly syntax for most cases. It is not a Python-like language. You can not approach Nim like Python. It is a system's

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
On slightly closer inspection of that spec, it seems that backslash quoting only happens in the `##` comment sections ignored above. So, maybe they aren't buggy after all if that data is not important to the calculation in question. A further point along the lines of "if we're going to provide

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread Araq
1. I suspect what's really slow here is Nim's IO or unzipping and this should definitely be looked into and fixed. 2. Well, yes, strutils is for quick & dirty hacking, not for "quick runtimes". And optimizing splits is easier said than done, effectively you need 2 different versions so that

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
In fact, that VCF format does use `\` escapes ([https://samtools.github.io/hts-specs/VCFv4.3.pdf)](https://samtools.github.io/hts-specs/VCFv4.3.pdf\)). So, basically all the above code examples are indeed wrong, and @Araq's advice is the right general advice (I did say they should probably

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread Araq
You can optimize it all you want, in the end it's still naive because it's **wrong** : What if data spans multiple lines? What if there is some escape mechanism via `\`? What if the data can be encoded via `%xx` (byte in hex). There are not many file formats around that have no escape/quoting

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
I agree with @siloamx. I would also point out that for many less compiler/parsing-sophisticated programmers, splitting is conceptually simple enough to make all the difference. Such programmers may never even have heard of "lexing". This is just to re-express @Araq's point about it being

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread siloamx
Maybe > 'Parsing with split is naive' but it will be very common esp. among users who migrate from Python/Javascript. It would be nice if you could optimize it.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread Araq
I use `lexbase` and `strscans` modules and never had performance problems. Parsing with `split` is naive, it hardly works, so we never really optimized it.

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-17 Thread erikenglund
I just wrote an obj loader for my game last week, obj files are text files that describes a 3d model. To stress test it I tried the common large minecraft model rungholt, this is a text file with over 9 million lines. Blender took 2 minutes to load the file and windows preview gave up. I load

Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-17 Thread Stefan_Salewski
Have you compiled your Nim program with option -d:release or -d:danger? Just to ensure that it is not a default debug built. Of course Nim should be not slower than Python, but I guess it may be not really faster for plain string split. The reason is, that these basic operations are generally

Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-17 Thread markebbert
Hello, I'm new to Nim, but was tempted to give it a go because I've heard it has the simplicity of Python and the speed of C. I sat down to write my first Nim script on last week, where I mimicked a script I had already written in Python. I was excited to see just how fast Nim would be. The