Re: Nim version of
@Stefan_Salewski, you can also just call the libc memchr (which is what the current memfiles does to delimit lines aka slices until string conversion). A good memchr will do the SSE/AVX internally. For example this program: import memfiles, os, times var f = memfiles.open(paramStr(1)) var cnt = 0 let t0 = epochTime() for x in memSlices(f): inc(cnt) # The action here is just one line let dt = (epochTime() - t0) * 1e9 echo "cnt: ", cnt, " ns: ", dt, " ns/cnt: ", dt / float(cnt), " B/ns: ", float(f.size) / dt f.close On my Linux machine this Nim program runs about as fast as the `memchrcount` version in Dan Lemiere's approach (which is about 4x the "naive"/"strawman" approach). While Dan does get a speed-up of another 3x more over that `memchr` approach with vector code, that also loses the feature to do anything besides _count_ newlines. Since some kind of non-"mere counting" processing is usually desired, the `memchr` approach is probably what most people would want.
Re: Fun with deduplicate
@Stefan - this is a classic application of `containsOrImpl` in the set APIs (or `getOrPut` in the table APIs) which are very close to the core "search or insert" hash table algorithm. proc uniqCp[T](s: openArray[T]): seq[T] = newSeq(result, s.len) var h = initSet[T](rightSize(s.len)) var i = 0 for el in s: if not h.containsOrIncl(el): result[i] = el inc(i) proc uniqInPlace[T](s: var seq[T]) = var h = initSet[T](rightSize(s.len)) var i = 0 for el in s: if not h.containsOrIncl(el): s[i] = el inc(i) s.setLen(h.len) These were also about 20% faster on my system than their `OrderedSet` counterparts.
Re: unescape \n \r etc in string
Perhaps you want strutils.escape? I.e.: import strutils echo escape("\n\t") Output is "\x0A\x09" Its output is designed to be as portable an 'input' to de-escapers/escape-interpreters as possible. So you will see things like \x0A and \x09 instead of the \n \t because the former syntax is accepted by more string parsers.
Re: To optimize operation of replacement of lines in the file
If you use memfiles then you can just map the whole file and let the OS page things in from disk on demand as necessary. This gives you a "whole file buffer for almost free" kind of user interface. After the first run or file creation/etc. there may be zero/little IO and also very little copying of data from kernel space to user space. You do need a real file name that is seekable (i.e. "stdin" may not work), but if that constraint is ok for you then mapped files is almost always the fastest way to go and among the easiest to use, too. You can look at the memSlices iterator in memfiles.nim for one example.
Re: How to properly bind a function to a compiler buildin?
> For my use case it is very important, that I can get all the sizes at compile > time You may not be able to make all these things available at Nim-compile-time because they are not fully determined until the C (or whatever) backend compile is reached. For example, type foo object = bar: char baz: int will induce generation by Nim of some struct in the C backend. This creates a choice for the C of how to layout the struct - is it "1 byte packed next to 4 or 8 bytes" or rather is there a 3 or 7 byte gap between `bar` and `baz`. Many back ends will put in that 3 or 7 byte padding to have the int field be aligned which can result in faster generated assembly code on some CPUs. Almost all C compilers allow some kind of compiler directives like GCC's `__attribute__((__packed__))` or an analogue to kill all padding and let users control layout. For the C backend (but maybe not others..?) it may be possible for Nim to use those directives for all its structs to workaround the ambiguities of backend struct layout and resolve these to Nim-compile-time values. It is very debatable if that is desirable, though. It is definitely more complexity than just "binding" calls. That binding is really more delegation than resolution to a value. What the compiler currently does seems the right thing to me which is to delegate the complex/compound cases to the backend and "optimize" the cases for simple/non-composite types where it can reliably know the answer. For compound types, Nim can know these backend layout sensitive values are Nim- and backend-compile-time constants, just not what the values are to do fully resolved at compile-time calculations. So, a higher level question is if this resolution to a value known at Nim compile time is really a strict requirement for your purposes or if you can relax that to delegated backend work so the calculation can be done at "backend compile time".
Re: how to read/write object from/to binary file?
While doofenstein makes some good points especially about fragility with regard to adding new fields and old data files, there is some real efficiency charm to a "just memory map and go" pure binary native data approach. It's not always fun to be parsing data files again and again. So, I would also add another way to read the data. Note that you should probably use the `{.packed.}` pragma, note the use of `ptr UncheckedArray` which should really alert the reader to be careful, note the commented out inference of the number of objects in the file. My example code assumes you wrote out at least two objects from code like you had above (with a `{.packed.}` pragma). It will probably crash if the file is too short, etc. Just to give you a flavor. import memfiles type MyType* {.packed.} = object somefield*: array[10, int16] MyTypes* = ptr UncheckedArray[MyType] var t: MyType var f = memfiles.open("/tmp/tmp.mytype") # infer n object from file size (may need packed); Do own range checking! # let n = f.size / sizeof(MyType) let myobs = cast[MyTypes](f.mem) echo myobs[0].somefield[1] echo myobs[1].somefield[0]
Re: Execution speed Nim vs. Python
flyx wrote: > And yes, something between array and seq would be nice. But currently, Nim's > type system does not allow a type with a runtime-calculated length parameter. > Allowing it would be a hack similar to C VLAs. No hack involved here. The length parameter can be part of the object like seq rather than part of the type like array. openarray-accepting procs should not care since they can receive seqs anyway with run-time variable everything or arrays with compile-time fixed everything. One might argue that I/others could just do their own concept which is true. The standard lib has lots of openarray-accepting routines that would work fine with post-init-fixed seq-like objects, though. So, not having such a type in the standard lib that's part of openarray feels unnecessarily inflexible, but on this I think we agree since you said it would be nice. It may just be easier than first imagined. :)
Re: How to be more concise in code [NEWBIE]
Not sure what will be easiest for you to use/understand..I thought the patty approach pretty nice. That said, to answer your last question, the macros module has a few interesting things along the lines of what you were asking for: `parseExpr`, `parseStmt`, and `emit`. The first two can compile strings to ASTs and the last can do that and then put the resulting AST/code in your program. `emit` has a little example right in its definition in `macros.nim`. I think `emit` is probably what you were asking for.
Re: Perfecting Nim
While excess can be bad, and there are inconsistencies that should be repaired, I think a "batteries included" stdlib is a good thing. E.g., the recent `enumerate` for loop macro in `tests/macros/tforloop_macro1.nim` should probably be in `lib/pure/sugar.nim`. I think the duplication (`re` and `nre`, `parseopt` and `parseopt2`) should go. `mersenne` does seem pointless. Maybe game programmers use `basic[23]d`? I think `coro` is a good example of low level programming in Nim, but I don't know if needs to be in the stdlib. I like `set[]`, "do", converters, and exceptions. Exception-free alternatives in the stdlib sounds good, though. I think `using` is a good idea. I don't usually use dynamic dispatch of any kind, but the Lisp world swears by multi-methods for such and they are more general than the usual first-slot-only OO dispatch. No opinion on holey enums vs distinct ints or `discardable`. I never liked non-case-sensitivity or `TaintedString`. I think it would be nice if `openArray` could be more "open". It could become a first class `concept` needing only `[]`, `len`, etc. when concepts are ready enough. That may be already planned.
Re: Looking for a set that sorts and deduplicates
`CountTable` is just a histogram and the sorting is by the bin counts not by the keys. Keys are visited in "hash order". `hashes` does use an identity hash for integers which could create some confusion from a simple test program if the modulo the table size post hash transform doesn't change the order (though it surely can). Of course, the larger idea embedded in this suggestion is quite legitimate - you could use a `HashSet[int]` (from `sets`) to manage the set until you are ready to create a `seq[int]` via `HashSet.toSeq` and then sort that seq at once. This is probably the approach that most would use and is probably faster for most access patterns. The heap/tree approach would likely win if you had dynamism -- insert a bunch, sweep in key order, insert some more, sweep in key order again, etc. This complex dynamic pattern wasn't really specified in your question, though.
Re: Execution speed Nim vs. Python
Nim `Table` and `HashSet` should be caching the hash code values in their tables and also using that as a "comparison prefix" (comparing string values only if the integer hash codes already match). The string hash of the new inputs is ineliminable - or rather has to be done at least once anyway. My guess is the lines are long-ish (greater than 10 chars, say). The default Nim hash string function could be faster, especially for longer strings. E.g., the current default does byte-at-a-time manipulations and usually 8-byte/word-at-a-time can get about as good bit mixing much faster, especially for long strings like lines in files. Such a replacement might take that 60% hashing time in this particular benchmark down to 10-15% for a 2X-ish overall speed up. I have benchmarked C++ for these kinds of tests and I suspect the current STL would not be faster than Nim with the -d:release in this case. (C++, too could benefit from faster default string hash).
Re: could Nim benefit from upcoming C++ modules to speedup compilation times?
@timothee - @amalek beat me to the punch, but user-visible compilation time is very sensitive to backend compiler/gcc options. Consider two invocations compiling my [cligen](https://github.com/c-blake/cligen) test suite (30 programs with somewhat complex macros running at compile-time): cligen$ ./test.sh ** ./test.sh * Time: 10.43s (u) + 0.92s (s)=11.05s (102%) cligen$ ./test.sh -d:r ** ./test.sh -d:r * Time: 57.54s (u) + 2.85s (s)=56.96s (106%) My `nim.cfg` is set up to have the former use tcc/TinyCC as a backend ([the mob branch](http://repo.or.cz/tinycc.git)), but to use -d:release and gcc on max optimization for the latter. So, basically a huge 5X difference in total user-visible compilation time. The Linux perf record/perf report tell me that in the former case Nim is about 82% and 14% in the latter case. So, in both cases Nim is 9 seconds or about 300 milliseconds per program (probably 10% more with kernel/libc activity...). That's not very "real-time" (like re-compile every keystroke), but it's pretty good. tcc is much faster (only 3% of the time for that faster run), but I think there are enough open issues with the Nim compiler that correctness should be the focus more than compile-time performance.
Re: To optimize operation of replacement of lines in the file
You're welcome, Garry. Stefan - a simple check for whether pattern has any regex metacharacters (e.g. '.', '*', etc.) can decide if a pattern is definitely not a regular expression. If there are any metacharacters in pattern, well only the user can know and there would have to be a command-line switch to force one interpretation vs. another. In terms of absolute maximum performance, one usually wants to not do the search "line at a time" at all. Rather, one wants to use an assembly-optimized "strchr"/"memchr"-like search on the biggest chunks of input possible. In this context that means scanning the input for the least likely character that is definitely necessary for a match. There may be characters much less likely than newlines, and so a big boost may be possible (though note that input data sets do vary). Once that byte offset is known, from there one determines if it is part of a real match. That is all quite a bit more bookkeeping/logic/hassle, obviously.
Command-line Parsing Preferences
So, there have been various on again/off again conversations on github about how the stdlib parseopt could/should behave. Those conversations seemed to have a pretty narrow audience..basically participants on these github issue threads: [4620](https://github.com/nim-lang/Nim/issues/4620) [6818](https://github.com/nim-lang/Nim/issues/6818) . It made sense to me to raise this topic in the forum where a wider audience might weigh in before 0.18/1.0/whatever stabilizing might happen. The basic issue is whether command users have to separate option keys from option values (e.g. with a ':' as the current code does). In order to avoid that requirement, command authors need to update a list of option keys whenever they change them which is a hassle. So, there is a tension between convenience to CLI authors and CLI users. As a user, I often forget to type the ':' and would be delighted if more people chose the symbol table route. Myself, I think having an **optional** symbol table is the most programmer and user-friendly solution. That 's what I did in parseopt3 in [cligen](https://github.com/c-blake/cligen). If the programmer wants to provide more POSIX-like command syntax then that's easy. OTOH, if like Araq they very understandably prefer not to have to update a symbol table when they add new option keys then they can just ignore that feature and make their users provide a ':' to separate option keys and option values. To me this kind of optional symbol table seems both backward compatible and forward looking and strikes a good balance for a stdlib. I'm just one person, though, and that's just my opinion. The point of this thread is to solicit other opinions/perspectives/input to help figure out the best resolution of [6818](https://github.com/nim-lang/Nim/issues/6818) .
Re: Fastest way to count number of lines
@jlp765 - good catch. I thought of that, too (I actually wrote that `memSlices` stuff), and almost went back and added a note later, but you beat me to it. I still am unaware about relative timings on platforms other than what I personally use and would be interested to hear reports, but on Linux/glibc `memSlices` (or more generally `mmap+memchr` however that is invoked) is always fastest in my tests.
Re: use fork() to speed up compilation testing.
@twetzel59 - I use fork (and even vfork) all the time. They work just fine. @Krux02 - I have never used compiler-as-a-service mode which seems originally targeted at IDEs, but that feature may apply to testing as well. See `compiler/service.nim` and `tests/testament/caasdriver.nim`..maybe search for "caas". That said, I suspect a lot of the time is actually the C compiler, not the Nim compiler. If you can set up a nim.cfg to run tests with TinyCC/tcc (the mob branch), things could probably go much faster.
Re: Trouble with reading from stdin
I believe what you are running into is this terminal driver issue: > [http://unix.stackexchange.com/questions/131105/how-to-read-over-4k-input-without-new-lines-on-a-terminal](http://unix.stackexchange.com/questions/131105/how-to-read-over-4k-input-without-new-lines-on-a-terminal)
Re: Nim vs D
Oh, sure. There isn't _usually_ a 1.618**4 kind of work reduction in play, though. Araq would know, but I doubt the reason NimMainInnner is called through a volatile function pointer is to trick gcc's optimizer, though that is a happy side effect. To me, this was just a performance mystery that I found curious and thought others might appreciate the answer to. Also, more Nim-relevant might be the [automatic memoization thread](https://forum.nim-lang.org/t/1343) which is a more dramatic optimization aedt might appreciate. Of course, an in-line array or even just the closed-form formula are faster still for Fibonacci, but presumably the point of this benchmark from xyz32/aedt was to stress test recursion in a programming language. They may want to reconsider that benchmark to something with less exponential sensitivity to unrolling tricks - unless, of course, the intent is to measure unrolling tricks.
Re: Reason for -fno-strict-aliasing?
I did a little searching and compiling and I'd have to say Jehan is right here about "absence of evidence". See, [this stackoverflow thread](https://stackoverflow.com/questions/21214875/gcc-accuracy-of-strict-aliasing-warnings) for a very simple example to people that is (still in 2017) too complex for (at least gcc's) Wstrict-aliasing heuristics. It sure seems like strict-aliasing is a real morass. @cdome - perhaps a better solution would be some kind of `emit` and/or `macro` machinery that turns on `-fstrict-aliasing` just for the procs you need to recover performance. gcc has had this `#pragma` or `__attribute__` way to do that for quite a few years now (2011, I think). See [here](https://gcc.gnu.org/wiki/FunctionSpecificOpt).
Re: Execution speed Nim vs. Python
@flyx - c99 (and gcc long before it) have VLAs of the fixed-after-initialization variety...mostly syntactic sugar for alloca, of course. All that is surely more commonly used than Ada and required no fancy containers or any constraints about being a leaf function. Such an array style may be different enough from the post-init-resizable Nim seqs to warrant a different type, but it could still be an openarray. I think Nim needs a kind of openarray between array and seq - one which is run-time-sized, but fixed after initialization and with backing memory not necessarily on the heap. For example, an array of char that is backed by a read-only memory mapped file and so cannot be zero-terminated, but being from a file its size cannot change after initialization and yet the size can also never be known at compile time. Then it could get bounds checking like other openarrays and otherwise behave similarly. Maybe that's not the best example, but I'm sure there are many more. Lots of times the sizes of things are only known at run-time, but don't change once known/computed. Also, many openarray accepting routines need not modify their inputs -- why, they cannot change lengths of arrays, for example. So, it would be nice if all openarray accepting routines could work with such a type, whatever it might be called. Maybe there is one and I am just unaware.
Re: Nim vs D
This recursion unpacking/unrolling trick that gcc does (at call-site if insulated by call via volatile function ptr, and always inside the recursive impl) is, in my experience, a rare compiler optimization, but maybe it will catch on. clang does _neither_. If you `objdump -D` the executable (or disassemble in `gdb`/etc.) you will see the single `callq` to Fibonacci at the entry with the full N and a pair of `callq` inside the impl. So with clang the full [1.618**n work](https://stackoverflow.com/questions/360748/computational-complexity-of-fibonacci-sequence) happens. On my i7-6700K Linux with gcc-7.1.0, I get a time ratio of about 15.3 to 1 (53.5s/3.5s). For what it's worth, it's likely the compiler writer guys think of this as mostly a "remove a few funcalls here and there" type optimization..maybe use SSE/AVX/branch predictor a bit better, maybe save a little dynamic stack usage, etc. This doubly recursive Fibonacci approach just happens to be about as sensitive as possible to that particular optimization because it can save exponentially many funcalls.
Re: Looking for a set that sorts and deduplicates
Memory hierarchies since the 1990s have broadened the relevance of 1970s practical wisdom for disk storage. A solid hash table with locality and solid B-Tree often win today. A great stdlib should have both. A more obscure bonus of a B-Tree (when compared to, say, balanced binary trees) is cheaper amortized cost to maintain simultaneous access by rank (e.g. quickly find 95th percentile, top 10, etc. as well as query by key). With B-Trees you need only one rank counter per big node of O(M) elements instrumentation, not per element as with binary trees. So, B-Trees afford O(M)-times less space overhead and O(log(N)/log(M))-times less time overhead than a rank-instrumented binary tree of N elements. This can be useful for, e.g. tracking quantiles over moving data windows.
Re: Fastest way to count number of lines
Yeah..Depending on what he's doing, same-file dynamic estimation might also work. Good point, @jlp765. On my system the 5.4 MB of `/usr/share/nim/**.nim` gets counted in about 4 milliseconds - over 1.3 GB/sec, probably faster than all but the most powerhouse nvme/disk array IO. This is why I suspect @alfrednewman might be re-calculating things instead of saving the answer either in RAM or as files. I'm sure a pre-pass calculating the number of lines can avoid certain complexities. However, once you start doing assembly hijinks that are not even portable through a given CPU family (e.g., using SSE, AVX2, AVX512, ...) performance becomes very deployment sensitive. Meanwhile, eliminating the entire pre-pass by merging it with per-line allocations/whatever costs complexity, too, but yields portable performance gains. If it's really ineliminable then have at it, asm-wise, I guess. I just suspect it's off-track.
Re: Remarks on standard library file functions (system, sysio)
Yeah. I haven't tried out your code, but an API idea might be to have the caller instruct the iterator what to do with any improper/undelimited record at the end -- includeUnterminated=False or something. Only the caller really knows what is desired. One observation, algo performance-wise, is that the fastest substring searches are usually achieved by using a fast libc SSE/AVX vectorized `memchr` to find the _least likely_ character in the substring. Sometimes one knows this from vague properties of the distribution. For example, framing the email messages in UUCP-style mailbox with '^From ' kinds of delimiting, `memchr` for the capital 'F' skips over most text most of the time. This idea may not work in a corpus of emails in ALL CAPS ALL THE TIME, but such a corpus should be rare. Anyway, this idea is just sort of the simplest/first step into a wider world of efficient substring search algorithms that you might enjoy reading about some day.
Re: How to properly bind a function to a compiler buildin?
You are right that C compilers already strive to ensure ABI struct compatiblity which is why I think having Nim leverage that effort is better than reproducing it at the Nim level, though using backends themselves to pre-generate things may help. My point was to get truly Nim-time resolved values was trickier than it seemed was being framed by the discussion. Another thought along those lines might be to resolve things to Nim-time values only for types using Nim's `{.packed.}` pragma. Maybe only `{.packed.}` types would get the best error reports.
Re: Fastest way to count number of lines
So, one other thing that is _probably_ obvious but bears mentioning just in case it didn't occur to @alfrednewman - the number of addressable/seekable bytes in a file is usually maintained by any modern filesystem on any modern OS as cheaply accessed metadata. So, if what you really need is not an exact count but a "reasonable bound" that could be violated once in a while then you may be able to _really_ accelerate your processing. As one example text corpus that we all have access to, the current Nim git repo's `lib/nim/**.nim` files have an average length of 33 bytes and a standard deviation of about 22 bytes as assessed by this little program: import os, memfiles when isMainModule: var sumSq = 0 for slc in memSlices(inp): inc(counter) sumSq += slc.size * slc.size echo "num lines: ", counter let mean = float(inp.size) / float(counter) echo "mean length: ", mean let meanSq = float(sumSq) / float(counter) echo "var(length): ", meanSq - mean*mean You could probably reasonably bound the average number of bytes per line as, say (average + 4*stdev) which in this case is about 33+22*4 =~ 121 bytes..maybe round up to 128. Then you could do something like: import os var reasonableUpperBoundOnLineCount = int(float(getFileInfo(myPath).size) / float(128)) If you use that bound to allocate something then you are unlikely to over allocate memory by more than about 4X which isn't usually considered "that bad" in this kind of problem space. Depending on what you are doing you can tune that parameter and you might need to be prepared in your code to "spill" past a very, very rare 4 standard deviations tail event. This optimization will beat the pants off any even AVX512 deal that iterates over all the file bytes at least for this aspect of the calculation. It basically eliminates a whole pass over the input data in a case that is made common by construction. Since you have seemed pretty dead set on an exact calculation in other posts, a small elaboration upon the "embedded assumptions" in this optimization may be warranted. All that is really relied upon is that some initial sample of files can predict the distribution of line lengths "well enough" to estimate some threshold (that "128" divisor") that has "tunably rare" spill overs where they are rare enough to not cause much slowdown in whatever ultimate calculation you are actually doing which you have not been clear about. Another idea along these lines, if, say, the files are processed over and over again, is to avoid re-computing all those `memchr()` s by writing a little proc/program to maintain an off-to-the-side file of byte indexes to the beginnings of lines. The idea here would be that you have two sets of files, your actual input files and some paired file "foo.idx" with foo.idx containing just a bunch of binary ints in the native format of the CPU that are either byte offsets or line lengths effectively caching the answer of the `memchr`. If you had such index files then when you want to know how many lines a file is you can `getFileInfo` on the ".idx" file and know immediately. You could be careful and check modification times on the .idx and the original data file and that sort of thing, too. Why, you could even add a "smarter" `memSlices` that checked for such a file and skipped almost all its work and an API call `numSlices` that skipped all the work if the ".idx" file is up-to-date according to time stamps. Basically, except for actually "just computing file statistics", it seems highly unlikely to me that you should really be optimizing the heck out of newline counting in and of itself beyond what `memSlices/memchr()` already do.
Re: Looking for a set that sorts and deduplicates
@Araq - fabulous. I think a B-Tree in the stdlib would be great. @mratsim - I think you meant `collections.intsets` with a 't' which does _not_ yield items() in key order. The built-in `set[up to int16]` type _does_ (implicitly) order its keys, though. @cdome also did not mention how large the integers he needed to support were, but he said "int" which I took to be the standard Nim `int` which is too big for a `set[]`. Also, while the set `incl` operations of `set[int16]` are unbeatable performance-wise, iterating over the set can be relatively slow if the set is not populated -- e.g., if you do int16 and only have 10 unique numbers, you still have to loop over 65536 entries to convert to a seq[int]. You can play around with this little program to see the various effects. import sets, intsets, times, random, sequtils, algorithm let trials = 1 let numberRange = 1000 # 1 let doPrint = false randomize() var Is = initIntSet() let tI0 = epochTime() for i in 1..trials: Is.incl(random(numberRange)) var Iss = toSeq(items(Is)) Iss.sort(cmp[int]) echo "IntSet time: ", epochTime() - tI0 if doPrint: echo Iss var Hs = initSet[int]() let tH0 = epochTime() for i in 1..trials: Hs.incl(random(numberRange)) var Hss = toSeq(items(Hs)) Hss.sort(cmp[int]) echo "HashSet[int] time: ", epochTime() - tH0 if doPrint: echo Hss var Ss: set[int16] let tS0 = epochTime() for i in 1..trials: Ss.incl(int16(random(numberRange))) var Sss = toSeq(items(Ss)) echo "set[int] time: ", epochTime() - tS0 if doPrint: echo Sss
Re: Reason for -fno-strict-aliasing?
@Tiberium, in the `csources/build.sh` context you should only need to remove the `-w` and change the `-fno-strict-aliasing` in `COMP_FLAGS` since it's just a shell script with a zillion gcc invocations. Compiling other nim code, I only had to change my nim.cfg's `gcc.options.always = "-Wall 2>>/tmp/gcc.log"` or something similar to catch the warning outputs. Still yet to see any warnings related to aliasing.
Re: OrderedTable is not an ordered table
@lightness1024 - good point. Another argument in favor of "KeyOrderedTable" would be that it would probably show up in searches of the more generic "OrderedTable". I don't think Nim has a key-ordered collection (EDIT - in the stdlib) right now, but Araq has said he has a B-Tree waiting in the wings.
Re: Perfecting Nim
Patching the stdlib to use concepts where it makes sense instead of `openArray` is also a fine idea. It's really used a lot -- (`algorithms`, `strutils`, `sequtils`, `random`, `stats`, etc.). `Iterable` & `Indexable` concepts make sense to me as a decomposition for all those interface use-cases. Something so basic as a dependency should not be in something called `sugar`, though. Some new stdlib module (`concepts`, `stdtypes`, `typeclassess`?) would be a better place for standard concepts that many other modules depended upon. But this is off-topic for "what to remove".
Re: Surprises with Generics
One other wrinkle more in line with this thread topic is making things work for general object types but specific standard key types. In C one might handle that by taking an `offsetof` for where they key field is within an object and `sizeof` for the size of object in the array. In Nim, one could do similar, but it might be more "Nimonic" to more safely dispatch via some `sort template` instead of a `sort proc`, e.g.: template sort[T](inp: seq[T], keyField: untyped) = when T.`keyField` is byte: ## XXX Handle all the standard types ... ## XXX good & stable algos for each one myArray.sort(myKey) The implementation might look nicer with a family of overloaded calls rather than a big `when T.`keyField` is` dispatcher. I am not sure there is a clean way to do that in this case. Each `when` clause should be able to dispatch to a type-specialized case, though. So, it's not too bad.
Re: Nim vs D
@aedt - with that C program I posted and just changing the numbers, I see no change in behavior at all: N fib(N) log_2(Fib(N)) WallTimeUsrCPU log_1.618(Tm) 57 365435256832 38.410825 88.338572 88.29 9.31268150798 56 225851408384 37.716583 54.570624 54.54 8.31166257771 55 139583848448 37.022341 33.718429 33.7 7.31112152655 54 86267559936 36.328100 20.863663 20.85 6.31352254278 53 53316284416 35.633857 12.885360 12.87 5.31201286047 52 32951275520 34.939615 7.9617497 7.95 4.31148872869 51 20365008896 34.245373 4.9201932 4.91 3.31125979281 50 12586267648 33.551131 3.0422103 3.04 2.31214787174 49 7778741248 32.856890 1.8785994 1.87 1.31034617400 48 4807526400 32.162648 1.1626232 1.16 0.31313742467 47 2971214848 31.468406 0.7182663 0.71 -0.68769978950 46 1836311680 30.774164 0.4441032 0.44 -1.68685323404 45 1134903040 30.079922 0.275319635 0.27 -2.68048037749 Indeed the fit is basically a perfect straight line: log_1.618(time) = N - 47.67. The integer is only passed around as an argument. The arithmetic is all float32 (or maybe float64 depending on impl). So, to me this is as should be expected. I believe that is the case in all your benchmark versions, too. Not sure what is going on for you.
Re: Creating a new seq is not that fast
If zeroed memory is required semantically, then it's better to compare against `calloc` not `malloc` in an apples-to-apples sense. Also, at least on Linux on you need to be careful in benchmarking this sort of thing. As you get to larger allocations the OS can actually give the process many linked copies of a copy-on-write 4096 byte page of all zero bytes. Because of the C-O-W, the initial allocation alone can seem much cheaper than it really is. The work is deferred to being done lazily as memory is written to. This zero page business can also cause confusion with calloc-never-write read-only workloads in benchmarks (say `calloc` and then sum all the ints) because on x86 the cache is physically mapped and that single zero page can be fully L1 resident. Such L1 residence can give a "false" speed advantage compared to more aged/populated memory. [ This latter point is not relevant here, but seemed worth pointing out anyway -- it was how I personally discovered Linux was finally doing this optimization. ]
Re: Execution speed Nim vs. Python
(I should perhaps qualify beyond -d:release using gcc or clang with high optimization levels on the back end since jil210 revealed in [another thread](https://forum.nim-lang.org/t/3198) the baseline 5X performance mystery arose from using tcc as a back end.) Also, for the curious, the reason C++ STL will usually underperform Nim's hash tables in benchmarks like this is that STL iterator deletion semantics make the natural STL hash table collision resolution implementation choice be external chaining. Those extra linked list indirections add latency, especially when tables do not fit in on-CPU cache.
Re: Remarks on standard library file functions (system, sysio)
Some greps can indeed be quite good. Really, I think this algorithmic intelligence should be able to be just used from Nim `strutils`. Nim strings know their length. Many/most of those read-only procs like `strutils.find` "should" be able to work on types like `MemSlice` or really some kind of `openarray[char]` \-- a pointer and length pair. Scare quotes for the "should" since the current Nim `openarray` does not include arrays of dynamic/run-time extent but with non-heap/GC backing memory like a memory map, or some other blob of memory. There is ongoing work on concepts that could probably profitably replace `openarray` in many use cases. So, maybe some day in the not too distant future, as they say. Concept support may already be good enough to start an ambitious project along those lines, but yeah, for you/your situation, just re-doing memSlices is a fine approach, esp. if absolute performance is not a priority.
Re: Command-line Parsing Preferences
It really shouldn't _need_ to break anything, though mistakes can always be made in some initial pass. I would think you might also like long option key normalization so that one doesn't have to remember when typing commands `--how-toSpell_OptKeys-exactly`. Entering commands is often more ad hoc and personal than writing code, to the extent they are even distinct. Also, if it matters, I'm happy to donate [parseopt3.nim](https://github.com/c-blake/cligen/blob/master/parseopt3.nim) to the cause. That adds a few more features beyond optional symbols like "\--"/`stopWords`, and `optionNormalize`. It should be API/backward compatible by default, but I really haven't tested it much outside being driven by [cligen](https://github.com/c-blake/cligen) macros. E.g., I'm not sure how it fares against quoting issues that arose recently.
Re: writeFile with iterator
Also, in terms of optimization just replacing: for str in iter(): file.writeLine(str) with for str in iter(): file.write(str & "\n") yielded a 1.3X speed-up (I.e., it took the time down to 0.14 seconds).
Re: Nim vs D
In case anyone else is curious about aedt's fibonacci benchmarks, I found an explanation for the approx 6..8X time difference of C vs Nim (with gcc as a backend, on Linux). It took digging into assembly to figure it out, though. With the Nim version, the nested calls/various module boilerplate allow gcc to unpack the top several entries of the recursion _at the call site_. It is not a properly memoized/fully made iterative tail call, though as in aedt's editted comment. (Both versions unpack a few levels in the fibonacci function itself). So, really the doubly recursive call is invoked with a number 4 or 5 smaller than the pure C implementation and so runs much faster since the running time is exponential in N. So, this is mostly just good luck for Nim-gcc in this highly specific benchmark, not something fundamental in the language or impls that would generalize all that well. (I haven't figured out how to tweak the pure C to get gcc to engage that optimization, but presumably someday the gcc folks will do so behind the scenes or maybe there's already a magic -f flag I'm unaware of.)
Re: OrderedTable is not an ordered table
Thanks for the compliments. You should probably learn about backward shift deletion. It's a little tricky the first time you see it, but ultimately it's better than tombstones and a bunch of ad hoc garbage collection rules. You can look at `lib/pure/collections/tables.nim`. There are really only a few improvements to Nim's `Table` that I think might be good. We could do Robin-Hood reorganization on inserts, being sure to retain linear probing and backward shift deletion. In my tests, Robin Hood costs about 10% more time on workloads which are mostly successful lookup but saves about 50% of the time for workloads that are mostly failed lookups. (Note that every insert of a novel element starts with an unsuccessful lookup.) So, Robin Hood never adds much on average and is often a net win on "many workloads" and as a bonus squashes the variance of lookup times. The second idea to speed up might be storing the hash codes in a separate array from the keys so that a single cache line can fit a larger collision cluster. That speeds up searches, but it also slows down inserts and deletes which need to hit twice as many memory spots. So, your mileage may vary even more than RH. Finally, Tables could also elide saving the hash code for `SomeInteger`-like "cheaply hashable/comparable" keys. Unlike the prior two ideas, the speed up from this would not be workload dependent. However, to do it there needs to be either a separate occupied bit vector (two cache lines again, like a segregated hash code array) or a special reserved key value signifying "empty". The reserved value is the efficient way to go and just one is usually not too onerous (zero, -1 or -2**31, etc.), **but** it 's not an easy thing to default. So, that expands the API to the user - for just some type classes of key - to specify the empty value. So, a specialized name like `SomeIntTable` analogous to the specialized histogram `CountTable` is probably warranted.
Re: Fastest way to count number of lines
Nice, boia01! In my same `/usr/share/nim/**.nim` test I get 768 microseconds for your version and 2080us for just doing the memSlices approach..So, 2.7X speed up, a bit less than the 4X I saw when I last compared the approach in two C versions..maybe unrolling. Dunno. @alfrednewman - if the statistics of these data files are stable over the whole file, you could always stop after the first gigabyte (or maybe less), figure out the average line length and use the file size to estimate the number of lines, and then it would only be a ~1 second delay. A MemFile knows its size, as does each slice...So, this is pretty easy to code. Of course, if the first gig is not always representative of the remainder that might not work, but it sounds like you probably don't need an exact answer. This is sort of a simplified version of one of my/jlp765's suggestions. 2.7 * 1.3 GB/s =~ 3.5 GB/s is faster IO than many people have. So, even using boia01's code you may not see a great speed-up.
Re: Surprises with Generics
@Stefan_Salewski - I think this idea of having a family of overloaded procs to sort on standard types is very good. I have almost raised it myself several times. It would be useful for `cmp`-less `sort` procs on such type to be _stable_ sorts (meaning elements with identical keys retain relative positions in the array). Then a user could implement a simple 2-level or 3-level sort on distinct embedded keys by just calling such sort procs 2 or 3 times. (A multi-level sort orders elements first by a primary key, then a secondary key for equal primary keys, and so on). This arrangement has another performance bonus potentially much greater than the inlining effect mentioned so far in this thread. It allows one to tune which sorting algorithm is used based on the nature of the key, e.g. how long it is, integer or floating point or string, etc. For example, for single-byte integer keys it is hard to be faster than [counting sort](https://en.wikipedia.org/wiki/Counting_sort) which does two pretty linear sweeps through the input - the first to histogram byte values and the second to output the answer (with a keyspace-sized prefix sum in between to convert counts to output offsets). The simplest version of that does require an additional copy of the input array which may have some consequences on the proc signature/interface/specification.
Re: unescape \n \r etc in string
Looking at the implementation of strutils.unescape, it seems to only interpret the xHH syntax that escape outputs. Of course, the compiler proper is often changing those raw/quoted string forms into special characters. So, maybe there is some other approach/trick that could work..sort of like an "eval" in interpreted languages.
Re: Add Nim to The Computer Language Benchmarks Game?
So, building on the 40% or so of the benchmarks def did, I rounded out the other 60% and gave a PR to his repo. With more like hours than years of my time per program profiling/tweaking/optimizing, I got the Nim implementations to mostly parity with the top performers (Nim was in the top 3..6 impls on my machine for almost all tests). Of course, I did start most of those impls from "for years tweaked impls". Anyway, shortly after I finished that work, I also saw the annoying comment on Alioth about "No more languages! Do your own website", but I was lazy and did not. Kudos to you, bluenote for doing something. Looks pretty nice.
Re: writeFile with iterator
Garry: --opt:size (and things like gcc -Os) instructs compilers to choose trade-offs to make object code smaller rather than faster. E.g., on Oderwat's code, I get 0.19 seconds with just -d:release (with gcc-6.1 on Linux x86_64, i7-6700k @4.5GHz). If I additionally specify --opt:size the time rises to 0.28 seconds, almost 1.5X slower. The program text segment is 80KB in the fast case but 33KB in the slow case. So, all is behaving as expected/requested.
Re: how to skip all the `Hint: foo [Processing]` during compilation?
You can also put in [any of your nim.cfgs](https://nim-lang.org/docs/nimc.html#compiler-usage-configuration-files) hint[Processing]=off You can also disable other categories of diagnostics besides "Processing" similarly.
Re: writeFile with iterator
Yeah...I thought about \n higher up but didn't want to change the semantics too much. With that change my time goes down to 0.10 seconds (almost a full 3X faster than the original 0.28 --opt:size version on the same machine). About 55..60% of that 0.10 seconds is all just libc fwrite stuff. The remaining Nim probably can't be sped up much more..maybe 10% total runtime at most. file.write is just a very thin wrapper for stdC fwrite. You can see if you look at lib/system/sysio.nim (search for proc.*write.*File).
Re: Nim vs D
I get the same perfect line with the Nim fib(47..57) as well. { Well, slightly slower: log_1.618(wall) =~ N - 47.35. So, 1.618**(47.67-47.35) = 1.16 or around 16% slower. }
Re: OrderedTable is not an ordered table
@lightness1024 - fair criticism of C++ STL. modulo prime reduction of the hash code to a table address/array index is slow because division is slow. In my tests using modulo can sometimes **triple** the entire lookup time for simple integer keys due to the slowness of division! I recommend you consult Knuth 's Art Of Computer Programming Volume 3 (section 6.4 in my 1998 edition, but his treatment mostly dates back to the 60s). He motivates why prime may be a good choice and then immediately moves on to a multiplicative hash. If you really need to mix the bits a multiply & shift is much faster than modulo. While the upper bits of the half-product do have more entropy, masking seems about as good in my experience and the code is slightly simpler due to the close relation of the mask and table size. Knuth goes on at some length about choosing multipliers - about 10X as much length as the module prime. For some reason (perhaps an overzealous affection for number theory by CS theorists that write algo books or maybe just simplicity?) "modulo prime" got repeated more (everywhere - in classes/lectures/textbooks/code, etc.). modulo prime sadly "stuck" in programming culture needlessly. Python even has some array of primes less than powers of two. The pedagogical situation seems to be improving in the past decade or so. In the context of Nim's Table, for integers the absolute most trivial identity hash is used. If that is not good enough, a user should override system hash functions with their own better scrambled hash code producer. Believe it or not (I always recommend testing on your own real data..), the trivial hash is often plenty fast. While adequacy of randomization is data dependent, with linear probing and modern CPU cache prefetching even large collision clusters can be scanned quickly (especially compared to many "improved randomness" mitigations). Here are two examples of improved randomness failing. You could use a super expensive cryptographic hash function to get super random hash code bits, but the time you spend doing that would likely be much more than the extra time from scanning bigger clusters. Another example is double hashing which hops all over the table randomly in a very cache unfriendly way. The clusters are smaller, but the number of cache misses goes way up. To avoid dealing with hardware variation, people, including smart researchers, often use **machine-neutral surrogates** for actual time like "number of comparisons" or "cluster sizes". This can create all sorts of confusion and cross-talk when the surrogates are not very faithful to real machines. Also, this is slightly off-topic for Nim, but I looked at your package and you probably have a subtle bug. I don't see you counting "tombstone" or deleted markers - usually desirable to reorganize the table if there are too many. See [this thread](https://forum.nim-lang.org/t/829) for details. You also might like to read the [thread on hash functions](https://forum.nim-lang.org/t/1730).
Re: Looking for a set that sorts and deduplicates
`collections.heapqueue` does mostly what you want. You would have to remove duplicates yourself, like below, but probably with an iterator called something like `unique`. import collections.heapqueue var heap = newHeapQueue[int]() let data = [1, 3, 5, 1, 7, 9, 2, 4, 6, 6, 8, 0] for item in data: push(heap, item) var last = heap.pop() #NOTE: assumes heap.len >= 1 echo last while heap.len > 0: let next = heap.pop() if next != last: echo next last = next Note that this does actually store duplicates until the final iteration because heaps do not possess a fast { i.e. O(log-set-size) } test for membership. So, if you had a great many copies and/or memory constraints that could be an issue. You didn't mention how high the duplication rate would be, but I bet something like the above is fast enough. A more memory optimal structure for this is probably a simple B-Tree or some kind of balanced binary tree (less cache friendly than a B-Tree, but typically more available/implemented more often) or possibly an adaptation of `collections.critbits` or some other digital search tree. I think there is a red-black tree in Nim floating around.
Re: Reason for -fno-strict-aliasing?
This may be obvious, but has anyone else tried changing `-fno-strict-aliasing` to `-fstrict-aliasing -Wstrict-aliasing` in their `nim.cfg` and seeing if any gcc warnings arise? When I try this on Linux with gcc-6.3 and gcc-7.2 and devel nim I don't see any warnings at all. You may also need to turn off the `-w` general gcc warning suppression, too (maybe in the system level nim.cfg as well as $HOME one) and double check with `nim c --verbosity:2` that the compiler is getting invoked as you think it should. I tried a few different garbage collector impls and a variety of small Nim programs. It's hardly exhaustive and I'm not sure that this particular gcc warning has a 0% false-negative rate (does someone know?). We may be so close to this to be almost already there.
Re: Fastest way to count number of lines
It sounds like you will have many regular files (i.e., random access/seekable inputs as opposed to things like Unix pipes). On Linux with glibc, memfiles.open is probably the fastest approach which uses memchr internally to find line boundaries. E.g. (right from memfiles documentation), import memfiles for line in lines(memfiles.open("foo")): inc(i) Your mileage on this may vary from OS to OS or libc to libc. I have no idea which if any Microsoft/Windows versions have well-optimized libc memchr() implementations.
Re: Loop backward through array -- howto do it brachless
`for i in 1..12: echo -i and 3 ` Run
Re: [poll] Moving all RFCs in a separate repo
+1 for labels rather than alternate repos (or more labels such as "random idea" or "half-baked" or such). But, also +1 if Nim core devs want to aggressively close issues that are too incomplete to be a "real" RFC or are considered settled questions.
Re: `import foo {.private.}` to allows access to private fields (eg: package-level visibility)
+1 on some kind of invisibility/private-ness override feature. Another bonus of that is to use it as a temporary workaround if some 3rd party package has made something "overly" private as viewed by an "expert" user. While this has come up in the context of library writing, sometimes users of libraries understand many of its internals and sometimes they don't. Private, not exported, is probably the right default or at least too late to change, but it's easy to forget a `*` export marker and easy to disagree. This proposed mechanism would allow temporary workarounds much closer to a likely more sanitary future usage. It might be better to call it something scarier/less similar to a token like `private` which can be _so_ common in other prog.lang's as to lead to questions like "why does `private` mean this negative sense in Nim but positive sense in Foo?". `import foo {.overrideInvisibility.}` or `overridePrivate` or something. Just some kind of word to indicate bypassing of normal-ness (to act as a redundant cue over there being a pragma there at all). I agree that `{.breakTheGlassAndMakeEverythingVisible.}` is severe overkill. :-)
Re: the Fibonacci benchmark
I also got `nim c` 's code running twice as fast as `nim cpp` 's code. See some previous discussion: [https://forum.nim-lang.org/t/1779](https://forum.nim-lang.org/t/1779) \- there are cases where Nim's generated C code "inspires" the C compiler to "unroll the last few recursions" and save 4x or 8x or maybe even 16x depending the number of funcalls (depending on the unroll amount). It's not quite constant folding. disassembly and looking for `callq` can be helpful here. I think the result is correct by the more old school definition starting from "1 1 2 3 5" (with indexing starting at zero), not "0 1 1 2 3 5" as per [https://en.wikipedia.org/wiki/Fibonacci_number](https://en.wikipedia.org/wiki/Fibonacci_number) { or maybe the indexing is not what @miran expects, but he uses a ";-)" }. Anyway, there is sort of "more than one right answer" in a couple of ways depending on index origin or 0 inclusion.
Re: the Fibonacci benchmark
No `gcc` doesn't -- at least not for me with a factor of about 7X, almost identical to @adrianv 's results. I used both an earlier gcc-7 and a gcc-8.2 just now. If you don't want to disassemble your binary, that's ok. You can check this with `gcc -pg` and `gprof` which will count calls for you. I also gave a complete example in C that does the optimization I mention in the prior thread which I will reproduce here because the new forum code hides it behind "load more posts". With `fib(30)` I get 118793 calls. With `fib(40)` I get 14612525 calls. With even a low `fib(20)` I get 950 calls. Does 950 calls seem like compile-time calculation to you? With `fib(46)` I get 262211393 calls. #include float fibonacci(int x) { return x < 2 ? (float)x : fibonacci(x - 1) + fibonacci(x - 2); } void NimMainInner() { printf("%.0f\n", fibonacci(46)); } int main() { #ifdef FIB_SLOW NimMainInner(); #else void (*volatile inner)(void) = NimMainInner; inner(); #endif return 0; } Run If one simply calls `NimMainInner()` directly instead of through the volatile function pointer `inner()`, `fib(20)` generates 10942 calls instead of 950. I understand my example uses floating point, not integer arithmetic, but @adrianv 's numbers make it sure seem like the same situation. @adrianv can also compile with `-pg` and run `gprof` to check. Or he could adapt this C and compile it the same way his `nim.cfg` or whatever compiles his Nim generated C. I'm not quite sure what more evidence you would require. It isn't hard to calculate how many function calls doubly recursive Fibonacci requires.
Re: the Fibonacci benchmark
It might be a lot of things, but I'm telling you - there is a `gcc` optimization, very finicky about context (and probably compiler flags, too), that reduces function calls by almost exactly his speed-up factor. There's really no need to guess if I'm right, though. Just add `-pg` to the gcc flags (in both cases), run `gprof` on the output, and count funcalls in both cases to confirm.
Re: the Fibonacci benchmark
I am not quite sure what "original" benchmark you mean - perhaps the 4.6 s for Nim vs 6.0 s for C on the original github page at the top of this thread. That may be small scale code layout/caching/branch predictor effects as you say. However, the difference in this thread of 8X is _almost certainly not_ compile-time evaluation of fib(46), nor a coincidence, nor some generalizable Nim is effectively faster than C { though ;) is noted! :-) }. Compile-time evaluation would likely be many thousands of times or more faster than that - faster even than the memoized/linearized versions of the calculation which are already O(1000x) faster. Indeed, one gets "too many iterations" trying to `const x = fib(46)` from the Nim VM. The merely ~8X faster observed twice in this thread is something else which I have already explained in the prior thread and in this thread linking to that. And that explanation/approach does generalize a little, and even across programming languages. Certain iterative/recursive code that does very little work like simple adds or swaps { permutations, I'm looking at you :-) } can often be sped up by either manually or automatically unpacking/unrolling the last few recursions so that more work is done on a per-function call basis or said otherwise there are fewer function calls.
Re: setLen without 0-initialization (for efficiency)
+1 on the feature. I also sometimes re-use buffers that would benefit from this.
Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST
Oh, I got your point and tried to emphasize that. I wasn't arguing against you anywhere that I know of. I totally agree with your a,b, I suspect we don't disagree on anything real at all, but are just discussing different aspects. I tried to express praise for your approach (if you happened to need to do this _particular_ thing) and gave a reference link. :-) Please note that nothing I say below should be taken as any specific rebuttal to something you said. My point was less about exactly zero vs non-zero abstraction cost (I even noted my own 5% variation) and more about the cost being very sensitive to compilation strategy, PGO as a particular example. I would say, looking at the generated C code, that this is not much about "Nim language iterator overhead aka 'cost'" and more just about "C compiler assembly code generation variability". That's an important distinction as slowness does not necessarily translate into "work for Nim core to improve anything". Backends vary and could change a bunch of stuff and be different in the next version of gcc/clang/whatever, for example. So, Nim kind of needs to target a "broad cross section" of what works well on the various backends, not overfit to one. (Not saying you said otherwise, just backing up importance of the distinction.) You mentioned compiling my code on Ryzen, but not using gcc's PGO. PGO is mostly just a very strong version of your (c) hints idea where the compiler generates extra code to measure what it thinks will help it make better code generation decisions. (Yeah, a person or some GA picking optim flags/luck/etc. could all still maybe do even better sometimes as I already alluded to, but at ever increasing developer costs.) To add a little more color quantitatively, I just did the same PGO experiment with the same version of gcc on a Ryzen 2950X clocked at 3.80 GHz and saw times go from 140 ms down to 85 ms..(best of 10 trials for both). That's not quite the same 2x speed-up as for the Intel case. Yeah, maybe Ryzen has better CPU predictors or predictor warm-ups or maybe it gets a bit luckier with "non-PGO" -O3 yadda-yadda defaults, etc. Whatever the causes, 1.65X is still a lot more than PGO boosts I usually see of 1.05x to 1.15x (such as with your algo, for example). { Of course, the boost is still not as good as a better algorithm. I never said so, nor meant to imply it. If a better algo is handy and speed matters then by all means use it! At least some of the time, a better algo will not be as easy to come by as PGO. } Anyway, I realize that it's a tangent from your original interest in this problem, but if you're as serious about (c) as a general guideline as you seem, and if you haven't already, then you should look into PGO sometime. My bet is that for this problem and for most backend compilers that a PGO Nim would be faster than non-PGO "pure C". Even though such a goal is explicitly _not_ your objective, I think it's still interesting. Nim is one of the very few languages I've tried that in "typical" practice is often "about as efficient as C" (with all the usual disclaimers as to algos, etc.) PGO can also be very little developer time extra cost. Once you get a little script set up, you just type `nim-pgo pythag './pythag > /dev/null'` for some `pythag.nim` file instead of `nim c -d:release pythag.nim`. If the program had less output (say the sum of the c in c^2=a^2+b^2) the shell IO re-direct wouldn't even be there. For this program, it took me almost no time to try PGO. And, of course, YMMV - that 2x on i7-6700k effect is among the biggest PGO boosts that I've ever seen. That seemed noteworthy. So, I noted to y'all. :-) In terms of comparing Ryzen timings, running in a VM should not make much difference on this problem as it is a pure user-space CPU hot loop except for the time queries..That's really a best case scenario for VMs. Of course, if your host OS is time-sharing the CPU between your VM instance and some other process then that could slow things down a lot - arbitrarily much, really if the host OS is overloaded. Might help to take the minimum time out of more trials. You should be able to get 100..200 ms of one core all to yourself at _some_ point. :-) My guess is that you could get that 210 ms down to under 100 ms (not that it really matters for anything practical in this exact case. A better algo is better, but a lot of people - not just in this forum - are, rightly or wrongly, discussing this problem in the context of "programming language performance" which is the only reason I bothered to write both of these posts. The clang/ldc/rust/etc. numbers were all more closely clustered than 1.65x of the Ryzen PGO experiment in Timothee's very first link (217 ms/153 ms = only 1.42x). TL;DR - A) For this problem, compiler back-end generation effects dominate Nim front end overheads (and possibly front-end overheads for other languages) for which the huge PGO boost
Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST
Those are all fine points. Asm can sometimes make a bigger difference than people conditioned to not question compiler output expect (and there are many such people). This is especially with vector units. A few years back, I wrote an AVX2 vectorized "minimum" function that ran 24x (twenty four times) faster than a C one. That's a much bigger time ratio than most "typical use" comparisons of _many_ programming language pairs (though obviously most programs do more things than compute minima). Auto-vectorization in gcc (at least) has gotten better since for that precise problem, though it can still miss many opportunities. If you ever do need to write asm, those "intrinsics functions" like `_mm256_add_ps` (there are a hundred others) are often an easier entry point/way to integrate with your program than raw asm/`.s` files. You can use them from Nim, too. :-) See, e.g., [https://github.com/numforge/laser](https://github.com/numforge/laser) README/code/etc. for how to use SIMD intrinsics.
Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST
Also, I guess a TL;DR part C) - I was never arguing against the tautology that a faster algorithm is faster. That is kind of a weird straw man position. No idea how quoting 600 microseconds for it left that impression, but maybe bears correcting the record. (I didn't assess correctness, though perhaps I should have. To me that whole algo area is besides the point of the prevailing discussion of an artificial benchmark. I never need Pythagorean triples, myself.)
Re: Advent of Code 2018 megathread
By the way, if you are trying to "right size" a set or table to streamline resizing costs, you can use rightSize(number of members to support). This is easier than trying to guess the power of two at which such a population is supported for both initSet and initTable. rightSize uses the same formula (its inverse, really) as the internal formula used to decide when doubling capacity is needed.
Re: Should we get rid of style insensitivity?
Does anyone out there routinely use this feature of diverging from the style of an `import` or as I mentioned just follow the lib's conventions? Part of @dom96's survey should perhaps ask if that aspect is just a "theoretical nicety". I mean, someone cared enough about project-internal style consistency to write NEP1 and someone also cared enough to add that `nim c --nep1` system. I think if most/all example code/standard and popular libraries use one style that most people will copy it. The reasons they would _not_ imitate it are more likely to be either A) technical like wrapping sensitive libs or B) cultural like some corporate style guide legislation (of which idents are just one dimension with spacing, comments, function size, etc. being many others) or C) strong personal preferences. In those cases, they seem more likely to just reject Nim outright, but that's just my hunch. C) could cut either way. So, call it 5 out of 6 cases of rejection for 1 case of ecstatic acceptance. A 6x larger community would make a world of difference... So, if it's the primary motivation is just that Nim core personally does not want to deal with other identifier styles, I doubt there is much value realized by style insensitivity. In a fully sensitive world they might have to deal with it 5-10% of the time. Yeah, more than zero. It's hard to know for sure and definitely late in the game, of course. It's certainly unique. Maybe it's the killer feature! If it were _easy_ to predict popularity then the world would be a _very_ different place.
Re: Should we get rid of style insensitivity?
I think a vote would be fine. Ballot box stuffing is possible, but probably identifiable via sheer numbers since the Nim community is so small unless the vote gets HackerNews'd or Slashdotted. I also agree that the people who matter most would never participate in such a vote because they've already dismissed Nim for being "too creative" on the identifier front. Wording matters a lot, too, as in any survey. I'd wager that there would be almost zero "yes" answers to a "would you stop using Nim if it became fully sensitive?" question. Also, a "do you know at least one developer who never tried Nim because one of the first things they heard about was its 'weird' ident rules?" would probably get an average of >50% yes's. Developers in very niche internet languages are not always the most social creatures or my estimate would be higher. Personally, I like identifier _sensitivity_. I think things that look different to a person should be treated as different by the system, especially a system used by programmers who know one wrong character somewhere breaks a program. Ease of use studies about filesystems by non-programmers are at best weakly relevant - and anyway often there is a wrinkle of "creation time casing" and a "reference time casing". Studies in text also show using ALL CAPS ALL THE TIME makes characters most easy to distinguish, but programmers tend to not want to do that. In physics and math (where the underlying syntax of adjacency implies multiplication) single character variable names like F=ma and subscripting prevail. The handwritten/specialty script nature of those fields then pushes users to use _more fonts_ and italics and bold face and so on, and you often see textbook authors declare their style conventions. In general, more terse languages without a lot of syntactic noise (like Nim) benefit more from shorter identifiers which in turn benefit more from sensitivity. Speaking in voice about "cap A" is as easy as "A zero". Giving more picky people flexibility to define their own conventions seems good to me, and to me that means sensitivity not the current rules. Less picky or more overall consistency-oriented people will just go with the flow and imitate Nim standard library conventions. Many will always go for camelCase since they know they'll have a lot of stdlib calls and the stdlib does that and so they'll want their own code to match that "larger world convention". C programmers imitating C stdlib and Java the Java APIs and C++ or POSIX programmers the POSIX apis and so on drive all this enormously. So, the core Nim community can probably get most of what they want just by controlling the style used by the stdlib. Also, one has to pick which battles one fights! _Injudicious_ choice of characters will _always_ be a problem. In typical fonts for Latin script languages the upper- and lowercase versions of a letter "visually collide" for about 50% of the 26 letters. For another 50% there is a lot of visual ambiguity, like the similarity of O/0, 1/l { big-Oh vs zero and one vs lowercase-ell }. Colon and semi-colon, commas and periods are often really hard to see much difference in but make the world of difference. To my knowledge, no one suggests in any seriousness that Nim should forbid big-Oh or little-ell from identifiers or that writing Nim should require a certain font because otherwise it's hard to tell what means what. Sometimes you have to just rely on users of any language to not choose to write deliberately confusing things or use programming-hostile fonts. However, it is perfectly fine ( _fantastic_ , even!) if the compiler has one or several warning systems (that can be easily turned off by a flag) for _ALL SORTS_ of confusing scenarios that "encourage" clarity/simplicity. People "in the thick of it" can have the warning be active. People more in their own closed world or with habits/inclinations/fonts that make certain mistakes less likely can turn it off. Problem solved. Besides naming convention issues, uneven spacing around operators is a good example. There's a warning for that in place now. Another example without a current warning in place would be spaces of leading indentation -- 1 spaces vs 2 vs 3 vs 4 vs 8 -- all currently treated the same by Nim. A shift from indents of 1 to 8 is probably some kind of bug or at least a highly erratic formatting style. The compiler warning could even take some limit where a change of indent shift by <=N was ok and warn only for changes bigger than that.
Re: Should we get rid of style insensitivity?
Well said, @arnetheduck. Honestly, for the true master, you could do some pluggable system that allowed macro/compile-time proc-driven identifier transformation with the current rules as a simple activation maybe as a pragma on `import`. Then people could override that current "compromise" batch of ident rules with something even more idiosyncratic (e.g., that handled ALL_CAPS or maybe CAPPREFIX_Foo identifiers differently, as needed/wanted). You might need to declare a convention inside defining modules, too, though, if you depart from the default. Anyway, instead of quashing admittedly often time-waste-y arguments about (just some dimensions) of style with a totally unique set of ident rules (more than just "case insensitive" as @dom96 points out), give devs tools to manage disagreements.
Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST
Every language has nested loops. My view is that the original article about C++20 ranges conceived this test/benchmark to be about the cost, if any, of abstractions, not exactly the performance of "nested loops however you write it" as suggested by Timothee's code or "the fastest algorithm for generating Pythagorean triples" (see [https://en.wikipedia.org/wiki/Pythagorean_triple](https://en.wikipedia.org/wiki/Pythagorean_triple) for an explanation of moerm's much better algorithm - you should definitely use something like moerm's algo _if you actually had a need for many triples_ for some reason). A lot of what is creating variation in results here, and why I am posting this reply, is that this particular case is more sensitive than most to compiler optimizations and assumptions/code layout, etc. With the below program, I got a full 2.02x speed-up using gcc's profile-guided optimization (145.0 ms to 71.8 ms; both best of 10 trials on a i7-6700K at 4.85 GHz on Linux, gcc-8.2.0 with Gentoo patches rev6). Usually, I only get 1.10x to 1.15x speed-ups. Thus, this case is more sensitive to optimizations. iterator toInfinity(n=1): int = var z = n while true: yield z inc(z) iterator triples(n=1000): array[3, int] = var i = 0 block all: for z in toInfinity(1): for x in 1 .. z-1: for y in x+1 .. z-1: if x*x + y*y == z*z: yield [x, y, z] inc(i) if i == n: break all import times proc main() = let t0 = now() for triple in triples(): echo triple echo now() - t0 main() Run When Timothee's current code is instrumented with `times.now()` and not compiled with PGO it is 1.5x faster than the above program (96 ms). When both programs are similarly compiled with PGO the performance basically matched to almost within measurement error (68.13 ms best of 10 trials, but noise/range over those trials was over 2 ms). (All I did was fiddle with the loop ranges.) In other words, Nim's iterators seem to be a zero cost abstraction, as advertised, at least when compiled with PGO on gcc in this use case. This also makes sense if you look at the C code Nim generates -- just 3 nested loops anyway. Maybe one could pick nits that 71.8 vs 68.13 is not quite zero. Fair enough, but 5% is much less than the 2x variation sensitivity from PGO. Heck, given the giant 2x sensitivity, one of those "genetic algorithm to optimize the optimization flags" approaches could probably do even better than PGO and have the two Nim versions jockey back & forth for fastest. Looking at the generated C code, that seems likely to me. The point of this post is that in this case, compilation strategy/optimization options matter more than people might expect (though several commenters did sort of seemed to know that already). For the curious, moerm's better algorithm with PGO runs in just 600 _microseconds_ on the same machine, and 680 microseconds without PGO, a more typical 1.13x factor, and much smaller than just using a better algo. Personally, I think that it might be a little nicer to be able to `return` to exit iterators rather than using the `block all: ... break all` construct. I could see bug-finding arguments (iterators != proc/func) for not allowing that, though. Anyway, I think Nim mostly wins (or at least ties) in the "elegance contest" here. Also, for anyone trying to reproduce these effects/do profile-guided-optimization with Nim in general, the way I drive my PGO compilation is with a script that boils down to: nim c -c -d:r PROG.nim $cc -fprofile-generate $nimcache/r/PROG/*.c -o PROG ./PROG $cc -fprofile-use $nimcache/r/PROG/*.c -o PROG Run You may also need some `-O3 -fwhole-program, -flto, -march=native` in that `$cc` variable, and definitely need at least `-fno-strict-aliasing`.
Re: beat c in fefes "competition"
Just a quick follow-up, with gcc-8.3 on Linux x86_64 Skylake CPU, profile-guided optimization did get that 176ms time down to 150 ms (With -d:useNimHash it was 152 ms), a 1.17x boost to 1.43x faster than the C in @enthus1ast 's `wp.c`. Of course, with 291,000 unique keys the average external chain length in the C is going to be 291./65.5 =~ 4.4 which is, um, not great. So, maybe this is only faster than a strawman C impl. YMMV. If you really want that particular C impl to be slow, just run it with 1e6 unique words. ;-) (But then you really better not use Nim's current `CountTable.sort`!)
Re: beat c in fefes "competition"
As @Stefan mentioned the string stuff can be slow. My `MemSlice` techniques might be more broadly interesting. The tokenizer below only works for "strict" one-character delimiting, not, e.g., repeated whitespace. Also, I agree completely with @arnetheduck that algorithm choices matter more than languages for Nim vs C comparisons. That said, the below Nim is faster than the C version on my machine (215 ms for C, 176 ms for Nim). @enthus1ast can try that on his data and report back if he likes. There are several little `defined` switches to play with. FWIW, the C version referenced is a rather lame one (fixed 64 KiEntry hash table with external chaining resolution). I didn't try profile-guided-optimization which could improve the Nim results some. Since the C version "cheats" a bit by pre-sizing its hash table, my version below takes any estimate on the number of unique words that will be found as input and pre-sizes the Nim table to fit that estimate. Also, **beware** ` CountTable.sort()`. It uses Shell's sort, not a fast algorithm. They're always integers and typically small. A radix sort would be the best choice, but regular old merge sort from `algorithm` is a lot better. On my test input with 291,000 unique words, the version using `count.sort` took 120 seconds instead of 176 milliseconds, almost 3 full orders of magnitude slower. Times seemed insensitive to hash functions (order 1% variations), though both the sorting and hash-sensitivity will be data set dependent, obviously. For very high entropy distributions you are going to want to avoid the current `CountTable.sort()`. Also, `CountTable` itself was a little slower than regular `Table` using the same sorting procedure. So, you might want to just avoid `CountTable` in general in peformance-sensitive cases. import memfiles, hashes, tables, os, algorithm, strutils proc hash*(ms: MemSlice): Hash {.inline.} = when defined(useNimHash): result = hashData(ms.data, ms.size) elif defined(useCB2): proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl, importc: "memhash_cb2", header: getHomeDir() & "s/cb/hash_cb2.c".} result = c_memhash(0, ms.data, ms.size) elif defined(useCB4): proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl, importc: "memhash_cb4", header: getHomeDir() & "s/cb/hash_cb4.c".} result = c_memhash(0, ms.data, ms.size) else: result = 0 for i in 0 ..< ms.size: result = (result + (result shl 5)) xor int(cast[cstring](ms.data)[i]) proc split*(ms: MemSlice, fields: var seq[MemSlice], nA: int, sep=' ', size = -1): int = proc c_memchr(s: cstring, c: char, n: csize): cstring {. importc: "memchr", header: "".} proc `+!`(p: cstring, i: int): cstring {.inline.} = cast[cstring](cast[int](p) +% i) proc `-!`(p, q: cstring): int {.inline.} = cast[int](p) -% cast[int](q) var b = cast[cstring](ms.data) var eob = cast[cstring](ms.data) +! (if size != -1: size else: ms.size) var e = c_memchr(b, sep, eob -! b) while e != nil: fields[result].data = cast[pointer](b) fields[result].size = e -! b result += 1 if result == nA - 2: #max ix nA-1; Need one more slot for final field break b = e +! 1 e = c_memchr(b, sep, eob -! b) fields[result].data = cast[pointer](b) fields[result].size = eob -! b result += 1 proc main(input: string, sizeGuess: int) = const mxCols = 1024 #file format-dependent max when defined(useCountTab): var count = initCountTable[MemSlice](rightSize(sizeGuess)) else: var count = initTable[MemSlice, int](rightSize(sizeGuess)) var fields = newSeq[MemSlice](mxCols) for line in memSlices(memfiles.open(input)): fields.setLen(split(line, fields, mxCols, ' ')) for word in fields: when defined(useCountTab): count.inc(word) else: inc(count.mgetOrPut(word, 0)) stderr.write count.len, " unique words\n" when defined(useCountTab) and defined(useCountSort): count.sort() #uses Shell's sort internally for key, cnt in count: stdout.write cnt, " " discard stdout.writeBuffer(cast[cstring](key.data), key.size) stdout.write "\n" else: var answer = newSeq[tuple[cnt: int, key: MemSlice]](count.len) var i = 0 for key, cnt in count.pairs(): answer[i].cnt = cnt answer[i].key = key inc(i) proc myCmp(x, y: tuple[cnt: int, key: MemSlice]): int = - cmp(x.cnt, y.cnt) answer.sort(myCmp) for ans in answer: stdout.write ans.cnt, " " discard stdout.writeBuffer(cast[cstring](ans.key.data), ans.key.size) stdout.write "\n" #Must call with
Re: beat c in fefes "competition"
I wrote: "most lookups are failed lookups", but Oops! 291e3 unique/4.8e6 total = 6% ==> 94% of lookups are successful. So, the `memcmp` that happens only after hash codes match does happen almost all the time, and so my particular data set is more like 16B/(32B+9B) = 39% cached, not 50%. This doesn't really alter the rest of my analysis, though since 39% and 50% are pretty close. A corollary of this is that (for that style of data set), Robin Hood hashing w/linear probing would not help. RH can help for failed lookups because probe sequences for missing items become about 50% as long (once the search hash code is > table hash code you know it's missing and where to put the new key). It costs a little work { O(1.15x) } to keep entries along probe sequences sorted by hash code. So, RH only pays off if you care about worst case times or have a workload dominated by failed lookups over long probe sequences. At 56% table load (291./512) the sequences aren't long unless the hash function is weak relative to the data (not the case for me unless all 4 functions were equally weak). Anyway, RH would be about the only algorithmic speed-up I think could help in the poorly cached large table case, but only for very low hit rate data. It would not be crazy to add a `defined(useRobinHood)` in the Nim hash table impl (or even have a per-instance run-time flag) since whether RH helps or hurts is so dependent upon workload.
Re: beat c in fefes "competition"
Oh, and two more stats about my input data important to reason about my timings - 43 MB and 4.8e6 total words total (coincidentally close to 1e-3 times my 1/4.8GHz clock cycle). So, average string length around 43e6/4.8e6 =~ 9 bytes (also a web server log), and about 150e-3/4.8e6 =~ 31 nsec =~ 150 CPU clock cycles per word. A little more fiddling (commenting out various sections) shows that the 150 cycles per word breaks down to: 22% parsing (no hash table or output, just the MemSlice stuff) 55% hash table counting 23% output (sorting + binary->ascii on the numbers). Run So, 150 cycles per input word might sound like a lot, but I personally doubt it could be improved much. 43MB does not fit in my 8MB L3 cache, but 2**ceil(lg(291e3*9B)) =~ 512kEntry*16 just barely does assuming about 16B per entry. Really, 41B per hash slot is more realistic (8B for the hash code, maybe 16B more for MemSlice, 8B for the count and about 9B for the string data). The string data is indirect, and most lookups are failed lookups, so 32B is probably the appropriate number. So, that 55% of 150 = 82 ms / 4.8e6 is =~ 17 ns ~ 80 cycles. The hash table is probably only about 50% L3 resident (16B/32B) - not bad, but not great. Quoted latency on Skylake L3 is supposedly about 40 cycles while my DIMMs are probably about 100 cycles (20ns). So, the (AFAIK unavoidable 4.8e6 lookup latencies) is some kind of average of 40 and 100 cycles. 80 cycles is not implausible. .5*100 + .5*40 = 70 cycles after all (and 100 may be..optimistic for my DIMMs). The table is only 56% full (291/512). So, probe sequences should be short and fit in one or two L3 cache lines, the 2nd of which is often preloaded by the hardware in a linear search. Of course, the table accesses are in dynamic competition with the streaming read from the parser because CPU designers sadly don't let user code control cache policies. So, _maybe_ some hand-rolled assembly could shave off 20 cycles in the hash phase and maybe 15..20 cycles in each of the other two phases for maybe 60 cycles better than the Nim's 150. 150/90 is another 1.6x of potential improvement. It's probably a ton of work to get that. Also, I haven't done it. So 60 cycles is just some nominal upper bound guess. 150 cycles might also be just about as good as can be done. If the table cannot be kept 50% resident in L3, but more like 10% L3 resident then most of that "assumed 60 cycles" improvement would be lost just to the DIMM vs L3 latency. Meanwhile, just packing those hash entries into 16B from 32B (using e.g. a 32-bit hash code and counter and inline string data) could potentially make almost as much a difference in pure Nim by keeping it 100% L3 resident and maybe getting down to 40 cycles from 80. There is, of course, a hazard of focusing too much on the dimensions of this one input data set here which is unlikely to be very informative. The "usual" informative way to benchmark hash tables is to do (at least) a "fits in L1" case to check code generation effects without a lot of memory system interference and then also a "doesn't fit in Lany" case to check latency/probe sequence effects. A more extreme variant of the latter would be a cold-cache spinning rust Winchester disk that doesn't fit in RAM, but such are usually painfully slow. Everything in between those two extremes gets a funky average of inter-cache-level behaviors that is harder to reason about. I figured I'd parse a web log like @enthus1ast. Generated synthetic data would be more reproducible, although also **not** exercise the hash function well. Sorry for the verbosity, but it's tricky to benchmark this stuff in a way that actually generalizes, and I've seen a lot of not so informative benchmarks get a lot of readership.
Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?
You probably want `proc getFileHandle*(f: File): FileHandle` from the system module (no need to import). That just calls through to the C fileno().
Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?
There is also `proc open*(a1: cstring, a2: cint)` in the `posix` module: import posix let fd = open("/dev/mydevice", O_RDONLY) Run if you want an unbuffered raw file handle which sounds like it might be the case (and you don't need portability which is implied by your Linux device file use-case context).
Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?
0.19.0 is very old in "Nim years" { like "dog years", but even more compressed ;-) }. 0.19.2 works better, but personally I would recommend `git clone https://github.com/Araq/Nim` and learning how to build that/run right out of the build. It's not too hard to follow the instructions. I think `sh ci/build.sh` almost totally automates the process.
Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?
Well, to make it work without the rest of your type environment, I changed just the first couple lines of your posted code to: import posix, selectors proc readDevice(dev: string) = let devfd = posix.open(dev, posix.O_RDWR) Run Then it compiled fine for me. I usually run a pretty fresh devel version of Nim. You may be on some earlier Nim. It's probably more useful long-term for you to learn to install a devel version of Nim than for me to test on back-dated versions.
Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?
You're welcome, and I'm very glad you didn't give up so quickly! FWIW, I was not sure it was a problem with that version of Nim -- only that when I changed the type environment enough to compile your code that it worked for me on a devel version. In the future, providing more whole code context (even if only a link to a github repo) may help you get more precise guidance. Having confidence getting a fresh devel build going will pay off someday, though, esp. if your OS package manager has one as old as 0.19.0. Even just more people submitting pull requests for stdlib documentation/API tweaks can help, and PRs usually make more sense to do against devel HEAD.
Re: Creating C array
Nim defines a proc allocCStringArray*(a: openArray[string]): cstringArray Run in the `system` module (no need to explicitly `import` it). There is also a corresponding: proc deallocCStringArray*(a: cstringArray) Run You probably want to use those two calls.
Re: Creating C array
You're welcome. Nim core devs are **_very_** willing to work with any and all comers on pull requests to improve documentation. Ignorance by otherwise reasonably patient and resourceful newcomers is **_not_** easy to simulate (in any project, really). In some ways it's a resource that decays with value as people learn their way around. I agree that foreign function interface helper procs like those mentioned probably deserve a section somewhere, probably linked to from the area you mention. I mean, there is the `system.nim` module documentation, but it's kind of its own problem area.
Re: Noob question: proper way to read binary files byte by byte
A short reply like this may be inadequate to explain virtual memory mechanisms if you have never heard of them before. That said, if you have heard in the past and forgotten this may help. The `newMemMapFileStream` will call `memfiles.open` with default flags. Default flags typically just lookup the size of the file and create an address range in your process that -- when page faulted by the virtual memory hardware -- will (transparently to your process inside the OS kernels "page fault handler") cause loading/population of 4k (or possibly larger) "pages" of memory, on-demand with file contents for the corresponding spot. This is all fairly portable behavior. So, in light of that, at the beginning of your loop, nothing will be "loaded". By the end of the loop, as much will be loaded as can fit in the RAM of your machine. The actual fact of the matter of "being loaded" depends upon the sometimes highly dynamic competition for physical memory among all the programs on a system. This is generally also true of any buffered IO mechanism when "swap files" or "page files" or partitions are enabled. Each page will be loaded for a little while or your program cannot make progress, but by the time you get to the end of your loop if the file is gigantic/larger than RAM or if some other process is demanding a lot of RAM then the beginning may no longer be "resident" in RAM. Certain operating systems allow you to "tune" the on-demand loading behavior with "flag" arguments to the API that sets up this "auto-loading" mechanism. For example, Linux allows you to specify `MAP_POPULATE` which will, in effect, pre-load the whole file into RAM **before** your program loop/without your program making the CPU dereference any of those file data addresses. You may want to do this for example if the persistent backing store is a magnetic spinning disk, the file is small and yo want to avoid "seeking" the disk head around. Similarly, on Unix, there are also the `madvise/posix_madvise` interfaces which lets a program advise the OS that memory accesses are likely to be sequential (your case) or random, or even specify certain ranges as candidates for preloading. These little tweaks tend to be very non-portable, though, and the default behavior probably does what you want. If it does not do what you want, `MemMapFileStream` does not (presently) support adding "flags" to the OS mapping calls. I did recently improve the `memfiles.open` interface to allow just that. You might like the non-stream API better anyway. You can `cast[ptr UncheckedArray[char]]` the `MemFile.pointer` and just use the file as an array of bytes if you like. You do have to be careful not to overrun the end of the file. And another recent addition I got in was to allow `toOpenArray(cast[ptr UncheckedArray[char]](ThePointer), 0, TheFileSize-1)` style passing of such arguments to Nim procs expecting OpenArray[char] parameters.
Re: Can I access arrays faster than this?
Either automatic or manual vectorization can also allow twice as many `float32` numbers to be handled per vector instruction vs `float64` on x86 just like the ARM or GPU cases. You may need `-march=native` or `-mavx` compiler flags (or manual intrinsics/assembly) to activate that feature, though, instead of targeting some lowest common denominator x86 cpu and C compiler autovectorization can be finicky. It is true that for many calculations things are memory bandwidth bound where you still get 2x improvement. However, many are not membw bound or may be in fast caches. For those the 2x wider vectors help. (Funny - caches used to be almost entirely about latency but have become about both latency & bandwidth in recent times). Obviously, the wrong answer faster is not helpful, but it often is close to 2x faster, depending on how vectorizable what you're doing is, compiler, and compiler flags (and/or manual assembly). Excess precision is also not helpful, if the cost is not minimal.
Re: Noob question: proper way to read binary files byte by byte
You can also use `memfiles`. There writing/reading is the same as accessing memory. Besides being possibly simpler presenting an "as if you already read the whole file into a buffer" view, it may also be much more efficient, especially for byte-at-a-time operation where other APIs might do a lot of behind the scenes work on a per-IO basis. Of course, to be usable as a `MemFile`, the data needs to be random access (e.g. on the disk as opposed to a network socket or pipe or some other unseekable input).
Re: Can I access arrays faster than this?
If you can't break your habit, you can always add a few lines near the top (before all the `release`-dependent switching) of your `nim.cfg`: @if r: #Allow short alias -d:r to activate fast release mode define:release @end Run Perhaps somewhere you picked up a `$HOME/.config/nim.cfg` that does exactly this, and then lost it somehow moving between accounts/machines or maybe `nim.cfg` to `.nims`? There's surely also some similar Nim Script/`.nims` variant
Re: Can I access arrays faster than this?
@mratsim is probably intending to refer to the `..` including `50` in Nim while the C `for` with a `<` excludes `50` doing about 2% less work, but the terseness and style of of his comment may leave the wrong impression. for i in 0..50: echo i Run indeed compiles down (in release mode) to simply: NI res = ((NI) 0); while (1) { if (!(res <= ((NI) 50))) goto LA3; res += ((NI) 1); } LA3: ; Run plus some stuff only related to my choice of `echo` for the body (which I removed for clarity). Any decent optimizing C compiler should treat those two ways to spell the loop (@mratsim's `for` and the above `while`) the same. TL;DR the extra time is from the bounds of the iteration, not the language or iterator overhead.
Re: Can I access arrays faster than this?
With such a brief comment, it's hard to know which is why I said "probably intending". Only one person knows. ;) Maybe he did think iterators cost more. You are right I did misread his 1..50 as 0..50 {after looking at the first version of the ggibson Nim code, not the 2nd where he confusingly switched to 1..50 not paralleling the C as well, but correcting his amount-of-work mismatch}.
Re: "Nim needs better documentation" - share your thoughts
@akavel \- you may (or may not) have noticed the "Group By Type" drop-down menu on the doc page for any given module (e.g. clicking on `tables` from `theindex`). That's sort of like what you're asking for. The big `theindex.html` doesn't have that Group By feature, but perhaps should. The option there should maybe mean a multi-level sort -- first by the name of the defining module and then the way a given module would sort when in Group By Type mode. (Not advocating dropping the current order, just adding an option for that new one, too). You might want to create a github issue for that (if there isn't one already). Also, I _do_ see type signatures in [https://nim-lang.org/docs/theindex.html](https://nim-lang.org/docs/theindex.html) , e.g. `system: add(x: var string; y: char)` . So, not sure what you mean there.
Re: "Nim needs better documentation" - share your thoughts
@akavel \- you're welcome. @both \- I'd have to agree that in a language like Nim which relies heavily on type-based overloads that long lists of the same ident in that side-bar/ToC section is not helpful. Right now each is a hyperlink to the specific one. So, collapsing would remove that choice, while including the signature would probably make them long and ugly. Not sure what the best choice to fix all that might be. To the extent many use/advocate using just the big `theindex.html` to find things in the stdlib, adding GroupBy there might help. The core maintainers are very accepting of documentation upgrades. Someone passionate about the output/formatting should work on that or at least file a Github issue.
Re: Fuzzy matching on table keys
What mashigan wrote may well be best suited to your exact purpose. Another possibility at least worth a shout out is `collections/critbits` which implements an efficient `keysWithPrefix`. For more fuzzy still than prefix matching, you could also do some kind of efficient edit distance-based thing, but data-structures to (attempt to) optimize that are complicated. (There is `std/editdistance`, though).
Re: Speeding up Python with Nim
If you want to _actually_ calculate Fibonacci numbers not use it as a (poor) benchmark for funcalls/recursive overheads or for other purposes, then you may also find this interesting: [https://sahandsaba.com/five-ways-to-calculate-fibonacci-numbers-with-python-code.html](https://sahandsaba.com/five-ways-to-calculate-fibonacci-numbers-with-python-code.html) In terms of the closed form formula, it would be nice if someone wrapped `mpfr` or similar someday for a `bigfloats`.
Re: len [0, 1, 2] fails
Or `[1,2,3].len` which is curiously not among the various listed choices for whatever reason, and which is even more visually distinct from array indexing, IMO.
Re: What is the purpose of tables.allValues()?
A documentation PR seems reasonable. Go for it. If you need to undo multiple add calls to `Table` then you should be able loop until you don't find the key anymore, but yeah, you don't know what order values for the key will arrive in that loop. There's no secondary order discipline like FIFO or anything. If you need that you should instead use a `Table[key, seq[valType]]` or `OrderedTable`.
Re: Macro to create a dictionary (table) like in python!
I would have to agree with `@`. +1 vote.
Re: Macro to create a dictionary (table) like in python!
These association lists, often abbreviated "alist", get a lot of play in lisp, but also OCaml, Haskell, etc.. [https://en.wikipedia.org/wiki/Association_list](https://en.wikipedia.org/wiki/Association_list)
Re: What is the purpose of tables.allValues()?
While it is a little unusual to see it, open addressed tables can support multiple keys (as a set or associatively as a table). I believe most other collision resolution schemes can, as well. So, a Nim `Table` is what some people call a "multi-set". Whether you have unique keys depends upon the history of an instance. When I was sprucing up the impls, I mentioned to Araq this might confuse some, but he said he uses this feature in the compiler proper. I think his was the right judgement call..Better to just inform about more than usual functionality when it comes up. To my memory, this is the first time it's come up 4 years. ;-)
Re: What is the purpose of tables.allValues()?
The relevant discussion: [https://github.com/nim-lang/Nim/pull/2078#issuecomment-73744634](https://github.com/nim-lang/Nim/pull/2078#issuecomment-73744634)
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
On slightly closer inspection of that spec, it seems that backslash quoting only happens in the `##` comment sections ignored above. So, maybe they aren't buggy after all if that data is not important to the calculation in question. A further point along the lines of "if we're going to provide `split`, maybe we should provide faster variants" is that the times when it is going to matter most will be for huge inputs where the columns may well have a more regular nature like numbers and specifically forbid more complex lexical structure. There it might A) be correct, B) performance might matter a lot in human terms, and C) the programmer might mostly be non-sophisticated with regard to even terminology like "lexing" or "vectorized memchr". This all seems to be the case with this VCF thing, but I feel like it's come up quite a few times over the years. It may not **always** be "bugs running faster". :-) Anyway, I don't think "to force people to learn new terminology/techniques" is a very welcoming answer. So, I tried to provide something more welcoming. Even if their parsing is sloppy & error prone, I think naive programmers facing consequent errors on their own data sets rather than complaining about Nim library performance is better optics for Nim. All that said, I think we agree 100% that we probably need more information from @markebbert to help him any more with his actual problem. Maybe it is IO. Maybe he didn't even compile with `-d:danger`. If he's on Linux, I would suggest him decompressing first and trying my `mmap` versions. 90 GB/(100 MB/s) =~ 900 seconds =~ 15 minutes. Heck, some people even have 90GB of RAM. :-)
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
I agree with @siloamx. I would also point out that for many less compiler/parsing-sophisticated programmers, splitting is conceptually simple enough to make all the difference. Such programmers may never even have heard of "lexing". This is just to re-express @Araq's point about it being "naive" in maybe a slightly more accomodating way. Probably, they should learn, but maybe they want to focus on some simple problem. Formats aren't always in a strict/pure table format amenable to `parsecsv`. I have some routines in `cligen/mslice` ([https://github.com/c-blake/cligen](https://github.com/c-blake/cligen)) that may help for this. Some example usage is in `examples/cols.nim`. In the `mmap` mode of `cols`, I get ~50% the run-time of GNU `awk` for field splitting, though I have admittedly never tried with 15,000 columns. One final, related point is that `gunzip` can be very slow at decompressing and is single-threaded. If the average column width is ~20 Bytes, 300k*15k*20 =~ 90 GB. Working with the uncompressed file directly or using something that can decompress in parallel like `Zstd` ([https://en.wikipedia.org/wiki/Zstandard](https://en.wikipedia.org/wiki/Zstandard)) may help **a lot** , especially if you have 4-12 idle cores as many do these days. On Linux, you should be able to just let f = popen("pzstd -cdqp8 < myfile.zs".cstring, "r".cstring) Run (You may have to, just once, `zcat file.gz | pzstd -f -p8 -19 > file.zs` and, of course, this will not help if you have a pile of `.gz` files only to be processed once.)
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
In fact, that VCF format does use `\` escapes ([https://samtools.github.io/hts-specs/VCFv4.3.pdf)](https://samtools.github.io/hts-specs/VCFv4.3.pdf\)). So, basically all the above code examples are indeed wrong, and @Araq's advice is the right general advice (I did say they should probably learn). That said, I also still think there are many situations both quick & dirty and otherwise where simple splitting makes sense and is not wrong. Why else provide it in `strutils`?
Re: Compile C file together with Nim one
Yet another approach is to have the `nim c`-generated C code just `#include "foo.c"` via `{.importc: "func", header: "foo.c".}`.
Re: What do you think about the programming language NIM?
I do not agree that lazy linear lists are the "main example" of recursion. They may be the _simplest_ example, thusly allow several easy workarounds. I mentioned "trees" right in my comment. Most of the (admittedly subjective) elegance of tree algorithms comes from the way recursive code structure mirrors recursive data structure. `lib/pure/collections/critbits.nim:iterator leaves` shows some ugliness required to workaround there, for example. I agree it may not be a high priority or even in the top 10. I do not think it would be considered a "no real need" extra/extraneous fluff/"feature bloat". I know you didn't say that, exactly, but I felt your post risked leaving that impression.
Re: What do you think about the programming language NIM?
There are always workarounds since CPUs are little state machines. That should be news to no one. Recursion working makes code look much nicer, IMO. For what it's worth, the example considered among the more compelling by the Python guys back in the Python 2.3 days when they added generators was in-order binary search tree traversal: iterator inorder[T](t: Tree[T]): T = if t: for x in inorder(t.left): yield x yield t.label for x in inorder(t.right): yield x Run which parallels the usual recursive call structure quite well. I think it would be great if the above Nim actually worked. Equivalent state machines are often ugly, trickier to get right, and a pain point (and yeah, usually faster). But "expressive" has always been a main goal of Nim. :-)