Re: Nim version of

2017-03-28 Thread cblake
@Stefan_Salewski, you can also just call the libc memchr (which is what the 
current memfiles does to delimit lines aka slices until string conversion). A 
good memchr will do the SSE/AVX internally. For example this program: 


import memfiles, os, times
var f = memfiles.open(paramStr(1))
var cnt = 0
let t0 = epochTime()
for x in memSlices(f): inc(cnt)  # The action here is just one line
let dt = (epochTime()  - t0) * 1e9
echo "cnt: ", cnt, " ns: ", dt, " ns/cnt: ", dt / float(cnt),
 " B/ns: ", float(f.size) / dt
f.close


On my Linux machine this Nim program runs about as fast as the `memchrcount` 
version in Dan Lemiere's approach (which is about 4x the "naive"/"strawman" 
approach). While Dan does get a speed-up of another 3x more over that `memchr` 
approach with vector code, that also loses the feature to do anything besides 
_count_ newlines. Since some kind of non-"mere counting" processing is usually 
desired, the `memchr` approach is probably what most people would want.


Re: Fun with deduplicate

2016-11-15 Thread cblake
@Stefan - this is a classic application of `containsOrImpl` in the set APIs (or 
`getOrPut` in the table APIs) which are very close to the core "search or 
insert" hash table algorithm.


proc uniqCp[T](s: openArray[T]): seq[T] =
  newSeq(result, s.len)
  var h = initSet[T](rightSize(s.len))
  var i = 0
  for el in s:
if not h.containsOrIncl(el):
  result[i] = el
  inc(i)

proc uniqInPlace[T](s: var seq[T]) =
  var h = initSet[T](rightSize(s.len))
  var i = 0
  for el in s:
if not h.containsOrIncl(el):
  s[i] = el
  inc(i)
  s.setLen(h.len)


These were also about 20% faster on my system than their `OrderedSet` 
counterparts.


Re: unescape \n \r etc in string

2016-12-15 Thread cblake
Perhaps you want strutils.escape? I.e.: 


import strutils
echo escape("\n\t")


Output is "\x0A\x09"

Its output is designed to be as portable an 'input' to 
de-escapers/escape-interpreters as possible. So you will see things like \x0A 
and \x09 instead of the \n \t because the former syntax is accepted by more 
string parsers.


Re: To optimize operation of replacement of lines in the file

2016-07-14 Thread cblake
If you use memfiles then you can just map the whole file and let the OS page 
things in from disk on demand as necessary. This gives you a "whole file buffer 
for almost free" kind of user interface. After the first run or file 
creation/etc. there may be zero/little IO and also very little copying of data 
from kernel space to user space. You do need a real file name that is seekable 
(i.e. "stdin" may not work), but if that constraint is ok for you then mapped 
files is almost always the fastest way to go and among the easiest to use, too. 
You can look at the memSlices iterator in memfiles.nim for one example.


Re: How to properly bind a function to a compiler buildin?

2017-04-09 Thread cblake
> For my use case it is very important, that I can get all the sizes at compile 
> time

You may not be able to make all these things available at Nim-compile-time 
because they are not fully determined until the C (or whatever) backend compile 
is reached. For example, 


type foo object =
  bar: char
  baz: int


will induce generation by Nim of some struct in the C backend. This creates a 
choice for the C of how to layout the struct - is it "1 byte packed next to 4 
or 8 bytes" or rather is there a 3 or 7 byte gap between `bar` and `baz`. Many 
back ends will put in that 3 or 7 byte padding to have the int field be aligned 
which can result in faster generated assembly code on some CPUs.

Almost all C compilers allow some kind of compiler directives like GCC's 
`__attribute__((__packed__))` or an analogue to kill all padding and let users 
control layout. For the C backend (but maybe not others..?) it may be possible 
for Nim to use those directives for all its structs to workaround the 
ambiguities of backend struct layout and resolve these to Nim-compile-time 
values. It is very debatable if that is desirable, though. It is definitely 
more complexity than just "binding" calls. That binding is really more 
delegation than resolution to a value.

What the compiler currently does seems the right thing to me which is to 
delegate the complex/compound cases to the backend and "optimize" the cases for 
simple/non-composite types where it can reliably know the answer. For compound 
types, Nim can know these backend layout sensitive values are Nim- and 
backend-compile-time constants, just not what the values are to do fully 
resolved at compile-time calculations.

So, a higher level question is if this resolution to a value known at Nim 
compile time is really a strict requirement for your purposes or if you can 
relax that to delegated backend work so the calculation can be done at "backend 
compile time".


Re: how to read/write object from/to binary file?

2018-03-21 Thread cblake
While doofenstein makes some good points especially about fragility with regard 
to adding new fields and old data files, there is some real efficiency charm to 
a "just memory map and go" pure binary native data approach. It's not always 
fun to be parsing data files again and again. So, I would also add another way 
to read the data. Note that you should probably use the `{.packed.}` pragma, 
note the use of `ptr UncheckedArray` which should really alert the reader to be 
careful, note the commented out inference of the number of objects in the file. 
My example code assumes you wrote out at least two objects from code like you 
had above (with a `{.packed.}` pragma). It will probably crash if the file is 
too short, etc. Just to give you a flavor. 


import memfiles
type
  MyType* {.packed.} = object
somefield*: array[10, int16]
  MyTypes* = ptr UncheckedArray[MyType]
var t: MyType
var f = memfiles.open("/tmp/tmp.mytype")
# infer n object from file size (may need packed); Do own range checking!
# let n = f.size / sizeof(MyType)
let myobs = cast[MyTypes](f.mem)
echo myobs[0].somefield[1]
echo myobs[1].somefield[0]



Re: Execution speed Nim vs. Python

2016-08-11 Thread cblake
flyx wrote:

> And yes, something between array and seq would be nice. But currently, Nim's 
> type system does not allow a type with a runtime-calculated length parameter. 
> Allowing it would be a hack similar to C VLAs.

No hack involved here. The length parameter can be part of the object like seq 
rather than part of the type like array. openarray-accepting procs should not 
care since they can receive seqs anyway with run-time variable everything or 
arrays with compile-time fixed everything.

One might argue that I/others could just do their own concept which is true. 
The standard lib has lots of openarray-accepting routines that would work fine 
with post-init-fixed seq-like objects, though. So, not having such a type in 
the standard lib that's part of openarray feels unnecessarily inflexible, but 
on this I think we agree since you said it would be nice. It may just be easier 
than first imagined. :)


Re: How to be more concise in code [NEWBIE]

2017-02-16 Thread cblake
Not sure what will be easiest for you to use/understand..I thought the patty 
approach pretty nice. That said, to answer your last question, the macros 
module has a few interesting things along the lines of what you were asking 
for: `parseExpr`, `parseStmt`, and `emit`. The first two can compile strings to 
ASTs and the last can do that and then put the resulting AST/code in your 
program. `emit` has a little example right in its definition in `macros.nim`. I 
think `emit` is probably what you were asking for.


Re: Perfecting Nim

2018-04-25 Thread cblake
While excess can be bad, and there are inconsistencies that should be repaired, 
I think a "batteries included" stdlib is a good thing. E.g., the recent 
`enumerate` for loop macro in `tests/macros/tforloop_macro1.nim` should 
probably be in `lib/pure/sugar.nim`. I think the duplication (`re` and `nre`, 
`parseopt` and `parseopt2`) should go. `mersenne` does seem pointless. Maybe 
game programmers use `basic[23]d`? I think `coro` is a good example of low 
level programming in Nim, but I don't know if needs to be in the stdlib.

I like `set[]`, "do", converters, and exceptions. Exception-free alternatives 
in the stdlib sounds good, though. I think `using` is a good idea. I don't 
usually use dynamic dispatch of any kind, but the Lisp world swears by 
multi-methods for such and they are more general than the usual first-slot-only 
OO dispatch. No opinion on holey enums vs distinct ints or `discardable`. I 
never liked non-case-sensitivity or `TaintedString`.

I think it would be nice if `openArray` could be more "open". It could become a 
first class `concept` needing only `[]`, `len`, etc. when concepts are ready 
enough. That may be already planned.


Re: Looking for a set that sorts and deduplicates

2017-11-23 Thread cblake
`CountTable` is just a histogram and the sorting is by the bin counts not by 
the keys. Keys are visited in "hash order". `hashes` does use an identity hash 
for integers which could create some confusion from a simple test program if 
the modulo the table size post hash transform doesn't change the order (though 
it surely can).

Of course, the larger idea embedded in this suggestion is quite legitimate - 
you could use a `HashSet[int]` (from `sets`) to manage the set until you are 
ready to create a `seq[int]` via `HashSet.toSeq` and then sort that seq at 
once. This is probably the approach that most would use and is probably faster 
for most access patterns. The heap/tree approach would likely win if you had 
dynamism -- insert a bunch, sweep in key order, insert some more, sweep in key 
order again, etc. This complex dynamic pattern wasn't really specified in your 
question, though.


Re: Execution speed Nim vs. Python

2017-09-27 Thread cblake
Nim `Table` and `HashSet` should be caching the hash code values in their 
tables and also using that as a "comparison prefix" (comparing string values 
only if the integer hash codes already match). The string hash of the new 
inputs is ineliminable - or rather has to be done at least once anyway.

My guess is the lines are long-ish (greater than 10 chars, say). The default 
Nim hash string function could be faster, especially for longer strings. E.g., 
the current default does byte-at-a-time manipulations and usually 
8-byte/word-at-a-time can get about as good bit mixing much faster, especially 
for long strings like lines in files. Such a replacement might take that 60% 
hashing time in this particular benchmark down to 10-15% for a 2X-ish overall 
speed up.

I have benchmarked C++ for these kinds of tests and I suspect the current STL 
would not be faster than Nim with the -d:release in this case. (C++, too could 
benefit from faster default string hash).


Re: could Nim benefit from upcoming C++ modules to speedup compilation times?

2018-04-05 Thread cblake
@timothee - @amalek beat me to the punch, but user-visible compilation time is 
very sensitive to backend compiler/gcc options. Consider two invocations 
compiling my [cligen](https://github.com/c-blake/cligen) test suite (30 
programs with somewhat complex macros running at compile-time): 


cligen$ ./test.sh
** ./test.sh * Time: 10.43s (u) + 0.92s (s)=11.05s (102%)
cligen$ ./test.sh -d:r
** ./test.sh -d:r * Time: 57.54s (u) + 2.85s (s)=56.96s (106%)


My `nim.cfg` is set up to have the former use tcc/TinyCC as a backend ([the mob 
branch](http://repo.or.cz/tinycc.git)), but to use -d:release and gcc on max 
optimization for the latter. So, basically a huge 5X difference in total 
user-visible compilation time.

The Linux perf record/perf report tell me that in the former case Nim is about 
82% and 14% in the latter case. So, in both cases Nim is 9 seconds or about 300 
milliseconds per program (probably 10% more with kernel/libc activity...). 
That's not very "real-time" (like re-compile every keystroke), but it's pretty 
good. tcc is much faster (only 3% of the time for that faster run), but I think 
there are enough open issues with the Nim compiler that correctness should be 
the focus more than compile-time performance.


Re: To optimize operation of replacement of lines in the file

2016-07-14 Thread cblake
You're welcome, Garry.

Stefan - a simple check for whether pattern has any regex metacharacters (e.g. 
'.', '*', etc.) can decide if a pattern is definitely not a regular expression. 
If there are any metacharacters in pattern, well only the user can know and 
there would have to be a command-line switch to force one interpretation vs. 
another.

In terms of absolute maximum performance, one usually wants to not do the 
search "line at a time" at all. Rather, one wants to use an assembly-optimized 
"strchr"/"memchr"-like search on the biggest chunks of input possible. In this 
context that means scanning the input for the least likely character that is 
definitely necessary for a match. There may be characters much less likely than 
newlines, and so a big boost may be possible (though note that input data sets 
do vary). Once that byte offset is known, from there one determines if it is 
part of a real match. That is all quite a bit more bookkeeping/logic/hassle, 
obviously.


Command-line Parsing Preferences

2018-03-01 Thread cblake
So, there have been various on again/off again conversations on github about 
how the stdlib parseopt could/should behave. Those conversations seemed to have 
a pretty narrow audience..basically participants on these github issue threads:
[4620](https://github.com/nim-lang/Nim/issues/4620) 
[6818](https://github.com/nim-lang/Nim/issues/6818) .

It made sense to me to raise this topic in the forum where a wider audience 
might weigh in before 0.18/1.0/whatever stabilizing might happen.

The basic issue is whether command users have to separate option keys from 
option values (e.g. with a ':' as the current code does). In order to avoid 
that requirement, command authors need to update a list of option keys whenever 
they change them which is a hassle. So, there is a tension between convenience 
to CLI authors and CLI users. As a user, I often forget to type the ':' and 
would be delighted if more people chose the symbol table route.

Myself, I think having an **optional** symbol table is the most programmer and 
user-friendly solution. That 's what I did in parseopt3 in 
[cligen](https://github.com/c-blake/cligen). If the programmer wants to provide 
more POSIX-like command syntax then that's easy. OTOH, if like Araq they very 
understandably prefer not to have to update a symbol table when they add new 
option keys then they can just ignore that feature and make their users provide 
a ':' to separate option keys and option values. To me this kind of optional 
symbol table seems both backward compatible and forward looking and strikes a 
good balance for a stdlib.

I'm just one person, though, and that's just my opinion. The point of this 
thread is to solicit other opinions/perspectives/input to help figure out the 
best resolution of [6818](https://github.com/nim-lang/Nim/issues/6818) .


Re: Fastest way to count number of lines

2017-10-20 Thread cblake
@jlp765 - good catch. I thought of that, too (I actually wrote that `memSlices` 
stuff), and almost went back and added a note later, but you beat me to it. 

I still am unaware about relative timings on platforms other than what I 
personally use and would be interested to hear reports, but on Linux/glibc 
`memSlices` (or more generally `mmap+memchr` however that is invoked) is always 
fastest in my tests.


Re: use fork() to speed up compilation testing.

2018-03-30 Thread cblake
@twetzel59 - I use fork (and even vfork) all the time. They work just fine.

@Krux02 - I have never used compiler-as-a-service mode which seems originally 
targeted at IDEs, but that feature may apply to testing as well. See 
`compiler/service.nim` and `tests/testament/caasdriver.nim`..maybe search for 
"caas".

That said, I suspect a lot of the time is actually the C compiler, not the Nim 
compiler. If you can set up a nim.cfg to run tests with TinyCC/tcc (the mob 
branch), things could probably go much faster.


Re: Trouble with reading from stdin

2017-02-12 Thread cblake
I believe what you are running into is this terminal driver issue:

> [http://unix.stackexchange.com/questions/131105/how-to-read-over-4k-input-without-new-lines-on-a-terminal](http://unix.stackexchange.com/questions/131105/how-to-read-over-4k-input-without-new-lines-on-a-terminal)


Re: Nim vs D

2017-07-07 Thread cblake
Oh, sure. There isn't _usually_ a 1.618**4 kind of work reduction in play, 
though.  Araq would know, but I doubt the reason NimMainInnner is called 
through a volatile function pointer is to trick gcc's optimizer, though that is 
a happy side effect. 

To me, this was just a performance mystery that I found curious and thought 
others might appreciate the answer to. Also, more Nim-relevant might be the 
[automatic memoization thread](https://forum.nim-lang.org/t/1343) which is a 
more dramatic optimization aedt might appreciate.

Of course, an in-line array or even just the closed-form formula are faster 
still for Fibonacci, but presumably the point of this benchmark from xyz32/aedt 
was to stress test recursion in a programming language. They may want to 
reconsider that benchmark to something with less exponential sensitivity to 
unrolling tricks - unless, of course, the intent is to measure unrolling 
tricks. 


Re: Reason for -fno-strict-aliasing?

2017-08-26 Thread cblake
I did a little searching and compiling and I'd have to say Jehan is right here 
about "absence of evidence". See, [this stackoverflow 
thread](https://stackoverflow.com/questions/21214875/gcc-accuracy-of-strict-aliasing-warnings)
 for a very simple example to people that is (still in 2017) too complex for 
(at least gcc's) Wstrict-aliasing heuristics. It sure seems like 
strict-aliasing is a real morass.

@cdome - perhaps a better solution would be some kind of `emit` and/or `macro` 
machinery that turns on `-fstrict-aliasing` just for the procs you need to 
recover performance. gcc has had this `#pragma` or `__attribute__` way to do 
that for quite a few years now (2011, I think). See 
[here](https://gcc.gnu.org/wiki/FunctionSpecificOpt).


Re: Execution speed Nim vs. Python

2016-08-11 Thread cblake
@flyx - c99 (and gcc long before it) have VLAs of the 
fixed-after-initialization variety...mostly syntactic sugar for alloca, of 
course. All that is surely more commonly used than Ada and required no fancy 
containers or any constraints about being a leaf function. Such an array style 
may be different enough from the post-init-resizable Nim seqs to warrant a 
different type, but it could still be an openarray.

I think Nim needs a kind of openarray between array and seq - one which is 
run-time-sized, but fixed after initialization and with backing memory not 
necessarily on the heap. For example, an array of char that is backed by a 
read-only memory mapped file and so cannot be zero-terminated, but being from a 
file its size cannot change after initialization and yet the size can also 
never be known at compile time. Then it could get bounds checking like other 
openarrays and otherwise behave similarly. Maybe that's not the best example, 
but I'm sure there are many more. Lots of times the sizes of things are only 
known at run-time, but don't change once known/computed. Also, many openarray 
accepting routines need not modify their inputs -- why, they cannot change 
lengths of arrays, for example. So, it would be nice if all openarray accepting 
routines could work with such a type, whatever it might be called. Maybe there 
is one and I am just unaware.


Re: Nim vs D

2017-07-08 Thread cblake
This recursion unpacking/unrolling trick that gcc does (at call-site if 
insulated by call via volatile function ptr, and always inside the recursive 
impl) is, in my experience, a rare compiler optimization, but maybe it will 
catch on. clang does _neither_. If you `objdump -D` the executable (or 
disassemble in `gdb`/etc.) you will see the single `callq` to Fibonacci at the 
entry with the full N and a pair of `callq` inside the impl. So with clang the 
full [1.618**n 
work](https://stackoverflow.com/questions/360748/computational-complexity-of-fibonacci-sequence)
 happens. On my i7-6700K Linux with gcc-7.1.0, I get a time ratio 
of about 15.3 to 1 (53.5s/3.5s).

For what it's worth, it's likely the compiler writer guys think of this as 
mostly a "remove a few funcalls here and there" type optimization..maybe use 
SSE/AVX/branch predictor a bit better, maybe save a little dynamic stack usage, 
etc. This doubly recursive Fibonacci approach just happens to be about as 
sensitive as possible to that particular optimization because it can save 
exponentially many funcalls.


Re: Looking for a set that sorts and deduplicates

2017-11-24 Thread cblake
Memory hierarchies since the 1990s have broadened the relevance of 1970s 
practical wisdom for disk storage. A solid hash table with locality and solid 
B-Tree often win today. A great stdlib should have both.

A more obscure bonus of a B-Tree (when compared to, say, balanced binary trees) 
is cheaper amortized cost to maintain simultaneous access by rank (e.g. quickly 
find 95th percentile, top 10, etc. as well as query by key). With B-Trees you 
need only one rank counter per big node of O(M) elements instrumentation, not 
per element as with binary trees. So, B-Trees afford O(M)-times less space 
overhead and O(log(N)/log(M))-times less time overhead than a rank-instrumented 
binary tree of N elements. This can be useful for, e.g. tracking quantiles over 
moving data windows.


Re: Fastest way to count number of lines

2017-10-21 Thread cblake
Yeah..Depending on what he's doing, same-file dynamic estimation might also 
work. Good point, @jlp765.

On my system the 5.4 MB of `/usr/share/nim/**.nim` gets counted in about 4 
milliseconds - over 1.3 GB/sec, probably faster than all but the most 
powerhouse nvme/disk array IO. This is why I suspect @alfrednewman might be 
re-calculating things instead of saving the answer either in RAM or as files.

I'm sure a pre-pass calculating the number of lines can avoid certain 
complexities. However, once you start doing assembly hijinks that are not even 
portable through a given CPU family (e.g., using SSE, AVX2, AVX512, ...) 
performance becomes very deployment sensitive. Meanwhile, eliminating the 
entire pre-pass by merging it with per-line allocations/whatever costs 
complexity, too, but yields portable performance gains. If it's really 
ineliminable then have at it, asm-wise, I guess. I just suspect it's off-track.


Re: Remarks on standard library file functions (system, sysio)

2017-02-09 Thread cblake
Yeah. I haven't tried out your code, but an API idea might be to have the 
caller instruct the iterator what to do with any improper/undelimited record at 
the end -- includeUnterminated=False or something. Only the caller really knows 
what is desired.

One observation, algo performance-wise, is that the fastest substring searches 
are usually achieved by using a fast libc SSE/AVX vectorized `memchr` to find 
the _least likely_ character in the substring. Sometimes one knows this from 
vague properties of the distribution. For example, framing the email messages 
in UUCP-style mailbox with '^From ' kinds of delimiting, `memchr` for the 
capital 'F' skips over most text most of the time. This idea may not work in a 
corpus of emails in ALL CAPS ALL THE TIME, but such a corpus should be rare.  
Anyway, this idea is just sort of the simplest/first step into a wider world of 
efficient substring search algorithms that you might enjoy reading about some 
day.


Re: How to properly bind a function to a compiler buildin?

2017-04-10 Thread cblake
You are right that C compilers already strive to ensure ABI struct compatiblity 
which is why I think having Nim leverage that effort is better than reproducing 
it at the Nim level, though using backends themselves to pre-generate things 
may help. My point was to get truly Nim-time resolved values was trickier than 
it seemed was being framed by the discussion. Another thought along those lines 
might be to resolve things to Nim-time values only for types using Nim's 
`{.packed.}` pragma. Maybe only `{.packed.}` types would get the best error 
reports.


Re: Fastest way to count number of lines

2017-10-20 Thread cblake
So, one other thing that is _probably_ obvious but bears mentioning just in 
case it didn't occur to @alfrednewman - the number of addressable/seekable 
bytes in a file is usually maintained by any modern filesystem on any modern OS 
as cheaply accessed metadata. So, if what you really need is not an exact count 
but a "reasonable bound" that could be violated once in a while then you may be 
able to _really_ accelerate your processing.

As one example text corpus that we all have access to, the current Nim git 
repo's `lib/nim/**.nim` files have an average length of 33 bytes and a standard 
deviation of about 22 bytes as assessed by this little program: 


import os, memfiles
when isMainModule:
  var sumSq = 0
  for slc in memSlices(inp):
inc(counter)
sumSq += slc.size * slc.size
  echo "num lines: ", counter
  let mean = float(inp.size) / float(counter)
  echo "mean length: ", mean
  let meanSq = float(sumSq) / float(counter)
  echo "var(length): ", meanSq - mean*mean


You could probably reasonably bound the average number of bytes per line as, 
say (average + 4*stdev) which in this case is about 33+22*4 =~ 121 bytes..maybe 
round up to 128. Then you could do something like: 


import os
var reasonableUpperBoundOnLineCount = int(float(getFileInfo(myPath).size) / 
float(128))


If you use that bound to allocate something then you are unlikely to over 
allocate memory by more than about 4X which isn't usually considered "that bad" 
in this kind of problem space. Depending on what you are doing you can tune 
that parameter and you might need to be prepared in your code to "spill" past a 
very, very rare 4 standard deviations tail event. This optimization will beat 
the pants off any even AVX512 deal that iterates over all the file bytes at 
least for this aspect of the calculation. It basically eliminates a whole pass 
over the input data in a case that is made common by construction.

Since you have seemed pretty dead set on an exact calculation in other posts, a 
small elaboration upon the "embedded assumptions" in this optimization may be 
warranted. All that is really relied upon is that some initial sample of files 
can predict the distribution of line lengths "well enough" to estimate some 
threshold (that "128" divisor") that has "tunably rare" spill overs where they 
are rare enough to not cause much slowdown in whatever ultimate calculation you 
are actually doing which you have not been clear about.

Another idea along these lines, if, say, the files are processed over and over 
again, is to avoid re-computing all those `memchr()` s by writing a little 
proc/program to maintain an off-to-the-side file of byte indexes to the 
beginnings of lines. The idea here would be that you have two sets of files, 
your actual input files and some paired file "foo.idx" with foo.idx containing 
just a bunch of binary ints in the native format of the CPU that are either 
byte offsets or line lengths effectively caching the answer of the `memchr`.

If you had such index files then when you want to know how many lines a file is 
you can `getFileInfo` on the ".idx" file and know immediately. You could be 
careful and check modification times on the .idx and the original data file and 
that sort of thing, too. Why, you could even add a "smarter" `memSlices` that 
checked for such a file and skipped almost all its work and an API call 
`numSlices` that skipped all the work if the ".idx" file is up-to-date 
according to time stamps.

Basically, except for actually "just computing file statistics", it seems 
highly unlikely to me that you should really be optimizing the heck out of 
newline counting in and of itself beyond what `memSlices/memchr()` already do.


Re: Looking for a set that sorts and deduplicates

2017-11-23 Thread cblake
@Araq - fabulous. I think a B-Tree in the stdlib would be great.

@mratsim - I think you meant `collections.intsets` with a 't' which does _not_ 
yield items() in key order.

The built-in `set[up to int16]` type _does_ (implicitly) order its keys, 
though. @cdome also did not mention how large the integers he needed to support 
were, but he said "int" which I took to be the standard Nim `int` which is too 
big for a `set[]`. Also, while the set `incl` operations of `set[int16]` are 
unbeatable performance-wise, iterating over the set can be relatively slow if 
the set is not populated -- e.g., if you do int16 and only have 10 unique 
numbers, you still have to loop over 65536 entries to convert to a seq[int].

You can play around with this little program to see the various effects. 


import sets, intsets, times, random, sequtils, algorithm
let trials = 1
let numberRange = 1000 # 1
let doPrint = false

randomize()
var Is = initIntSet()
let tI0 = epochTime()
for i in 1..trials: Is.incl(random(numberRange))
var Iss = toSeq(items(Is))
Iss.sort(cmp[int])
echo "IntSet time:   ", epochTime() - tI0
if doPrint: echo Iss

var Hs = initSet[int]()
let tH0 = epochTime()
for i in 1..trials: Hs.incl(random(numberRange))
var Hss = toSeq(items(Hs))
Hss.sort(cmp[int])
echo "HashSet[int] time: ", epochTime() - tH0
if doPrint: echo Hss

var Ss: set[int16]
let tS0 = epochTime()
for i in 1..trials: Ss.incl(int16(random(numberRange)))
var Sss = toSeq(items(Ss))
echo "set[int] time: ", epochTime() - tS0
if doPrint: echo Sss



Re: Reason for -fno-strict-aliasing?

2017-08-25 Thread cblake
@Tiberium, in the `csources/build.sh` context you should only need to remove 
the `-w` and change the `-fno-strict-aliasing` in `COMP_FLAGS` since it's just 
a shell script with a zillion gcc invocations. Compiling other nim code, I only 
had to change my nim.cfg's `gcc.options.always = "-Wall 2>>/tmp/gcc.log"` or 
something similar to catch the warning outputs. Still yet to see any warnings 
related to aliasing.


Re: OrderedTable is not an ordered table

2018-03-21 Thread cblake
@lightness1024 - good point. Another argument in favor of "KeyOrderedTable" 
would be that it would probably show up in searches of the more generic 
"OrderedTable". I don't think Nim has a key-ordered collection (EDIT - in the 
stdlib) right now, but Araq has said he has a B-Tree waiting in the wings. 


Re: Perfecting Nim

2018-04-25 Thread cblake
Patching the stdlib to use concepts where it makes sense instead of `openArray` 
is also a fine idea. It's really used a lot -- (`algorithms`, `strutils`, 
`sequtils`, `random`, `stats`, etc.). `Iterable` & `Indexable` concepts make 
sense to me as a decomposition for all those interface use-cases.

Something so basic as a dependency should not be in something called `sugar`, 
though. Some new stdlib module (`concepts`, `stdtypes`, `typeclassess`?) would 
be a better place for standard concepts that many other modules depended upon. 
But this is off-topic for "what to remove". 


Re: Surprises with Generics

2017-11-12 Thread cblake
One other wrinkle more in line with this thread topic is making things work for 
general object types but specific standard key types. In C one might handle 
that by taking an `offsetof` for where they key field is within an object and 
`sizeof` for the size of object in the array. In Nim, one could do similar, but 
it might be more "Nimonic" to more safely dispatch via some `sort template` 
instead of a `sort proc`, e.g.: 


template sort[T](inp: seq[T], keyField: untyped) =
  when T.`keyField` is byte: ## XXX Handle all the standard types
...  ## XXX good & stable algos for each one
myArray.sort(myKey)


The implementation might look nicer with a family of overloaded calls rather 
than a big `when T.`keyField` is` dispatcher. I am not sure there is a clean 
way to do that in this case. Each `when` clause should be able to dispatch to a 
type-specialized case, though. So, it's not too bad.


Re: Nim vs D

2017-07-07 Thread cblake
@aedt - with that C program I posted and just changing the numbers, I see no 
change in behavior at all: 


N  fib(N)   log_2(Fib(N)) WallTimeUsrCPU log_1.618(Tm)
57 365435256832 38.410825 88.338572   88.29  9.31268150798
56 225851408384 37.716583 54.570624   54.54  8.31166257771
55 139583848448 37.022341 33.718429   33.7   7.31112152655
54 86267559936  36.328100 20.863663   20.85  6.31352254278
53 53316284416  35.633857 12.885360   12.87  5.31201286047
52 32951275520  34.939615 7.9617497   7.95   4.31148872869
51 20365008896  34.245373 4.9201932   4.91   3.31125979281
50 12586267648  33.551131 3.0422103   3.04   2.31214787174
49 7778741248   32.856890 1.8785994   1.87   1.31034617400
48 4807526400   32.162648 1.1626232   1.16   0.31313742467
47 2971214848   31.468406 0.7182663   0.71  -0.68769978950
46 1836311680   30.774164 0.4441032   0.44  -1.68685323404
45 1134903040   30.079922 0.275319635 0.27  -2.68048037749


Indeed the fit is basically a perfect straight line: log_1.618(time) = N - 
47.67. The integer is only passed around as an argument. The arithmetic is all 
float32 (or maybe float64 depending on impl). So, to me this is as should be 
expected. I believe that is the case in all your benchmark versions, too. Not 
sure what is going on for you.


Re: Creating a new seq is not that fast

2017-04-18 Thread cblake
If zeroed memory is required semantically, then it's better to compare against 
`calloc` not `malloc` in an apples-to-apples sense.

Also, at least on Linux on you need to be careful in benchmarking this sort of 
thing. As you get to larger allocations the OS can actually give the process 
many linked copies of a copy-on-write 4096 byte page of all zero bytes. Because 
of the C-O-W, the initial allocation alone can seem much cheaper than it really 
is. The work is deferred to being done lazily as memory is written to.

This zero page business can also cause confusion with calloc-never-write 
read-only workloads in benchmarks (say `calloc` and then sum all the ints) 
because on x86 the cache is physically mapped and that single zero page can be 
fully L1 resident. Such L1 residence can give a "false" speed advantage 
compared to more aged/populated memory. [ This latter point is not relevant 
here, but seemed worth pointing out anyway -- it was how I personally 
discovered Linux was finally doing this optimization. ]


Re: Execution speed Nim vs. Python

2017-09-27 Thread cblake
(I should perhaps qualify beyond -d:release using gcc or clang with high 
optimization levels on the back end since jil210 revealed in [another 
thread](https://forum.nim-lang.org/t/3198) the baseline 5X performance mystery 
arose from using tcc as a back end.)

Also, for the curious, the reason C++ STL will usually underperform Nim's hash 
tables in benchmarks like this is that STL iterator deletion semantics make the 
natural STL hash table collision resolution implementation choice be external 
chaining. Those extra linked list indirections add latency, especially when 
tables do not fit in on-CPU cache.


Re: Remarks on standard library file functions (system, sysio)

2017-02-09 Thread cblake
Some greps can indeed be quite good. 

Really, I think this algorithmic intelligence should be able to be just used 
from Nim `strutils`. Nim strings know their length. Many/most of those 
read-only procs like `strutils.find` "should" be able to work on types like 
`MemSlice` or really some kind of `openarray[char]` \-- a pointer and length 
pair. Scare quotes for the "should" since the current Nim `openarray` does not 
include arrays of dynamic/run-time extent but with non-heap/GC backing memory 
like a memory map, or some other blob of memory.

There is ongoing work on concepts that could probably profitably replace 
`openarray` in many use cases. So, maybe some day in the not too distant 
future, as they say. Concept support may already be good enough to start an 
ambitious project along those lines, but yeah, for you/your situation, just 
re-doing memSlices is a fine approach, esp. if absolute performance is not a 
priority.


Re: Command-line Parsing Preferences

2018-03-02 Thread cblake
It really shouldn't _need_ to break anything, though mistakes can always be 
made in some initial pass. 

I would think you might also like long option key normalization so that one 
doesn't have to remember when typing commands `--how-toSpell_OptKeys-exactly`. 
Entering commands is often more ad hoc and personal than writing code, to the 
extent they are even distinct.

Also, if it matters, I'm happy to donate 
[parseopt3.nim](https://github.com/c-blake/cligen/blob/master/parseopt3.nim) to 
the cause. That adds a few more features beyond optional symbols like 
"\--"/`stopWords`, and `optionNormalize`. It should be API/backward compatible 
by default, but I really haven't tested it much outside being driven by 
[cligen](https://github.com/c-blake/cligen) macros. E.g., I'm not sure how it 
fares against quoting issues that arose recently.


Re: writeFile with iterator

2016-07-25 Thread cblake
Also, in terms of optimization just replacing: 


for str in iter(): file.writeLine(str)


with 


for str in iter(): file.write(str & "\n")


yielded a 1.3X speed-up (I.e., it took the time down to 0.14 seconds).


Re: Nim vs D

2017-07-07 Thread cblake
In case anyone else is curious about aedt's fibonacci benchmarks, I found an 
explanation for the approx 6..8X time difference of C vs Nim (with gcc as a 
backend, on Linux). It took digging into assembly to figure it out, though. 
With the Nim version, the nested calls/various module boilerplate allow gcc to 
unpack the top several entries of the recursion _at the call site_. It is not a 
properly memoized/fully made iterative tail call, though as in aedt's editted 
comment. (Both versions unpack a few levels in the fibonacci function itself). 
So, really the doubly recursive call is invoked with a number 4 or 5 smaller 
than the pure C implementation and so runs much faster since the running time 
is exponential in N. So, this is mostly just good luck for Nim-gcc in this 
highly specific benchmark, not something fundamental in the language or impls 
that would generalize all that well. (I haven't figured out how to tweak the 
pure C to get gcc to engage that optimization, but presumably someday the gcc 
folks will do so behind the scenes or maybe there's already a magic -f flag I'm 
unaware of.)


Re: OrderedTable is not an ordered table

2018-03-21 Thread cblake
Thanks for the compliments. You should probably learn about backward shift 
deletion. It's a little tricky the first time you see it, but ultimately it's 
better than tombstones and a bunch of ad hoc garbage collection rules. You can 
look at `lib/pure/collections/tables.nim`. There are really only a few 
improvements to Nim's `Table` that I think might be good.

We could do Robin-Hood reorganization on inserts, being sure to retain linear 
probing and backward shift deletion. In my tests, Robin Hood costs about 10% 
more time on workloads which are mostly successful lookup but saves about 50% 
of the time for workloads that are mostly failed lookups. (Note that every 
insert of a novel element starts with an unsuccessful lookup.) So, Robin Hood 
never adds much on average and is often a net win on "many workloads" and as a 
bonus squashes the variance of lookup times.

The second idea to speed up might be storing the hash codes in a separate array 
from the keys so that a single cache line can fit a larger collision cluster. 
That speeds up searches, but it also slows down inserts and deletes which need 
to hit twice as many memory spots. So, your mileage may vary even more than RH.

Finally, Tables could also elide saving the hash code for `SomeInteger`-like 
"cheaply hashable/comparable" keys. Unlike the prior two ideas, the speed up 
from this would not be workload dependent. However, to do it there needs to be 
either a separate occupied bit vector (two cache lines again, like a segregated 
hash code array) or a special reserved key value signifying "empty". The 
reserved value is the efficient way to go and just one is usually not too 
onerous (zero, -1 or -2**31, etc.), **but** it 's not an easy thing to default. 
So, that expands the API to the user - for just some type classes of key - to 
specify the empty value. So, a specialized name like `SomeIntTable` analogous 
to the specialized histogram `CountTable` is probably warranted.


Re: Fastest way to count number of lines

2017-10-22 Thread cblake
Nice, boia01! In my same `/usr/share/nim/**.nim` test I get 768 microseconds 
for your version and 2080us for just doing the memSlices approach..So, 2.7X 
speed up, a bit less than the 4X I saw when I last compared the approach in two 
C versions..maybe unrolling. Dunno.

@alfrednewman - if the statistics of these data files are stable over the whole 
file, you could always stop after the first gigabyte (or maybe less), figure 
out the average line length and use the file size to estimate the number of 
lines, and then it would only be a ~1 second delay. A MemFile knows its size, 
as does each slice...So, this is pretty easy to code. Of course, if the first 
gig is not always representative of the remainder that might not work, but it 
sounds like you probably don't need an exact answer. This is sort of a 
simplified version of one of my/jlp765's suggestions. 2.7 * 1.3 GB/s =~ 3.5 
GB/s is faster IO than many people have. So, even using boia01's code you may 
not see a great speed-up.


Re: Surprises with Generics

2017-11-12 Thread cblake
@Stefan_Salewski - I think this idea of having a family of overloaded procs to 
sort on standard types is very good. I have almost raised it myself several 
times.

It would be useful for `cmp`-less `sort` procs on such type to be _stable_ 
sorts (meaning elements with identical keys retain relative positions in the 
array). Then a user could implement a simple 2-level or 3-level sort on 
distinct embedded keys by just calling such sort procs 2 or 3 times. (A 
multi-level sort orders elements first by a primary key, then a secondary key 
for equal primary keys, and so on).

This arrangement has another performance bonus potentially much greater than 
the inlining effect mentioned so far in this thread. It allows one to tune 
which sorting algorithm is used based on the nature of the key, e.g. how long 
it is, integer or floating point or string, etc. For example, for single-byte 
integer keys it is hard to be faster than [counting 
sort](https://en.wikipedia.org/wiki/Counting_sort) which does two pretty linear 
sweeps through the input - the first to histogram byte values and the second to 
output the answer (with a keyspace-sized prefix sum in between to convert 
counts to output offsets). The simplest version of that does require an 
additional copy of the input array which may have some consequences on the proc 
signature/interface/specification.


Re: unescape \n \r etc in string

2016-12-16 Thread cblake
Looking at the implementation of strutils.unescape, it seems to only interpret 
the xHH syntax that escape outputs. Of course, the compiler proper is often 
changing those raw/quoted string forms into special characters. So, maybe there 
is some other approach/trick that could work..sort of like an "eval" in 
interpreted languages.


Re: Add Nim to The Computer Language Benchmarks Game?

2017-06-25 Thread cblake
So, building on the 40% or so of the benchmarks def did, I rounded out the 
other 60% and gave a PR to his repo. With more like hours than years of my time 
per program profiling/tweaking/optimizing, I got the Nim implementations to 
mostly parity with the top performers (Nim was in the top 3..6 impls on my 
machine for almost all tests). Of course, I did start most of those impls from 
"for years tweaked impls".

Anyway, shortly after I finished that work, I also saw the annoying comment on 
Alioth about "No more languages! Do your own website", but I was lazy and did 
not.  Kudos to you, bluenote for doing something.  Looks pretty nice.


Re: writeFile with iterator

2016-07-25 Thread cblake
Garry: --opt:size (and things like gcc -Os) instructs compilers to choose 
trade-offs to make object code smaller rather than faster. E.g., on Oderwat's 
code, I get 0.19 seconds with just -d:release (with gcc-6.1 on Linux x86_64, 
i7-6700k @4.5GHz). If I additionally specify --opt:size the time rises to 0.28 
seconds, almost 1.5X slower. The program text segment is 80KB in the fast case 
but 33KB in the slow case. So, all is behaving as expected/requested.


Re: how to skip all the `Hint: foo [Processing]` during compilation?

2018-04-03 Thread cblake
You can also put in [any of your 
nim.cfgs](https://nim-lang.org/docs/nimc.html#compiler-usage-configuration-files)


hint[Processing]=off


You can also disable other categories of diagnostics besides "Processing" 
similarly.


Re: writeFile with iterator

2016-07-25 Thread cblake
Yeah...I thought about \n higher up but didn't want to change the semantics too 
much. With that change my time goes down to 0.10 seconds (almost a full 3X 
faster than the original 0.28 --opt:size version on the same machine). About 
55..60% of that 0.10 seconds is all just libc fwrite stuff. The remaining Nim 
probably can't be sped up much more..maybe 10% total runtime at most. 
file.write is just a very thin wrapper for stdC fwrite. You can see if you look 
at lib/system/sysio.nim (search for proc.*write.*File).


Re: Nim vs D

2017-07-07 Thread cblake
I get the same perfect line with the Nim fib(47..57) as well. { Well, slightly 
slower: log_1.618(wall) =~ N - 47.35. So, 1.618**(47.67-47.35) = 1.16 or around 
16% slower. }


Re: OrderedTable is not an ordered table

2018-03-21 Thread cblake
@lightness1024 - fair criticism of C++ STL.

modulo prime reduction of the hash code to a table address/array index is slow 
because division is slow. In my tests using modulo can sometimes **triple** the 
entire lookup time for simple integer keys due to the slowness of division! I 
recommend you consult Knuth 's Art Of Computer Programming Volume 3 (section 
6.4 in my 1998 edition, but his treatment mostly dates back to the 60s). He 
motivates why prime may be a good choice and then immediately moves on to a 
multiplicative hash. If you really need to mix the bits a multiply & shift is 
much faster than modulo. While the upper bits of the half-product do have more 
entropy, masking seems about as good in my experience and the code is slightly 
simpler due to the close relation of the mask and table size.

Knuth goes on at some length about choosing multipliers - about 10X as much 
length as the module prime. For some reason (perhaps an overzealous affection 
for number theory by CS theorists that write algo books or maybe just 
simplicity?) "modulo prime" got repeated more (everywhere - in 
classes/lectures/textbooks/code, etc.). modulo prime sadly "stuck" in 
programming culture needlessly. Python even has some array of primes less than 
powers of two. The pedagogical situation seems to be improving in the past 
decade or so.

In the context of Nim's Table, for integers the absolute most trivial identity 
hash is used. If that is not good enough, a user should override system hash 
functions with their own better scrambled hash code producer. Believe it or not 
(I always recommend testing on your own real data..), the trivial hash is often 
plenty fast. While adequacy of randomization is data dependent, with linear 
probing and modern CPU cache prefetching even large collision clusters can be 
scanned quickly (especially compared to many "improved randomness" mitigations).

Here are two examples of improved randomness failing. You could use a super 
expensive cryptographic hash function to get super random hash code bits, but 
the time you spend doing that would likely be much more than the extra time 
from scanning bigger clusters. Another example is double hashing which hops all 
over the table randomly in a very cache unfriendly way. The clusters are 
smaller, but the number of cache misses goes way up. To avoid dealing with 
hardware variation, people, including smart researchers, often use 
**machine-neutral surrogates** for actual time like  "number of comparisons" or 
"cluster sizes". This can create all sorts of confusion and cross-talk when the 
surrogates are not very faithful to real machines.

Also, this is slightly off-topic for Nim, but I looked at your package and you 
probably have a subtle bug. I don't see you counting "tombstone" or deleted 
markers - usually desirable to reorganize the table if there are too many. See 
[this thread](https://forum.nim-lang.org/t/829) for details. You also might 
like to read the [thread on hash functions](https://forum.nim-lang.org/t/1730).


Re: Looking for a set that sorts and deduplicates

2017-11-22 Thread cblake
`collections.heapqueue` does mostly what you want. You would have to remove 
duplicates yourself, like below, but probably with an iterator called something 
like `unique`. 


import collections.heapqueue

var heap = newHeapQueue[int]()
let data = [1, 3, 5, 1, 7, 9, 2, 4, 6, 6, 8, 0]
for item in data:
  push(heap, item)

var last = heap.pop() #NOTE: assumes heap.len >= 1
echo last
while heap.len > 0:
  let next = heap.pop()
  if next != last:
echo next
  last = next


Note that this does actually store duplicates until the final iteration because 
heaps do not possess a fast { i.e. O(log-set-size) } test for membership. So, 
if you had a great many copies and/or memory constraints that could be an 
issue. You didn't mention how high the duplication rate would be, but I bet 
something like the above is fast enough.

A more memory optimal structure for this is probably a simple B-Tree or some 
kind of balanced binary tree (less cache friendly than a B-Tree, but typically 
more available/implemented more often) or possibly an adaptation of 
`collections.critbits` or some other digital search tree. I think there is a 
red-black tree in Nim floating around.


Re: Reason for -fno-strict-aliasing?

2017-08-25 Thread cblake
This may be obvious, but has anyone else tried changing `-fno-strict-aliasing` 
to `-fstrict-aliasing -Wstrict-aliasing` in their `nim.cfg` and seeing if any 
gcc warnings arise? When I try this on Linux with gcc-6.3 and gcc-7.2 and devel 
nim I don't see any warnings at all. You may also need to turn off the `-w` 
general gcc warning suppression, too (maybe in the system level nim.cfg as well 
as $HOME one) and double check with `nim c --verbosity:2` that the compiler is 
getting invoked as you think it should. I tried a few different garbage 
collector impls and a variety of small Nim programs. It's hardly exhaustive and 
I'm not sure that this particular gcc warning has a 0% false-negative rate 
(does someone know?). We may be so close to this to be almost already there.


Re: Fastest way to count number of lines

2017-10-20 Thread cblake
It sounds like you will have many regular files (i.e., random access/seekable 
inputs as opposed to things like Unix pipes). On Linux with glibc, 
memfiles.open is probably the fastest approach which uses memchr internally to 
find line boundaries. E.g. (right from memfiles documentation), 


import memfiles
for line in lines(memfiles.open("foo")):
  inc(i)


Your mileage on this may vary from OS to OS or libc to libc. I have no idea 
which if any Microsoft/Windows versions have well-optimized libc memchr() 
implementations.


Re: Loop backward through array -- howto do it brachless

2018-10-05 Thread cblake
`for i in 1..12: echo -i and 3 `

Run


Re: [poll] Moving all RFCs in a separate repo

2018-10-15 Thread cblake
+1 for labels rather than alternate repos (or more labels such as "random idea" 
or "half-baked" or such).

But, also +1 if Nim core devs want to aggressively close issues that are too 
incomplete to be a "real" RFC or are considered settled questions.


Re: `import foo {.private.}` to allows access to private fields (eg: package-level visibility)

2018-10-10 Thread cblake
+1 on some kind of invisibility/private-ness override feature. Another bonus of 
that is to use it as a temporary workaround if some 3rd party package has made 
something "overly" private as viewed by an "expert" user. While this has come 
up in the context of library writing, sometimes users of libraries understand 
many of its internals and sometimes they don't. Private, not exported, is 
probably the right default or at least too late to change, but it's easy to 
forget a `*` export marker and easy to disagree. This proposed mechanism would 
allow temporary workarounds much closer to a likely more sanitary future usage.

It might be better to call it something scarier/less similar to a token like 
`private` which can be _so_ common in other prog.lang's as to lead to questions 
like "why does `private` mean this negative sense in Nim but positive sense in 
Foo?". `import foo {.overrideInvisibility.}` or `overridePrivate` or something. 
Just some kind of word to indicate bypassing of normal-ness (to act as a 
redundant cue over there being a pragma there at all). I agree that 
`{.breakTheGlassAndMakeEverythingVisible.}` is severe overkill. :-)


Re: the Fibonacci benchmark

2018-09-29 Thread cblake
I also got `nim c` 's code running twice as fast as `nim cpp` 's code. See some 
previous discussion: 
[https://forum.nim-lang.org/t/1779](https://forum.nim-lang.org/t/1779) \- there 
are cases where Nim's generated C code "inspires" the C compiler to "unroll the 
last few recursions" and save 4x or 8x or maybe even 16x depending the number 
of funcalls (depending on the unroll amount). It's not quite constant folding. 
disassembly and looking for `callq` can be helpful here.

I think the result is correct by the more old school definition starting from 
"1 1 2 3 5" (with indexing starting at zero), not "0 1 1 2 3 5" as per 
[https://en.wikipedia.org/wiki/Fibonacci_number](https://en.wikipedia.org/wiki/Fibonacci_number)
 { or maybe the indexing is not what @miran expects, but he uses a ";-)" }. 
Anyway, there is sort of "more than one right answer" in a couple of ways 
depending on index origin or 0 inclusion.


Re: the Fibonacci benchmark

2018-09-29 Thread cblake
No `gcc` doesn't -- at least not for me with a factor of about 7X, almost 
identical to @adrianv 's results. I used both an earlier gcc-7 and a gcc-8.2 
just now.

If you don't want to disassemble your binary, that's ok. You can check this 
with `gcc -pg` and `gprof` which will count calls for you.

I also gave a complete example in C that does the optimization I mention in the 
prior thread which I will reproduce here because the new forum code hides it 
behind "load more posts". With `fib(30)` I get 118793 calls. With `fib(40)` I 
get 14612525 calls. With even a low `fib(20)` I get 950 calls. Does 950 calls 
seem like compile-time calculation to you? With `fib(46)` I get 262211393 calls.


#include 

float fibonacci(int x) {
return x < 2 ? (float)x : fibonacci(x - 1) + fibonacci(x - 2);
}

void NimMainInner() {
printf("%.0f\n", fibonacci(46));
}
int main() {
#ifdef FIB_SLOW
NimMainInner();
#else
void (*volatile inner)(void) = NimMainInner;
inner();
#endif
return 0;
}


Run

If one simply calls `NimMainInner()` directly instead of through the volatile 
function pointer `inner()`, `fib(20)` generates 10942 calls instead of 950.

I understand my example uses floating point, not integer arithmetic, but 
@adrianv 's numbers make it sure seem like the same situation. @adrianv can 
also compile with `-pg` and run `gprof` to check. Or he could adapt this C and 
compile it the same way his `nim.cfg` or whatever compiles his Nim generated C.

I'm not quite sure what more evidence you would require. It isn't hard to 
calculate how many function calls doubly recursive Fibonacci requires.


Re: the Fibonacci benchmark

2018-09-29 Thread cblake
It might be a lot of things, but I'm telling you - there is a `gcc` 
optimization, very finicky about context (and probably compiler flags, too), 
that reduces function calls by almost exactly his speed-up factor. There's 
really no need to guess if I'm right, though. Just add `-pg` to the gcc flags 
(in both cases), run `gprof` on the output, and count funcalls in both cases to 
confirm.


Re: the Fibonacci benchmark

2018-09-29 Thread cblake
I am not quite sure what "original" benchmark you mean - perhaps the 4.6 s for 
Nim vs 6.0 s for C on the original github page at the top of this thread. That 
may be small scale code layout/caching/branch predictor effects as you say.

However, the difference in this thread of 8X is _almost certainly not_ 
compile-time evaluation of fib(46), nor a coincidence, nor some generalizable 
Nim is effectively faster than C { though ;) is noted! :-) }. Compile-time 
evaluation would likely be many thousands of times or more faster than that - 
faster even than the memoized/linearized versions of the calculation which are 
already O(1000x) faster. Indeed, one gets "too many iterations" trying to 
`const x = fib(46)` from the Nim VM.

The merely ~8X faster observed twice in this thread is something else which I 
have already explained in the prior thread and in this thread linking to that. 
And that explanation/approach does generalize a little, and even across 
programming languages. Certain iterative/recursive code that does very little 
work like simple adds or swaps { permutations, I'm looking at you :-) } can 
often be sped up by either manually or automatically unpacking/unrolling the 
last few recursions so that more work is done on a per-function call basis or 
said otherwise there are fewer function calls.


Re: setLen without 0-initialization (for efficiency)

2018-09-26 Thread cblake
+1 on the feature. I also sometimes re-use buffers that would benefit from this.


Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST

2019-01-02 Thread cblake
Oh, I got your point and tried to emphasize that. I wasn't arguing against you 
anywhere that I know of. I totally agree with your a,b, I suspect we don't 
disagree on anything real at all, but are just discussing different aspects. I 
tried to express praise for your approach (if you happened to need to do this 
_particular_ thing) and gave a reference link. :-) Please note that nothing I 
say below should be taken as any specific rebuttal to something you said.

My point was less about exactly zero vs non-zero abstraction cost (I even noted 
my own 5% variation) and more about the cost being very sensitive to 
compilation strategy, PGO as a particular example. I would say, looking at the 
generated C code, that this is not much about "Nim language iterator overhead 
aka 'cost'" and more just about "C compiler assembly code generation 
variability". That's an important distinction as slowness does not necessarily 
translate into "work for Nim core to improve anything". Backends vary and could 
change a bunch of stuff and be different in the next version of 
gcc/clang/whatever, for example. So, Nim kind of needs to target a "broad cross 
section" of what works well on the various backends, not overfit to one. (Not 
saying you said otherwise, just backing up importance of the distinction.)

You mentioned compiling my code on Ryzen, but not using gcc's PGO. PGO is 
mostly just a very strong version of your (c) hints idea where the compiler 
generates extra code to measure what it thinks will help it make better code 
generation decisions. (Yeah, a person or some GA picking optim flags/luck/etc. 
could all still maybe do even better sometimes as I already alluded to, but at 
ever increasing developer costs.)

To add a little more color quantitatively, I just did the same PGO experiment 
with the same version of gcc on a Ryzen 2950X clocked at 3.80 GHz and saw times 
go from 140 ms down to 85 ms..(best of 10 trials for both). That's not quite 
the same 2x speed-up as for the Intel case. Yeah, maybe Ryzen has better CPU 
predictors or predictor warm-ups or maybe it gets a bit luckier with "non-PGO" 
-O3 yadda-yadda defaults, etc. Whatever the causes, 1.65X is still a lot more 
than PGO boosts I usually see of 1.05x to 1.15x (such as with your algo, for 
example). { Of course, the boost is still not as good as a better algorithm. I 
never said so, nor meant to imply it. If a better algo is handy and speed 
matters then by all means use it! At least some of the time, a better algo will 
not be as easy to come by as PGO. }

Anyway, I realize that it's a tangent from your original interest in this 
problem, but if you're as serious about (c) as a general guideline as you seem, 
and if you haven't already, then you should look into PGO sometime. My bet is 
that for this problem and for most backend compilers that a PGO Nim would be 
faster than non-PGO "pure C". Even though such a goal is explicitly _not_ your 
objective, I think it's still interesting. Nim is one of the very few languages 
I've tried that in "typical" practice is often "about as efficient as C" (with 
all the usual disclaimers as to algos, etc.)

PGO can also be very little developer time extra cost. Once you get a little 
script set up, you just type `nim-pgo pythag './pythag > /dev/null'` for some 
`pythag.nim` file instead of `nim c -d:release pythag.nim`. If the program had 
less output (say the sum of the c in c^2=a^2+b^2) the shell IO re-direct 
wouldn't even be there. For this program, it took me almost no time to try PGO. 
And, of course, YMMV - that 2x on i7-6700k effect is among the biggest PGO 
boosts that I've ever seen. That seemed noteworthy. So, I noted to y'all. :-)

In terms of comparing Ryzen timings, running in a VM should not make much 
difference on this problem as it is a pure user-space CPU hot loop except for 
the time queries..That's really a best case scenario for VMs. Of course, if 
your host OS is time-sharing the CPU between your VM instance and some other 
process then that could slow things down a lot - arbitrarily much, really if 
the host OS is overloaded. Might help to take the minimum time out of more 
trials. You should be able to get 100..200 ms of one core all to yourself at 
_some_ point. :-) My guess is that you could get that 210 ms down to under 100 
ms (not that it really matters for anything practical in this exact case.

A better algo is better, but a lot of people - not just in this forum - are, 
rightly or wrongly, discussing this problem in the context of "programming 
language performance" which is the only reason I bothered to write both of 
these posts. The clang/ldc/rust/etc. numbers were all more closely clustered 
than 1.65x of the Ryzen PGO experiment in Timothee's very first link (217 
ms/153 ms = only 1.42x).

TL;DR - A) For this problem, compiler back-end generation effects dominate Nim 
front end overheads (and possibly front-end overheads for other languages) for 
which the huge PGO boost 

Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST

2019-01-02 Thread cblake
Those are all fine points. Asm can sometimes make a bigger difference than 
people conditioned to not question compiler output expect (and there are many 
such people). This is especially with vector units. A few years back, I wrote 
an AVX2 vectorized "minimum" function that ran 24x (twenty four times) faster 
than a C one. That's a much bigger time ratio than most "typical use" 
comparisons of _many_ programming language pairs (though obviously most 
programs do more things than compute minima). Auto-vectorization in gcc (at 
least) has gotten better since for that precise problem, though it can still 
miss many opportunities.

If you ever do need to write asm, those "intrinsics functions" like 
`_mm256_add_ps` (there are a hundred others) are often an easier entry 
point/way to integrate with your program than raw asm/`.s` files. You can use 
them from Nim, too. :-) See, e.g., 
[https://github.com/numforge/laser](https://github.com/numforge/laser) 
README/code/etc. for how to use SIMD intrinsics.


Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST

2019-01-02 Thread cblake
Also, I guess a TL;DR part C) - I was never arguing against the tautology that 
a faster algorithm is faster. That is kind of a weird straw man position. No 
idea how quoting 600 microseconds for it left that impression, but maybe bears 
correcting the record. (I didn't assess correctness, though perhaps I should 
have. To me that whole algo area is besides the point of the prevailing 
discussion of an artificial benchmark. I never need Pythagorean triples, 
myself.)


Re: Advent of Code 2018 megathread

2018-12-02 Thread cblake
By the way, if you are trying to "right size" a set or table to streamline 
resizing costs, you can use rightSize(number of members to support). This is 
easier than trying to guess the power of two at which such a population is 
supported for both initSet and initTable. rightSize uses the same formula (its 
inverse, really) as the internal formula used to decide when doubling capacity 
is needed.


Re: Should we get rid of style insensitivity?

2018-11-23 Thread cblake
Does anyone out there routinely use this feature of diverging from the style of 
an `import` or as I mentioned just follow the lib's conventions? Part of 
@dom96's survey should perhaps ask if that aspect is just a "theoretical 
nicety".

I mean, someone cared enough about project-internal style consistency to write 
NEP1 and someone also cared enough to add that `nim c --nep1` system. I think 
if most/all example code/standard and popular libraries use one style that most 
people will copy it.

The reasons they would _not_ imitate it are more likely to be either A) 
technical like wrapping sensitive libs or B) cultural like some corporate style 
guide legislation (of which idents are just one dimension with spacing, 
comments, function size, etc. being many others) or C) strong personal 
preferences. In those cases, they seem more likely to just reject Nim outright, 
but that's just my hunch. C) could cut either way. So, call it 5 out of 6 cases 
of rejection for 1 case of ecstatic acceptance. A 6x larger community would 
make a world of difference...

So, if it's the primary motivation is just that Nim core personally does not 
want to deal with other identifier styles, I doubt there is much value realized 
by style insensitivity. In a fully sensitive world they might have to deal with 
it 5-10% of the time. Yeah, more than zero.

It's hard to know for sure and definitely late in the game, of course. It's 
certainly unique. Maybe it's the killer feature! If it were _easy_ to predict 
popularity then the world would be a _very_ different place.


Re: Should we get rid of style insensitivity?

2018-11-23 Thread cblake
I think a vote would be fine. Ballot box stuffing is possible, but probably 
identifiable via sheer numbers since the Nim community is so small unless the 
vote gets HackerNews'd or Slashdotted. I also agree that the people who matter 
most would never participate in such a vote because they've already dismissed 
Nim for being "too creative" on the identifier front.

Wording matters a lot, too, as in any survey. I'd wager that there would be 
almost zero "yes" answers to a "would you stop using Nim if it became fully 
sensitive?" question. Also, a "do you know at least one developer who never 
tried Nim because one of the first things they heard about was its 'weird' 
ident rules?" would probably get an average of >50% yes's. Developers in very 
niche internet languages are not always the most social creatures or my 
estimate would be higher.

Personally, I like identifier _sensitivity_. I think things that look different 
to a person should be treated as different by the system, especially a system 
used by programmers who know one wrong character somewhere breaks a program. 
Ease of use studies about filesystems by non-programmers are at best weakly 
relevant - and anyway often there is a wrinkle of  "creation time casing" and a 
"reference time casing". Studies in text also show using ALL CAPS ALL THE TIME 
makes characters most easy to distinguish, but programmers tend to not want to 
do that.

In physics and math (where the underlying syntax of adjacency implies 
multiplication) single character variable names like F=ma and subscripting 
prevail. The handwritten/specialty script nature of those fields then pushes 
users to use _more fonts_ and italics and bold face and so on, and you often 
see textbook authors declare their style conventions. In general, more terse 
languages without a lot of syntactic noise (like Nim) benefit more from shorter 
identifiers which in turn benefit more from sensitivity. Speaking in voice 
about "cap A" is as easy as "A zero".

Giving more picky people flexibility to define their own conventions seems good 
to me, and to me that means sensitivity not the current rules. Less picky or 
more overall consistency-oriented people will just go with the flow and imitate 
Nim standard library conventions. Many will always go for camelCase since they 
know they'll have a lot of stdlib calls and the stdlib does that and so they'll 
want their own code to match that "larger world convention". C programmers 
imitating C stdlib and Java the Java APIs and C++ or POSIX programmers the 
POSIX apis and so on drive all this enormously. So, the core Nim community can 
probably get most of what they want just by controlling the style used by the 
stdlib.

Also, one has to pick which battles one fights! _Injudicious_ choice of 
characters will _always_ be a problem. In typical fonts for Latin script 
languages the upper- and lowercase versions of a letter "visually collide" for 
about 50% of the 26 letters. For another 50% there is a lot of visual 
ambiguity, like the similarity of O/0, 1/l { big-Oh vs zero and one vs 
lowercase-ell }. Colon and semi-colon, commas and periods are often really hard 
to see much difference in but make the world of difference. To my knowledge, no 
one suggests in any seriousness that Nim should forbid big-Oh or little-ell 
from identifiers or that writing Nim should require a certain font because 
otherwise it's hard to tell what means what. Sometimes you have to just rely on 
users of any language to not choose to write deliberately confusing things or 
use programming-hostile fonts.

However, it is perfectly fine ( _fantastic_ , even!) if the compiler has one or 
several warning systems (that can be easily turned off by a flag) for _ALL 
SORTS_ of confusing scenarios that "encourage" clarity/simplicity. People "in 
the thick of it" can have the warning be active. People more in their own 
closed world or with habits/inclinations/fonts that make certain mistakes less 
likely can turn it off. Problem solved. Besides naming convention issues, 
uneven spacing around operators is a good example. There's a warning for that 
in place now. Another example without a current warning in place would be 
spaces of leading indentation -- 1 spaces vs 2 vs 3 vs 4 vs 8 -- all currently 
treated the same by Nim. A shift from indents of 1 to 8 is probably some kind 
of bug or at least a highly erratic formatting style. The compiler warning 
could even take some limit where a change of indent shift by <=N was ok and 
warn only for changes bigger than that.


Re: Should we get rid of style insensitivity?

2018-11-24 Thread cblake
Well said, @arnetheduck.

Honestly, for the true master, you could do some pluggable system that allowed 
macro/compile-time proc-driven identifier transformation with the current rules 
as a simple activation maybe as a pragma on `import`. Then people could 
override that current "compromise" batch of ident rules with something even 
more idiosyncratic (e.g., that handled ALL_CAPS or maybe CAPPREFIX_Foo 
identifiers differently, as needed/wanted). You might need to declare a 
convention inside defining modules, too, though, if you depart from the default.

Anyway, instead of quashing admittedly often time-waste-y arguments about (just 
some dimensions) of style with a totally unique set of ident rules (more than 
just "case insensitive" as @dom96 points out), give devs tools to manage 
disagreements.


Re: [help needed] nim version of: COMPARING PYTHAGOREAN TRIPLES IN C++, D, AND RUST

2019-01-02 Thread cblake
Every language has nested loops. My view is that the original article about 
C++20 ranges conceived this test/benchmark to be about the cost, if any, of 
abstractions, not exactly the performance of "nested loops however you write 
it" as suggested by Timothee's code or "the fastest algorithm for generating 
Pythagorean triples" (see 
[https://en.wikipedia.org/wiki/Pythagorean_triple](https://en.wikipedia.org/wiki/Pythagorean_triple)
 for an explanation of moerm's much better algorithm - you should definitely 
use something like moerm's algo _if you actually had a need for many triples_ 
for some reason).

A lot of what is creating variation in results here, and why I am posting this 
reply, is that this particular case is more sensitive than most to compiler 
optimizations and assumptions/code layout, etc. With the below program, I got a 
full 2.02x speed-up using gcc's profile-guided optimization (145.0 ms to 71.8 
ms; both best of 10 trials on a i7-6700K at 4.85 GHz on Linux, gcc-8.2.0 with 
Gentoo patches rev6). Usually, I only get 1.10x to 1.15x speed-ups. Thus, this 
case is more sensitive to optimizations. 


iterator toInfinity(n=1): int =
  var z = n
  while true:
yield z
inc(z)

iterator triples(n=1000): array[3, int] =
  var i = 0
  block all:
for z in toInfinity(1):
  for x in 1 .. z-1:
for y in x+1 .. z-1:
  if x*x + y*y == z*z:
yield [x, y, z]
inc(i)
if i == n:
  break all

import times

proc main() =
  let t0 = now()
  for triple in triples():
echo triple
  echo now() - t0

main()


Run

When Timothee's current code is instrumented with `times.now()` and not 
compiled with PGO it is 1.5x faster than the above program (96 ms). When both 
programs are similarly compiled with PGO the performance basically matched to 
almost within measurement error (68.13 ms best of 10 trials, but noise/range 
over those trials was over 2 ms). (All I did was fiddle with the loop ranges.)

In other words, Nim's iterators seem to be a zero cost abstraction, as 
advertised, at least when compiled with PGO on gcc in this use case. This also 
makes sense if you look at the C code Nim generates -- just 3 nested loops 
anyway. Maybe one could pick nits that 71.8 vs 68.13 is not quite zero. Fair 
enough, but 5% is much less than the 2x variation sensitivity from PGO. Heck, 
given the giant 2x sensitivity, one of those "genetic algorithm to optimize the 
optimization flags" approaches could probably do even better than PGO and have 
the two Nim versions jockey back & forth for fastest. Looking at the generated 
C code, that seems likely to me. The point of this post is that in this case, 
compilation strategy/optimization options matter more than people might expect 
(though several commenters did sort of seemed to know that already).

For the curious, moerm's better algorithm with PGO runs in just 600 
_microseconds_ on the same machine, and 680 microseconds without PGO, a more 
typical 1.13x factor, and much smaller than just using a better algo.

Personally, I think that it might be a little nicer to be able to `return` to 
exit iterators rather than using the `block all: ... break all` construct. I 
could see bug-finding arguments (iterators != proc/func) for not allowing that, 
though. Anyway, I think Nim mostly wins (or at least ties) in the "elegance 
contest" here.

Also, for anyone trying to reproduce these effects/do 
profile-guided-optimization with Nim in general, the way I drive my PGO 
compilation is with a script that boils down to: 


nim c -c -d:r PROG.nim
$cc -fprofile-generate $nimcache/r/PROG/*.c -o PROG
./PROG
$cc -fprofile-use  $nimcache/r/PROG/*.c -o PROG


Run

You may also need some `-O3 -fwhole-program, -flto, -march=native` in that 
`$cc` variable, and definitely need at least `-fno-strict-aliasing`.


Re: beat c in fefes "competition"

2019-03-25 Thread cblake
Just a quick follow-up, with gcc-8.3 on Linux x86_64 Skylake CPU, 
profile-guided optimization did get that 176ms time down to 150 ms (With 
-d:useNimHash it was 152 ms), a 1.17x boost to 1.43x faster than the C in 
@enthus1ast 's `wp.c`.

Of course, with 291,000 unique keys the average external chain length in the C 
is going to be 291./65.5 =~ 4.4 which is, um, not great. So, maybe this is only 
faster than a strawman C impl. YMMV. If you really want that particular C impl 
to be slow, just run it with 1e6 unique words. ;-) (But then you really better 
not use Nim's current `CountTable.sort`!)


Re: beat c in fefes "competition"

2019-03-25 Thread cblake
As @Stefan mentioned the string stuff can be slow. My `MemSlice` techniques 
might be more broadly interesting. The tokenizer below only works for "strict" 
one-character delimiting, not, e.g., repeated whitespace. Also, I agree 
completely with @arnetheduck that algorithm choices matter more than languages 
for Nim vs C comparisons.

That said, the below Nim is faster than the C version on my machine (215 ms for 
C, 176 ms for Nim). @enthus1ast can try that on his data and report back if he 
likes. There are several little `defined` switches to play with.

FWIW, the C version referenced is a rather lame one (fixed 64 KiEntry hash 
table with external chaining resolution). I didn't try 
profile-guided-optimization which could improve the Nim results some. Since the 
C version "cheats" a bit by pre-sizing its hash table, my version below takes 
any estimate on the number of unique words that will be found as input and 
pre-sizes the Nim table to fit that estimate.

Also, **beware** ` CountTable.sort()`. It uses Shell's sort, not a fast 
algorithm. They're always integers and typically small. A radix sort would be 
the best choice, but regular old merge sort from `algorithm` is a lot better. 
On my test input with 291,000 unique words, the version using `count.sort` took 
120 seconds instead of 176 milliseconds, almost 3 full orders of magnitude 
slower. Times seemed insensitive to hash functions (order 1% variations), 
though both the sorting and hash-sensitivity will be data set dependent, 
obviously. For very high entropy distributions you are going to want to avoid 
the current `CountTable.sort()`. Also, `CountTable` itself was a little slower 
than regular `Table` using the same sorting procedure. So, you might want to 
just avoid `CountTable` in general in peformance-sensitive cases.


import memfiles, hashes, tables, os, algorithm, strutils

proc hash*(ms: MemSlice): Hash {.inline.} =
  when defined(useNimHash):
result = hashData(ms.data, ms.size)
  elif defined(useCB2):
proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl,
importc: "memhash_cb2", header: getHomeDir() & 
"s/cb/hash_cb2.c".}
result = c_memhash(0, ms.data, ms.size)
  elif defined(useCB4):
proc c_memhash(h0: int, a: pointer, size: int): int {. nodecl,
importc: "memhash_cb4", header: getHomeDir() & 
"s/cb/hash_cb4.c".}
result = c_memhash(0, ms.data, ms.size)
  else:
result = 0
for i in 0 ..< ms.size:
  result = (result + (result shl 5)) xor int(cast[cstring](ms.data)[i])

proc split*(ms: MemSlice, fields: var seq[MemSlice], nA: int,
sep=' ', size = -1): int =
  proc c_memchr(s: cstring, c: char, n: csize): cstring {.
   importc: "memchr", header: "".}
  proc `+!`(p: cstring, i: int): cstring {.inline.} = 
cast[cstring](cast[int](p) +% i)
  proc `-!`(p, q: cstring): int {.inline.} = cast[int](p) -% cast[int](q)
  var b   = cast[cstring](ms.data)
  var eob = cast[cstring](ms.data) +! (if size != -1: size else: ms.size)
  var e   = c_memchr(b, sep, eob -! b)
  while e != nil:
fields[result].data = cast[pointer](b)
fields[result].size = e -! b
result += 1
if result == nA - 2:  #max ix nA-1; Need one more slot for final field
  break
b = e +! 1
e = c_memchr(b, sep, eob -! b)
  fields[result].data = cast[pointer](b)
  fields[result].size = eob -! b
  result += 1

proc main(input: string, sizeGuess: int) =
  const mxCols = 1024 #file format-dependent max
  when defined(useCountTab):
var count = initCountTable[MemSlice](rightSize(sizeGuess))
  else:
var count = initTable[MemSlice, int](rightSize(sizeGuess))
  var fields = newSeq[MemSlice](mxCols)
  for line in memSlices(memfiles.open(input)):
fields.setLen(split(line, fields, mxCols, ' '))
for word in fields:
  when defined(useCountTab):
count.inc(word)
  else:
inc(count.mgetOrPut(word, 0))
  stderr.write count.len, " unique words\n"
  when defined(useCountTab) and defined(useCountSort):
count.sort()  #uses Shell's sort internally
for key, cnt in count:
  stdout.write cnt, " "
  discard stdout.writeBuffer(cast[cstring](key.data), key.size)
  stdout.write "\n"
  else:
var answer = newSeq[tuple[cnt: int, key: MemSlice]](count.len)
var i = 0
for key, cnt in count.pairs():
  answer[i].cnt = cnt
  answer[i].key = key
  inc(i)
proc myCmp(x, y: tuple[cnt: int, key: MemSlice]): int = - cmp(x.cnt, 
y.cnt)
answer.sort(myCmp)
for ans in answer:
  stdout.write ans.cnt, " "
  discard stdout.writeBuffer(cast[cstring](ans.key.data), ans.key.size)
  stdout.write "\n"

#Must call with 

Re: beat c in fefes "competition"

2019-03-25 Thread cblake
I wrote: "most lookups are failed lookups", but Oops! 291e3 unique/4.8e6 total 
= 6% ==> 94% of lookups are successful. So, the `memcmp` that happens only 
after hash codes match does happen almost all the time, and so my particular 
data set is more like 16B/(32B+9B) = 39% cached, not 50%. This doesn't really 
alter the rest of my analysis, though since 39% and 50% are pretty close.

A corollary of this is that (for that style of data set), Robin Hood hashing 
w/linear probing would not help. RH can help for failed lookups because probe 
sequences for missing items become about 50% as long (once the search hash code 
is > table hash code you know it's missing and where to put the new key). It 
costs a little work { O(1.15x) } to keep entries along probe sequences sorted 
by hash code. So, RH only pays off if you care about worst case times or have a 
workload dominated by failed lookups over long probe sequences. At 56% table 
load (291./512) the sequences aren't long unless the hash function is weak 
relative to the data (not the case for me unless all 4 functions were equally 
weak).

Anyway, RH would be about the only algorithmic speed-up I think could help in 
the poorly cached large table case, but only for very low hit rate data. It 
would not be crazy to add a `defined(useRobinHood)` in the Nim hash table impl 
(or even have a per-instance run-time flag) since whether RH helps or hurts is 
so dependent upon workload.


Re: beat c in fefes "competition"

2019-03-25 Thread cblake
Oh, and two more stats about my input data important to reason about my timings 
- 43 MB and 4.8e6 total words total (coincidentally close to 1e-3 times my 
1/4.8GHz clock cycle).

So, average string length around 43e6/4.8e6 =~ 9 bytes (also a web server log), 
and about 150e-3/4.8e6 =~ 31 nsec =~ 150 CPU clock cycles per word.

A little more fiddling (commenting out various sections) shows that the 150 
cycles per word breaks down to:


  22% parsing (no hash table or output, just the MemSlice stuff)
  55% hash table counting
  23% output (sorting + binary->ascii on the numbers).


Run

So, 150 cycles per input word might sound like a lot, but I personally doubt it 
could be improved much. 43MB does not fit in my 8MB L3 cache, but 
2**ceil(lg(291e3*9B)) =~ 512kEntry*16 just barely does assuming about 16B per 
entry. Really, 41B per hash slot is more realistic (8B for the hash code, maybe 
16B more for MemSlice, 8B for the count and about 9B for the string data). The 
string data is indirect, and most lookups are failed lookups, so 32B is 
probably the appropriate number.

So, that 55% of 150 = 82 ms / 4.8e6 is =~ 17 ns ~ 80 cycles. The hash table is 
probably only about 50% L3 resident (16B/32B) - not bad, but not great. Quoted 
latency on Skylake L3 is supposedly about 40 cycles while my DIMMs are probably 
about 100 cycles (20ns). So, the (AFAIK unavoidable 4.8e6 lookup latencies) is 
some kind of average of 40 and 100 cycles. 80 cycles is not implausible. .5*100 
+ .5*40 = 70 cycles after all (and 100 may be..optimistic for my DIMMs). The 
table is only 56% full (291/512). So, probe sequences should be short and fit 
in one or two L3 cache lines, the 2nd of which is often preloaded by the 
hardware in a linear search. Of course, the table accesses are in dynamic 
competition with the streaming read from the parser because CPU designers sadly 
don't let user code control cache policies.

So, _maybe_ some hand-rolled assembly could shave off 20 cycles in the hash 
phase and maybe 15..20 cycles in each of the other two phases for maybe 60 
cycles better than the Nim's 150. 150/90 is another 1.6x of potential 
improvement. It's probably a ton of work to get that. Also, I haven't done it. 
So 60 cycles is just some nominal upper bound guess. 150 cycles might also be 
just about as good as can be done.

If the table cannot be kept 50% resident in L3, but more like 10% L3 resident 
then most of that "assumed 60 cycles" improvement would be lost just to the 
DIMM vs L3 latency. Meanwhile, just packing those hash entries into 16B from 
32B (using e.g. a 32-bit hash code and counter and inline string data) could 
potentially make almost as much a difference in pure Nim by keeping it 100% L3 
resident and maybe getting down to 40 cycles from 80. There is, of course, a 
hazard of focusing too much on the dimensions of this one input data set here 
which is unlikely to be very informative.

The "usual" informative way to benchmark hash tables is to do (at least) a 
"fits in L1" case to check code generation effects without a lot of memory 
system interference and then also a "doesn't fit in Lany" case to check 
latency/probe sequence effects. A more extreme variant of the latter would be a 
cold-cache spinning rust Winchester disk that doesn't fit in RAM, but such are 
usually painfully slow. Everything in between those two extremes gets a funky 
average of inter-cache-level behaviors that is harder to reason about. I 
figured I'd parse a web log like @enthus1ast. Generated synthetic data would be 
more reproducible, although also **not** exercise the hash function well. Sorry 
for the verbosity, but it's tricky to benchmark this stuff in a way that 
actually generalizes, and I've seen a lot of not so informative benchmarks get 
a lot of readership.


Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?

2019-04-05 Thread cblake
You probably want `proc getFileHandle*(f: File): FileHandle` from the system 
module (no need to import). That just calls through to the C fileno().


Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?

2019-04-05 Thread cblake
There is also `proc open*(a1: cstring, a2: cint)` in the `posix` module: 


import posix
let fd = open("/dev/mydevice", O_RDONLY)


Run

if you want an unbuffered raw file handle which sounds like it might be the 
case (and you don't need portability which is implied by your Linux device file 
use-case context).


Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?

2019-04-05 Thread cblake
0.19.0 is very old in "Nim years" { like "dog years", but even more compressed 
;-) }. 0.19.2 works better, but personally I would recommend `git clone 
https://github.com/Araq/Nim` and learning how to build that/run right out of 
the build. It's not too hard to follow the instructions. I think `sh 
ci/build.sh` almost totally automates the process.


Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?

2019-04-05 Thread cblake
Well, to make it work without the rest of your type environment, I changed just 
the first couple lines of your posted code to: 


import posix, selectors
proc readDevice(dev: string) =
  let devfd = posix.open(dev, posix.O_RDWR)


Run

Then it compiled fine for me. I usually run a pretty fresh devel version of 
Nim. You may be on some earlier Nim. It's probably more useful long-term for 
you to learn to install a devel version of Nim than for me to test on 
back-dated versions.


Re: Running Selector on device file in linux. How to get file descriptor (int fd) from file object?

2019-04-06 Thread cblake
You're welcome, and I'm very glad you didn't give up so quickly!

FWIW, I was not sure it was a problem with that version of Nim -- only that 
when I changed the type environment enough to compile your code that it worked 
for me on a devel version. In the future, providing more whole code context 
(even if only a link to a github repo) may help you get more precise guidance.

Having confidence getting a fresh devel build going will pay off someday, 
though, esp. if your OS package manager has one as old as 0.19.0. Even just 
more people submitting pull requests for stdlib documentation/API tweaks can 
help, and PRs usually make more sense to do against devel HEAD.


Re: Creating C array

2019-03-29 Thread cblake
Nim defines a 


proc allocCStringArray*(a: openArray[string]): cstringArray


Run

in the `system` module (no need to explicitly `import` it). There is also a 
corresponding: 


proc deallocCStringArray*(a: cstringArray)


Run

You probably want to use those two calls.


Re: Creating C array

2019-03-29 Thread cblake
You're welcome. Nim core devs are **_very_** willing to work with any and all 
comers on pull requests to improve documentation. Ignorance by otherwise 
reasonably patient and resourceful newcomers is **_not_** easy to simulate (in 
any project, really). In some ways it's a resource that decays with value as 
people learn their way around.

I agree that foreign function interface helper procs like those mentioned 
probably deserve a section somewhere, probably linked to from the area you 
mention. I mean, there is the `system.nim` module documentation, but it's kind 
of its own problem area.


Re: Noob question: proper way to read binary files byte by byte

2019-02-27 Thread cblake
A short reply like this may be inadequate to explain virtual memory mechanisms 
if you have never heard of them before. That said, if you have heard in the 
past and forgotten this may help.

The `newMemMapFileStream` will call `memfiles.open` with default flags. Default 
flags typically just lookup the size of the file and create an address range in 
your process that -- when page faulted by the virtual memory hardware -- will 
(transparently to your process inside the OS kernels "page fault handler") 
cause loading/population of 4k (or possibly larger) "pages" of memory, 
on-demand with file contents for the corresponding spot. This is all fairly 
portable behavior.

So, in light of that, at the beginning of your loop, nothing will be "loaded". 
By the end of the loop, as much will be loaded as can fit in the RAM of your 
machine. The actual fact of the matter of "being loaded" depends upon the 
sometimes highly dynamic competition for physical memory among all the programs 
on a system. This is generally also true of any buffered IO mechanism when 
"swap files" or "page files" or partitions are enabled. Each page will be 
loaded for a little while or your program cannot make progress, but by the time 
you get to the end of your loop if the file is gigantic/larger than RAM or if 
some other process is demanding a lot of RAM then the beginning may no longer 
be "resident" in RAM.

Certain operating systems allow you to "tune" the on-demand loading behavior 
with "flag" arguments to the API that sets up this "auto-loading" mechanism. 
For example, Linux allows you to specify `MAP_POPULATE` which will, in effect, 
pre-load the whole file into RAM **before** your program loop/without your 
program making the CPU dereference any of those file data addresses. You may 
want to do this for example if the persistent backing store is a magnetic 
spinning disk, the file is small and yo want to avoid "seeking" the disk head 
around. Similarly, on Unix, there are also the `madvise/posix_madvise` 
interfaces which lets a program advise the OS that memory accesses are likely 
to be sequential (your case) or random, or even specify certain ranges as 
candidates for preloading. These little tweaks tend to be very non-portable, 
though, and the default behavior probably does what you want.

If it does not do what you want, `MemMapFileStream` does not (presently) 
support adding "flags" to the OS mapping calls. I did recently improve the 
`memfiles.open` interface to allow just that. You might like the non-stream API 
better anyway. You can `cast[ptr UncheckedArray[char]]` the `MemFile.pointer` 
and just use the file as an array of bytes if you like. You do have to be 
careful not to overrun the end of the file. And another recent addition I got 
in was to allow `toOpenArray(cast[ptr UncheckedArray[char]](ThePointer), 0, 
TheFileSize-1)` style passing of such arguments to Nim procs expecting 
OpenArray[char] parameters.


Re: Can I access arrays faster than this?

2019-03-06 Thread cblake
Either automatic or manual vectorization can also allow twice as many `float32` 
numbers to be handled per vector instruction vs `float64` on x86 just like the 
ARM or GPU cases. You may need `-march=native` or `-mavx` compiler flags (or 
manual intrinsics/assembly) to activate that feature, though, instead of 
targeting some lowest common denominator x86 cpu and C compiler 
autovectorization can be finicky.

It is true that for many calculations things are memory bandwidth bound where 
you still get 2x improvement. However, many are not membw bound or may be in 
fast caches. For those the 2x wider vectors help. (Funny - caches used to be 
almost entirely about latency but have become about both latency & bandwidth in 
recent times).

Obviously, the wrong answer faster is not helpful, but it often is close to 2x 
faster, depending on how vectorizable what you're doing is, compiler, and 
compiler flags (and/or manual assembly). Excess precision is also not helpful, 
if the cost is not minimal.


Re: Noob question: proper way to read binary files byte by byte

2019-02-24 Thread cblake
You can also use `memfiles`. There writing/reading is the same as accessing 
memory. Besides being possibly simpler presenting an "as if you already read 
the whole file into a buffer" view, it may also be much more efficient, 
especially for byte-at-a-time operation where other APIs might do a lot of 
behind the scenes work on a per-IO basis. Of course, to be usable as a 
`MemFile`, the data needs to be random access (e.g. on the disk as opposed to a 
network socket or pipe or some other unseekable input).


Re: Can I access arrays faster than this?

2019-03-05 Thread cblake
If you can't break your habit, you can always add a few lines near the top 
(before all the `release`-dependent switching) of your `nim.cfg`: 


@if r:   #Allow short alias -d:r to activate fast release mode
  define:release
@end


Run

Perhaps somewhere you picked up a `$HOME/.config/nim.cfg` that does exactly 
this, and then lost it somehow moving between accounts/machines or maybe 
`nim.cfg` to `.nims`? There's surely also some similar Nim Script/`.nims` 
variant


Re: Can I access arrays faster than this?

2019-03-07 Thread cblake
@mratsim is probably intending to refer to the `..` including `50` in Nim while 
the C `for` with a `<` excludes `50` doing about 2% less work, but the 
terseness and style of of his comment may leave the wrong impression. 


for i in 0..50: echo i


Run

indeed compiles down (in release mode) to simply: 


NI res = ((NI) 0);
while (1) {
if (!(res <= ((NI) 50))) goto LA3;
res += ((NI) 1);
}
LA3: ;


Run

plus some stuff only related to my choice of `echo` for the body (which I 
removed for clarity). Any decent optimizing C compiler should treat those two 
ways to spell the loop (@mratsim's `for` and the above `while`) the same.

TL;DR the extra time is from the bounds of the iteration, not the language or 
iterator overhead.


Re: Can I access arrays faster than this?

2019-03-07 Thread cblake
With such a brief comment, it's hard to know which is why I said "probably 
intending". Only one person knows. ;) Maybe he did think iterators cost more.

You are right I did misread his 1..50 as 0..50 {after looking at the first 
version of the ggibson Nim code, not the 2nd where he confusingly switched to 
1..50 not paralleling the C as well, but correcting his amount-of-work 
mismatch}.


Re: "Nim needs better documentation" - share your thoughts

2019-02-06 Thread cblake
@akavel \- you may (or may not) have noticed the "Group By Type" drop-down menu 
on the doc page for any given module (e.g. clicking on `tables` from 
`theindex`). That's sort of like what you're asking for.

The big `theindex.html` doesn't have that Group By feature, but perhaps should. 
The option there should maybe mean a multi-level sort -- first by the name of 
the defining module and then the way a given module would sort when in Group By 
Type mode. (Not advocating dropping the current order, just adding an option 
for that new one, too). You might want to create a github issue for that (if 
there isn't one already).

Also, I _do_ see type signatures in 
[https://nim-lang.org/docs/theindex.html](https://nim-lang.org/docs/theindex.html)
 , e.g. `system: add(x: var string; y: char)` . So, not sure what you mean 
there.


Re: "Nim needs better documentation" - share your thoughts

2019-02-06 Thread cblake
@akavel \- you're welcome.

@both \- I'd have to agree that in a language like Nim which relies heavily on 
type-based overloads that long lists of the same ident in that side-bar/ToC 
section is not helpful. Right now each is a hyperlink to the specific one. So, 
collapsing would remove that choice, while including the signature would 
probably make them long and ugly. Not sure what the best choice to fix all that 
might be.

To the extent many use/advocate using just the big `theindex.html` to find 
things in the stdlib, adding GroupBy there might help. The core maintainers are 
very accepting of documentation upgrades. Someone passionate about the 
output/formatting should work on that or at least file a Github issue.


Re: Fuzzy matching on table keys

2019-02-12 Thread cblake
What mashigan wrote may well be best suited to your exact purpose. Another 
possibility at least worth a shout out is `collections/critbits` which 
implements an efficient `keysWithPrefix`. For more fuzzy still than prefix 
matching, you could also do some kind of efficient edit distance-based thing, 
but data-structures to (attempt to) optimize that are complicated. (There is 
`std/editdistance`, though).


Re: Speeding up Python with Nim

2019-02-14 Thread cblake
If you want to _actually_ calculate Fibonacci numbers not use it as a (poor) 
benchmark for funcalls/recursive overheads or for other purposes, then you may 
also find this interesting: 
[https://sahandsaba.com/five-ways-to-calculate-fibonacci-numbers-with-python-code.html](https://sahandsaba.com/five-ways-to-calculate-fibonacci-numbers-with-python-code.html)

In terms of the closed form formula, it would be nice if someone wrapped `mpfr` 
or similar someday for a `bigfloats`.


Re: len [0, 1, 2] fails

2019-02-11 Thread cblake
Or `[1,2,3].len` which is curiously not among the various listed choices for 
whatever reason, and which is even more visually distinct from array indexing, 
IMO.


Re: What is the purpose of tables.allValues()?

2019-04-14 Thread cblake
A documentation PR seems reasonable. Go for it. If you need to undo multiple 
add calls to `Table` then you should be able loop until you don't find the key 
anymore, but yeah, you don't know what order values for the key will arrive in 
that loop. There's no secondary order discipline like FIFO or anything. If you 
need that you should instead use a `Table[key, seq[valType]]` or `OrderedTable`.


Re: Macro to create a dictionary (table) like in python!

2019-07-10 Thread cblake
I would have to agree with `@`. +1 vote.


Re: Macro to create a dictionary (table) like in python!

2019-07-12 Thread cblake
These association lists, often abbreviated "alist", get a lot of play in lisp, 
but also OCaml, Haskell, etc.. 
[https://en.wikipedia.org/wiki/Association_list](https://en.wikipedia.org/wiki/Association_list)


Re: What is the purpose of tables.allValues()?

2019-04-14 Thread cblake
While it is a little unusual to see it, open addressed tables can support 
multiple keys (as a set or associatively as a table). I believe most other 
collision resolution schemes can, as well. So, a Nim `Table` is what some 
people call a "multi-set". Whether you have unique keys depends upon the 
history of an instance.

When I was sprucing up the impls, I mentioned to Araq this might confuse some, 
but he said he uses this feature in the compiler proper. I think his was the 
right judgement call..Better to just inform about more than usual functionality 
when it comes up. To my memory, this is the first time it's come up 4 years. ;-)


Re: What is the purpose of tables.allValues()?

2019-04-14 Thread cblake
The relevant discussion: 
[https://github.com/nim-lang/Nim/pull/2078#issuecomment-73744634](https://github.com/nim-lang/Nim/pull/2078#issuecomment-73744634)


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
On slightly closer inspection of that spec, it seems that backslash quoting 
only happens in the `##` comment sections ignored above. So, maybe they aren't 
buggy after all if that data is not important to the calculation in question.

A further point along the lines of "if we're going to provide `split`, maybe we 
should provide faster variants" is that the times when it is going to matter 
most will be for huge inputs where the columns may well have a more regular 
nature like numbers and specifically forbid more complex lexical structure. 
There it might A) be correct, B) performance might matter a lot in human terms, 
and C) the programmer might mostly be non-sophisticated with regard to even 
terminology like "lexing" or "vectorized memchr". This all seems to be the case 
with this VCF thing, but I feel like it's come up quite a few times over the 
years. It may not **always** be "bugs running faster". :-)

Anyway, I don't think "to force people to learn new terminology/techniques" is 
a very welcoming answer. So, I tried to provide something more welcoming. Even 
if their parsing is sloppy & error prone, I think naive programmers facing 
consequent errors on their own data sets rather than complaining about Nim 
library performance is better optics for Nim.

All that said, I think we agree 100% that we probably need more information 
from @markebbert to help him any more with his actual problem. Maybe it is IO. 
Maybe he didn't even compile with `-d:danger`. If he's on Linux, I would 
suggest him decompressing first and trying my `mmap` versions. 90 GB/(100 MB/s) 
=~ 900 seconds =~ 15 minutes. Heck, some people even have 90GB of RAM. :-)


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
I agree with @siloamx. I would also point out that for many less 
compiler/parsing-sophisticated programmers, splitting is conceptually simple 
enough to make all the difference. Such programmers may never even have heard 
of "lexing". This is just to re-express @Araq's point about it being "naive" in 
maybe a slightly more accomodating way. Probably, they should learn, but maybe 
they want to focus on some simple problem. Formats aren't always in a 
strict/pure table format amenable to `parsecsv`.

I have some routines in `cligen/mslice` 
([https://github.com/c-blake/cligen](https://github.com/c-blake/cligen)) that 
may help for this. Some example usage is in `examples/cols.nim`. In the `mmap` 
mode of `cols`, I get ~50% the run-time of GNU `awk` for field splitting, 
though I have admittedly never tried with 15,000 columns.

One final, related point is that `gunzip` can be very slow at decompressing and 
is single-threaded. If the average column width is ~20 Bytes, 300k*15k*20 =~ 90 
GB. Working with the uncompressed file directly or using something that can 
decompress in parallel like `Zstd` 
([https://en.wikipedia.org/wiki/Zstandard](https://en.wikipedia.org/wiki/Zstandard))
 may help **a lot** , especially if you have 4-12 idle cores as many do these 
days. On Linux, you should be able to just 


let f = popen("pzstd -cdqp8 < myfile.zs".cstring, "r".cstring)


Run

(You may have to, just once, `zcat file.gz | pzstd -f -p8 -19 > file.zs` and, 
of course, this will not help if you have a pile of `.gz` files only to be 
processed once.)


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
In fact, that VCF format does use `\` escapes 
([https://samtools.github.io/hts-specs/VCFv4.3.pdf)](https://samtools.github.io/hts-specs/VCFv4.3.pdf\)).
 So, basically all the above code examples are indeed wrong, and @Araq's advice 
is the right general advice (I did say they should probably learn).

That said, I also still think there are many situations both quick & dirty and 
otherwise where simple splitting makes sense and is not wrong. Why else provide 
it in `strutils`?


Re: Compile C file together with Nim one

2019-08-15 Thread cblake
Yet another approach is to have the `nim c`-generated C code just `#include 
"foo.c"` via `{.importc: "func", header: "foo.c".}`.


Re: What do you think about the programming language NIM?

2019-08-15 Thread cblake
I do not agree that lazy linear lists are the "main example" of recursion. They 
may be the _simplest_ example, thusly allow several easy workarounds. I 
mentioned "trees" right in my comment. Most of the (admittedly subjective) 
elegance of tree algorithms comes from the way recursive code structure mirrors 
recursive data structure. `lib/pure/collections/critbits.nim:iterator leaves` 
shows some ugliness required to workaround there, for example.

I agree it may not be a high priority or even in the top 10. I do not think it 
would be considered a "no real need" extra/extraneous fluff/"feature bloat". I 
know you didn't say that, exactly, but I felt your post risked leaving that 
impression.


Re: What do you think about the programming language NIM?

2019-08-15 Thread cblake
There are always workarounds since CPUs are little state machines. That should 
be news to no one. Recursion working makes code look much nicer, IMO. For what 
it's worth, the example considered among the more compelling by the Python guys 
back in the Python 2.3 days when they added generators was in-order binary 
search tree traversal: 


iterator inorder[T](t: Tree[T]): T =
  if t:
for x in inorder(t.left): yield x
yield t.label
for x in inorder(t.right): yield x


Run

which parallels the usual recursive call structure quite well. I think it would 
be great if the above Nim actually worked. Equivalent state machines are often 
ugly, trickier to get right, and a pain point (and yeah, usually faster). But 
"expressive" has always been a main goal of Nim. :-)


  1   2   3   >