Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-18 Thread Yardanico
By "main config" I mean 
[https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg](https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg)


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-18 Thread Yardanico
That's not a nimble bug - see 
[https://github.com/nim-lang/Nim/issues/14272](https://github.com/nim-lang/Nim/issues/14272)


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-18 Thread snej
> Are you putting -d:release in the config file by any chance? If you do - that 
> wouldn't work, you must specify it in the call to the compiler/nimble build

That's it! I put it in `nim.cfg` like all the other compile options. If instead 
I run `nimble build -d:release`, I get an optimized build.

The help text for `nimble` doesn't describe the options to `build`, it just 
says `[opts, ...] [bin]`. And I already knew I could put `nim c` option flags 
in the config file, so it seemed obvious to put `-d:release` there.

Help me understand the build process better -- why is it different putting it 
on the command line vs. in the config file?


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-18 Thread Stefan_Salewski
> key distribution matters a lot

That is true, but it is obviously not the problem for Mr snej. He gets good 
performance when he disables all checks one by one, but not with -d:danger. So 
his configuration is somewhat broken.

And note, -d:danger is enough, no need for -d:danger -d:release. We should not 
confuse people too much. Same for -d:release --opt:speed, -d:release already 
includes --opt:speed.


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-18 Thread coffeepot
Just wanted to add to what @treeform mentioned, key distribution matters a lot 
when deleting with tables. Check out the timings of sequential keys vs a more 
distributed table access.


import tables, times, random, sugar

template bench(code: untyped) =
  let t1 = cpuTime()
  code
  let t2 = cpuTime()
  echo "Line ", instantiationInfo().line, " took ", t2 - t1

var t: Table[int, int]
const tests = 100_000

bench:
  for i in 0 ..< tests:
t.add i, 1

bench:
  for i in 0 ..< tests:
t.del i

let randomKeys = collect(newSeqOfCap(tests)):
  for i in 0 ..< tests:
rand int.high

bench:
  for i in 0 ..< tests:
t[randomKeys[i]] = 1

bench:
  for i in 0 ..< tests:
t.del randomKeys[i]


Run

I get the following output with -d:danger -d:release:

  * Line 12 took 0.003
  * Line 16 took 7.209
  * Line 24 took 0.00178
  * Line 28 took 0.003114



In this example, deleting with sequential keys is 2400 times slower than keys 
with a more random distribution.


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-18 Thread Yardanico
Are you putting -d:release in the config file by any chance? If you do - that 
wouldn't work, you must specify it in the call to the compiler/nimble build


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread timothee
@snej

  * what is nim -v ?



(I'm assuming it's 1.2.2 but please double check)

  * please post minimum reproducible example, otherwise it's hard to do 
anything about your problem




Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread Stefan_Salewski
> use binary size as a rough indicator of optimization

That makes not much sense. For gcc -O3 gives generally large but fast 
executables. Large because gcc unrolls loops and apply SIMD instructions and 
much more.

Have you tested -d:danger ? That option should really give you fastest code and 
turn of all checks. If not then there is something wrong. If -d:danger is fine, 
then you may try without danger and turn of checks to find out what check makes 
it slow. Or create a minimal example for table that we can investigate.


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread snej
> But specifying --opt:size at the same time may be not that good, do you use 
> -O3 and -Os for gcc at the same time? Most people do not.

They're mutually exclusive, so the compiler's going to pick one if both are 
given. It looks as though the -Os wins since I do see a large reduction in code 
size.

I'm of the belief that -Os is the best setting in most cases. (That was 
definitely the rule when I worked at Apple in the '00s.) Smaller code is more 
cache-friendly, launches faster, and contributes less to memory issues. If 
necessary you can use Clang or GCC function attributes to force individual 
functions to optimize for speed.


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread snej
I looked at the Compiler User Guide, and it also implies that `-d:release` is 
sufficient to turn on compiler optimizations. As do your comments above. 
Weird...

Here's my environment:

  * MacBook Pro, macOS 10.15.5
  * Xcode 11.5; `clang --version` reports "Apple clang version 11.0.3 
(clang-1103.0.32.62)"
  * I used to have Nim 1.2 installed from HomeBrew; this morning I uninstalled 
that, installed choosenim, and installed 1.2.2.



I'm building a Nimble package that I created a few days ago, which imports 
asynchttpserver and the 'news' WebSocket library. I'm using `nimble build` to 
compile. My `nim.cfg` file contains: 


--path:src
--outdir:bin
--hints:off


Run

I'll use binary size as a rough indicator of optimization, for now.

  * Default (debug): binary size is 752K
  * With `-d:release`: 701K
  * With `-d:release --opt:speed`: 608K
  * With `-d:release --opt:size`: 528K
  * With `-d:release --opt:size --stackTrace:off --lineTrace:off 
--assertions:off --checks:off`: 273K
  * With `-d:danger`: 752K
  * With `-d:danger --opt:speed`: 647K
  * With `-d: danger --opt:size`: 570K
  * With `-d: danger --opt:size --stackTrace:off --lineTrace:off 
--assertions:off --checks:off`: 307K



Conclusions: It definitely appears `-d:release` is not turning on 
optimizations. And it's weird that `-d:danger` results in a larger binary; it 
should be the other way around since it disables generating runtime checks, 
right? 


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread juancarlospaco
`--opt:size` can me stuff slower, it changes file size for speed. 


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread treeform
I also experienced really bad Table performance which was related to really bad 
hash generation. Print out the hash(key) and see if you hashes are usually bad.


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread SolitudeSF
-d:danger -d:lto| 


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread Stefan_Salewski
> and everything got about 20 times faster.

We have never seen such an behaviour before, so it would be great if you could 
investigate it in more detail.

Maybe a strange bug? Did you use gcc10 or clang, or maybe the microsoft or 
intel compiler as backend?

-d:release is fine, -d:danger can be a bit faster as it turns off all runtime 
checks. Both is O3 for gcc by default. But specifying --opt:size at the same 
time may be not that good, do you use -O3 and -Os for gcc at the same time? 
Most people do not.|   
---|---  
  
Try --passC:-flto to for lto, that give you inlining for all procs.

I enable generally -march=native also. And you may try PGO, there was a post 
about it recently on the forum.


Re: Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread snej
Sorry, false alarm! I was misled by the `nim` command's help text:


  --opt:none|speed|size optimize not at all or for speed|size
Note: use -d:release for a release build!


Run

**I took this to mean that ``-d:release`` was *sufficient* to get a fully 
optimized release build. Nope!** Once I started to look at the x86 disassembly 
it was painfully non-optimal. ядо

I changed the flags to `-d:release --opt:size --stackTrace:off --lineTrace:off 
--assertions:off --checks:off`, and everything got about 20 times faster... Now 
the code is spending almost all its time in `kevent` waiting for I/O, as it 
should be.

**Question:** Is this the optimal set of flags for a release build? Is there an 
easy way to toggle between debug and release builds, without having to edit a 
bunch of lines in my `nim.cfg` file?


Perf: Table.del(key)is taking 85% of my code's time

2020-06-17 Thread snej
I've got some networking code running now (implementation of a protocol based 
on WebSockets) and I have a little client and server that send each other 
messages as fast as they can. Now I get to profile it! (With `-d:release` of 
course.)

Disappointingly, **about 85% of the total time is being spent in a single call 
to ``del(table,key)``** , where the Table's keys are `uint64` and the values 
are a `ref object` type. Here's a snippet what the macOS `sample` tool is 
showing me:


1614 pendingResponse__9bNzW9abL3s854sblP5selTw_2  (in client) + 434  
[0x10d342b72]
  1096 del__F9aUzT5FfSHbDaq0vvIxNng  (in client) + 1935,1769,...  
[0x10d2e292f,0x10d2e2889,...]
  518 del__F9aUzT5FfSHbDaq0vvIxNng  (in client) + 1419  [0x10d2e272b]
277 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables  (in client) + 48,23,...  
[0x10d2e2cc0,0x10d2e2ca7,...]
180 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables  (in client) + 57  
[0x10d2e2cc9]
  180 nimFrame  (in client) + 1,15,...  [0x10d2e04d1,0x10d2e04df,...]
61 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables  (in client) + 102  
[0x10d2e2cf6]
  61 popFrame  (in client) + 1,14,...  [0x10d2e0651,0x10d2e065e,...]


Run

_(Indentation shows the call nesting. The numbers on the left are the number of 
times the given stack frame was observed; there were 1886 samples total over 2 
seconds.)_

I haven't dug into the Table source code yet, but it looks like there may be 
something really unoptimal there.

The name-mangling is weird and very random-looking: are those base64 digests 
being appended to function names? **Is there a tool to unmangle names** so I 
can find out exactly what a specific function is?

PS: From my past experiences with Go, I was expecting GC to be the culprit; 
there are a few stack frames that look like GC but they're wa down low, 
only a handful of samples. Nice! Is Nim using ARC by default now?