I asked for some information from Emery Berger about his video talk on performance where he said they got a 25% improvement in SQLite performance. Here is the reply I got back.
I know there has been a lot of talk about what can and cannot be done with the C calling interface because of compatibility issues and the myriad set of wrappers on various forms. I’m having a hard time letting go of a possible 25% performance improvement. I don’t have the slightest idea on how to run a benchmark (but I could learn). I wonder if the current set of benchmarks used by SQLite developers actually measure throughput using wall-clock numbers. It might be a good idea to put a wrapper around all the benchmarks to capture how long they took to run (wall-clock), and include things like number and type of cpu cores, average cpu busy time, and other relevant numbers. If the benchmarks are run on lots of different machines (all over the world?), it would provide an excellent view of what changes in SQLite made a difference in performance. Doug From: Curtsinger, Charlie <curtsin...@grinnell.edu> Sent: Thursday, January 02, 2020 10:55 AM To: dougf....@comcast.net Cc: Emery D Berger <em...@cs.umass.edu> Subject: Re: Questions about your "Performance Matters" talk re SQLite Hello Doug, I was able to track down the sqlite benchmark I ran for the paper, and I’ve checked it into the github repository at https://github.com/plasma-umass/coz/tree/master/benchmarks/sqlite. This benchmark creates 64 threads that operate on independent tables in the sqlite database, performing operations that should be almost entirely independent. This benchmark exposes contention inside of sqlite, since running it with a larger number of hardware threads will hurt performance. I see a performance improvement of nearly 5x when I run this on a two-core linux VM versus a 64-thread Xeon machine, since there are fewer opportunities for the threads to interfere with each other. You can also find the modified version of sqlite with the same benchmark at https://github.com/plasma-umass/coz/tree/master/benchmarks/sqlite-modified. There are just a few changes from indirect to direct calls in the sqlite3.c file. I reran the experiment on the same machine we used for the original Coz paper, and saw a performance improvement of around 20% with the modified version of sqlite. That’s slightly less than what we originally found, but I didn’t do many runs (just five) and there’s quite a bit of variability. The compiler has been upgraded on this machine as well, so there could be some effect there as well. On a much-newer 64-thread Xeon machine I see a difference of just 5%, still in favor of the modified version of sqlite. That’s not terribly surprising, since Intel has baked a lot of extra pointer-chasing and branch prediction smarts into processors in the years since we set up the 64-core AMD machine we originally used for the Coz benchmarks. As far as measuring performance, I’d encourage you *not* to use cpu cycles as a proxy for runtime. Dynamic frequency scaling can mess up these measurements, especially if the clock frequency is dropped in response to the program’s behavior. Putting many threads to sleep might allow the OS to drop the CPU frequency, thereby reducing the number of CPU cycles. That doesn’t mean the program will actually run in a shorter wall clock time. Some CPUs have a hardware event that counts “clock cycles” at a constant rate even with frequency scaling, but these are really just high-precision timers and would be perfectly fine for measuring runtime. I’m thinking of the “ref-cycles” event from perf here. Hope this helps, - Charlie _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users