2016-07-05 10:08 GMT+02:00 Antoine Pitrou <solip...@pitrou.net>: >> When the system noise is high, the skewness is much larger. In this >> case, median looks "more correct". > > It "looks" more correct?
My main worry is to get reproductible "stable" benchmark results. I started to work on perf because most results of the CPython benchmark suite just looked like pure noise. It became very hard for me to decide if it's my fault, if my change makes Python slower and faster. I'm not talking of specific benchmarks which are obviously much faster or much slower, but all small changes between -5% and +5%. It looks like median helps to reduce the effect of outliers. > Let's say your Python implementation has a flaw: it is almost always > fast, but every 10 runs, it becomes 3x slower. Taking the mean will > reflect the occasional slowness. Taking the median will completely > hide it. I'm not sure that the median will completly hide such behaviour. Moreover, I modified the benchmark suite to always display the standard deviation just after the median. The standard deviation should help to detect a large variation. In practice, it almost never occurs to have all samples with the same value. There is always a statistic distribution, usually as a gaussian curse. The question is what is the best way to "summary" a curve with two numbers. I add a constraint: I also want to reduce the system noise. > Then of course, since you have several processes and several runs per > process, you could try something more convoluted, such as > mean-of-medians or mean-of-mins or... I don't know these functions. I also prefer consider each sample as individual and only apply a function on the whole serie of all samples. > However, if you're concerned by system noise, there may be other ways > to avoid it. For example, measure both CPU time and wall time, and if > CPU time < 0.9 * wall time (for example), ignore the number and take > another measurement. > > (this assumes all benchmarks are CPU-bound - which they should be here > - and single-threaded - which they *probably* are, except in a > hypothetical parallelizing Python implementation ;-))) CPU isolation helps a lot to reduce the system noise, but it requires "complex" system tuning. I don't understand that users will use it, especially users of timeit. I don't think that CPU time is generic enough to put it in the perf module. I would prefer to not restrict myself to CPU-bound benchmarks. But the perf module already warns users when it detects that the benchmark looks too unstable. See the example at the end of: http://perf.readthedocs.io/en/latest/perf.html#runs-samples-warmups-outter-and-inner-loops Or try: "python3 -m perf.timeit --loops=10 pass". Currently, I'm using the shortest raw sample (>= 1 ms) and standard deviation / median (< 10%). Someone suggested me to compare the minimum and the maximum to the median. You get already see that using perf stats: ------------------ $ python3 -m perf show --stats perf/tests/telco.json Number of samples: 250 (50 runs x 5 samples; 1 warmup) Standard deviation / median: 1% Shortest raw sample: 264 ms (10 loops) Minimum: 26.4 ms (-1.8%) Median +- std dev: 26.9 ms +- 0.2 ms Maximum: 27.3 ms (+1.7%) Median +- std dev: 26.9 ms +- 0.2 ms ------------------ => -1.8% and +1.7% numbers for minimum and maximum When you get outliers, numbers are up to 20% for the maximum or much more. Victor _______________________________________________ Speed mailing list Speed@python.org https://mail.python.org/mailman/listinfo/speed