2016-07-05 10:08 GMT+02:00 Antoine Pitrou <solip...@pitrou.net>:
>> When the system noise is high, the skewness is much larger. In this
>> case, median looks "more correct".
>
> It "looks" more correct?

My main worry is to get reproductible "stable" benchmark results. I
started to work on perf because most results of the CPython benchmark
suite just looked like pure noise. It became very hard for me to
decide if it's my fault, if my change makes Python slower and faster.
I'm not talking of specific benchmarks which are obviously much faster
or much slower, but all small changes between -5% and +5%.

It looks like median helps to reduce the effect of outliers.


> Let's say your Python implementation has a flaw: it is almost always
> fast, but every 10 runs, it becomes 3x slower.  Taking the mean will
> reflect the occasional slowness.  Taking the median will completely
> hide it.

I'm not sure that the median will completly hide such behaviour.
Moreover, I modified the benchmark suite to always display the
standard deviation just after the median. The standard deviation
should help to detect a large variation.

In practice, it almost never occurs to have all samples with the same
value. There is always a statistic distribution, usually as a gaussian
curse. The question is what is the best way to "summary" a curve with
two numbers. I add a constraint: I also want to reduce the system
noise.


> Then of course, since you have several processes and several runs per
> process, you could try something more convoluted, such as
> mean-of-medians or mean-of-mins or...

I don't know these functions. I also prefer consider each sample as
individual and only apply a function on the whole serie of all
samples.



> However, if you're concerned by system noise, there may be other ways
> to avoid it. For example, measure both CPU time and wall time, and if
> CPU time < 0.9 * wall time (for example), ignore the number and take
> another measurement.
>
> (this assumes all benchmarks are CPU-bound - which they should be here
> - and single-threaded - which they *probably* are, except in a
> hypothetical parallelizing Python implementation ;-)))

CPU isolation helps a lot to reduce the system noise, but it requires
"complex" system tuning. I don't understand that users will use it,
especially users of timeit.

I don't think that CPU time is generic enough to put it in the perf
module. I would prefer to not restrict myself to CPU-bound benchmarks.

But the perf module already warns users when it detects that the
benchmark looks too unstable. See the example at the end of:
http://perf.readthedocs.io/en/latest/perf.html#runs-samples-warmups-outter-and-inner-loops

Or try: "python3 -m perf.timeit --loops=10 pass".

Currently, I'm using the shortest raw sample (>= 1 ms) and standard
deviation / median (< 10%).

Someone suggested me to compare the minimum and the maximum to the
median. You get already see that using perf stats:
------------------
$ python3 -m perf show --stats perf/tests/telco.json
Number of samples: 250 (50 runs x 5 samples; 1 warmup)
Standard deviation / median: 1%
Shortest raw sample: 264 ms (10 loops)

Minimum: 26.4 ms (-1.8%)
Median +- std dev: 26.9 ms +- 0.2 ms
Maximum: 27.3 ms (+1.7%)

Median +- std dev: 26.9 ms +- 0.2 ms
------------------
=> -1.8% and +1.7% numbers for minimum and maximum

When you get outliers, numbers are up to 20% for the maximum or much more.

Victor
_______________________________________________
Speed mailing list
Speed@python.org
https://mail.python.org/mailman/listinfo/speed

Reply via email to