On Tue, 7 Mar 2017 01:03:23 +0100
Victor Stinner <victor.stin...@gmail.com>
wrote:
> Another example on the same computer. It's interesting:
> * MAD and std dev is the half of result 1
> * the benchmark is less unstable
> * median is very close to result 1
> * mean changed much more than median
> 
> Benchmark result 1:
> 
> Median +- MAD: 276 ns +- 10 ns
> Mean +- std dev: 371 ns +- 196 ns
> 
> Benchmark result 2:
> 
> Median +- MAD: 278 ns +- 5 ns
> Mean +- std dev: 303 ns +- 103 ns
> 
> If the goal is to get reproductible results, Median +- MAD seems better.

Getting reproducible results is only half of the goal. Getting
meaningful (i.e. informative) results is the other half.

The mean approximates the expected performance over multiple runs (note
"expected" is a rigorously defined term in statistics here: see
https://en.wikipedia.org/wiki/Expected_value).  The median doesn't tell
you anything about the expected value (*).  So the mean is more
informative for the task at hand.

Additionally, while mean and std dev are generally quite well
understood, the properties of the median absolute deviation are
generally little known.

So my vote goes to mean +/- std dev.


(*) Quick example: let's say your runtimes in seconds are
[1, 1, 1, 1, 1, 1, 10, 10, 10, 10].
Evidently, there are four outliers (over 10 measurements) that indicate
a huge performance regression occurring at random points.  However, the
median here is 1 and the median absolute deviation (the median of
absolute deviations from the median, i.e. the median of [0, 0, 0, 0, 0,
0, 9, 9, 9, 9]) is 0: the information about possible performance
regressions is entirely lost, and the numbers (median +/- MAD) make it
look like the benchmark reliably takes 1 s. to run.

Regards

Antoine.


_______________________________________________
Speed mailing list
Speed@python.org
https://mail.python.org/mailman/listinfo/speed

Reply via email to