So I did a bit more research. Check out how LNT does this:

https://github.com/llvm-mirror/lnt 
<https://github.com/llvm-mirror/lnt/search?utf8=%E2%9C%93&q=mann-whitney&type=>

I talked with Chris Matthews (+CC) about how LNT uses Mann-Whitney. In the 
following let n be the number of samples taken. From what he told me this is 
what LNT does:

1. If n is < 5, then some sort of computation around confidence intervals is 
used.
2. If the number of samples is > 5, then Mann-Whitney U is done.

I am not 100% sure what 1 is, but I think it has to do with some sort of 
quartile measurements. I.e. Find the median of the new data and make sure it is 
within +- median absolute deviation (basically mean + std-dev but more robust 
to errors). I believe the code is in LNT so we can find it for sure.

Thus in my mind the natural experiment here in terms of Mann-Whitney U.

1. This seems to suggest that for small numbers we do some sort of simple 
comparison that we do today and if a regression is "identified", we grab more 
samples of before/after and run mann-whitney u.
2. Try out different versions of n. I am not 100% sure if 5 is the right or 
wrong answer or if it should be dependent on the test.

Chris, did I get it right?

Michael

> On Jun 13, 2017, at 7:11 AM, Pavol Vaskovic via swift-dev 
> <swift-dev@swift.org> wrote:
> 
> On Tue, Jun 13, 2017 at 8:51 AM, Andrew Trick <atr...@apple.com 
> <mailto:atr...@apple.com>> wrote:
> I’m confused though because I thought we agreed that all samples need to run 
> with exactly the same number of iterations. So, there would be one short run 
> to find the desired `num_iters` for each benchmark, then each subsequent 
> invocation of the benchmark harness would be handed `num_iters` as input.
> 
> That was agreed on in the discussion about measuring memory consumption (PR 
> 8793) <https://github.com/apple/swift/pull/8793#issuecomment-297834790>. 
> MAX_RSS was variable between runs, due to dynamic `num_iters` adjustment 
> inside `DriverUtils` to fit the ~1s budget.
> 
> This could work for keeping the num_iters same during comparison between the 
> [master] and [branch], give we logged the num_iters from [master] and used 
> them to drive [branch] MAX_RSS memory. I don't know how to extend this to 
> make memory consumption comparable between different measurement runs (over 
> time...), tough.
> 
> --Pavol
> _______________________________________________
> swift-dev mailing list
> swift-dev@swift.org
> https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Reply via email to