So I did a bit more research. Check out how LNT does this: https://github.com/llvm-mirror/lnt <https://github.com/llvm-mirror/lnt/search?utf8=%E2%9C%93&q=mann-whitney&type=>
I talked with Chris Matthews (+CC) about how LNT uses Mann-Whitney. In the following let n be the number of samples taken. From what he told me this is what LNT does: 1. If n is < 5, then some sort of computation around confidence intervals is used. 2. If the number of samples is > 5, then Mann-Whitney U is done. I am not 100% sure what 1 is, but I think it has to do with some sort of quartile measurements. I.e. Find the median of the new data and make sure it is within +- median absolute deviation (basically mean + std-dev but more robust to errors). I believe the code is in LNT so we can find it for sure. Thus in my mind the natural experiment here in terms of Mann-Whitney U. 1. This seems to suggest that for small numbers we do some sort of simple comparison that we do today and if a regression is "identified", we grab more samples of before/after and run mann-whitney u. 2. Try out different versions of n. I am not 100% sure if 5 is the right or wrong answer or if it should be dependent on the test. Chris, did I get it right? Michael > On Jun 13, 2017, at 7:11 AM, Pavol Vaskovic via swift-dev > <swift-dev@swift.org> wrote: > > On Tue, Jun 13, 2017 at 8:51 AM, Andrew Trick <atr...@apple.com > <mailto:atr...@apple.com>> wrote: > I’m confused though because I thought we agreed that all samples need to run > with exactly the same number of iterations. So, there would be one short run > to find the desired `num_iters` for each benchmark, then each subsequent > invocation of the benchmark harness would be handed `num_iters` as input. > > That was agreed on in the discussion about measuring memory consumption (PR > 8793) <https://github.com/apple/swift/pull/8793#issuecomment-297834790>. > MAX_RSS was variable between runs, due to dynamic `num_iters` adjustment > inside `DriverUtils` to fit the ~1s budget. > > This could work for keeping the num_iters same during comparison between the > [master] and [branch], give we logged the num_iters from [master] and used > them to drive [branch] MAX_RSS memory. I don't know how to extend this to > make memory consumption comparable between different measurement runs (over > time...), tough. > > --Pavol > _______________________________________________ > swift-dev mailing list > swift-dev@swift.org > https://lists.swift.org/mailman/listinfo/swift-dev
_______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev