Re: [Wikimedia-search] On frequency of A/B tests and peeking at the data early

Kevin Smith Tue, 01 Sep 2015 11:39:46 -0700

We will, of course, continue to sanity-check the data within a day or so
after a new test starts to run, to make sure that we are logging the
information that we will need to perform analyses, that our bucket sizes
appear to working as designed, etc.




Kevin Smith
Agile Coach, Wikimedia Foundation


On Tue, Sep 1, 2015 at 11:28 AM, Trey Jones <[email protected]> wrote:

> Well, peeking is okay as long as you don't act on it:
>
> “Peeking” at the data is OK as long as you can restrain yourself from
>> stopping an experiment before it has run its course. I know this goes
>> against something in human nature, so perhaps the best advice is: no
>> peeking!
>
>
> It does take up time, though, and based only on data from the morning of
> the deployment it may not give a representative preview. It's still fun to
> peek, though. ;)
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Mon, Aug 31, 2015 at 2:05 PM, Mikhail Popov <[email protected]>
> wrote:
>
>> Hi all,
>>
>> Last week we discussed our approach to A/B testing and we've decided to
>> have a week (at least) between tests.
>>
>> A two-week-minimum cadence will give the analysis team enough time to
>> thoroughly think about the experimental design of each test, as well as
>> give the engineers enough time to implement it. Which is great because some
>> of the changes we are planning to test are not trivial and we don't want to
>> rush a test out and realize halfway through that we should have been
>> tracking something we're not.
>>
>> We are also going to move away from doing initial analyses (analysis of
>> the data from the morning of a launch) for practical and scientific
>> reasons. Practical in the sense that we've been putting time and effort
>> into getting preliminary results that are not representative of final
>> results whatsoever while putting other work on the backburner. Scientific
>> in the sense that peeking at the data mid-experiment is bad science:
>>
>> *Repeated significance testing always increases the rate of false
>> positives, that is, you’ll think many insignificant results are significant
>> (but not the other way around). The problem will be present if you ever
>> find yourself “peeking” at the data and stopping an experiment that seems
>> to be giving a significant result. The more you peek, the more your
>> significance levels will be off. For example, if you peek at an ongoing
>> experiment ten times, then what you think is 1% significance is actually
>> just 5% significance.* – Evan Miller, How Not To Run An A/B Test
>> <http://www.evanmiller.org/how-not-to-run-an-ab-test.html>
>>
>>
>> In science, it's a problem called multiple comparisons. The more tests
>> you perform, the more likely you are to see something where there is
>> nothing. Going forward, we are going to wait until we have collected all
>> the data before analyzing it.
>>
>> Cheers,
>> Mikhail, Junior Swifty
>> Discovery // The Swifties
>>
>> _______________________________________________
>> Wikimedia-search mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>>
>>
>
> _______________________________________________
> Wikimedia-search mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>
>

_______________________________________________
Wikimedia-search mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Re: [Wikimedia-search] On frequency of A/B tests and peeking at the data early

Reply via email to