Matt, I have specific answers to most of your questions, but I don't
know whether others on wikimedia-l would be interested in them, and
I'm not sure about the specifics of a couple terms you used relative
to what I remember of the testing harness, so I'll reply in more
detail off-list with some questions about the terms over the weekend.
For now, I think the banner text message has aways been the most
important part of any appeal, and that if you were to take all 300 of
the existing volunteer submissions (and accept more -- e.g. How much
you donate may help determine how much we pay our programmers would
be incredibly effective, and hope you will measure it) and if you were
or other changes over a one week period with about 3000 impressions
each at random times of day and days of week for each, you would have
plenty to work with. That's about a million impressions, or a 0.3%
impressions test, which I believe will give you well over 95%
confidence in the results.
That would not account for banner fatigue, which may be significant
all the way from timezone-to-timezone up to year-to-year, but I have
no ideas about how to account for that other than to do a multivariate
test shortly before beginning fundraising in earnest.
On Fri, Dec 28, 2012 at 3:46 PM, Matthew Walker mwal...@wikimedia.org wrote:
On Fri, Dec 28, 2012 at 2:11 PM, James Salsman jsals...@gmail.com wrote:
I mean as in the tests done May 16, September 20, and October 9
without adjusting the best performing pull-down delivery combined
banner/landing page from the beginning of this month
I obviously cannot speak for what Zack will end up doing but let's talk shop
for a moment on how this would be implemented.
The tests you indicated play banner, landing page impressions, and donation
amount against each other. It appears that everyone saw a collection of
random banners (ie: the test was not bucketed.) Are these the same variables
you want to test?
Regardless of the answer to the above; how do you propose we normalize our
tests across time of day, day of week, and day of month factors - we've seen
evidence that these all play a role. I don't know how many banner variations
we actually have to test but it's likely we won't be able to test them all
at the same time (In fact with the current weighting setup we can only test
30 banners at a time). Do we just take each group as it stands -- find the
best performers in the group and then test the winners against each other?
An additional considering is that we have four buckets to play with; buckets
are independent so we could potentially test 120 banners at a time to four
different groups. Presumably if we did this we would want a couple of
control banners in each to normalize with?
An additional something to consider is how long do we have to run these
tests to gain statistical significance? At least a day I'm guessing. Are we
going to account for banner fatigue at all? IE: show banners during only the
first 10 visits like we just did with this most recent campaign?
Wikimedia-l mailing list