> > On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli < > [email protected]> wrote: > >> I tried uploading a cohort from a recent A/B test (1,780 unique >> user_id’s). The async validation took about 5 minutes to complete. >> >> If I create a temporary table with the data in my CSV and run a join with >> the user table against a slave, the query to validate that these users >> exist takes about 400ms if I use user_id (primary key in enwiki.user) and >> about 3s using user_name (unique in enwiki.user). >> >> What’s the reason why it takes so long to validate a cohort in the >> application? >> > > My understanding is that this is due to Labs being slow compared to stat1? >
I don't think labs is that much slower though, we're talking orders of magnitude here. So, I think the reason is that currently it's validating one user at a time. Since for each record I have to check against a potential user_id and user_name match, this takes forever. Two ways to make it much faster: * batch every X users and do a where user_id in (...) or user_name in (...) query instead of checking each one * create temporary tables just like Dario did The problem is that cohorts can have users from multiple projects. That makes both approaches harder, but should still be doable. The reason I haven't done this yet is that when we scheduled 818 we broke out the performance issue and agreed we'd work on it later. Sounds important though, I'll look at it now.
_______________________________________________ Wikimetrics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikimetrics
