fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?
On Nov 21, 2013, at 1:52 PM, Dan Andreescu <[email protected]> wrote: > OK, I got 10k users to validate in about 30 seconds. Not instant but it does > have to do a bunch of duplicate checks, multi-project batching, etc. Let me > know how it works for you, and if there are any problems. > > > On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu <[email protected]> > wrote: > > > On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli > <[email protected]> wrote: > I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). > The async validation took about 5 minutes to complete. > > If I create a temporary table with the data in my CSV and run a join with the > user table against a slave, the query to validate that these users exist > takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s > using user_name (unique in enwiki.user). > > What’s the reason why it takes so long to validate a cohort in the > application? > > My understanding is that this is due to Labs being slow compared to stat1? > > I don't think labs is that much slower though, we're talking orders of > magnitude here. So, I think the reason is that currently it's validating one > user at a time. Since for each record I have to check against a potential > user_id and user_name match, this takes forever. > > Two ways to make it much faster: > > * batch every X users and do a where user_id in (...) or user_name in (...) > query instead of checking each one > * create temporary tables just like Dario did > > The problem is that cohorts can have users from multiple projects. That > makes both approaches harder, but should still be doable. The reason I > haven't done this yet is that when we scheduled 818 we broke out the > performance issue and agreed we'd work on it later. Sounds important though, > I'll look at it now. > > _______________________________________________ > Wikimetrics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikimetrics
_______________________________________________ Wikimetrics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikimetrics
