fantastic, is there any chance we could get an even better performance if we 
allowed users to specify the field type in the upload form (if it’s just 
user_ids, validation will be faster and the app doesn’t need to check every 
single entry for a valid user_name too). I understand that by design the 
application makes no assumption about the type of that field (and in fact it 
accepts a mix of user_id’s and user_names, correct)?

On Nov 21, 2013, at 1:52 PM, Dan Andreescu <[email protected]> wrote:

> OK, I got 10k users to validate in about 30 seconds.  Not instant but it does 
> have to do a bunch of duplicate checks, multi-project batching, etc.  Let me 
> know how it works for you, and if there are any problems.
> 
> 
> On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu <[email protected]> 
> wrote:
> 
> 
> On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli 
> <[email protected]> wrote:
> I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). 
> The async validation took about 5 minutes to complete.
> 
> If I create a temporary table with the data in my CSV and run a join with the 
> user table against a slave, the query to validate that these users exist 
> takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s 
> using user_name (unique in enwiki.user). 
> 
> What’s the reason why it takes so long to validate a cohort in the 
> application?
> 
> My understanding is that this is due to Labs being slow compared to stat1? 
> 
> I don't think labs is that much slower though, we're talking orders of 
> magnitude here.  So, I think the reason is that currently it's validating one 
> user at a time.  Since for each record I have to check against a potential 
> user_id and user_name match, this takes forever.
> 
> Two ways to make it much faster:
> 
> * batch every X users and do a where user_id in (...) or user_name in (...) 
> query instead of checking each one
> * create temporary tables just like Dario did
> 
> The problem is that cohorts can have users from multiple projects.  That 
> makes both approaches harder, but should still be doable.  The reason I 
> haven't done this yet is that when we scheduled 818 we broke out the 
> performance issue and agreed we'd work on it later.  Sounds important though, 
> I'll look at it now.
> 
> _______________________________________________
> Wikimetrics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikimetrics

_______________________________________________
Wikimetrics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Reply via email to