OK, I got 10k users to validate in about 30 seconds.  Not instant but it
does have to do a bunch of duplicate checks, multi-project batching, etc.
 Let me know how it works for you, and if there are any problems.


On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu <[email protected]>wrote:

>
>
>> On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli <
>> [email protected]> wrote:
>>
>>> I tried uploading a cohort from a recent A/B test (1,780 unique
>>> user_id’s). The async validation took about 5 minutes to complete.
>>>
>>> If I create a temporary table with the data in my CSV and run a join
>>> with the user table against a slave, the query to validate that these users
>>> exist takes about 400ms if I use user_id (primary key in enwiki.user) and
>>> about 3s using user_name (unique in enwiki.user).
>>>
>>> What’s the reason why it takes so long to validate a cohort in the
>>> application?
>>>
>>
>> My understanding is that this is due to Labs being slow compared to
>> stat1?
>>
>
> I don't think labs is that much slower though, we're talking orders of
> magnitude here.  So, I think the reason is that currently it's validating
> one user at a time.  Since for each record I have to check against a
> potential user_id and user_name match, this takes forever.
>
> Two ways to make it much faster:
>
> * batch every X users and do a where user_id in (...) or user_name in
> (...) query instead of checking each one
> * create temporary tables just like Dario did
>
> The problem is that cohorts can have users from multiple projects.  That
> makes both approaches harder, but should still be doable.  The reason I
> haven't done this yet is that when we scheduled 818 we broke out the
> performance issue and agreed we'd work on it later.  Sounds important
> though, I'll look at it now.
>
_______________________________________________
Wikimetrics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Reply via email to