Dan,

I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The 
async validation took about 5 minutes to complete.

If I create a temporary table with the data in my CSV and run a join with the 
user table against a slave, the query to validate that these users exist takes 
about 400ms if I use user_id (primary key in enwiki.user) and about 3s using 
user_name (unique in enwiki.user). 

What’s the reason why it takes so long to validate a cohort in the application?

Dario

On Nov 21, 2013, at 7:45 AM, Dario Taraborelli <[email protected]> wrote:

> thanks Dan, this is awesome – I’ll give it a try this morning with some of 
> the recent mobile cohorts.
> 
> On Nov 21, 2013, at 7:39 AM, Dan Andreescu <[email protected]> wrote:
> 
>> Dear Wikimetrics users,
>> 
>> I've just deployed asynchronous cohort upload.  This is feature #818: 
>> https://mingle.corp.wikimedia.org/projects/analytics/cards/818 and basically 
>> allows you to upload larger cohorts because validation is happening behind 
>> the scenes.  I'll go over how the new functionality works here, and will 
>> rely on one of you to point me to the appropriate on-wiki place to update 
>> documentation.
>> 
>> So basically, visiting /cohorts and clicking "Upload Cohort" works as 
>> before.  But once you click "Upload CSV", your form is validated, processed, 
>> and you're taken back to the cohorts page.  Your new cohort is immediately 
>> created but is not yet validated.  While it validates, you'll see the 
>> validation status and have a few options:
>> 
>> * Remove Cohort.  This is destructive and will remove this cohort from your 
>> list.  Use this in case you made a mistake, uploaded the wrong file, etc.
>> * Validate Again.  This will run validation again.  One possible use for it 
>> is, let's say you upload a cohort with some *very* newly registered users.  
>> And because of replication lag to the labsdb databases, most of them come up 
>> invalid.  You can then run validation again.
>> * Refresh.  This just refreshes the status of the validation and will update 
>> the counts that show up below.
>> 
>> You will not have the "Create Report" option until validation is done.  And 
>> when you do create a report, only valid users will be considered and used in 
>> the output.
>> 
>> One caveat.  Validation is still slow.  And the time limit for the 
>> asynchronous task is set to 1 hour.  I have some ideas for making this 
>> faster by batching, and I can increase the time limit per task (but that has 
>> other repercussions).  For now, just keep in mind that the theoretical 
>> maximum cohort size you should upload is roughly 18,000 users.  I would love 
>> some feedback about whether it's ok to increase the time limit or if people 
>> want me to focus on making validation faster.
>> 
>> Dan
>> _______________________________________________
>> Wikimetrics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikimetrics
> 

_______________________________________________
Wikimetrics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Reply via email to