Dan, I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.
If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user). What’s the reason why it takes so long to validate a cohort in the application? Dario On Nov 21, 2013, at 7:45 AM, Dario Taraborelli <[email protected]> wrote: > thanks Dan, this is awesome – I’ll give it a try this morning with some of > the recent mobile cohorts. > > On Nov 21, 2013, at 7:39 AM, Dan Andreescu <[email protected]> wrote: > >> Dear Wikimetrics users, >> >> I've just deployed asynchronous cohort upload. This is feature #818: >> https://mingle.corp.wikimedia.org/projects/analytics/cards/818 and basically >> allows you to upload larger cohorts because validation is happening behind >> the scenes. I'll go over how the new functionality works here, and will >> rely on one of you to point me to the appropriate on-wiki place to update >> documentation. >> >> So basically, visiting /cohorts and clicking "Upload Cohort" works as >> before. But once you click "Upload CSV", your form is validated, processed, >> and you're taken back to the cohorts page. Your new cohort is immediately >> created but is not yet validated. While it validates, you'll see the >> validation status and have a few options: >> >> * Remove Cohort. This is destructive and will remove this cohort from your >> list. Use this in case you made a mistake, uploaded the wrong file, etc. >> * Validate Again. This will run validation again. One possible use for it >> is, let's say you upload a cohort with some *very* newly registered users. >> And because of replication lag to the labsdb databases, most of them come up >> invalid. You can then run validation again. >> * Refresh. This just refreshes the status of the validation and will update >> the counts that show up below. >> >> You will not have the "Create Report" option until validation is done. And >> when you do create a report, only valid users will be considered and used in >> the output. >> >> One caveat. Validation is still slow. And the time limit for the >> asynchronous task is set to 1 hour. I have some ideas for making this >> faster by batching, and I can increase the time limit per task (but that has >> other repercussions). For now, just keep in mind that the theoretical >> maximum cohort size you should upload is roughly 18,000 users. I would love >> some feedback about whether it's ok to increase the time limit or if people >> want me to focus on making validation faster. >> >> Dan >> _______________________________________________ >> Wikimetrics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikimetrics >
_______________________________________________ Wikimetrics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikimetrics
