https://bugzilla.wikimedia.org/show_bug.cgi?id=60826
Web browser: --- Bug ID: 60826 Summary: Enable parallel processing of stub dump and full archive dump for same wiki. Product: Analytics Version: unspecified Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: Wikimetrics Assignee: wikibugs-l@lists.wikimedia.org Reporter: ezac...@wikimedia.org CC: christ...@quelltextlich.at, dandree...@wikimedia.org, dvanli...@gmail.com, nu...@wikimedia.org, tneg...@wikimedia.org Classification: Unclassified Mobile Platform: --- Years ago Wikistats used to process the full archive dump for each wiki, the dump which contains full text for each revision of each article. Only that type of dump file can yield word count and average article size and some other content based metrics. For a list of affected metrics see all partially empty columns at e.g. http://stats.wikimedia.org/EN/TablesWikipediaEN.htm (first table). As the dumps grew larger and larger this was no longer possible on a monthly schedule, at least for the largest Wikipedia wikis. Processing the English full archive dump takes more than a month now by itself. Some very heavy regexps are partially to blame. Many people have asked when the missing metrics will be revived. A pressing case was brought forward in the first days of 2014 in https://nl.wikipedia.org/wiki/Overleg_gebruiker:Erik_Zachte#Does German Wikipedia have a crisis? For example "Can you find out, if the growth of average size has significantly changed in 2013?" At the moment there is limited parallelism within Wikistats dump processing. Two wikis from different projects can be processed in parallel, as each project has its own set of input/output folders. But processing two Wikipedia wikis at the same time could bring interference problems, as there are some project-wide csv files. Not to mention processing stub and full archive dump for the same wiki at the same time, where all files for that wiki would be updated by two processes. The simplest solution is to schedule full archive dump processing on a different server than stub dump processing (e.g. stat1 instead of stat1001?) and merge the few metrics that can be only collected from the full archive dumps into the csv files generated from the stub dumps. This merge would require a separate script, which can fetch a csv file from one server and merge specific columns into the equivalent csv files on another server. This/these csv file(s) should be protected against concurrent access (metaphore? how?) or the merge step should be part of the round-robin job which processes dumps whenever they become available. (the latter being slightly less safe, as there is a theoretical change that a concurrent access still could occur, as there are on occasion manually scheduled extra runs). -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l