https://bugzilla.wikimedia.org/show_bug.cgi?id=60826

       Web browser: ---
            Bug ID: 60826
           Summary: Enable parallel processing of stub dump and full
                    archive dump for same wiki.
           Product: Analytics
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: Wikimetrics
          Assignee: wikibugs-l@lists.wikimedia.org
          Reporter: ezac...@wikimedia.org
                CC: christ...@quelltextlich.at, dandree...@wikimedia.org,
                    dvanli...@gmail.com, nu...@wikimedia.org,
                    tneg...@wikimedia.org
    Classification: Unclassified
   Mobile Platform: ---

Years ago Wikistats used to process the full archive dump for each wiki, the
dump which contains full text for each revision of each article. Only that type
of dump file can yield word count and average article size and some other
content based metrics. For a list of affected metrics see all partially empty
columns at e.g. http://stats.wikimedia.org/EN/TablesWikipediaEN.htm (first
table).

As the dumps grew larger and larger this was no longer possible on a monthly
schedule, at least for the largest Wikipedia wikis. Processing the English full
archive dump takes more than a month now by itself. Some very heavy regexps are
partially to blame. 

Many people have asked when the missing metrics will be revived. A pressing
case was brought forward in the first days of 2014 in
https://nl.wikipedia.org/wiki/Overleg_gebruiker:Erik_Zachte#Does German
Wikipedia have a crisis? For example "Can you find out, if the growth of
average size has significantly changed in 2013?" 

At the moment there is limited parallelism within Wikistats dump processing.
Two wikis from different projects can be processed in parallel, as each project
has its own set of input/output folders. But processing two Wikipedia wikis at
the same time could bring interference problems, as there are some project-wide
csv files. Not to mention processing stub and full archive dump for the same
wiki at the same time, where all files for that wiki would be updated by two
processes.

The simplest solution is to schedule full archive dump processing on a
different  server than stub dump processing (e.g. stat1 instead of stat1001?)
and merge the few metrics that can be only collected from the full archive
dumps into the csv files generated from the stub dumps.  

This merge would require a separate script, which can fetch a csv file from one
server and merge specific columns into the equivalent csv files on another
server. 

This/these csv file(s) should be protected against concurrent access
(metaphore?   how?) or the merge step should be part of the round-robin job
which processes dumps whenever they become available. (the latter being
slightly less safe, as there is a theoretical change that a concurrent access
still could occur, as there are on occasion manually scheduled extra runs).

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to