[Wikitech-l] Parralel computing project

Erik Zachte Tue, 26 Oct 2010 15:54:59 -0700

Robert Rohde:

Getting back to Wikimedia, it appears correct that the Wikistats code
is designed to run from the compressed files ....(source linked from [1]).
As you suggest, one could use the properties of .bz2 format to
parallelize that.  I would also observe that parsers tend to be
relatively slow, while decompressors tend to be relatively fast.


Some additional notes:

Yes wikistats processes compressed dumps.
Nowadays these are mostly stub dumps.
Most monthly metrics can be collected here, with few exceptions like 
word count.

For stub dumps decompression is the major resource hog,
for full dumps some heavy regexp's do contribute considerably.

Wikistats could benefit a lot from parallelization (although these days 
dump production for larger wikis is generally the bottleneck).
First thing I would want to look into (some day) is running the count 
scripts for several wikis in parallel.
All intermediate data are stored in csv files, often one file for one 
metric for all languages.
Decoupling and aggregation as post processing step is simple.

Running several count threads on one machine might tax memory.
Some hashes are huge (much has been externalized, but e.g. edits per 
user per namespace is still a hash file).

The basic structure dates from the time that a full archive dump for 
English Wikipedia was processed in minutes rather than months.
There have been a lot of optimizations , but general setup is still like 
this:
Every months all counts for past 10 years are reproduced from scratch. 
Wikistats basically has no memory.
This probably sounds crazy, incremental processing has been suggested 
more than once.

Main reason to keep it this way is: ever so often new functionality is 
added to the scripts (and the occasional bug fix)
In order to have new counts for full history we would need to rerun from 
scratch ever so often anyway.

People asked me how come the counts can change from to month to month.
Same answer: counts are redone for all months, newer dumps will have 
more deletions for earlier months.
Although this mostly effects last two months: nearly all deletions occur 
within a month or two.

In early years deletions were very rare. most were done to prevent court 
orders (privacy).
Nowadays deletionism has taken hold.
Still wikistats treats deleted content as 'should not have been there in 
the first place'.
This makes our editor counts somewhat conservative, basically skews the 
activity patterns in favor of good content contributors.

Erik Zachte



_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Parralel computing project

Reply via email to