https://bugzilla.wikimedia.org/show_bug.cgi?id=60826

Erik Zachte <ezac...@wikimedia.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|dvanli...@gmail.com         |ezac...@wikimedia.org

--- Comment #3 from Erik Zachte <ezac...@wikimedia.org> ---
As discussed with Toby off-line, given the current functionality replacing it
with HADOOP will not so simple. Possibly opportune, but some caution as for ETA
seems warranted.

The new job will need to incorporate several filters (in Wikistats countable
namespaces are determined dynamically, redirects are filtered out with
awareness of language specific tags harvested from php files and WikiTranslate,
dumps need to be vetted for validity (ideally such housekeeping would be done
by the dump process, but given the low bandwidth for dump maintenance for many
years that might take a while, so right now the ugly approach of parsing html
status files is used). Also word count is far from the straightforward function
implemented in some languages. Here markup, headers, links etc are first
stripped, also for some language the current approach is aware of ideographic
languages and their differnt content density. This list is probably not
exhaustive. Any rebuild will probably be less ambitious in some aspects (e.g.
word count) but it will not be trivial.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to