On Nov 15, 2012, at 2:51 PM, MZMcBride <z...@mzmcbride.com> wrote:

> Max Semenik wrote:
>> On 15.11.2012, 4:06 Diederik wrote:
>>> I think that the Analytics team would prefer either:
>>> 1) detect source of edit in the URL
>>> Or 
>>> 2) have a hook activated after a successful edit and have the data send to
>>> the pixel service
>> 
>>> Having this data in a MySQL table poses a lot of challenges with
>>> respect of importing that data into the analytics cluster
>> 
>> That's for analytics purposes. However, there can be other use cases
>> for which tags in the DB are perfect, for example filter recent
>> changes for edits made only via a particular channel.

        Max is right.

        The general issue is that the revision table could use a generalized 
metadata store the same way that page table has page_props[1]. This is not the 
same, but sometimes coincident to analytical needs (I assume that if we come up 
with a way to attach revision-based metadata, it would be easy to expose that 
same data to the analytics pipeline for RevTagging).

        To Amir's original suggestion, I think that hacking in a rev_mobile 
field into the rev table sounds extremely clunky. I'd be worried that over time 
this will end up with an explosion that will resemble our Recentchanges 
table[2]. I assume that's why Amir brought this up to suggestion.

        On the other hand, a "revision_props" the way page_props would be a 
terrible waste of space and performance—imagine storing a boolean in a BLOB 
with every revision? 

        Perhaps we could use a small varchar or smallint in place of the BLOB 
and not too high impact but fliexible enough to handle both existing (key: 
mobile_edit; value:1) and future needs? Especially if mobile_edit=0 isn't 
actually stored as an entry at all.


> Right, which is why a revision tagging system exists in MediaWiki core
> currently. If someone wanted to, for example, modify the MobileFrontend
> extension to add a "mobile" tag to edits, it would be trivial to do. The
> tagging infrastructure is already in place.

         It's unfortunate that RevTagging got mixed in this discussion, but I 
hope this clarifies the distinction between mobile's needs and RevTagging.

        Currently, MW has a very limited ability to attach metadata revision 
table to the revision table in the form of new cols to the revision table 
(existing cols are… limited[3]) The issue is that this data is prioritized for 
transactional use, not necessary analytical use (in wiki[4]: "is needed to 
operate the website and, in particular, to populate article revision 
histories").

        In analytical systems, data is fed down a different pipeline in order 
to be "online" and have no impact to the web transactions. Naïvely, that's 
because analytical questions on transactional databases look like "COUNT * FROM 
sometable" which are full table scans (or thereabouts) and are expensive. 
Adding the metadata for analytical purposes based on the OLTP store would then 
be "COUNT * FROM sometable GROUP BY datafromothertable JOIN awholemessoftables" 
which are multiple full table scans, and pretty soon that is would require a 
dedicated offline read-only DB, and still be terribly slow.

        So there is a need to attach metadata needed for analytics (which may 
or may not be the same metadata "needed to operate the website") at runtime so 
that it can be run down the analytical data pipeline without needing to hit the 
live OLTP store continually asking things like "give me the campaign that this 
revision occurred under?" especially when things like "campaign" probably have 
no importance at all to the website itself.

        My thinking that if we had a way of attaching arbitrary meta to 
revisions, then, in cases where the two needs are coincident, all we have to do 
is expose that same meta to analytics through their pixel service (revtagging) 
and we're good to go. If revtagging isn't up, or hasn't recorded it, we could 
still go back to the transactional store offline and amortize the missing 
information.

> Going back to the broader point, I'm completely lost as to why the Analytics
> team can't handle a structured database.

        I assume this last is a bit tongue-in-cheek, but I LOL'd… for 
completely different reasons.

[1]: http://www.mediawiki.org/wiki/Manual:Page_props_table
[2]: http://www.mediawiki.org/wiki/Manual:Recentchanges_table
[3]: http://www.mediawiki.org/wiki/Manual:Revision_table
[4]: http://www.mediawiki.org/wiki/Revtagging



terry chay  최태리
Director of Features Engineering
Wikimedia Foundation
“Imagine a world in which every single human being can freely share in the sum 
of all knowledge. That's our commitment.”

p: +1 (415) 839-6885 x6832
m: +1 (408) 480-8902
e: tc...@wikimedia.org
i: http://terrychay.com/
w: http://meta.wikimedia.org/wiki/User:Tychay
aim: terrychay
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to