Re: [Wikitech-l] RDFa and Microdata in MediaWiki

Conrad Irwin Wed, 20 Jan 2010 10:41:00 -0800

On 01/20/2010 04:47 PM, Happy-melon wrote:
> 
> "Aryeh Gregor" <[email protected]> wrote in message 
> news:[email protected]...
>> On Mon, Jan 18, 2010 at 7:34 PM, Happy-melon <[email protected]> wrote:
>>
>> I bet very few people would bother adding metadata without a concrete
>> use.  And they'd probably get into fights with other people annoyed at
>> them for making it harder to edit wikitext.  This would all be
>> irrelevant if we only supported a few whitelisted vocabularies,
>> though, as the current microdata implementation does.  We should
>> encourage bulky and not-so-useful stuff to go in a separate stream.
> 
> Yes, very few people would bother.  Those few people would still introduce a 
> monstrous amount of extra markup by working deep in the template stack. 
> Doesn't take much to add kilobytes to large articles; I've added 5kb to 
> [[Barack Obama]] myself just by adding a span round reference brackets. 
> Just adding author metadata to citation templates would add seconds to load 
> times for large articles.
> 
>>> I would say it's
>>> definitely 'worth' exposing license metadata on every use of an image; 
>>> the
>>> status of a page's images affects our whole terms of use, whether we can 
>>> say
>>> "yes you can use all this in this fashion" verses "you have to jump 
>>> through
>>> these hoops for these images because they're different".  Author, 
>>> location,
>>> capture date; yes these probably aren't 'worth' the cost of exposing on
>>> pages.  But being able to search commons for all photos taken in Berlin
>>> between 1989 and 1991 would be worth its weight in gold.
>>
>> Sure -- but that can be exposed in a separate data stream, since
>>> 99.9% of page views won't need it.
> 
> I'm not talking about exposing it in a data stream per se, I'm suggesting 
> that that's what our internal search would be able to achieve if the 
> metadata was accessible to MediaWiki.
> 
>>> Indeed, but that's data *output*, not input.  Currently our categories 
>>> are
>>> input via [[Category:Foo]] and output via some HTML at the bottom of the
>>> page, but also via the API in a variety of formats; people use both 
>>> methods
>>> to extract the metadata.  Once MW knows what data an object has, how it
>>> outputs that data back is totally open as you say.  So given that a
>>> translation into a format that MW understands is desirable for its own 
>>> sake,
>>> and that from there it's trivial to translate back into whatever output
>>> format(s) the current web demands, why would we choose an input format 
>>> like
>>>
>>> <span xmlns:dc="http://purl.org/dc/elements/1.1/";
>>> href="http://purl.org/dc/dcmitype/StillImage"; property="dc:title"
>>> rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
>>> by <span xmlns:cc="http://creativecommons.org/ns#"; href="#mw-image"
>>> property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span>
>>> is licensed under a <a rel="license"
>>> href="http://creativecommons.org/licenses/by-sa/3.0/us/";>Creative
>>> Commons Attribution-Share Alike 3.0 United States License</a>
>>>
>>> Rather than an input format like [[License::CC-BY-SA-3.0]]??
>>
>> First, why are you asking me why we would choose RDFa when I don't
>> think we should?  At least quote microdata.
>>
>> Second, this is apples to oranges.  Your RDFa sample a) says that the
>> work is a still image, b) gives its name, c) gives the author's name,
>> d) gives the URL of the license, e) contains user-visible prose.  Your
>> wikitext sample just gives the license name (not even a license URL!).
>> No kidding the latter is shorter.  A more realistic comparison might
>> be
>>
>> <p><span 
>> itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
>> by <span itemprop="author">Bob Smith</span> is licensed under a <a
>> itemprop="license"
>> href="http://creativecommons.org/licenses/by-sa/3.0/us/";>Creative
>> Commons Attribution-Share Alike 3.0 United States License</a>.</p>
>>
>> vs.
>>
>> <p>[[title::EmeryMolyneux-terrestrialglobe-1592-20061127.jpg|]]
>> by [[author::Bob Smith|]] is licensed under a
>> [[license::http://creativecommons.org/licenses/by-sa/3.0/us/|[http://creativecommons.org/licenses/by-sa/3.0/us/
>> Creative
>> Commons Attribution-Share Alike 3.0 United States License]]].</p>
>>
>> or something, which is not such an easy call.  The wikitext is not
>> that much shorter or simpler -- particularly when you account for the
>> fact that you'd have to separately define mappings to concrete
>> microdata/RDFa/RDF vocabularies for output.  (Yes, I left out the
>> itemtype on the microdata, but again, that would have to be defined
>> somewhere for the wikisyntax too.)
> 
> True, the markup Dmitry offered is more suitable.  But Ryan is absolutely 
> right.  You're only thinking about the the *current* generation of formats, 
> and assuming (maybe legitimately, I don't know) that microdata is the best 
> format for us to use.  What happens when the next generation of format(s) 
> come out?  With a format-neutral input format, MW sites can quickly adapt to 
> accommodate it.  Plus this method of data-injection will much more work to 
> allow MW to extract the data from the wikitext, which puts our searching for 
> photos in Berlin issue further out of reach.
> 
> You could say that we're talking about different things again; that you're 
> talking about marking up data for external use.  But there's no reason why a 
> {{#prop:foo|bar}} magic word can't *also* output some appropriate metadata 
> format into the wikitext.  Marking up in a format-neutral syntax allows us 
> to output metadata from wikitext *and* from MW generally, and to change 
> *both* formats at the drop of a hat.  Marking up in a particular format, 
> whatever the format is, makes it damn near impossible (or at least 
> hopelessly hackish) to change wikitext output from one format to another, 
> and equally horrible for MW to collect data at all.


I do not like the idea of having a parser function that outputs the data
into the article - if people want the meta-data they can query it from
an API, or a dump, as opposed to screen-scraping. Perhaps meta-data on
image pages is useful, but if someone wants to get licenses of all the
images, surely providing a single file containing all is better than
screen-scraping for it (even RDFa/microdata is screen scraping, in my
opinion; it's just done with the hope that a developer has made it easy
for you - you will still have to deal with invalid uses of markup, and
the more complicated the markup, the more it will be used invalidly).

I would not be against using whitelisting necessary attributes to allow
wikis to put in these formats manually.

I do like the idea (a lot) of having a parser function that can put data
into a storage model inside MediaWiki (probably tabular, ideally
relational) that can be dumped like the current articles or queried
using the API. My original thoughts [0] had the wiki's technocrat's
define a few "tables" which could be populated with the {{#store}} command.

Conrad

[0] http://en.wiktionary.org/w/index.php?oldid=6304302

> 
> --HM
>  
> 
> 
> 
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RDFa and Microdata in MediaWiki

Reply via email to