On 03/04/2012 05:09 AM, Gabriel Wicke wrote:
> Hello Sebastian,
>
>> It comes down to these two options:
>> a) create one scraper configuration for each template, which captures
>> the intention of the creator and allows to "correctly" scrape the data
>> from all pages.
>> b) load all necessary template definitions into MediaWiki and then do a
>> transformation to HTML or XML and use XPath (or JQuery)
>>
>> On 01/12/2012 03:38 PM, Oren Bochman wrote:
>>> 2. the only aplication which (correctly!?) expands templates is
>>> MedaiWiki itself.
>> (Thanks for your answer) I agree, that only Mediawiki can "correctly"
>> expand templates, as it can interpret the code on the template pages.
>> The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am
>> currently not aware of any other transformation options.)
>
> we are currently working on http://www.mediawiki.org/wiki/Parsoid, a
> JS parser that by now expands templates well and also supports a few
> parser functions. We need to mark up template parameters for the
> visual editor in any case, and plan to employ HTML5 microdata or RDFa
> for this purpose (see
> http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I
> intend to start implementing this sometime this month. Let us know if
> you have feedback / ideas on the microdata or RDFa design.
>
>> To ask more precisely:
>> Is there a best practice for scraping data from Wikipedia? What is the
>> smartest way to resolve templates for scraping? Am I not seeing any
>> third option?
>
> AFAIK most scraping is based on parsing the WikiText source. This gets
> you the top-most template parameters, which might already be good
> enough for many of your applications.
>
> We try to provide provenance information for expanded content in the
> HTML DOM produced by Parsoid. Initially this will likely focus on
> top-level arguments, as that is all we need for the editor. Extending
> this to nested expansions should be quite straightforward however, as
> provenance is tracked per-token internally.
>
> Gabriel


_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to