On 03/04/2012 05:09 AM, Gabriel Wicke wrote: > Hello Sebastian, > >> It comes down to these two options: >> a) create one scraper configuration for each template, which captures >> the intention of the creator and allows to "correctly" scrape the data >> from all pages. >> b) load all necessary template definitions into MediaWiki and then do a >> transformation to HTML or XML and use XPath (or JQuery) >> >> On 01/12/2012 03:38 PM, Oren Bochman wrote: >>> 2. the only aplication which (correctly!?) expands templates is >>> MedaiWiki itself. >> (Thanks for your answer) I agree, that only Mediawiki can "correctly" >> expand templates, as it can interpret the code on the template pages. >> The MediaWiki parser can transform Wiki Markup into XML and HTML. (I am >> currently not aware of any other transformation options.) > > we are currently working on http://www.mediawiki.org/wiki/Parsoid, a > JS parser that by now expands templates well and also supports a few > parser functions. We need to mark up template parameters for the > visual editor in any case, and plan to employ HTML5 microdata or RDFa > for this purpose (see > http://www.mediawiki.org/wiki/Parsoid/HTML5_DOM_with_microdata). I > intend to start implementing this sometime this month. Let us know if > you have feedback / ideas on the microdata or RDFa design. > >> To ask more precisely: >> Is there a best practice for scraping data from Wikipedia? What is the >> smartest way to resolve templates for scraping? Am I not seeing any >> third option? > > AFAIK most scraping is based on parsing the WikiText source. This gets > you the top-most template parameters, which might already be good > enough for many of your applications. > > We try to provide provenance information for expanded content in the > HTML DOM produced by Parsoid. Initially this will likely focus on > top-level arguments, as that is all we need for the editor. Extending > this to nested expansions should be quite straightforward however, as > provenance is tracked per-token internally. > > Gabriel
_______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
