On 01/12/2012 05:37 AM, Sebastian Hellmann wrote:
> Hello all,
> is there a query language for wiki syntax?
> (NOTE: I really do not mean the Wikipedia API here.)
> 
> I am looking for an easy way to scrape data from Wiki pages.
> In this way, we could apply a crowd-sourcing approach to knowledge 
> extraction from Wikis.
> 
> There must be thousands of data scraping approaches. But is there one 
> amongst them that has developed a "wiki scraper language" ?
> Maybe with some sort of fuzziness involved, if the pages are too messy.
> I have not yet worked with the XML transformation of the wiki markup:
> 
> *action=expandtemplates **
>    generatexml         - Generate XML parse tree
> 
> Is it any good for issuing XPATH queries ?

You could use an HTML parser to produce a DOM of the rendered document,
and then process that using plain DOM methods or JQuery. An example
would be the 'html5' node.js module, which produces a DOM compatible
with JQuery. There are also more specialized HTML scrape libraries
available in various languages.

Rendered HTML obviously misses some of the information available in the
wiki source, so you might have to rely on CSS class / tag pairs to
identify template output.

Gabriel


_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to