Robert Ullmann wrote:
>> I've been spending hours on the parsing now and don't find it simple
>> at all due to the fact that templates can be nested. Just extracting
>> the Infobox as one big lump is hard due to the need to match nested {{
>> and }}
>>
>> Andrew Dunbar (hippietrail)
>>
>
> Hi,
>
> Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in
> the text, then count braces. Start at depth=2, count up and down 'till
> you reach 0, and you are at the end of the template. (you can be picky
> about only counting them if paired if you like ;-)
>
> Then just regex match the lines/parameters you want.
>
> However, if you are pulling the wikitext with the API, the XML parse
> tree option sounds good; then you can just use elementTree (or the
> like) and pull out the parameters directly
>
> Robert
>
Or you could use the pyparsing Python library, with which you can
implement the grammar of your choice, making matching nested template
extraction trivial. Using the psyco package to accelerate it, you can
parse a whole en: dump in a few hours.
See the code below for a sample grammar...
-- Neil
------------------------------------------------
# Use pyparsing, enablePackrat() _and_ psyco for a considerable speed-up
from pyparsing import *
import psyco
# These two must be in the correct order, or bad things will happen
ParserElement.enablePackrat()
psyco.full()
wikitemplate = Forward()
wikilink = Combine("[[" + SkipTo("]]") + "]]")
wikiargname = CharsNotIn("|{}=")
wikiargval = ZeroOrMore(
wikilink | Group(wikitemplate) | CharsNotIn("[|{}") | "[" | "{" |
Regex("}[^}]"))
wikiarg = Group(Optional(wikiargname + Suppress("="), default="??") +
wikiargval)
wikitemplate << (Suppress("{{") + wikiargname + Optional(Suppress("|") +
delimitedList(wikiarg, "|")) + Suppress("}}"))
wikitext = ZeroOrMore(CharsNotIn("{") | Group(wikitemplate) | "{" )
def parse_page(text):
return wikitext.parseString(text)
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l