On Feb 8, 2008 3:31 PM, Steve Bennett <[EMAIL PROTECTED]> wrote: > On 2/9/08, Magnus Manske <[EMAIL PROTECTED]> wrote: > > My > > http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php > > parses it correctly as well, but it's still manual PHP hacks, while > > your is a real parser - respect! > > Not too much respect. I think I have only *just* worked out why I need > all these syntactic predicates and what backtracking is used for. > > Throughout my grammar everywhere I have had to place these predicates like > this: > > ((LEFT_BRACKET LEFT_BRACKET LEFT_BRACKET) => literal_left_bracket > // try and save it some time on [[[foo]]]? > |(literal_left_bracket bracketed_url) => literal_left_bracket > |(image) => image > |(category) => category > |(external_link) => external_link > |(internal_link) => internal_link > |(magic_link) => magic_link > |pre_block > |(formatted_text_elem) =>formatted_text_elem > ) > > > The bit before the => on each line basically says "look ahead, and if > the syntax matches the bit in brackets, then go ahead and parse it as > the bit after the =>. > > I never knew why I needed them to make it work, but now I see: in the > case of an image, if it just dove straight into trying to parse a > string like [[image:foo]] (not a valid image), it would hit the first > [[, think the image rule matched, and keep going. Eventually it would > realise the rule didn't match but it would be too late: because the > grammar is blatantly not LALR (I think?), it would just fail (unless > it could backtrack, which I'm not using). By using the syntactic > predicate, it's able to prevent itself from falling in a hole - it > looks ahead, sees "that looks like an image...oh wait, no it's not!", > and tries the next rule instead. > > There's a huge amount of messiness in the grammar so far caused by me > not really understanding this stuff. I also haven't been very clean > about where newlines and whitespace are handled exactly. > > > Anyway, my latest rant about tables (sorry Magnus :)) In the following > table, which part is the style attribute for a table cell, and which > part is the cell contents: > > {| > |an [[image:foo.jpg|thumb|blah|]] or [[blaah|moo|wah]] floop | moop > |} > > (reminder: cell definitions with style attributes look like this: | > style | contents ||... > > Buggered if I know. I might have to impose a rule involving the range > of possible characters that could appear in the style attribute. I > didn't really want to have actually parse that bit properly...
That's exactly what I did in wiki2xml, and it works (yesss, still ahead;-) Of course, I cheap out in another regard there: wiki2xml parses images and links alike, and parses even links with "too many" parameters. My reasons for that: * Lazyness * No need to know the language/wiki settings (which make "Image:" special for en) * Flexible for "add-ons" (who knows, we might use three-part links someday...) * Not much additional burden for the next level (XML-to-something) Cheers, Magnus _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
