On Feb 8, 2008 3:31 PM, Steve Bennett <[EMAIL PROTECTED]> wrote:
> On 2/9/08, Magnus Manske <[EMAIL PROTECTED]> wrote:
> > My
> > http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php
> > parses it correctly as well, but it's still manual PHP hacks, while
> > your is a real parser - respect!
>
> Not too much respect. I think I have only *just* worked out why I need
> all these syntactic predicates and what backtracking is used for.
>
> Throughout my grammar everywhere I have had to place these predicates like 
> this:
>
>     ((LEFT_BRACKET LEFT_BRACKET LEFT_BRACKET) => literal_left_bracket
> // try and save it some time on [[[foo]]]?
>     |(literal_left_bracket bracketed_url) => literal_left_bracket
>     |(image)              => image
>     |(category)           => category
>     |(external_link)      => external_link
>     |(internal_link)      => internal_link
>     |(magic_link)         => magic_link
>     |pre_block
>     |(formatted_text_elem) =>formatted_text_elem
>     )
>
>
> The bit before the => on each line basically says "look ahead, and if
> the syntax matches the bit in brackets, then go ahead and parse it as
> the bit after the =>.
>
> I never knew why I needed them to make it work, but now I see: in the
> case of an image, if it just dove straight into trying to parse a
> string like [[image:foo]] (not a valid image), it would hit the first
> [[, think the image rule matched, and keep going. Eventually it would
> realise the rule didn't match but it would be too late: because the
> grammar is blatantly not LALR (I think?), it would just fail (unless
> it could backtrack, which I'm not using). By using the syntactic
> predicate, it's able to prevent itself from falling in a hole - it
> looks ahead, sees "that looks like an image...oh wait, no it's not!",
> and tries the next rule instead.
>
> There's a huge amount of messiness in the grammar so far caused by me
> not really understanding this stuff. I also haven't been very clean
> about where newlines and whitespace are handled exactly.
>
>
> Anyway, my latest rant about tables (sorry Magnus :)) In the following
> table, which part is the style attribute for a table cell, and which
> part is the cell contents:
>
> {|
> |an [[image:foo.jpg|thumb|blah|]] or [[blaah|moo|wah]] floop | moop
> |}
>
> (reminder: cell definitions with style attributes look like this:  |
> style | contents ||...
>
> Buggered if I know. I might have to impose a rule involving the range
> of possible characters that could appear in the style attribute. I
> didn't really want to have actually parse that bit properly...

That's exactly what I did in wiki2xml, and it works (yesss, still ahead;-)

Of course, I cheap out in another regard there: wiki2xml parses images
and links alike, and parses even links with "too many" parameters. My
reasons for that:
* Lazyness
* No need to know the language/wiki settings (which make "Image:"
special for en)
* Flexible for "add-ons" (who knows, we might use three-part links someday...)
* Not much additional burden for the next level (XML-to-something)

Cheers,
Magnus

_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to