2010-08-22 17:34, David Gerard skrev: > On 22 August 2010 16:09, Andreas Jonsson<[email protected]> wrote: > > >> The parser I am writing follows a four layered design: >> > > ... I feel like I just looked Cthulhu in the eye. > >
:-) The complexity can't really be helped, though. I can't find any simpler way of structuring the parser. > If you survive this, you'll deserve a Wikipedia holiday declared in your name. > > I have already implemented the lexer the way I described it. It works fine. I did some profiling and its still the parser that is the slowest component, not the lexer. I find that a bit surprising myself. The current status is that it supports all html listed in Sanitizer.php except <pre>, <img>, and <hr>, lists, tables, headings, table of contents, indented text, apostrophe formatting, nowiki, horizontal rule, internal links (except that the link title isn't validated, and the trail and prefix isn't implemented). As far as I can see, the major missing things are image/media links and external links. Also, the indented text shouldn't be considered as such if it contains a block html element. Thus, I must introduce another lookahead for this. I think I have solved the really hard parts. Now its just a matter of polishing the details. I'll try to upload the whole thing to Wikimedia's subversion repository sometimes in the next few days. /Andreas _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
