Re: [Wikitext-l] On tokenizing wikitext

Andreas Jonsson Sun, 22 Aug 2010 12:44:07 -0700

2010-08-22 17:34, David Gerard skrev:
> On 22 August 2010 16:09, Andreas Jonsson<[email protected]>  wrote:
>
>    
>> The parser I am writing follows a four layered design:
>>      
>
> ... I feel like I just looked Cthulhu in the eye.
>
>


:-)

The complexity can't really be helped, though.  I can't find any simpler
way of structuring the parser.


> If you survive this, you'll deserve a Wikipedia holiday declared in your name.
>
>    
I have already implemented the lexer the way I described
it.  It works fine.

I did some profiling and its still the parser that is the slowest
component, not the lexer.  I find that a bit surprising myself.

The current status is that it supports all html listed in
Sanitizer.php except <pre>, <img>, and <hr>, lists, tables, headings,
table of contents, indented text, apostrophe formatting, nowiki,
horizontal rule, internal links (except that the link title isn't
validated, and the trail and prefix isn't implemented).

As far as I can see, the major missing things are image/media links
and external links.  Also, the indented text shouldn't be considered
as such if it contains a block html element.  Thus, I must introduce
another lookahead for this.

I think I have solved the really hard parts.  Now its just a matter of
polishing the details.  I'll try to upload the whole thing to Wikimedia's
subversion repository sometimes in the next few days.


/Andreas


_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Re: [Wikitext-l] On tokenizing wikitext

Reply via email to