2010-08-12 00:13, Platonides skrev:
> Andreas Jonsson wrote:
>    
>> I have previously stated that the most complex thing in the MediaWiki
>> wikitext seemed to be the apostrophes.  I was wrong.
>>
>> Links are divided into internal and external.  Here I will write about
>> internal links.  Internal links can further be subdivided into regular
>> and image-links, image-links complicate things even further, but I
>> will not discuss them here.
>>
>>      
> (...)
>    
>> Before the pattern matching, the whole article is split into pieces
>> using the '[[' token as delimiter.  A pattern match for all variants
>> of internal link patterns (sans the '[[') is attempted at the
>> beginning of each piece of text.  The consequence is that the closing
>> ']]' will be mathed with the last opening '[[' preceeding it.
>>
>> If the pattern match fails, no link is produced; the wikitext is
>> outputted and may be processed again to produce other constructions
>> later on.
>>
>> The fundamental problem with all this is the fact that when looking at
>> the wikitext '[[Link|', it cannot easily be determined if this is
>> actually a link, or if it is just the text '[[Link|'.  It is not a
>> link unless there is a closing ']]' token, that is not preceeded by
>> another '[[' token.  The closing token may appear almost anywhere:
>>      
> (funny links)
>
> That's because after preprocessing, tables are doing before anything
> else. This wasn't like this originally, but after some vulnerabilities
> based on parsing after tables, they were moved there.
>
> They cannot appear inside of ids or classes because that would be an
> illegal id/class name. So it'll be moved into the parameter, then the
> Sanitizer stripping them, I suppose.
>    
The Sanitizer allows both "]]" and "</a>" as class names.

> Don't rely on these things. Consider it unsupported.
>
>
>    
>> In fact, the current MediaWiki parser does not seem to parse
>> links in linear time using linear amount of memory.  My test server
>> failed to process a preview of an article consisisting of about 24000
>> links on the form [[a]]. It was working hard before it, I
>> guess, ran out of memory.  As a comparison it parsed over 38000 italic
>> a's, ''a'', without problems.
>>      
> You can parse them more or less easily because on each '' you can output
> a<i>  or</i>.
>
> We could make the parser output a link each time it finds one, but that
> would require a db query per link (to see if it's blue or red) on the
> page. So instead they are added to a list checkid in one big query and
> replaced back at the end.
> Your OOM will be noticing that listing and the replacements.
>
>    

I don't think that this really explains how the server can run out of 
memory on
such a small input (less than 150kB).  The splitting of the input and 
the regexp should run in
linear time as far as I can tell, but unless there is a very large 
constant overhead
per link, I believe that something is consuming a super linear amount of 
memory.
An n log n step somewhere might explain it, though.

>> So, what is the reasonable thing to do?  First of all it should be
>> pointed out that block elements are not allowed inside link text:
>>
>> http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-strict.dtd_a
>>      
> Lots of things should be made context aware.
> '''<div>''Foo</div>''''' will happily give you
> <b><div><i>Foo</div></i></b>. You can close an italic *inside a link*.
> ''Foo [[bar|bar'' baz]] is expected to work (I changed that recently,
> but should readd the 'feature').
>
>    

I'm aware of this, and it is not really a problem to handle.

>> This suggests that any sane wikitext should not allow a link to
>> continue past the end of the inlined text where it is located.  Even
>> better is to say that the sequence [[Link| always opens up a new link
>> and that 'end of inline text' will implicitly close the link if it is
>> still open.  That will not require any lookahead to parse.  It would
>> be consistent with the format parsing to only allow it to run to the
>> end of line, though.  Also, currently paragraphs and list elements
>> aren't rendered inside link text, unless enclosed or preceeded by a
>> table.  So, unless tables inside link text is a widely used feature,
>> such a change might not break that many pages.
>>
>> /Andreas
>>      
> I agree. The link text shouldn't span multiple lines. Some page designs
> (eg. main pages) could be using it, though.
>    
I can think of a number of variations with increasing implementation
complexity:

* [[Link| opens a link, unless a link is already opened, end of line
   closes unterminated link text.

* [[Link| opens a link, unless a link is already opened, end of inline
   text closes unterminated link text.

* [[Link| opens a link, unless a link is already opened, start of a
   non-paragraph block elements closes the link.

* [[Link| opens a link, unless a link is already opened, list elements
   are disabled in the lexer, non-paragraph block elements closes the
   link.

* [[Link| opens a link, unless a link is already opened, end of
   article closes unterminated link text.  The closing token ']]' may
   only appear inside inline text.

* [[Link| opens a link, unless there is another [[Link| not preceeded
   by a ]] on the same line.  End of line closes unterminated link
   text.

* [[Link| opens a link, unless there is another [[Link| not preceeded
   by a ]] in the same inlined text.  End of inline text closes
   unterminated link text.

* etc.

As long as the link opening and closing tokens are only allowed inside
inlined text, it can be implemented.

However, requiring a link to be properly closed in order to be a link
is fairly complex.  What should the parser should do with the link
title, if it desides that it is not really a link title after all?  It
may contain tokens.  Thus, the lexer must use lookahead and not
produce any spurious link open tokens.  To avoid the n^2 worst case, a
full extra pass to compute hints would be necessary before doing the
actual lexing.

/Andreas


_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to