2010-08-12 00:13, Platonides skrev: > Andreas Jonsson wrote: > >> I have previously stated that the most complex thing in the MediaWiki >> wikitext seemed to be the apostrophes. I was wrong. >> >> Links are divided into internal and external. Here I will write about >> internal links. Internal links can further be subdivided into regular >> and image-links, image-links complicate things even further, but I >> will not discuss them here. >> >> > (...) > >> Before the pattern matching, the whole article is split into pieces >> using the '[[' token as delimiter. A pattern match for all variants >> of internal link patterns (sans the '[[') is attempted at the >> beginning of each piece of text. The consequence is that the closing >> ']]' will be mathed with the last opening '[[' preceeding it. >> >> If the pattern match fails, no link is produced; the wikitext is >> outputted and may be processed again to produce other constructions >> later on. >> >> The fundamental problem with all this is the fact that when looking at >> the wikitext '[[Link|', it cannot easily be determined if this is >> actually a link, or if it is just the text '[[Link|'. It is not a >> link unless there is a closing ']]' token, that is not preceeded by >> another '[[' token. The closing token may appear almost anywhere: >> > (funny links) > > That's because after preprocessing, tables are doing before anything > else. This wasn't like this originally, but after some vulnerabilities > based on parsing after tables, they were moved there. > > They cannot appear inside of ids or classes because that would be an > illegal id/class name. So it'll be moved into the parameter, then the > Sanitizer stripping them, I suppose. > The Sanitizer allows both "]]" and "</a>" as class names.
> Don't rely on these things. Consider it unsupported. > > > >> In fact, the current MediaWiki parser does not seem to parse >> links in linear time using linear amount of memory. My test server >> failed to process a preview of an article consisisting of about 24000 >> links on the form [[a]]. It was working hard before it, I >> guess, ran out of memory. As a comparison it parsed over 38000 italic >> a's, ''a'', without problems. >> > You can parse them more or less easily because on each '' you can output > a<i> or</i>. > > We could make the parser output a link each time it finds one, but that > would require a db query per link (to see if it's blue or red) on the > page. So instead they are added to a list checkid in one big query and > replaced back at the end. > Your OOM will be noticing that listing and the replacements. > > I don't think that this really explains how the server can run out of memory on such a small input (less than 150kB). The splitting of the input and the regexp should run in linear time as far as I can tell, but unless there is a very large constant overhead per link, I believe that something is consuming a super linear amount of memory. An n log n step somewhere might explain it, though. >> So, what is the reasonable thing to do? First of all it should be >> pointed out that block elements are not allowed inside link text: >> >> http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-strict.dtd_a >> > Lots of things should be made context aware. > '''<div>''Foo</div>''''' will happily give you > <b><div><i>Foo</div></i></b>. You can close an italic *inside a link*. > ''Foo [[bar|bar'' baz]] is expected to work (I changed that recently, > but should readd the 'feature'). > > I'm aware of this, and it is not really a problem to handle. >> This suggests that any sane wikitext should not allow a link to >> continue past the end of the inlined text where it is located. Even >> better is to say that the sequence [[Link| always opens up a new link >> and that 'end of inline text' will implicitly close the link if it is >> still open. That will not require any lookahead to parse. It would >> be consistent with the format parsing to only allow it to run to the >> end of line, though. Also, currently paragraphs and list elements >> aren't rendered inside link text, unless enclosed or preceeded by a >> table. So, unless tables inside link text is a widely used feature, >> such a change might not break that many pages. >> >> /Andreas >> > I agree. The link text shouldn't span multiple lines. Some page designs > (eg. main pages) could be using it, though. > I can think of a number of variations with increasing implementation complexity: * [[Link| opens a link, unless a link is already opened, end of line closes unterminated link text. * [[Link| opens a link, unless a link is already opened, end of inline text closes unterminated link text. * [[Link| opens a link, unless a link is already opened, start of a non-paragraph block elements closes the link. * [[Link| opens a link, unless a link is already opened, list elements are disabled in the lexer, non-paragraph block elements closes the link. * [[Link| opens a link, unless a link is already opened, end of article closes unterminated link text. The closing token ']]' may only appear inside inline text. * [[Link| opens a link, unless there is another [[Link| not preceeded by a ]] on the same line. End of line closes unterminated link text. * [[Link| opens a link, unless there is another [[Link| not preceeded by a ]] in the same inlined text. End of inline text closes unterminated link text. * etc. As long as the link opening and closing tokens are only allowed inside inlined text, it can be implemented. However, requiring a link to be properly closed in order to be a link is fairly complex. What should the parser should do with the link title, if it desides that it is not really a link title after all? It may contain tokens. Thus, the lexer must use lookahead and not produce any spurious link open tokens. To avoid the n^2 worst case, a full extra pass to compute hints would be necessary before doing the actual lexing. /Andreas _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
