Gabriel Wicke, 16/03/2014 22:52:
In any case there needs to be some analysis on how much existing
wikitext would be affected by this. This can be done with a dump grepper
(we have one in the parsoid repo).

I didn't manage to use that (<https://www.mediawiki.org/wiki/Talk:Parsoid/Setup#dumpGrepper.js>) but I tried some grepping. Do you have something specific in mind? As long as we terminate the list item when there is another list prefix at the beginning of the next line (i.e. a new list item), disruption should be minimal, I'd think?

From the looks of it, most such unclosed tags are <small> tags which are applied to multiple items of a list. I don't know how legal/sane that can be considered but for "multiline tags" I think we can settle on some stricter definition if one doesn't exist yet, mostly I'd say blockquote pre span div (and pre is already handled, though buggily).[1]

As for the PHP parser, if we're lucky maybe it's enough to combine the lines in question after
  $textLines = StringUtils::explode( "\n", $text );
and before the "List generation" block? It might also be an occasion to fix some of the bugs with that <pre> block as byproduct.
<https://git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/includes%2Fparser%2FParser.php#L2368>

Nemo

[1] $ time bzgrep --perl-regexp -c '^[#*:;]+.*&lt;(blockquote|span|div|pre)( |&gt;)((?!&lt;/\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
2602

real    7m40.029s
user    7m24.904s
sys     0m14.621s

vs.

$ time bzgrep --perl-regexp -c '^[#*:;]+.*&lt;(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)( |&gt;)((?!&lt;/\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
7344

real    8m7.467s
user    7m52.158s
sys     0m14.813s

_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to