Gabriel Wicke, 16/03/2014 22:52: > On 03/16/2014 01:49 PM, Federico Leva (Nemo) wrote: >> How hard is it to implement? (Probably rather or >> very hard, but who knows.) > > In general this is pretty hard to implement cleanly without a DOM. In > Parsoid we currently implement list handling on the token stream, but > could probably move it to the DOM in the longer run. Doing the same in > the PHP parser is harder, and might not be worth it. > > In any case there needs to be some analysis on how much existing > wikitext would be affected by this. This can be done with a dump grepper > (we have one in the parsoid repo).
I wasn't able to use that one, but I made some simple counts that I forgot in a screen and didn't post. nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2602 real 7m40.029s user 7m24.904s sys 0m14.621s nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 7344 real 8m7.467s user 7m52.158s sys 0m14.813s nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2588 real 7m45.508s user 7m29.936s sys 0m14.581s nemobis@dumps-2:~$ time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^#' ; time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^*' 1655833 real 6m10.518s user 6m10.935s sys 0m8.141s 154647451 real 6m13.748s user 6m18.012s sys 0m8.377s nemobis@dumps-2:~$ time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^:' ; time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^;' 981035 real 6m2.725s user 6m11.815s sys 0m7.804s 148563 real 6m6.082s user 6m14.855s sys 0m8.157s _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
