Gabriel Wicke, 16/03/2014 22:52:
> On 03/16/2014 01:49 PM, Federico Leva (Nemo) wrote:
>> How hard is it to implement? (Probably rather or
>> very hard, but who knows.)
> 
> In general this is pretty hard to implement cleanly without a DOM. In
> Parsoid we currently implement list handling on the token stream, but
> could probably move it to the DOM in the longer run. Doing the same in
> the PHP parser is harder, and might not be worth it.
> 
> In any case there needs to be some analysis on how much existing
> wikitext would be affected by this. This can be done with a dump grepper
> (we have one in the parsoid repo).

I wasn't able to use that one, but I made some simple counts that I
forgot in a screen and didn't post.

nemobis@dumps-2:~$ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
2602

real    7m40.029s
user    7m24.904s
sys     0m14.621s
nemobis@dumps-2:~$ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)(
|>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
7344

real    8m7.467s
user    7m52.158s
sys     0m14.813s
nemobis@dumps-2:~$ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div)( |>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
2588

real    7m45.508s
user    7m29.936s
sys     0m14.581s
nemobis@dumps-2:~$ time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^#' ; time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^*'
1655833

real    6m10.518s
user    6m10.935s
sys     0m8.141s
154647451

real    6m13.748s
user    6m18.012s
sys     0m8.377s
nemobis@dumps-2:~$ time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^:' ; time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^;'
981035

real    6m2.725s
user    6m11.815s
sys     0m7.804s
148563

real    6m6.082s
user    6m14.855s
sys     0m8.157s

_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to