On 05/11/2012 05:02 PM, Jon Robson wrote: > It would be good if possible to get a more accurate feel for what > percentage of articles use inline styles. > e.g. articles that contain style= vs articles that don't > This would help us get a better idea of what we are dealing with.
I just grepped an enwiki dump using the dumpGrepper tool in extensions/VisualEditor/tests/parser: zcat enwiki-latest-pages-articles.xml.gz \ | node dumpGrepper.js "\bstyle\s*=\s*['\"]" (..all matches..) ################################################ Total revisions: 11687077 Total matches: 675254 Ratio: 5.77% ################################################ This includes templates, and counts all matches vs. all revisions- the number of matched articles will be even lower. So it is safe to assume that most content pages don't contain any inline styles. The bzip-compressed (1.2GB uncompressed) output can soon be found here: http://dev.wikidev.net/gabriel/tmp/style.txt.bz2 Gabriel _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
