On 05/11/2012 05:02 PM, Jon Robson wrote:
> It would be good if possible to get a more accurate feel for what
> percentage of articles use inline styles.
> e.g. articles that contain style= vs articles that don't
> This would help us get a better idea of what we are dealing with.

I just grepped an enwiki dump using the dumpGrepper tool in
extensions/VisualEditor/tests/parser:

zcat enwiki-latest-pages-articles.xml.gz \
| node dumpGrepper.js "\bstyle\s*=\s*['\"]"

(..all matches..)
################################################
Total revisions: 11687077
Total matches: 675254
Ratio: 5.77%
################################################

This includes templates, and counts all matches vs. all revisions- the
number of matched articles will be even lower.

So it is safe to assume that most content pages don't contain any inline
styles.

The bzip-compressed (1.2GB uncompressed) output can soon be found here:
http://dev.wikidev.net/gabriel/tmp/style.txt.bz2

Gabriel


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to