I am looking into the feasibility of writing a comprehensive parser regression
test (CPRT). Before writing code, I thought I would try to get some idea of how
well such a tool would perform and what gotchas might pop up. An easy first
step is to run dump_HTML and capture some data and statistics.
I tried to run the version of dumpHTML in r54724, but it failed. So, I went
back to 1.14 and ran that version against a small personal wiki database I
have. I did this to get an idea of what structures dump_HTML produces and also
to get some performance data with which to do projections of runtime/resource
usage.
I ran dumpHTML twice using the same MW version and same database. I then diff'd
the two directories produced. One would expect no differences, but that
expectation is wrong. I got a bunch of diffs of the following form (I have put
a newline between the two file names to shorten the line length):
diff -r
HTML_Dump/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html
HTML_Dump2/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html
77,78c77,78
< Post-expand include size: 16145/2097152 bytes
< Template argument size: 12139/2097152 bytes
---
> Post-expand include size: 16235/2097152 bytes
> Template argument size: 12151/2097152 bytes
I looked at one of the html files to see where these differences appear. They
occur in an html comment:
<!--
NewPP limit report
Preprocessor node count: 1891/1000000
Post-expand include size: 16145/2097152 bytes
Template argument size: 12139/2097152 bytes
Expensive parser function count: 0/100
-->
Does anyone have an idea of what this is for? Is there any way to configure MW
so it isn't produced?
I will post some performance data later.
Dan
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l