I am looking into the feasibility of writing a comprehensive parser regression 
test (CPRT). Before writing code, I thought I would try to get some idea of how 
well such a tool would perform and what gotchas might pop up. An easy first 
step is to run dump_HTML and capture some data and statistics.

I tried to run the version of dumpHTML in r54724, but it failed. So, I went 
back to 1.14 and ran that version against a small personal wiki database I 
have. I did this to get an idea of what structures dump_HTML produces and also 
to get some performance data with which to do projections of runtime/resource 
usage.

I ran dumpHTML twice using the same MW version and same database. I then diff'd 
the two directories produced. One would expect no differences, but that 
expectation is wrong. I got a bunch of diffs of the following form (I have put 
a newline between the two file names to shorten the line length):

diff -r 
HTML_Dump/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html
 
HTML_Dump2/articles/d/n/e/User~Dnessett_Bref_Examples_Example1_Chapter_1_4083.html
77,78c77,78
< Post-expand include size: 16145/2097152 bytes
< Template argument size: 12139/2097152 bytes
---
> Post-expand include size: 16235/2097152 bytes
> Template argument size: 12151/2097152 bytes

I looked at one of the html files to see where these differences appear. They 
occur in an html comment:

<!-- 
NewPP limit report
Preprocessor node count: 1891/1000000
Post-expand include size: 16145/2097152 bytes
Template argument size: 12139/2097152 bytes
Expensive parser function count: 0/100
-->

Does anyone have an idea of what this is for? Is there any way to configure MW 
so it isn't produced?

I will post some performance data later.

Dan


      

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to