https://bugzilla.wikimedia.org/show_bug.cgi?id=47407

--- Comment #2 from Kiran Mathew Koshy <[email protected]> ---
I have implemented a primitive version of the above tool...

https://github.com/kiranmathewkoshy/zimcheck/

It implements the following checks:
1- Internal checkSum
2- Verify that there are no online dependencies
3- Check for all metadata entries 
4- Verify favicon.png
5- Main Page Header.
6- Duplicate content.


Although search for Duplicate content was initially slow on large files, I have
managed to speed it up to run in less than 2 minutes on the 2.6 GB wikipedia
zim file.

However, checking internal URLs is still slow, and being a CPU intensive
process, I have decided to go on with dividing the work on a few threads.

Also note that the regex library used is a part of C++11, and I'm not aware if
the rest of zimlib is compatible with C++11.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to