On Mon, Sep 14, 2015 at 4:49 PM, Platonides <[email protected]> wrote:

> You know it will fail for all kind of images included through templates
> (particularly infoboxes), right?


Indeed, it is not possible to find out what thumbnails are used by a page
without actually parsing it. Your best bet is to wait until Parsoid dumps
become available (T17017 <https://phabricator.wikimedia.org/T17017>), then
go through those with an XML parser and extract the thumb URLs. That's
still slow but not as slow as the MediaWiki parser. (Or you can try to find
a regexp which matches thumbnail URLs but we all know what happens
<http://stackoverflow.com/a/1732454/323407> when you use a regexp to parse
HTML.) After that, just throw those URLs at the 404 handler.
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to