On Mon, Sep 14, 2015 at 4:49 PM, Platonides <[email protected]> wrote:
> You know it will fail for all kind of images included through templates > (particularly infoboxes), right? Indeed, it is not possible to find out what thumbnails are used by a page without actually parsing it. Your best bet is to wait until Parsoid dumps become available (T17017 <https://phabricator.wikimedia.org/T17017>), then go through those with an XML parser and extract the thumb URLs. That's still slow but not as slow as the MediaWiki parser. (Or you can try to find a regexp which matches thumbnail URLs but we all know what happens <http://stackoverflow.com/a/1732454/323407> when you use a regexp to parse HTML.) After that, just throw those URLs at the 404 handler. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
