Well, using a regex to catch these troublemakers isn't going to be useful. Although i caught the first faulty url's, there can be many more and it's unpredictable; here's just a random pick from the list of errors:
http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/ Here's another very disturbing url it's trying to fetch: http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ I'm seems these bad url's are somehow found by the parser and get fetched the next time, and the next time making the url grow longer and longer for each fetch and parse and updateDB cycle. http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus This doesn't look good at all. Anyone got a suggestion or some pointer? -----Original message----- From: Markus Jelsma <markus.jel...@buyways.nl> Sent: Wed 22-09-2010 12:12 To: user@nutch.apache.org; Subject: Funky duplicate url's Hi, This is not about deduplication, but about preventing certain url's to end up in the CrawlDB. I'm crawling a news site for testing purposes, it has the usual categories etc. News item pages feature a gray text block that's got some url's as well. See http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example. The problem is, the parser somehow manages to concatenate the href with the inner anchor text (which happens to be an url as you can see). So, subsequent fetches are completely messed up, i'm almost only fetching duplicates: fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece fetching http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece This is not desired behavior, as you'd expect. The question is, where to fix and how to fix it? Is it a problem with the parser? Or is it fixable using some freaky url filter for this one? Cheers,