yes you can stop but how do you know if a URL is good or not? You can use URL filter to discard unwanted URLs.
Dedup works to remove old/obsolet content and you cannot check it without downloading it. Best Regards Alexander Aristov On 13 September 2011 14:57, Markus Jelsma <[email protected]>wrote: > Yes, we use several deduplication mechanisms and they work fine. The > problem > is wasting a lot of CPU cycles for nothing. Why not stop those unwanted > URL's > from entering the CrawlDB in the first place instead of getting rid of them > afterwards? > > Growth of the CrawlDB is something very significant, especially with > thousands > of long URL's. > > On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote: > > Hi Markus, > > > > Please correct me if I'm wrong, but isn't there a document signature > check > > to detect if the page contains same content with some other already > parsed > > and indexed. > > > > Dinçer > > > > 2011/9/12 Markus Jelsma <[email protected]> > > > > > Hi, > > > > > > Would it not be a good idea to patch DomContentUtils with an option not > > > to consider relative outlinks without a base url? This example [1] will > > > currently > > > quickly take over the crawl db and produce countless unique URL's that > > > cannot > > > be filtered out with the regex that detects repeating URI segments. > > > > > > There are many websites on the internet that suffer from this problem. > > > > > > A patch would protect this common crawler trap but not against > incorrect > > > absolute URL's - one that is supposed to be absolute but for example > has > > > an incorrect protocol scheme. > > > > > > [1]: > > > > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/ > > > > > > Cheers, > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 >

