On Tuesday 13 September 2011 13:12:41 Alexander Aristov wrote: > yes you can stop but how do you know if a URL is good or not? > You can use URL filter to discard unwanted URLs.
We see that many sites with relative URL's without base href produce erroneous links. As with the example there is a pattern but sometimes a pattern of hard to find. If we can stop relative URL's without base href for now, we can at least continue crawling. Right now we need to manually check samples of the crawled URL's (millions) for crawler traps which we (for now) add to the regex filter. We do need to look for a better solution but we cannot work on a solution and manually test the crawldb at the same time. One related problem is detecting calendars / agenda's. > > Dedup works to remove old/obsolet content and you cannot check it without > downloading it. > > Best Regards > Alexander Aristov > > On 13 September 2011 14:57, Markus Jelsma <[email protected]>wrote: > > Yes, we use several deduplication mechanisms and they work fine. The > > problem > > is wasting a lot of CPU cycles for nothing. Why not stop those unwanted > > URL's > > from entering the CrawlDB in the first place instead of getting rid of > > them afterwards? > > > > Growth of the CrawlDB is something very significant, especially with > > thousands > > of long URL's. > > > > On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote: > > > Hi Markus, > > > > > > Please correct me if I'm wrong, but isn't there a document signature > > > > check > > > > > to detect if the page contains same content with some other already > > > > parsed > > > > > and indexed. > > > > > > Dinçer > > > > > > 2011/9/12 Markus Jelsma <[email protected]> > > > > > > > Hi, > > > > > > > > Would it not be a good idea to patch DomContentUtils with an option > > > > not to consider relative outlinks without a base url? This example > > > > [1] will currently > > > > quickly take over the crawl db and produce countless unique URL's > > > > that cannot > > > > be filtered out with the regex that detects repeating URI segments. > > > > > > > > There are many websites on the internet that suffer from this > > > > problem. > > > > > > > > A patch would protect this common crawler trap but not against > > > > incorrect > > > > > > absolute URL's - one that is supposed to be absolute but for example > > > > has > > > > > > an incorrect protocol scheme. > > > > > > [1]: > > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/ > > > > > > Cheers, > > > > -- > > > > Markus Jelsma - CTO - Openindex > > > > http://www.linkedin.com/in/markus17 > > > > 050-8536620 / 06-50258350 > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

