yes you can stop but how do you know if a URL is good or not?
You can use URL filter to discard unwanted URLs.

Dedup works to remove old/obsolet content and you cannot check it without
downloading it.

Best Regards
Alexander Aristov


On 13 September 2011 14:57, Markus Jelsma <[email protected]>wrote:

> Yes, we use several deduplication mechanisms and they work fine. The
> problem
> is wasting a lot of CPU cycles for nothing. Why not stop those unwanted
> URL's
> from entering the CrawlDB in the first place instead of getting rid of them
> afterwards?
>
> Growth of the CrawlDB is something very significant, especially with
> thousands
> of long URL's.
>
> On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> > Hi Markus,
> >
> > Please correct me if I'm wrong, but isn't there a document signature
> check
> > to detect if the page contains same content with some other already
> parsed
> > and indexed.
> >
> > Dinçer
> >
> > 2011/9/12 Markus Jelsma <[email protected]>
> >
> > > Hi,
> > >
> > > Would it not be a good idea to patch DomContentUtils with an option not
> > > to consider relative outlinks without a base url? This example [1] will
> > > currently
> > > quickly take over the crawl db and produce countless unique URL's that
> > > cannot
> > > be filtered out with the regex that detects repeating URI segments.
> > >
> > > There are many websites on the internet that suffer from this problem.
> > >
> > > A patch would protect this common crawler trap but not against
> incorrect
> > > absolute URL's - one that is supposed to be absolute but for example
> has
> > > an incorrect protocol scheme.
> > >
> > > [1]:
> > >
> http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> > >
> > > Cheers,
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Reply via email to