Re: Avoid crawling nonsense calendar webpage

Markus Jelsma Sat, 05 May 2012 04:16:49 -0700

Hi,

This is a tough problem indeed. We partially mitigate this problem byusing several regular expressions, linkrank scores with domain limitinggenerator for regular crawls and a second shallow crawl, only followinglinks from the home page.

A custom URLFilter as Ferdy explains is a good idea indeed. However,URLFilters operate on single URL's only, which is as difficuly ascreating regular expressions. If we can process all outlinks of a givenpage at the same time it's easier to compare and calculate similarityand if needed, discard them if we consider them as unwanted calendars.


Can you explain how Heretrix does it? Perhaps we can learn from it.

Cheers,
Markus

On Sat, 5 May 2012 12:44:27 +0200, Ferdy Galema<[email protected]> wrote:

Hi,
Fetching unwanted pages such as in this case dynamically generatedpages isa general problem. Currently I'm not aware of any pendingimprovements inthis area, but feel free to contribute if you have a solution.Probably thebest way to solve such a problem is by implementing a customURLFilter.This filter might have some heuristics that is able to detectdynamically
generated urls.

Ferdy.
On Fri, May 4, 2012 at 9:13 PM, Xiao Li <[email protected]>wrote:
Hi Nutch people,

I am using Nutch to index a website. I notice that Nutch has crawled
some junk webpages, such as
http://**************/category/events/2015-11. This webpage is about
the event occurring in 2015, 11. This is completely nonsense for me.I
want to know is it possible for Nutch to intelligently skip such
webpages. It may be argued that I can use Regex to avoid this.
However, as the naming pattern of calendar webpages are not the same
all the time, there is no way to write a perfect Regex for this. I
know Heritrix (a Internet archive crawler) has such capabilities to
avoid crawling nonsense calendar webpage. Does anyone solve this
issue?

Regrds
Xiao

--

Re: Avoid crawling nonsense calendar webpage

Reply via email to