Hi,

This is a tough problem indeed. We partially mitigate this problem by using several regular expressions, linkrank scores with domain limiting generator for regular crawls and a second shallow crawl, only following links from the home page.

A custom URLFilter as Ferdy explains is a good idea indeed. However, URLFilters operate on single URL's only, which is as difficuly as creating regular expressions. If we can process all outlinks of a given page at the same time it's easier to compare and calculate similarity and if needed, discard them if we consider them as unwanted calendars.

Can you explain how Heretrix does it? Perhaps we can learn from it.

Cheers,
Markus

On Sat, 5 May 2012 12:44:27 +0200, Ferdy Galema <[email protected]> wrote:
Hi,

Fetching unwanted pages such as in this case dynamically generated pages is a general problem. Currently I'm not aware of any pending improvements in this area, but feel free to contribute if you have a solution. Probably the best way to solve such a problem is by implementing a custom URLFilter. This filter might have some heuristics that is able to detect dynamically
generated urls.

Ferdy.

On Fri, May 4, 2012 at 9:13 PM, Xiao Li <[email protected]> wrote:

Hi Nutch people,

I am using Nutch to index a website. I notice that Nutch has crawled
some junk webpages, such as
http://**************/category/events/2015-11. This webpage is about
the event occurring in 2015, 11. This is completely nonsense for me. I
want to know is it possible for Nutch to intelligently skip such
webpages. It may be argued that I can use Regex to avoid this.
However, as the naming pattern of calendar webpages are not the same
all the time, there is no way to write a perfect Regex for this. I
know Heritrix (a Internet archive crawler) has such capabilities to
avoid crawling nonsense calendar webpage. Does anyone solve this
issue?

Regrds
Xiao


--

Reply via email to