Hi Nutch people,

I am using Nutch to index a website. I notice that Nutch has crawled
some junk webpages, such as
http://**************/category/events/2015-11. This webpage is about
the event occurring in 2015, 11. This is completely nonsense for me. I
want to know is it possible for Nutch to intelligently skip such
webpages. It may be argued that I can use Regex to avoid this.
However, as the naming pattern of calendar webpages are not the same
all the time, there is no way to write a perfect Regex for this. I
know Heritrix (a Internet archive crawler) has such capabilities to
avoid crawling nonsense calendar webpage. Does anyone solve this
issue?

Regrds
Xiao

Reply via email to