Hi Nutch people, I am using Nutch to index a website. I notice that Nutch has crawled some junk webpages, such as http://**************/category/events/2015-11. This webpage is about the event occurring in 2015, 11. This is completely nonsense for me. I want to know is it possible for Nutch to intelligently skip such webpages. It may be argued that I can use Regex to avoid this. However, as the naming pattern of calendar webpages are not the same all the time, there is no way to write a perfect Regex for this. I know Heritrix (a Internet archive crawler) has such capabilities to avoid crawling nonsense calendar webpage. Does anyone solve this issue?
Regrds Xiao

