Hi, Fetching unwanted pages such as in this case dynamically generated pages is a general problem. Currently I'm not aware of any pending improvements in this area, but feel free to contribute if you have a solution. Probably the best way to solve such a problem is by implementing a custom URLFilter. This filter might have some heuristics that is able to detect dynamically generated urls.
Ferdy. On Fri, May 4, 2012 at 9:13 PM, Xiao Li <[email protected]> wrote: > Hi Nutch people, > > I am using Nutch to index a website. I notice that Nutch has crawled > some junk webpages, such as > http://**************/category/events/2015-11. This webpage is about > the event occurring in 2015, 11. This is completely nonsense for me. I > want to know is it possible for Nutch to intelligently skip such > webpages. It may be argued that I can use Regex to avoid this. > However, as the naming pattern of calendar webpages are not the same > all the time, there is no way to write a perfect Regex for this. I > know Heritrix (a Internet archive crawler) has such capabilities to > avoid crawling nonsense calendar webpage. Does anyone solve this > issue? > > Regrds > Xiao >

