Hi, I am crawling pages which are organized so as to present a parent category page which in turn lists items in that category. Category page can list subcategories and only then you get to the item lists.
The URL will generally be in the following form: domain/index/[category/subcategory]*/item Nutch with sufficiently high depth parameter will crawl such pages just fine. But resulting index will be polluted by the terms from category pages listing its items. Is there a way to prevent indexing category pages? Having an urlfilter which instructs to omit category pages (+http://domain.*/item$ || -.*) will yield 0 pages fetched because only those category pages contain links to items. I have seen suggestions to create indexing plugin to return null if the page should not be indexed. But then this plugin will work only with the pages it was programed for. regards zm

