Hi,

I am crawling pages which are organized so as to present a parent category
page which in turn lists items in that category. Category page can list
subcategories and only then you get to the item lists.

The URL will generally be in the following form:
domain/index/[category/subcategory]*/item

Nutch with sufficiently high depth parameter will crawl such pages just
fine. But resulting index will be polluted by the terms from category pages
listing its items. Is there a way to prevent indexing category pages?

Having an urlfilter which instructs to omit category pages
(+http://domain.*/item$
|| -.*) will yield 0 pages fetched because only those category pages contain
links to items.

I have seen suggestions to create indexing plugin to return null if the page
should not be indexed. But then this plugin will work only with the pages it
was programed for.

regards
zm

Reply via email to