Hi,

I am using Nutch 1.10 and we are planing to crawl just some url which match 
some pattern. 
The problem is we can not do it using regex-urlfilter.txt as this way the seeds 
itself would be rejected.

For e.g seed is apple.com <http://apple.com/> and we want to crawl just urls 
which has /mac/ in url string. May be we have to filter the urls at Generate or 
fetch time .
Any thoughts ? Can we customize Generate or Fetch phases ?

Thanks
Manish Verma


Reply via email to