It may seem slow if you put 5000 domains or paths in regex-urlfilter. But, after you try it, you may find the performance acceptable. It works for me anyway. -aj
On Fri, Aug 20, 2010 at 12:12 PM, Sonal Goyal <[email protected]> wrote: > Hi, > > I have a list of about 5000 URLs which I need to crawl and fetch using > Nutch. I want to do a very deep crawl on each and I want subdomains, but I > dont want external links. If I set db.ignore.external.links, I dont get the > subdomains. So I cant use that. If I set the domain in regex-urlfilter, I > can avoid the external links and get the subdomains, but it does not seem > right to include so many urls in the filter. Am I missing some > configuration > or am I using Nutch wrongly? > > I would appreciate any help. Thanks in advance. > > Thanks and Regards, > Sonal > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

