It may seem slow if you put 5000 domains or paths in regex-urlfilter. But,
after you try it, you may find the performance acceptable.  It works for me
anyway.
-aj

On Fri, Aug 20, 2010 at 12:12 PM, Sonal Goyal <[email protected]> wrote:

> Hi,
>
> I have a list of about 5000 URLs which I need to crawl and fetch using
> Nutch. I want to do a very deep crawl on each and I want subdomains, but I
> dont want external links. If I set db.ignore.external.links, I dont get the
> subdomains. So I cant use that. If I set the domain in regex-urlfilter, I
> can avoid the external links and get the subdomains, but it does not seem
> right to include so many urls in the filter. Am I missing some
> configuration
> or am I using Nutch wrongly?
>
> I would appreciate any help. Thanks in advance.
>
> Thanks and Regards,
> Sonal
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to