I haven't really focused my time on subdomains. I think I saw some in my crawl data, but can't confirm ATM. One question is, are you putting "www." in your injected urls... Or just http://[domain]?
If that doesnt make a difference, then it would seem to me that the regex handler should be the target of a patch. Perhaps something like: http://([\w-]\.)*<%= inject.base_fqdn %>* I'd really like to see more datum exposure in the regex parsers, rather than churning out XML for every mundane use-case. Or just a "...\$domain..." etc. Sent from my iPhone On Aug 20, 2010, at 12:12 PM, Sonal Goyal <[email protected]> wrote: > Hi, > > I have a list of about 5000 URLs which I need to crawl and fetch using > Nutch. I want to do a very deep crawl on each and I want subdomains, but I > dont want external links. If I set db.ignore.external.links, I dont get the > subdomains. So I cant use that. If I set the domain in regex-urlfilter, I > can avoid the external links and get the subdomains, but it does not seem > right to include so many urls in the filter. Am I missing some configuration > or am I using Nutch wrongly? > > I would appreciate any help. Thanks in advance. > > Thanks and Regards, > Sonal

