I haven't really focused my time on subdomains. I think I saw some in my crawl 
data, but can't confirm ATM. One question is, are you putting "www." in your 
injected urls... Or just http://[domain]?

If that doesnt make a difference, then it would seem to me that the regex 
handler should be the target of a patch.

Perhaps something like:
  http://([\w-]\.)*<%= inject.base_fqdn %>*

I'd really like to see more datum exposure in the regex parsers, rather than 
churning out XML for every mundane use-case.

Or just a "...\$domain..." etc.

Sent from my iPhone

On Aug 20, 2010, at 12:12 PM, Sonal Goyal <[email protected]> wrote:

> Hi,
> 
> I have a list of about 5000 URLs which I need to crawl and fetch using
> Nutch. I want to do a very deep crawl on each and I want subdomains, but I
> dont want external links. If I set db.ignore.external.links, I dont get the
> subdomains. So I cant use that. If I set the domain in regex-urlfilter, I
> can avoid the external links and get the subdomains, but it does not seem
> right to include so many urls in the filter. Am I missing some configuration
> or am I using Nutch wrongly?
> 
> I would appreciate any help. Thanks in advance.
> 
> Thanks and Regards,
> Sonal

Reply via email to