Re: crawling a subdomain

Sergey A Volkov Sun, 06 Nov 2011 23:23:04 -0800

If I understand correctly,

нou can run inject job on your crawldb with new url's and new inputfile, old url's would be still in crawldb


On Mon 07 Nov 2011 10:15:26 AM MSK, Peyman Mohajerian wrote:

Thanks Sergey,
I don't think I was clear on the issue, the subdomain I'm speaking of
won't be found by the crawler, I have to somehow add it, so in my
original input url of: http://www.xyz.com/stuff
there is absolutely no way the crawler would know about http://abc.xyz.com/stuff
I have to somehow dynamically add the subdomain.
I also don't have the option of actually adding
'http://abc.xyz.com/stuff' in my input file (a bit of an extra
convolution I don't want to bore you with!!).

Thanks,
Peyman

On Sun, Nov 6, 2011 at 1:21 PM, Sergey A Volkov
<[email protected]>  wrote:

Hi!

I think you should use urlfilter-regex like "http://\w\.xyz\.com/stuff.*";
instead of urlfilter-domain and set db.ignore.external.links to false, this
will work, but this is quite slow if you have many regex.

You may also try to add xyz.com to domain-suffixes.xml, this may cause some
side effects, i had never tested this, just looked in DomainURLFilter
source, so it's probably not really good idea.

Sergey Volkov

On Mon 07 Nov 2011 12:35:30 AM MSK, Peyman Mohajerian wrote:


Hi Guys,

Let's say my input file is:
http://www.xyz.com/stuff

and I have thousands of these URLs in my input. How do I configure
Nutch to also crawl this subdomain for each input:
http://abc.xyz.com/stuff

I don't want to just replace 'www' with 'abc' i want to crawl both.

Thanks
Peyman

Re: crawling a subdomain

Reply via email to