Hello - you would need to set the db.ignore.external.links.mode, check the
nutch-default description for that parameter. You can also use the domain URL
filter plugin.
Markus
-----Original message-----
> From:harsh <[email protected]>
> Sent: Thursday 25th February 2016 7:53
> To: [email protected]
> Subject: Re: How does fetcher.queue.mode seprates url for queues when it is
> set byhost
>
> Hi Markus
> If db.ignore.internal.links =false then URL "x.abc.com" will be ignored
> or not while seed URL is "www.abc.com". If yes then what I have to do
> to include subdomains .I just want to crawl all the out links of the
> page which belong to the "abc.com","x.abc.com","abc.ax.com" only.(In
> this case we can not use db.ignore.internal.links =true as it will
> allow all the outlinks of entirely different hosts and domains )
> Can we solve this problem can be solved by edit the url-filter.txt
> accordingly.If yes then is there any other way to resolve this problem.
>
> Thanks
>
>
>
>
> On Thursday 25 February 2016 04:11 AM, Manish Verma wrote:
> > Thanks Markus, Yes this answered my question.
> > So basically one queue is created for each sub domain when queue mode is
> > set byHost.
> >
> > Thanks
> >
> >
> >> On Feb 24, 2016, at 1:54 PM, Markus Jelsma <[email protected]>
> >> wrote:
> >>
> >> Hello - separated by name means by hostname. In your example there are in
> >> queueMode byHost only two queues, www.apple.com and itunes.apple.com.
> >> When queued by domain, there is obviously just one queue, the apple.com
> >> queue.
> >>
> >> Does this answer your question?
> >> Markus
> >>
> >> -----Original message-----
> >>> From:Manish Verma <[email protected]>
> >>> Sent: Wednesday 24th February 2016 22:36
> >>> To: [email protected]
> >>> Subject: Re: How does fetcher.queue.mode seprates url for queues when it
> >>> is set byhost
> >>>
> >>> What you mean seprate by name only here.
> >>> I have below urls can you please tell how many queues will be here if
> >>> queue mode is byhost.
> >>>
> >>> http://www.apple.com/ipad/ <http://www.apple.com/ipad/>
> >>> http://www.apple.com/iphone/ <http://www.apple.com/iphone/>
> >>> http://itunes.apple.com
> >>>
> >>> Thanks
> >>>
> >>>> On Feb 24, 2016, at 12:52 PM, Markus Jelsma <[email protected]>
> >>>> wrote:
> >>>>
> >>>> Hello Manish - byHost in fetcher|generate.queu.mode means queue/separate
> >>>> by name only. Generator nor fetcher use IP address information for
> >>>> queuing purposes. I am not sure what you mean by working with a load
> >>>> balancer. A hostname resolves to one or more IP's, possibly any casted
> >>>> addresses as well. As far as i know/remember, a single IP is used during
> >>>> the fetch, without any DNS round robin, but this might be different
> >>>> between protocol plugins.
> >>>>
> >>>> Do you have a concrete problem to solve?
> >>>>
> >>>> Markus
> >>>>
> >>>> -----Original message-----
> >>>>> From:Manish Verma <[email protected]>
> >>>>> Sent: Wednesday 24th February 2016 21:45
> >>>>> To: [email protected]
> >>>>> Subject: How does fetcher.queue.mode seprates url for queues when it
> >>>>> is set byhost
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am little bot confused over how fetcher.queue.mode property
> >>>>> identifies the urls.
> >>>>> How does it work when the value is given “byhost”, does it identify
> >>>>> urls by IP ? , how does it work with load balancer.
> >>>>> I know it creates queue based on host but what does mean by host here ?
> >>>>>
> >>>>> Is there any other property which have impact on this.
> >>>>>
> >>>>> Thanks
> >>>
> >
>
>