Hello - you would need to set the db.ignore.external.links.mode, check the 
nutch-default description for that parameter. You can also use the domain URL 
filter plugin.
Markus
 
-----Original message-----
> From:harsh <[email protected]>
> Sent: Thursday 25th February 2016 7:53
> To: [email protected]
> Subject: Re: How does fetcher.queue.mode seprates  url for queues when it is 
> set byhost
> 
> Hi Markus
> If db.ignore.internal.links =false then URL  "x.abc.com" will be ignored 
> or not  while seed URL is "www.abc.com". If yes then what I have to do 
> to include subdomains .I just want to crawl all the out links of the 
> page which belong to the "abc.com","x.abc.com","abc.ax.com"  only.(In 
> this case we can not use  db.ignore.internal.links =true as it will 
> allow all the outlinks of entirely different hosts and domains  )
> Can we solve this problem can be solved by edit the  url-filter.txt 
> accordingly.If yes then is there any other way to resolve this problem.
> 
> Thanks
> 
> 
> 
> 
> On Thursday 25 February 2016 04:11 AM, Manish Verma wrote:
> > Thanks Markus, Yes this answered my question.
> > So basically one queue is created for each sub domain when queue mode is 
> > set byHost.
> >
> > Thanks
> >
> >
> >> On Feb 24, 2016, at 1:54 PM, Markus Jelsma <[email protected]> 
> >> wrote:
> >>
> >> Hello - separated by name means by hostname. In your example there are in 
> >> queueMode  byHost only two queues, www.apple.com and itunes.apple.com. 
> >> When queued by domain, there is obviously just one queue, the apple.com 
> >> queue.
> >>
> >> Does this answer your question?
> >> Markus
> >>
> >> -----Original message-----
> >>> From:Manish Verma <[email protected]>
> >>> Sent: Wednesday 24th February 2016 22:36
> >>> To: [email protected]
> >>> Subject: Re: How does fetcher.queue.mode seprates  url for queues when it 
> >>> is set byhost
> >>>
> >>> What you mean seprate by name only here.
> >>> I have below urls can you please tell how many queues  will be here if 
> >>> queue mode is byhost.
> >>>
> >>> http://www.apple.com/ipad/ <http://www.apple.com/ipad/>
> >>> http://www.apple.com/iphone/ <http://www.apple.com/iphone/>
> >>> http://itunes.apple.com
> >>>
> >>> Thanks
> >>>
> >>>> On Feb 24, 2016, at 12:52 PM, Markus Jelsma <[email protected]> 
> >>>> wrote:
> >>>>
> >>>> Hello Manish - byHost in fetcher|generate.queu.mode means queue/separate 
> >>>> by name only. Generator nor fetcher use IP address information for 
> >>>> queuing purposes. I am not sure what you mean by working with a load 
> >>>> balancer. A hostname resolves to one or more IP's, possibly any casted 
> >>>> addresses as well. As far as i know/remember, a single IP is used during 
> >>>> the fetch, without any DNS round robin, but this might be different 
> >>>> between protocol plugins.
> >>>>
> >>>> Do you have a concrete problem to solve?
> >>>>
> >>>> Markus
> >>>>
> >>>> -----Original message-----
> >>>>> From:Manish Verma <[email protected]>
> >>>>> Sent: Wednesday 24th February 2016 21:45
> >>>>> To: [email protected]
> >>>>> Subject: How does fetcher.queue.mode seprates  url for queues when it 
> >>>>> is set byhost
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am little bot confused over how  fetcher.queue.mode property 
> >>>>> identifies the urls.
> >>>>> How does it work when the value is given “byhost”, does it identify 
> >>>>> urls by IP ? , how does it work with load balancer.
> >>>>> I know it creates queue based on host but what does mean by host here ?
> >>>>>
> >>>>> Is there any other property which have impact on this.
> >>>>>
> >>>>> Thanks
> >>>
> >
> 
> 

Reply via email to