Hi Markus
If db.ignore.internal.links =false then URL "x.abc.com" will be ignored
or not while seed URL is "www.abc.com". If yes then what I have to do
to include subdomains .I just want to crawl all the out links of the
page which belong to the "abc.com","x.abc.com","abc.ax.com" only.(In
this case we can not use db.ignore.internal.links =true as it will
allow all the outlinks of entirely different hosts and domains )
Can we solve this problem can be solved by edit the url-filter.txt
accordingly.If yes then is there any other way to resolve this problem.
Thanks
On Thursday 25 February 2016 04:11 AM, Manish Verma wrote:
Thanks Markus, Yes this answered my question.
So basically one queue is created for each sub domain when queue mode is set
byHost.
Thanks
On Feb 24, 2016, at 1:54 PM, Markus Jelsma <[email protected]> wrote:
Hello - separated by name means by hostname. In your example there are in
queueMode byHost only two queues, www.apple.com and itunes.apple.com. When
queued by domain, there is obviously just one queue, the apple.com queue.
Does this answer your question?
Markus
-----Original message-----
From:Manish Verma <[email protected]>
Sent: Wednesday 24th February 2016 22:36
To: [email protected]
Subject: Re: How does fetcher.queue.mode seprates url for queues when it is
set byhost
What you mean seprate by name only here.
I have below urls can you please tell how many queues will be here if queue
mode is byhost.
http://www.apple.com/ipad/ <http://www.apple.com/ipad/>
http://www.apple.com/iphone/ <http://www.apple.com/iphone/>
http://itunes.apple.com
Thanks
On Feb 24, 2016, at 12:52 PM, Markus Jelsma <[email protected]> wrote:
Hello Manish - byHost in fetcher|generate.queu.mode means queue/separate by
name only. Generator nor fetcher use IP address information for queuing
purposes. I am not sure what you mean by working with a load balancer. A
hostname resolves to one or more IP's, possibly any casted addresses as well.
As far as i know/remember, a single IP is used during the fetch, without any
DNS round robin, but this might be different between protocol plugins.
Do you have a concrete problem to solve?
Markus
-----Original message-----
From:Manish Verma <[email protected]>
Sent: Wednesday 24th February 2016 21:45
To: [email protected]
Subject: How does fetcher.queue.mode seprates url for queues when it is set
byhost
Hi,
I am little bot confused over how fetcher.queue.mode property identifies the
urls.
How does it work when the value is given “byhost”, does it identify urls by IP
? , how does it work with load balancer.
I know it creates queue based on host but what does mean by host here ?
Is there any other property which have impact on this.
Thanks