Re: Reduce phase in Fetcher taking excessive time to finish.

2014-11-02 Thread Meraj A. Khan
Julien, Do we need to consider any data loss(URLs) in this scenario ? no, why? Thank you for confirming. J. On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Meraj You can control the # of URLs per segment with property

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-31 Thread Julien Nioche
On 30 October 2014 22:58, Meraj A. Khan mera...@gmail.com wrote: Thanks for the info Julien.For the hypothetical example below topN 200,000 generate.max.count = 10,000 generate.count.mode = host If the number of hosts is 10 and let us assume that each one of those hosts has more than

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-30 Thread Julien Nioche
Hi Meraj You can control the # of URLs per segment with property namegenerate.max.count/name value-1/value descriptionThe maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. /description

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-30 Thread Meraj A. Khan
Thanks for the info Julien.For the hypothetical example below topN 200,000 generate.max.count = 10,000 generate.count.mode = host If the number of hosts is 10 and let us assume that each one of those hosts has more than 10,000 unfetched URLs in CrawlDB , since we have set generate.max.count to

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-26 Thread Meraj A. Khan
Julien, On further analysis , I found that it was not a delay at reduce time , but a long running fetch map task , when I have multiple fetch map tasks running on a single segment , I see that one of the map tasks runs for a excessively longer period of time than the other fetch map tasks ,it

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-18 Thread Julien Nioche
Hi Meraj, What do the logs for the map tasks tell you about the URLs being fetched? J. On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote: Julien, Thanks for your suggestion , I looked at the jstack thread dumps , and I could see that the fetcher threads are in a waiting state

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-17 Thread Meraj A. Khan
Julien, Thanks for your suggestion , I looked at the jstack thread dumps , and I could see that the fetcher threads are in a waiting state and actually the map phase is not yet complete looking at the JobClient console. 14/10/15 12:09:48 INFO mapreduce.Job: map 95% reduce 31% 14/10/16 07:11:20

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-16 Thread Julien Nioche
Hi Meraj You could call jstack on the Java process a couple of times to see what it is busy doing, that will be a simple of way of checking that this is indeed the source of the problem. See https://issues.apache.org/jira/browse/NUTCH-1314 for a possible solution J. On 16 October 2014 06:08,

Reduce phase in Fetcher taking excessive time to finish.

2014-10-15 Thread Meraj A. Khan
Hi All, I am running into a situation where the reduce phase of the fetch job with parsing enabled at the time of fetch is taking excessively long amount of time , I have seen recommendations to filter the URLs based on length to avoid normalization related delays ,I am not filtering any URLs