Julien,
Do we need to
consider any data loss(URLs) in this scenario ?
no, why?
Thank you for confirming.
J.
On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Hi Meraj
You can control the # of URLs per segment with
property
On 30 October 2014 22:58, Meraj A. Khan mera...@gmail.com wrote:
Thanks for the info Julien.For the hypothetical example below
topN 200,000
generate.max.count = 10,000
generate.count.mode = host
If the number of hosts is 10 and let us assume that each one of those hosts
has more than
Hi Meraj
You can control the # of URLs per segment with
property
namegenerate.max.count/name
value-1/value
descriptionThe maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
/description
Thanks for the info Julien.For the hypothetical example below
topN 200,000
generate.max.count = 10,000
generate.count.mode = host
If the number of hosts is 10 and let us assume that each one of those hosts
has more than 10,000 unfetched URLs in CrawlDB , since we have set
generate.max.count to
Julien,
On further analysis , I found that it was not a delay at reduce time , but
a long running fetch map task , when I have multiple fetch map tasks
running on a single segment , I see that one of the map tasks runs for a
excessively longer period of time than the other fetch map tasks ,it
Hi Meraj,
What do the logs for the map tasks tell you about the URLs being fetched?
J.
On 17 October 2014 19:08, Meraj A. Khan mera...@gmail.com wrote:
Julien,
Thanks for your suggestion , I looked at the jstack thread dumps , and I
could see that the fetcher threads are in a waiting state
Julien,
Thanks for your suggestion , I looked at the jstack thread dumps , and I
could see that the fetcher threads are in a waiting state and actually the
map phase is not yet complete looking at the JobClient console.
14/10/15 12:09:48 INFO mapreduce.Job: map 95% reduce 31%
14/10/16 07:11:20
Hi Meraj
You could call jstack on the Java process a couple of times to see what it
is busy doing, that will be a simple of way of checking that this is indeed
the source of the problem.
See https://issues.apache.org/jira/browse/NUTCH-1314 for a possible
solution
J.
On 16 October 2014 06:08,
Hi All,
I am running into a situation where the reduce phase of the fetch job with
parsing enabled at the time of fetch is taking excessively long amount of
time , I have seen recommendations to filter the URLs based on length to
avoid normalization related delays ,I am not filtering any URLs
9 matches
Mail list logo