Re: Nutch 1.3 Fetching where does this happen?

Markus Jelsma Wed, 05 Oct 2011 05:33:20 -0700


On Wednesday 05 October 2011 14:26:01 webdev1977 wrote:
> Hello All!
> 
> When using nutch 1.3 in fully distributed mode, where does the fetching
> occur? Does each node get a list of urls to fetch?  What property in
> hadoop/mareduce, etc decides how many urls that a node gets to fetch?


Check the numFetchers parameter of the generator. If you set it to equal then 
number of nodes, the entire fetch list is split in parts.

> I am
> worried about memory on my nodes.  Some of the files in our enterprise are
> very, very large.  Like 800mb pdf files.

I would be worried about that too especially if multiple files are downloaded 
at the same time on the same node. Limit the number of threads and check 
memory settings.

> 
> I am able to run inject on my cluster, but then the generate step fails and
> I always loose one node from the cluster.

More details?

> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-3-Fetching-where-does-this-happ
> en-tp3396326p3396326.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch 1.3 Fetching where does this happen?

Reply via email to