On Wednesday 05 October 2011 14:26:01 webdev1977 wrote: > Hello All! > > When using nutch 1.3 in fully distributed mode, where does the fetching > occur? Does each node get a list of urls to fetch? What property in > hadoop/mareduce, etc decides how many urls that a node gets to fetch?
Check the numFetchers parameter of the generator. If you set it to equal then number of nodes, the entire fetch list is split in parts. > I am > worried about memory on my nodes. Some of the files in our enterprise are > very, very large. Like 800mb pdf files. I would be worried about that too especially if multiple files are downloaded at the same time on the same node. Limit the number of threads and check memory settings. > > I am able to run inject on my cluster, but then the generate step fails and > I always loose one node from the cluster. More details? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-1-3-Fetching-where-does-this-happ > en-tp3396326p3396326.html Sent from the Nutch - User mailing list archive > at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

