Thanks Andrzej, I will make an effort in getting it to run on Hadoop but i'd rather go for a fully distributed set up (although with only a single node for now) so i can add more machines later. Will the HadoopNutch tutorial on the wiki allow me to set up for a cluster on a single node? Also, will it then still make use of multiple cores?
Cheers, On Tuesday 28 September 2010 14:20:02 Andrzej Bialecki wrote: > On 2010-09-28 14:02, Markus Jelsma wrote: > > Hi, > > > > My test setup (only local) now has just over 20 million URL's, i fetched > > 3m already and the rest needs to be fetched. It's now less time wasting > > to fetch for 12 hours because merging takes now over 5.5 hours! > > > > I've searched but found little information so far. Would now be a good > > time to try running Nutch on a Hadoop cluster (which i don't have) or try > > to let Hadoop take advantage of my multiple cores? > > Even running Hadoop in pseudo-distributed mode (on a single node but > with real JobTracker/TaskTracker) would be much better. The reason is > that in local mode tasks are NOT executed in parallel but serially. > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350