Fwd: Optimizing Nutch 2.2.1

BlackIce Tue, 18 Mar 2014 05:06:22 -0700

Hi,

I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram


Currently the Fetch cycle is limited by my Internet connection.

Parse cycle uses an average of 10% per CPU core

Updatedb cycle uses average 3% per CPU core

Currently I'm only running Hbase in pseudo distributed, not Nutch.

As the DB grows everything slows down significantly but as you can see CPU
resources are not used very much, heck during Update DB my web browsing
creates higher utilization spikes than the updatedb process. I feel that my
hardware is very underutilized and adding more phisycal machines would be a
waste.

What are the bottlenecks? how can I optimize them? should I run a cluster
on 3 Virtual machines?

Thank you for any help you can give!


Ralf R. Kotowski

Fwd: Optimizing Nutch 2.2.1

Reply via email to