Hi,

running Nutch-1.2 on a single machine I produce a 2GB index (157 segments, 
slice 
50.000).
Because the performance is low, I would like to test further crawls in Amazon 
EC2.
Q1: If I start with 4 nodes,  should I divide the number of the segments 
proportionally for each node, and then start a new crawl?
Q2: Analysing the actual log files (single machine), I found out that the most 
time consuming are Index Database (13 hours), Actualize Crawl Database (11 
hours),  Actualize LinkDB (7 hours 24), etc. Assuming the 4 node structure, do 
the index and database actualization time proportionally decrease?

Thanks
Patricio


Reply via email to