Hello,

After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I
had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec
counting those 2 slaves) and along with map-reduce overhead the whole
solution hasn't satisfied me at all. I moved Nutch crawl databases and
segments to single EC2 instance and it works pretty fast now reaching 35
fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to
work with Hadoop environment and regret it didn't work in my case.

Anyway I would like to know if I'm alone with the approach and everybody
set up Nutch with Hadoop. If no and some of you runs Nutch in a single
instance maybe you can share with some best practices e.g. do you use crawl
script or generate/fetch/update continuously perhaps using some cron jobs?

Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats - what
exactly does it mean?

Regards,
Tomasz

Here are my current crawldb stats:
TOTAL urls:     16347942
retry 0:        16012503
retry 1:        134346
retry 2:        106037
retry 3:        95056
min score:      0.0
avg score:      0.04090025
max score:      331.052
status 1 (db_unfetched):        14045806
status 2 (db_fetched):  1769382
status 3 (db_gone):     160768
status 4 (db_redir_temp):       68104
status 5 (db_redir_perm):       151944
status 6 (db_notmodified):      151938

Reply via email to