Hi,
I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages.
That makes a volume of 240.000 fetched pages - I want to get all of them.
Can one give me an advice on the right threads/delay/per-host configuration
in this environnement?
My current conf:
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>90</value>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>45</value>
</property>
<property>
<name>fetcher.threads.per.host.by.ip</name>
<value>false</value>
</property>
The total runtime is about 5 hours.
How can performance be improved? (I still have enough CPU, Bandwith)
Note: This runs on a single machine, distribution to other machines is not
planned.
Thanks and Regards
Hannes