Hi,
I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages.
That makes a volume of 240.000 fetched pages - I want to get all of them.

Can one give me an advice on the right threads/delay/per-host configuration
in this environnement?

My current conf:

<property>
        <name>fetcher.server.delay</name>
        <value>1.0</value>
</property>

<property>
        <name>fetcher.threads.fetch</name>
        <value>90</value>
</property>

<property>
        <name>fetcher.threads.per.host</name>
        <value>45</value>
</property>

<property>
      <name>fetcher.threads.per.host.by.ip</name>
      <value>false</value>
</property>

The total runtime is about 5 hours.

How can performance be improved? (I still have enough CPU, Bandwith)

Note: This runs on a single machine, distribution to other machines is not
planned.

Thanks and Regards

Hannes

Reply via email to