If you're hitting each host with 45 threads, you better be on really good terms with those webmasters :)

With 90 total threads, that means as few as 2 hosts are active at any time, yes?

-- Ken


On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote:

Hi,
I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages. That makes a volume of 240.000 fetched pages - I want to get all of them.

Can one give me an advice on the right threads/delay/per-host configuration
in this environnement?

My current conf:

<property>
       <name>fetcher.server.delay</name>
       <value>1.0</value>
</property>

<property>
       <name>fetcher.threads.fetch</name>
       <value>90</value>
</property>

<property>
       <name>fetcher.threads.per.host</name>
       <value>45</value>
</property>

<property>
     <name>fetcher.threads.per.host.by.ip</name>
     <value>false</value>
</property>

The total runtime is about 5 hours.

How can performance be improved? (I still have enough CPU, Bandwith)

Note: This runs on a single machine, distribution to other machines is not
planned.

Thanks and Regards

Hannes

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to