If you're hitting each host with 45 threads, you better be on really
good terms with those webmasters :)
With 90 total threads, that means as few as 2 hosts are active at any
time, yes?
-- Ken
On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote:
Hi,
I'm using nutch 0.9 to crawl about 400 hosts with an average of 600
pages.
That makes a volume of 240.000 fetched pages - I want to get all of
them.
Can one give me an advice on the right threads/delay/per-host
configuration
in this environnement?
My current conf:
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>90</value>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>45</value>
</property>
<property>
<name>fetcher.threads.per.host.by.ip</name>
<value>false</value>
</property>
The total runtime is about 5 hours.
How can performance be improved? (I still have enough CPU, Bandwith)
Note: This runs on a single machine, distribution to other machines
is not
planned.
Thanks and Regards
Hannes
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g