Hi all,
I wanted to know how long nutch should run.
I change the configurations, and ran distributed - one master node and 3
slaves, and it for 20k links for about a day (depth 15).
Is it normal? Or it should take less?
This is my configurations:
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external
hosts
will be ignored. This is an effective way to limit the
crawl to include
only initially injected hosts, without creating complex
URLFilters.
</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>1000</value>
<description>The maximum number of outlinks that we'll process
for a page.
If this value is nonnegative (>=0), at most
db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks
will be processed.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>100</value>
<description>The number of FetcherThreads the fetcher should
use.
This is also determines the maximum number of requests
that are
made at once (each FetcherThread handles one
connection). The total
number of threads running in distributed mode will be
the number of
fetcher threads * number of nodes as fetcher has one
map task per node.
</description>
</property>
<property>
<name>fetcher.queue.depth.multiplier</name>
<value>150</value>
<description>(EXPERT)The fetcher buffers the incoming URLs into
queues based on the [host|domain|IP]
see param fetcher.queue.mode). The depth of the queue
is the number of threads times the value of this parameter.
A large value requires more memory but can improve the
performance of the fetch when the order of the URLS in the fetch list
is not optimal.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>10</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time.
Setting it to
a value > 1 will cause the Crawl-Delay value from
robots.txt to
be ignored and the value of fetcher.server.min.delay to
be used
as a delay between successive requests to the same
server instead
of fetcher.server.delay.
</description>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>0.0</value>
<description>The minimum number of seconds the fetcher will
delay between
successive requests to the same server. This value is
applicable ONLY
if fetcher.threads.per.queue is greater than 1 (i.e.
the host blocking
is turned off).
</description>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>5</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than
this value (in
seconds) then the fetcher will skip this page,
generating an error report.
If set to -1 the fetcher will never skip such pages and
will wait the
amount of time retrieved from robots.txt Crawl-Delay,
however long that
might be.
</description>
</property>
---------------------------------------------------------------------
Intel Electronics Ltd.
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.