Nutch running time

Chaushu, Shani Thu, 01 Jan 2015 03:01:05 -0800


Hi all,
 I wanted to know how long nutch should run.
I change the configurations, and ran distributed - one master node and 3 
slaves, and it for 20k links for about a day (depth 15).
Is it normal? Or it should take less?
This is my configurations:



        <property>
                <name>db.ignore.external.links</name>
                <value>true</value>
                <description>If true, outlinks leading from a page to external 
hosts
                        will be ignored. This is an effective way to limit the 
crawl to include
                        only initially injected hosts, without creating complex 
URLFilters.
                </description>
        </property>

        <property>
                <name>db.max.outlinks.per.page</name>
                <value>1000</value>
                <description>The maximum number of outlinks that we'll process 
for a page.
                        If this value is nonnegative (>=0), at most 
db.max.outlinks.per.page outlinks
                        will be processed for a page; otherwise, all outlinks 
will be processed.
                </description>
        </property>


        <property>
                <name>fetcher.threads.fetch</name>
                <value>100</value>
                <description>The number of FetcherThreads the fetcher should 
use.
                        This is also determines the maximum number of requests 
that are
                        made at once (each FetcherThread handles one 
connection). The total
                        number of threads running in distributed mode will be 
the number of
                        fetcher threads * number of nodes as fetcher has one 
map task per node.
                </description>
        </property>


        <property>
                <name>fetcher.queue.depth.multiplier</name>
                <value>150</value>
                <description>(EXPERT)The fetcher buffers the incoming URLs into 
queues based on the [host|domain|IP]
                        see param fetcher.queue.mode). The depth of the queue 
is the number of threads times the value of this parameter.
                        A large value requires more memory but can improve the 
performance of the fetch when the order of the URLS in the fetch list
                        is not optimal.
                </description>
        </property>


        <property>
                <name>fetcher.threads.per.queue</name>
                <value>10</value>
                 <description>This number is the maximum number of threads that
                        should be allowed to access a queue at one time. 
Setting it to
                        a value > 1 will cause the Crawl-Delay value from 
robots.txt to
                        be ignored and the value of fetcher.server.min.delay to 
be used
                        as a delay between successive requests to the same 
server instead
                        of fetcher.server.delay.
                </description>
        </property>

        <property>
                <name>fetcher.server.min.delay</name>
                <value>0.0</value>
                <description>The minimum number of seconds the fetcher will 
delay between
                        successive requests to the same server. This value is 
applicable ONLY
                        if fetcher.threads.per.queue is greater than 1 (i.e. 
the host blocking
                        is turned off).
                </description>
        </property>


        <property>
                <name>fetcher.max.crawl.delay</name>
                <value>5</value>
                <description>
                        If the Crawl-Delay in robots.txt is set to greater than 
this value (in
                        seconds) then the fetcher will skip this page, 
generating an error report.
                        If set to -1 the fetcher will never skip such pages and 
will wait the
                        amount of time retrieved from robots.txt Crawl-Delay, 
however long that
                        might be.
                </description>
        </property>





---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Nutch running time

Reply via email to