Hi All,
I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls only 5
of those domaisn and ignores the other 5 domains , can you please let me
know whats preventing it from crawling all the domains.
I am running this on *Hadoop2.3.0* and in a cluster mode and giving a *depth
of 10* when submitting the job. I have already set the
*db.ignore.external.links
*property to tru as I only intend to crawl the domains in the seed list.
Some relevant properties that I have set , are mentioned below ,* please
advise*.
<property>
<name>*fetcher.threads.per.queue*</name>
<value>5</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time. Replaces
deprecated parameter 'fetcher.threads.per.host'.
</description>
</property>
<property>
<name>*db.ignore.external.links*</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to
include
only initially injected hosts, without creating complex
URLFilters.
</description>
</property>