Hi All,

I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls only 5
of those domaisn and ignores the other 5  domains , can you please let me
know whats preventing it from crawling all the domains.

I am running this on *Hadoop2.3.0* and in a cluster mode and giving a *depth
of 10* when submitting the job. I have already set the
*db.ignore.external.links
*property to tru as I only intend to crawl the domains in the seed list.

Some relevant properties that I have set , are mentioned below ,* please
advise*.

<property>
        <name>*fetcher.threads.per.queue*</name>
        <value>5</value>
        <description>This number is the maximum number of threads that
            should be allowed to access a queue at one time. Replaces
            deprecated parameter 'fetcher.threads.per.host'.
        </description>
    </property>

    <property>
        <name>*db.ignore.external.links*</name>
        <value>true</value>
        <description>If true, outlinks leading from a page to external hosts
            will be ignored. This is an effective way to limit the crawl to
            include
            only initially injected hosts, without creating complex
URLFilters.
        </description>
    </property>

Reply via email to