Re: Nutch not crawling all the domains in the seed list.

S.L Fri, 22 Aug 2014 08:50:31 -0700

Any help guys ?


On Wed, Aug 20, 2014 at 12:13 PM, S.L <[email protected]> wrote:

> Thanks,the problem is that if I reduce the URLs in the seed list to any 5
> , all of them are being crawled , which tells me its not a URL filtering
> issue , is just seems Nutch is not able to crawl more than 5 domains from
> the seed list , is there  a property that I am setting by mistake that's
> causing this behavior?
>
>
> On Wed, Aug 20, 2014 at 11:38 AM, Bin Wang <[email protected]> wrote:
>
>> Hi S.L.,
>>
>> 1. Nutch will follow site's robots.txt file as default, maybe you can take
>> a look at robot rule for the missing domains by going to
>> http://example.com/robots.txt?
>>
>> 2. Also, there are some URL filters that will be applied, maybe you can
>> paste the output after you inject the seed.txt (nutch inject), so you can
>> make sure all the URLs passed the filtering process.
>>
>> Bin
>>
>>
>> On Tue, Aug 19, 2014 at 11:03 PM, S.L <[email protected]> wrote:
>>
>> > Hi All,
>> >
>> > I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls
>> only 5
>> > of those domaisn and ignores the other 5  domains , can you please let
>> me
>> > know whats preventing it from crawling all the domains.
>> >
>> > I am running this on *Hadoop2.3.0* and in a cluster mode and giving a
>> > *depth
>> > of 10* when submitting the job. I have already set the
>> > *db.ignore.external.links
>> > *property to tru as I only intend to crawl the domains in the seed list.
>> >
>> > Some relevant properties that I have set , are mentioned below ,* please
>> > advise*.
>> >
>> > <property>
>> >         <name>*fetcher.threads.per.queue*</name>
>> >         <value>5</value>
>> >         <description>This number is the maximum number of threads that
>> >             should be allowed to access a queue at one time. Replaces
>> >             deprecated parameter 'fetcher.threads.per.host'.
>> >         </description>
>> >     </property>
>> >
>> >     <property>
>> >         <name>*db.ignore.external.links*</name>
>> >         <value>true</value>
>> >         <description>If true, outlinks leading from a page to external
>> > hosts
>> >             will be ignored. This is an effective way to limit the
>> crawl to
>> >             include
>> >             only initially injected hosts, without creating complex
>> > URLFilters.
>> >         </description>
>> >     </property>
>> >
>>
>
>

Re: Nutch not crawling all the domains in the seed list.

Reply via email to