Any help guys ?
On Wed, Aug 20, 2014 at 12:13 PM, S.L <[email protected]> wrote: > Thanks,the problem is that if I reduce the URLs in the seed list to any 5 > , all of them are being crawled , which tells me its not a URL filtering > issue , is just seems Nutch is not able to crawl more than 5 domains from > the seed list , is there a property that I am setting by mistake that's > causing this behavior? > > > On Wed, Aug 20, 2014 at 11:38 AM, Bin Wang <[email protected]> wrote: > >> Hi S.L., >> >> 1. Nutch will follow site's robots.txt file as default, maybe you can take >> a look at robot rule for the missing domains by going to >> http://example.com/robots.txt? >> >> 2. Also, there are some URL filters that will be applied, maybe you can >> paste the output after you inject the seed.txt (nutch inject), so you can >> make sure all the URLs passed the filtering process. >> >> Bin >> >> >> On Tue, Aug 19, 2014 at 11:03 PM, S.L <[email protected]> wrote: >> >> > Hi All, >> > >> > I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls >> only 5 >> > of those domaisn and ignores the other 5 domains , can you please let >> me >> > know whats preventing it from crawling all the domains. >> > >> > I am running this on *Hadoop2.3.0* and in a cluster mode and giving a >> > *depth >> > of 10* when submitting the job. I have already set the >> > *db.ignore.external.links >> > *property to tru as I only intend to crawl the domains in the seed list. >> > >> > Some relevant properties that I have set , are mentioned below ,* please >> > advise*. >> > >> > <property> >> > <name>*fetcher.threads.per.queue*</name> >> > <value>5</value> >> > <description>This number is the maximum number of threads that >> > should be allowed to access a queue at one time. Replaces >> > deprecated parameter 'fetcher.threads.per.host'. >> > </description> >> > </property> >> > >> > <property> >> > <name>*db.ignore.external.links*</name> >> > <value>true</value> >> > <description>If true, outlinks leading from a page to external >> > hosts >> > will be ignored. This is an effective way to limit the >> crawl to >> > include >> > only initially injected hosts, without creating complex >> > URLFilters. >> > </description> >> > </property> >> > >> > >

