<name>generate.max.count</name> <value>1</value>
I think this is the problem. Please increase as you crawl only one host, each generate cycle will contain only 1 page for this host since your mode is set to host. Set to -1 or a higher value. On Wednesday 16 November 2011 14:54:43 Rafael Pappert wrote: > Hello List, > > I try to setup a crawler for ~10K Urls and their subpages (just the > internal ones). I set topN to 10000 (nutch crawl urls -dir crawl -depth 1 > -topN 10000 -threads 10) but the fetch job only fetches 2 * > generate.max.count pages per run. > > The hadoop map task list looks like that: > > task_201111161348_0005_m_000000 100.00% > 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s, > 524.0 (65536) kbits/s, 16-Nov-2011 13:53:16 > 16-Nov-2011 13:53:22 (6sec) > > task_201111161348_0005_m_000001 100.00% > 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s, > 92.0 (34552) kbits/s, 16-Nov-2011 13:53:19 > 16-Nov-2011 13:53:28 (9sec) > > task_201111161348_0005_m_000002 100.00% > 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 > (0) kbits/s, 16-Nov-2011 13:53:22 > 16-Nov-2011 13:53:28 (6sec) > > task_201111161348_0005_m_000003 100.00% > 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 > (0) kbits/s, 16-Nov-2011 13:53:25 > 16-Nov-2011 13:53:31 (6sec) > > task_201111161348_0005_m_000004 100.00% > 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 > (0) kbits/s, 16-Nov-2011 13:53:28 > 16-Nov-2011 13:53:34 (6sec) > > task_201111161348_0005_m_000005 100.00% > 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 > (0) kbits/s, 16-Nov-2011 13:53:31 > 16-Nov-2011 13:53:37 (6sec) > > task_201111161348_0005_m_000006 100.00% > 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 > (0) kbits/s, 16-Nov-2011 13:53:34 > 16-Nov-2011 13:53:40 (6sec) > > task_201111161348_0005_m_000007 100.00% > 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 > (0) kbits/s, 16-Nov-2011 13:53:37 > 16-Nov-2011 13:53:43 (6sec) > > But "readdb -stats" is like that, after a few runs: > > 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 14653 > 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched): 17 > > Server: > > One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch > 1.4. > > Configfiles: > > // nutch-site.xml > > <property> > <name>http.accept.language</name> > <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value> > <description>Value of the "Accept-Language" request header field. > This allows selecting non-English language as default one to retrieve. > It is a useful setting for search engines build for certain national > group. </description> > </property> > > <property> > <name>plugin.folders</name> > <value>plugins</value> > </property> > > <property> > <name>plugin.includes</name> > > <value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse- > (html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|rege > x|basic)</value> <description>Regular expression naming plugin directory > names to include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By default Nutch includes crawling just HTML and plain text via HTTP, and > basic indexing and search plugins. > </description> > </property> > > <property> > <name>db.ignore.external.links</name> > <value>true</value> > <description></description> > </property> > > <property> > <name>generate.count.mode</name> > <value>host</value> > <description>Determines how the URLs are counted for generator.max.count. > Default value is 'host' but can be 'domain'. Note that we do not count > per IP in the new version of the Generator. > </description> > </property> > > <property> > <name>generate.max.count</name> > <value>1</value> > <description>The maximum number of urls in a single > fetchlist. -1 if unlimited. The urls are counted according > to the value of the parameter generator.count.mode. > > </description> > > // mapred-site.xml > > <property> > <name>mapred.tasktracker.map.tasks.maximum</name> > <value>8</value> > <description>The maximum number of map tasks that will be run > simultaneously by a task tracker. </description> </property> > > <property> > <name>mapred.tasktracker.reduce.tasks.maximum</name> > <value>8</value> > <description>The maximum number of reduce tasks that will be run > simultaneously by a task tracker. </description> </property> > > <property> > <name>mapred.map.tasks</name> > <value>8</value> > <description></description> > </property> > > <property> > <name>mapred.reduce.tasks</name> > <value>8</value> > <description></description> > </property> > > Whats wrong with my configuration? Please correct me if i'm wrong but I > guess topN = 10k and depth = 1 means nutch should fetch 10k pages in one > run? > > Thanks in advance, > Rafael -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

