hi, i changed generate.max.count to -1 but the result is nearly the same. Now the fetch task fetches 500 urls but there are still 6 map tasks with 0 queues.
task_201111161504_0005_m_000002 100.00% 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s, 92.0 (34540) kbits/s, 16-Nov-2011 15:08:24 16-Nov-2011 15:08:33 (9sec) task_201111161504_0005_m_000003 100.00% 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 16-Nov-2011 15:08:27 16-Nov-2011 15:08:33 (6sec) task_201111161504_0005_m_000004 100.00% 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 16-Nov-2011 15:08:30 16-Nov-2011 15:08:36 (6sec) task_201111161504_0005_m_000005 100.00% 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 16-Nov-2011 15:08:33 16-Nov-2011 15:08:39 (6sec) task_201111161504_0005_m_000006 100.00% 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 16-Nov-2011 15:08:36 16-Nov-2011 15:08:42 (6sec) task_201111161504_0005_m_000007 100.00% 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) kbits/s, 16-Nov-2011 15:08:39 16-Nov-2011 15:08:45 (6sec) task_201111161504_0005_m_000000 42.83% 10 threads, 1 queues, 500 URLs queued, 38 pages, 0 errors, 0.0 (0) pages/s, 96.0 (0) kbits/s, 16-Nov-2011 15:08:18 task_201111161504_0005_m_000001 98.43% 10 threads, 1 queues, 500 URLs queued, 32 pages, 0 errors, 0.0 (0) pages/s, 76.0 (0) kbits/s, 16-Nov-2011 15:08:21 Why creates the generator only 2 fetch lists and why is the generator taking the same 2 hosts again and again. Now after 20 runs, i have 2000 pages fetched but only from 2 different hosts. best regards, rafael. On 16/Nov/ 2011, at 15:01 , Markus Jelsma wrote: > <name>generate.max.count</name> > <value>1</value> > > I think this is the problem. Please increase as you crawl only one host, each > generate cycle will contain only 1 page for this host since your mode is set > to host. > > Set to -1 or a higher value. > > > On Wednesday 16 November 2011 14:54:43 Rafael Pappert wrote: >> Hello List, >> >> I try to setup a crawler for ~10K Urls and their subpages (just the >> internal ones). I set topN to 10000 (nutch crawl urls -dir crawl -depth 1 >> -topN 10000 -threads 10) but the fetch job only fetches 2 * >> generate.max.count pages per run. >> >> The hadoop map task list looks like that: >> >> task_201111161348_0005_m_000000 100.00% >> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s, >> 524.0 (65536) kbits/s, 16-Nov-2011 13:53:16 >> 16-Nov-2011 13:53:22 (6sec) >> >> task_201111161348_0005_m_000001 100.00% >> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s, >> 92.0 (34552) kbits/s, 16-Nov-2011 13:53:19 >> 16-Nov-2011 13:53:28 (9sec) >> >> task_201111161348_0005_m_000002 100.00% >> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 >> (0) kbits/s, 16-Nov-2011 13:53:22 >> 16-Nov-2011 13:53:28 (6sec) >> >> task_201111161348_0005_m_000003 100.00% >> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 >> (0) kbits/s, 16-Nov-2011 13:53:25 >> 16-Nov-2011 13:53:31 (6sec) >> >> task_201111161348_0005_m_000004 100.00% >> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 >> (0) kbits/s, 16-Nov-2011 13:53:28 >> 16-Nov-2011 13:53:34 (6sec) >> >> task_201111161348_0005_m_000005 100.00% >> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 >> (0) kbits/s, 16-Nov-2011 13:53:31 >> 16-Nov-2011 13:53:37 (6sec) >> >> task_201111161348_0005_m_000006 100.00% >> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 >> (0) kbits/s, 16-Nov-2011 13:53:34 >> 16-Nov-2011 13:53:40 (6sec) >> >> task_201111161348_0005_m_000007 100.00% >> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 >> (0) kbits/s, 16-Nov-2011 13:53:37 >> 16-Nov-2011 13:53:43 (6sec) >> >> But "readdb -stats" is like that, after a few runs: >> >> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 14653 >> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched): 17 >> >> Server: >> >> One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch >> 1.4. >> >> Configfiles: >> >> // nutch-site.xml >> >> <property> >> <name>http.accept.language</name> >> <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value> >> <description>Value of the "Accept-Language" request header field. >> This allows selecting non-English language as default one to retrieve. >> It is a useful setting for search engines build for certain national >> group. </description> >> </property> >> >> <property> >> <name>plugin.folders</name> >> <value>plugins</value> >> </property> >> >> <property> >> <name>plugin.includes</name> >> >> <value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse- >> (html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|rege >> x|basic)</value> <description>Regular expression naming plugin directory >> names to include. Any plugin not matching this expression is excluded. >> In any case you need at least include the nutch-extensionpoints plugin. >> By default Nutch includes crawling just HTML and plain text via HTTP, and >> basic indexing and search plugins. >> </description> >> </property> >> >> <property> >> <name>db.ignore.external.links</name> >> <value>true</value> >> <description></description> >> </property> >> >> <property> >> <name>generate.count.mode</name> >> <value>host</value> >> <description>Determines how the URLs are counted for generator.max.count. >> Default value is 'host' but can be 'domain'. Note that we do not count >> per IP in the new version of the Generator. >> </description> >> </property> >> >> <property> >> <name>generate.max.count</name> >> <value>1</value> >> <description>The maximum number of urls in a single >> fetchlist. -1 if unlimited. The urls are counted according >> to the value of the parameter generator.count.mode. >> >> </description> >> >> // mapred-site.xml >> >> <property> >> <name>mapred.tasktracker.map.tasks.maximum</name> >> <value>8</value> >> <description>The maximum number of map tasks that will be run >> simultaneously by a task tracker. </description> </property> >> >> <property> >> <name>mapred.tasktracker.reduce.tasks.maximum</name> >> <value>8</value> >> <description>The maximum number of reduce tasks that will be run >> simultaneously by a task tracker. </description> </property> >> >> <property> >> <name>mapred.map.tasks</name> >> <value>8</value> >> <description></description> >> </property> >> >> <property> >> <name>mapred.reduce.tasks</name> >> <value>8</value> >> <description></description> >> </property> >> >> Whats wrong with my configuration? Please correct me if i'm wrong but I >> guess topN = 10k and depth = 1 means nutch should fetch 10k pages in one >> run? >> >> Thanks in advance, >> Rafael > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350

