Crawler fetches only a few page at each run

Rafael Pappert Wed, 16 Nov 2011 05:55:31 -0800

Hello List,

I try to setup a crawler for ~10K Urls and their subpages (just the internal 
ones).
I set topN to 10000 (nutch crawl urls -dir crawl -depth 1 -topN 10000 -threads 
10) 
but the fetch job only fetches 2 * generate.max.count pages per run.


The hadoop map task list looks like that:

task_201111161348_0005_m_000000 100.00%
0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s, 524.0 
(65536) kbits/s, 
16-Nov-2011 13:53:16
16-Nov-2011 13:53:22 (6sec)

task_201111161348_0005_m_000001 100.00%
0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s, 92.0 
(34552) kbits/s, 
16-Nov-2011 13:53:19
16-Nov-2011 13:53:28 (9sec)

task_201111161348_0005_m_000002 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 13:53:22
16-Nov-2011 13:53:28 (6sec)

task_201111161348_0005_m_000003 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 13:53:25
16-Nov-2011 13:53:31 (6sec)

task_201111161348_0005_m_000004 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 13:53:28
16-Nov-2011 13:53:34 (6sec)

task_201111161348_0005_m_000005 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 13:53:31
16-Nov-2011 13:53:37 (6sec)

task_201111161348_0005_m_000006 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 13:53:34
16-Nov-2011 13:53:40 (6sec)

task_201111161348_0005_m_000007 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 13:53:37
16-Nov-2011 13:53:43 (6sec)

But "readdb -stats" is like that, after a few runs:

11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    14653
11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched):      17

Server:

One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch 1.4.

Configfiles:

// nutch-site.xml

<property>
  <name>http.accept.language</name>
  <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value>
  <description>Value of the "Accept-Language" request header field.
  This allows selecting non-English language as default one to retrieve.
  It is a useful setting for search engines build for certain national group.
  </description>
</property>

<property>
  <name>plugin.folders</name>
  <value>plugins</value>
</property>

<property>
  <name>plugin.includes</name>
  
<value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description></description>
</property>

<property>
  <name>generate.count.mode</name>
  <value>host</value>
  <description>Determines how the URLs are counted for generator.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count
  per IP in the new version of the Generator.
  </description>
</property>

<property>
  <name>generate.max.count</name>
  <value>1</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.

  </description>

// mapred-site.xml

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of map tasks that will be run simultaneously 
by a task tracker. </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of reduce tasks that will be run 
simultaneously by a task tracker. </description>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>8</value>
  <description></description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>8</value>
  <description></description>
</property>

Whats wrong with my configuration? Please correct me if i'm wrong but I guess 
topN = 10k and depth = 1 means nutch should fetch
10k pages in one run?  

Thanks in advance,
Rafael

Crawler fetches only a few page at each run

Reply via email to