Re: Crawler fetches only a few page at each run

Markus Jelsma Wed, 16 Nov 2011 06:01:54 -0800

  <name>generate.max.count</name>
  <value>1</value>


I think this is the problem. Please increase as you crawl only one host, each 
generate cycle will contain only 1 page for this host since your mode is set 
to host.

Set to -1 or a higher value.


On Wednesday 16 November 2011 14:54:43 Rafael Pappert wrote:
> Hello List,
> 
> I try to setup a crawler for ~10K Urls and their subpages (just the
> internal ones). I set topN to 10000 (nutch crawl urls -dir crawl -depth 1
> -topN 10000 -threads 10) but the fetch job only fetches 2 *
> generate.max.count pages per run.
> 
> The hadoop map task list looks like that:
> 
> task_201111161348_0005_m_000000       100.00%
> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s,
> 524.0 (65536) kbits/s, 16-Nov-2011 13:53:16
> 16-Nov-2011 13:53:22 (6sec)
> 
> task_201111161348_0005_m_000001       100.00%
> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s,
> 92.0 (34552) kbits/s, 16-Nov-2011 13:53:19
> 16-Nov-2011 13:53:28 (9sec)
> 
> task_201111161348_0005_m_000002       100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:22
> 16-Nov-2011 13:53:28 (6sec)
> 
> task_201111161348_0005_m_000003       100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:25
> 16-Nov-2011 13:53:31 (6sec)
> 
> task_201111161348_0005_m_000004       100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:28
> 16-Nov-2011 13:53:34 (6sec)
> 
> task_201111161348_0005_m_000005       100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:31
> 16-Nov-2011 13:53:37 (6sec)
> 
> task_201111161348_0005_m_000006       100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:34
> 16-Nov-2011 13:53:40 (6sec)
> 
> task_201111161348_0005_m_000007       100.00%
> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
> (0) kbits/s, 16-Nov-2011 13:53:37
> 16-Nov-2011 13:53:43 (6sec)
> 
> But "readdb -stats" is like that, after a few runs:
> 
> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched):  14653
> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched):    17
> 
> Server:
> 
> One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch
> 1.4.
> 
> Configfiles:
> 
> // nutch-site.xml
> 
> <property>
>   <name>http.accept.language</name>
>   <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value>
>   <description>Value of the "Accept-Language" request header field.
>   This allows selecting non-English language as default one to retrieve.
>   It is a useful setting for search engines build for certain national
> group. </description>
> </property>
> 
> <property>
>   <name>plugin.folders</name>
>   <value>plugins</value>
> </property>
> 
> <property>
>   <name>plugin.includes</name>
>  
> <value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse-
> (html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|rege
> x|basic)</value> <description>Regular expression naming plugin directory
> names to include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By default Nutch includes crawling just HTML and plain text via HTTP, and
> basic indexing and search plugins.
>   </description>
> </property>
> 
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
>   <description></description>
> </property>
> 
> <property>
>   <name>generate.count.mode</name>
>   <value>host</value>
>   <description>Determines how the URLs are counted for generator.max.count.
>   Default value is 'host' but can be 'domain'. Note that we do not count
>   per IP in the new version of the Generator.
>   </description>
> </property>
> 
> <property>
>   <name>generate.max.count</name>
>   <value>1</value>
>   <description>The maximum number of urls in a single
>   fetchlist.  -1 if unlimited. The urls are counted according
>   to the value of the parameter generator.count.mode.
> 
>   </description>
> 
> // mapred-site.xml
> 
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>8</value>
>   <description>The maximum number of map tasks that will be run
> simultaneously by a task tracker. </description> </property>
> 
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>8</value>
>   <description>The maximum number of reduce tasks that will be run
> simultaneously by a task tracker. </description> </property>
> 
> <property>
>   <name>mapred.map.tasks</name>
>   <value>8</value>
>   <description></description>
> </property>
> 
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>8</value>
>   <description></description>
> </property>
> 
> Whats wrong with my configuration? Please correct me if i'm wrong but I
> guess topN = 10k and depth = 1 means nutch should fetch 10k pages in one
> run?
> 
> Thanks in advance,
> Rafael

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Crawler fetches only a few page at each run

Reply via email to