Re: Crawler fetches only a few page at each run

Rafael Pappert Wed, 16 Nov 2011 06:23:53 -0800

hi,

i changed generate.max.count to -1 but the result is nearly the same.
Now the fetch task fetches 500 urls but there are still 6 map tasks
with 0 queues.


task_201111161504_0005_m_000002 100.00%
0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s, 92.0 
(34540) kbits/s, 
16-Nov-2011 15:08:24
16-Nov-2011 15:08:33 (9sec)

task_201111161504_0005_m_000003 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 15:08:27
16-Nov-2011 15:08:33 (6sec)

task_201111161504_0005_m_000004 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 15:08:30
16-Nov-2011 15:08:36 (6sec)

task_201111161504_0005_m_000005 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 15:08:33
16-Nov-2011 15:08:39 (6sec)

task_201111161504_0005_m_000006 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 15:08:36
16-Nov-2011 15:08:42 (6sec)

task_201111161504_0005_m_000007 100.00%
0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0 (0) 
kbits/s, 
16-Nov-2011 15:08:39
16-Nov-2011 15:08:45 (6sec)

task_201111161504_0005_m_000000 42.83%
10 threads, 1 queues, 500 URLs queued, 38 pages, 0 errors, 0.0 (0) pages/s, 
96.0 (0) kbits/s, 
16-Nov-2011 15:08:18

task_201111161504_0005_m_000001 98.43%
10 threads, 1 queues, 500 URLs queued, 32 pages, 0 errors, 0.0 (0) pages/s, 
76.0 (0) kbits/s, 
16-Nov-2011 15:08:21

Why creates the generator only 2 fetch lists and why is the generator taking 
the same 2 hosts
again and again. Now after 20 runs, i have 2000 pages fetched but only from 2 
different hosts.

best regards,
rafael.

On 16/Nov/ 2011, at 15:01 , Markus Jelsma wrote:

>  <name>generate.max.count</name>
>  <value>1</value>
> 
> I think this is the problem. Please increase as you crawl only one host, each 
> generate cycle will contain only 1 page for this host since your mode is set 
> to host.
> 
> Set to -1 or a higher value.
> 
> 
> On Wednesday 16 November 2011 14:54:43 Rafael Pappert wrote:
>> Hello List,
>> 
>> I try to setup a crawler for ~10K Urls and their subpages (just the
>> internal ones). I set topN to 10000 (nutch crawl urls -dir crawl -depth 1
>> -topN 10000 -threads 10) but the fetch job only fetches 2 *
>> generate.max.count pages per run.
>> 
>> The hadoop map task list looks like that:
>> 
>> task_201111161348_0005_m_000000      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 1.0 (1) pages/s,
>> 524.0 (65536) kbits/s, 16-Nov-2011 13:53:16
>> 16-Nov-2011 13:53:22 (6sec)
>> 
>> task_201111161348_0005_m_000001      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 1 pages, 0 errors, 0.0 (1) pages/s,
>> 92.0 (34552) kbits/s, 16-Nov-2011 13:53:19
>> 16-Nov-2011 13:53:28 (9sec)
>> 
>> task_201111161348_0005_m_000002      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:22
>> 16-Nov-2011 13:53:28 (6sec)
>> 
>> task_201111161348_0005_m_000003      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:25
>> 16-Nov-2011 13:53:31 (6sec)
>> 
>> task_201111161348_0005_m_000004      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:28
>> 16-Nov-2011 13:53:34 (6sec)
>> 
>> task_201111161348_0005_m_000005      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:31
>> 16-Nov-2011 13:53:37 (6sec)
>> 
>> task_201111161348_0005_m_000006      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:34
>> 16-Nov-2011 13:53:40 (6sec)
>> 
>> task_201111161348_0005_m_000007      100.00%
>> 0 threads, 0 queues, 0 URLs queued, 0 pages, 0 errors, 0.0 (0) pages/s, 0.0
>> (0) kbits/s, 16-Nov-2011 13:53:37
>> 16-Nov-2011 13:53:43 (6sec)
>> 
>> But "readdb -stats" is like that, after a few runs:
>> 
>> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 14653
>> 11/11/16 14:45:13 INFO crawl.CrawlDbReader: status 2 (db_fetched):   17
>> 
>> Server:
>> 
>> One Node running Hadoop 0.20.203.0 (later i will add more nodes) and Nutch
>> 1.4.
>> 
>> Configfiles:
>> 
>> // nutch-site.xml
>> 
>> <property>
>>  <name>http.accept.language</name>
>>  <value>de-de,de,en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>  <description>Value of the "Accept-Language" request header field.
>>  This allows selecting non-English language as default one to retrieve.
>>  It is a useful setting for search engines build for certain national
>> group. </description>
>> </property>
>> 
>> <property>
>>  <name>plugin.folders</name>
>>  <value>plugins</value>
>> </property>
>> 
>> <property>
>>  <name>plugin.includes</name>
>> 
>> <value>linkbutler|language-identifier|protocol-http|urlfilter-regex|parse-
>> (html|tika)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|rege
>> x|basic)</value> <description>Regular expression naming plugin directory
>> names to include.  Any plugin not matching this expression is excluded.
>>  In any case you need at least include the nutch-extensionpoints plugin.
>> By default Nutch includes crawling just HTML and plain text via HTTP, and
>> basic indexing and search plugins.
>>  </description>
>> </property>
>> 
>> <property>
>>  <name>db.ignore.external.links</name>
>>  <value>true</value>
>>  <description></description>
>> </property>
>> 
>> <property>
>>  <name>generate.count.mode</name>
>>  <value>host</value>
>>  <description>Determines how the URLs are counted for generator.max.count.
>>  Default value is 'host' but can be 'domain'. Note that we do not count
>>  per IP in the new version of the Generator.
>>  </description>
>> </property>
>> 
>> <property>
>>  <name>generate.max.count</name>
>>  <value>1</value>
>>  <description>The maximum number of urls in a single
>>  fetchlist.  -1 if unlimited. The urls are counted according
>>  to the value of the parameter generator.count.mode.
>> 
>>  </description>
>> 
>> // mapred-site.xml
>> 
>> <property>
>>  <name>mapred.tasktracker.map.tasks.maximum</name>
>>  <value>8</value>
>>  <description>The maximum number of map tasks that will be run
>> simultaneously by a task tracker. </description> </property>
>> 
>> <property>
>>  <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>  <value>8</value>
>>  <description>The maximum number of reduce tasks that will be run
>> simultaneously by a task tracker. </description> </property>
>> 
>> <property>
>>  <name>mapred.map.tasks</name>
>>  <value>8</value>
>>  <description></description>
>> </property>
>> 
>> <property>
>>  <name>mapred.reduce.tasks</name>
>>  <value>8</value>
>>  <description></description>
>> </property>
>> 
>> Whats wrong with my configuration? Please correct me if i'm wrong but I
>> guess topN = 10k and depth = 1 means nutch should fetch 10k pages in one
>> run?
>> 
>> Thanks in advance,
>> Rafael
> 
> -- 
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

Re: Crawler fetches only a few page at each run

Reply via email to