On 02.11.2011 14:17, Markus Jelsma wrote:
Hi Marek,
With your settings the generator should select all records that are _eligible_
for fetch due to their fetch time being expired. I suspect that you generate,
fetch, update and generate again. In the meanwhile the DB may have changed so
this would explain this behaviour.
Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
the small hadoop cluster ;-) )
My fetch intervals are:
<property>
<name>db.fetch.interval.max</name>
<value>1209600</value>
<description>
1209600 s = 14 days
</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>603450</value>
<description>
6034500 s = 7 days
</description>
</property>
I think that the status "unfetched" is for urls that have never been
fetched, am I right?
So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle
there are 20k urls, the generator should add all of them to the fetch list.
An example:
Started with:
11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb
11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0: 236834
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1: 4794
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2: 170
11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score: 0.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score: 2.48141E-5
11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score: 1.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
18314
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
202241
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 9181
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 2896
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6 (db_notmodified): 797
11/11/02 14:48:14 INFO crawl.CrawlDbReader: CrawlDb statistics: done
I ran an GFPU-Cycle and then:
11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0: 241755
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1: 4810
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2: 188
11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score: 0.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score: 2.4315814E-5
11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score: 1.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
13753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
211389
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 9303
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 2909
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6 (db_notmodified): 797
11/11/02 15:07:58 INFO crawl.CrawlDbReader: CrawlDb statistics: done
As you can see, there were ~18k unfetched urls but only ~9,5k have been
processed (from Hadoop Job Details):
FetcherStaus:
moved 16
exception 85
access_denied 109
success 9.214
temp_moved 135
notfound 111
Thank you once again, Markus
PS: What's the magic trick the generator does to determine an url as
eligible? :)
If you do not update the DB it will (by default) always generate identical
fetch lists under the similar circustances.
I think it sometimes generates only ~1k because you already fetched all other
records.
Cheers
On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:
Hello people,
can someone explain me how the generator genrates the fetch lists?
In particular:
I don't understand why it generates fetch lists which very different
amounts of urls.
Sometimes it generates> 25k urls and somestimes> 1k.
In every case there were more than>25k urls unfetched in the crawldb.
So I was expecting that it always generates ~ 25k urls. But as I said
before, sometimes only ~ 1k.
In my nutch-site.xml I have defined following values:
<property>
<name>generate.max.count</name>
<value>-1</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
<property>
<name>generate.max.count</name>
<value>-1</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
Any ideas?
Thanks