Re: general questions about the generator

Marek Bachmann Wed, 02 Nov 2011 07:18:48 -0700

On 02.11.2011 14:17, Markus Jelsma wrote:

Hi Marek,


With your settings the generator should select all records that are _eligible_
for fetch due to their fetch time being expired. I suspect that you generate,
fetch, update and generate again. In the meanwhile the DB may have changed so
this would explain this behaviour.

Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx tothe small hadoop cluster ;-) )


My fetch intervals are:

<property>
  <name>db.fetch.interval.max</name>
  <value>1209600</value>
  <description>
        1209600 s =  14 days
  </description>
</property>


<property>
  <name>db.fetch.interval.default</name>
  <value>603450</value>
  <description>
        6034500 s = 7 days
  </description>
</property>

I think that the status "unfetched" is for urls that have never beenfetched, am I right?

So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cyclethere are 20k urls, the generator should add all of them to the fetch list.


An example:

Started with:
11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb
11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0:    236834
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1:    4794
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2:    170
11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score:  0.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score:  2.48141E-5
11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score:  1.0

11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):1831411/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):202241

11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   9181
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   2896
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6 (db_notmodified):  797
11/11/02 14:48:14 INFO crawl.CrawlDbReader: CrawlDb statistics: done

I ran an GFPU-Cycle and then:

11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0:    241755
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1:    4810
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2:    188
11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score:  0.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score:  2.4315814E-5
11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score:  1.0

11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):1375311/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):211389

11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   9303
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   2909
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6 (db_notmodified):  797
11/11/02 15:07:58 INFO crawl.CrawlDbReader: CrawlDb statistics: done

As you can see, there were ~18k unfetched urls but only ~9,5k have beenprocessed (from Hadoop Job Details):

FetcherStaus:
moved           16      
exception       85      
access_denied   109     
success         9.214   
temp_moved      135     
notfound        111     


Thank you once again, Markus

PS: What's the magic trick the generator does to determine an url aseligible? :)


If you do not update the DB it will (by default) always generate identical
fetch lists under the similar circustances.

I think it sometimes generates only ~1k because you already fetched all other
records.

Cheers


On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:

Hello people,

can someone explain me how the generator genrates the fetch lists?

In particular:

I don't understand why it generates fetch lists which very different
amounts of urls.

Sometimes it generates>  25k urls and somestimes>  1k.

In every case there were more than>25k urls unfetched in the crawldb.
So I was expecting that it always generates ~ 25k urls. But as I said
before, sometimes only ~ 1k.

In my nutch-site.xml I have defined following values:

<property>
    <name>generate.max.count</name>
    <value>-1</value>
    <description>The maximum number of urls in a single
    fetchlist.  -1 if unlimited. The urls are counted according
    to the value of the parameter generator.count.mode.
    </description>
</property>

<property>
    <name>generate.max.count</name>
    <value>-1</value>
    <description>The maximum number of urls in a single
    fetchlist.  -1 if unlimited. The urls are counted according
    to the value of the parameter generator.count.mode.
    </description>
</property>

Any ideas?

Thanks

Re: general questions about the generator

Reply via email to