Re: general questions about the generator

Marek Bachmann Wed, 02 Nov 2011 13:02:30 -0700

Hi Markus, hi List,

I used the CrawlDBScanner to look at the remaining unfetched urls.

Markus, you are, as usual, absolute correct. The one and only reason whythe urls weren't scheduled was that the refetch time hasn't come yet.

As I inspected the URLs that are unfetched I noticed that they have alljava.net errors. Either SockerTimeout or UnknownHostException.


Here are two examples:

http://cape.gforge.cs.uni-kassel.de/    Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Nov 02 21:19:07 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null

Metadata: _pst_: exception(16), lastModified=0:java.net.UnknownHostException: cape.gforge.cs.uni-kassel.de


and

http://bst-ws1.statik.bauingenieure.uni-kassel.de/web/Mitarbeiter       
Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Nov 03 13:28:45 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 603450 seconds (6 days)
Score: 0.0
Signature: null

Metadata: _pst_: exception(16), lastModified=0:java.net.SocketTimeoutException: connect timed out

For some reasons I thougt that if a page couldn't be loaded it willdisappear from the list of unfetched urls.


I know it better know. :)

But now comes up an other question for me. I set

<property>
  <name>db.fetch.retry.max</name>
  <value>2</value>
  <description>The maximum number of times a url that has encountered
  recoverable errors is generated for fetch.</description>
</property>

but in my (old) crawldb there are urls that have up to "retry 11" status.

Does db.fetch.retry.max mean how often a url is selected for retry evenif its recrawl time hasn't come?And if so, when will be urls, that can't be loaded, deleted from thecrawldb?



Thank your very much


Am 02.11.2011 17:16, schrieb Markus Jelsma:



On Wednesday 02 November 2011 16:24:09 Marek Bachmann wrote:

Is there a config value that could be setting the topN value? I
definitely don't use it in my script:


-topN as command parameter


#!/bin/bash

HADOOP_DIR=/nutch/hadoop/

./nutch generate crawldb segs
newSeg=`/nutch/hadoop/bin/hadoop dfs -ls segs/ | tail -1 | awk {'print
$8'}` echo $newSeg

./nutch fetch $newSeg
./nutch parse $newSeg
./nutch updatedb crawldb $newSeg

Are there any test for the generator? So taht I can see what it will
select?

Thank You

On 02.11.2011 15:30, Markus Jelsma wrote:

On Wednesday 02 November 2011 15:08:42 Marek Bachmann wrote:

On 02.11.2011 14:17, Markus Jelsma wrote:

Hi Marek,

With your settings the generator should select all records that are
_eligible_ for fetch due to their fetch time being expired. I suspect
that you generate, fetch, update and generate again. In the meanwhile
the DB may have changed so this would explain this behaviour.


Indeed, I do so, but I do the cycles in 15 to 30 min intervals (thx to
the small hadoop cluster ;-) )

My fetch intervals are:

<property>

     <name>db.fetch.interval.max</name>
     <value>1209600</value>
     <description>

           1209600 s =  14 days

     </description>

</property>


<property>

     <name>db.fetch.interval.default</name>
     <value>603450</value>
     <description>

           6034500 s = 7 days

     </description>

</property>

I think that the status "unfetched" is for urls that have never been
fetched, am I right?


Yes. See the CrawlDatum source for more descriptions on all status codes.

So, what I expect is, that when after a Genrate-Fetch-Parse-Update Cycle
there are 20k urls, the generator should add all of them to the fetch
list.

An example:

Started with:
11/11/02 14:48:14 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
crawldb 11/11/02 14:48:14 INFO crawl.CrawlDbReader: TOTAL urls: 241798
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 0:    236834
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 1:    4794
11/11/02 14:48:14 INFO crawl.CrawlDbReader: retry 2:    170
11/11/02 14:48:14 INFO crawl.CrawlDbReader: min score:  0.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: avg score:  2.48141E-5
11/11/02 14:48:14 INFO crawl.CrawlDbReader: max score:  1.0
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
18314
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):
202241
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 3 (db_gone): 8369
11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):

9181 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 5

(db_redir_perm):

    2896 11/11/02 14:48:14 INFO crawl.CrawlDbReader: status 6

(db_notmodified):  797 11/11/02 14:48:14 INFO crawl.CrawlDbReader:
CrawlDb statistics: done

I ran an GFPU-Cycle and then:

11/11/02 15:07:58 INFO crawl.CrawlDbReader: TOTAL urls: 246753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 0:    241755
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 1:    4810
11/11/02 15:07:58 INFO crawl.CrawlDbReader: retry 2:    188
11/11/02 15:07:58 INFO crawl.CrawlDbReader: min score:  0.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: avg score:  2.4315814E-5
11/11/02 15:07:58 INFO crawl.CrawlDbReader: max score:  1.0
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
13753
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 2 (db_fetched):
211389
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 3 (db_gone): 8602
11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):

9303 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 5

(db_redir_perm):

    2909 11/11/02 15:07:58 INFO crawl.CrawlDbReader: status 6

(db_notmodified):  797 11/11/02 15:07:58 INFO crawl.CrawlDbReader:
CrawlDb statistics: done

As you can see, there were ~18k unfetched urls but only ~9,5k have been

processed (from Hadoop Job Details):

Yes, i would expect it would generate all db_unfetched records too but i
cannot reproduce such behaviour. If i don't use topN to cut it off i get
fetch lists with 100 millions of URL's incl. all db_unfetched.

FetcherStaus:
moved           16
exception       85
access_denied   109
success         9.214
temp_moved      135
notfound        111


Thank you once again, Markus

PS: What's the magic trick the generator does to determine an url as
eligible? :)


You should check the mapper method in the source to get a full picture.

If you do not update the DB it will (by default) always generate
identical fetch lists under the similar circustances.

I think it sometimes generates only ~1k because you already fetched all
other records.

Cheers

On Wednesday 02 November 2011 14:03:08 Marek Bachmann wrote:

Hello people,

can someone explain me how the generator genrates the fetch lists?

In particular:

I don't understand why it generates fetch lists which very different
amounts of urls.

Sometimes it generates>    25k urls and somestimes>    1k.

In every case there were more than>25k urls unfetched in the crawldb.
So I was expecting that it always generates ~ 25k urls. But as I said
before, sometimes only ~ 1k.

In my nutch-site.xml I have defined following values:

<property>

      <name>generate.max.count</name>
      <value>-1</value>
      <description>The maximum number of urls in a single
      fetchlist.  -1 if unlimited. The urls are counted according
      to the value of the parameter generator.count.mode.
      </description>

</property>

<property>

      <name>generate.max.count</name>
      <value>-1</value>
      <description>The maximum number of urls in a single
      fetchlist.  -1 if unlimited. The urls are counted according
      to the value of the parameter generator.count.mode.
      </description>

</property>

Any ideas?

Thanks

Re: general questions about the generator

Reply via email to