1) generate.max.count sets a limit on the number of URLs for a single host
or domain - this is different from the overall limit set by the generate
-top parameter.

2) the generator only skips the URLs which are beyond the max number allowed
for the host (in your case 3K). This does not mean that ALL urls for that
host are skipped

Makes sense?

On 16 August 2011 14:16, Marek Bachmann <[email protected]> wrote:

> Hello,
>
> there are two things I don't understand regarding the generator:
>
> 1.) If I set the generate.max.count value to a value, e.g. 3000, it seems
> that this value is ignored. In every run about 20000 pages are fetched.
>
> TOTAL urls: 102396
> retry 0:    101679
> retry 1:    325
> retry 2:    392
> min score:  1.0
> avg score:  1.0
> max score:  1.0
> status 1 (db_unfetched):    33072
> status 2 (db_fetched):      57146
> status 3 (db_gone): 6878
> status 4 (db_redir_temp):   2510
> status 5 (db_redir_perm):   2509
> status 6 (db_notmodified):  281
> CrawlDb statistics: done
>
> After a generate / fetch / parse / update cycle:
>
> TOTAL urls:     122885
> retry 0:        121816
> retry 1:        677
> retry 2:        392
> min score:      1.0
> avg score:      1.0
> max score:      1.0
> status 1 (db_unfetched):        32153
> status 2 (db_fetched):  75366
> status 3 (db_gone):     9167
> status 4 (db_redir_temp):       2979
> status 5 (db_redir_perm):       2878
> status 6 (db_notmodified):      342
> CrawlDb statistics: done
>
> 2.) The next thing is related to the first one:
>
> The generator tells me in the log files:
> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
>
> But when the fetcher is running it fetches many urls which the generator
> told me it had skipped before, like:
>
> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kassel.de/unicms/index.php?id=27436>
> 2011-08-16 13:56:31,706 INFO  fetcher.Fetcher - fetching
> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1>
>
> A second example:
>
> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> skipping
>
> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf>
>
> Did I do something wrong? I don't get it :)
>
> Thank you all
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to