Some question about the generator

Marek Bachmann Tue, 16 Aug 2011 06:17:44 -0700

Hello,

there are two things I don't understand regarding the generator:

1.) If I set the generate.max.count value to a value, e.g. 3000, itseems that this value is ignored. In every run about 20000 pages arefetched.


TOTAL urls: 102396
retry 0:    101679
retry 1:    325
retry 2:    392
min score:  1.0
avg score:  1.0
max score:  1.0
status 1 (db_unfetched):    33072
status 2 (db_fetched):      57146
status 3 (db_gone): 6878
status 4 (db_redir_temp):   2510
status 5 (db_redir_perm):   2509
status 6 (db_notmodified):  281
CrawlDb statistics: done

After a generate / fetch / parse / update cycle:

TOTAL urls:     122885
retry 0:        121816
retry 1:        677
retry 2:        392
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        32153
status 2 (db_fetched):  75366
status 3 (db_gone):     9167
status 4 (db_redir_temp):       2979
status 5 (db_redir_perm):       2878
status 6 (db_notmodified):      342
CrawlDb statistics: done

2.) The next thing is related to the first one:

The generator tells me in the log files:

2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domaincms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

But when the fetcher is running it fetches many urls which the generatortold me it had skipped before, like:

2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetchinghttp://cms.uni-kassel.de/unicms/index.php?id=274362011-08-16 13:56:31,706 INFO fetcher.Fetcher - fetchinghttp://cms.uni-kassel.de/unicms/index.php?id=24287&L=1


A second example:

2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domainwww.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetchinghttp://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetchinghttp://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf


Did I do something wrong? I don't get it :)

Thank you all

Some question about the generator

Reply via email to