Hello,
there are two things I don't understand regarding the generator:
1.) If I set the generate.max.count value to a value, e.g. 3000, it
seems that this value is ignored. In every run about 20000 pages are
fetched.
TOTAL urls: 102396
retry 0: 101679
retry 1: 325
retry 2: 392
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 33072
status 2 (db_fetched): 57146
status 3 (db_gone): 6878
status 4 (db_redir_temp): 2510
status 5 (db_redir_perm): 2509
status 6 (db_notmodified): 281
CrawlDb statistics: done
After a generate / fetch / parse / update cycle:
TOTAL urls: 122885
retry 0: 121816
retry 1: 677
retry 2: 392
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 32153
status 2 (db_fetched): 75366
status 3 (db_gone): 9167
status 4 (db_redir_temp): 2979
status 5 (db_redir_perm): 2878
status 6 (db_notmodified): 342
CrawlDb statistics: done
2.) The next thing is related to the first one:
The generator tells me in the log files:
2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
But when the fetcher is running it fetches many urls which the generator
told me it had skipped before, like:
2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
http://cms.uni-kassel.de/unicms/index.php?id=27436
2011-08-16 13:56:31,706 INFO fetcher.Fetcher - fetching
http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1
A second example:
2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf
2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf
Did I do something wrong? I don't get it :)
Thank you all