1) generate.max.count sets a limit on the number of URLs for a single host or domain - this is different from the overall limit set by the generate -top parameter.
2) the generator only skips the URLs which are beyond the max number allowed for the host (in your case 3K). This does not mean that ALL urls for that host are skipped Makes sense? On 16 August 2011 14:16, Marek Bachmann <[email protected]> wrote: > Hello, > > there are two things I don't understand regarding the generator: > > 1.) If I set the generate.max.count value to a value, e.g. 3000, it seems > that this value is ignored. In every run about 20000 pages are fetched. > > TOTAL urls: 102396 > retry 0: 101679 > retry 1: 325 > retry 2: 392 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > status 1 (db_unfetched): 33072 > status 2 (db_fetched): 57146 > status 3 (db_gone): 6878 > status 4 (db_redir_temp): 2510 > status 5 (db_redir_perm): 2509 > status 6 (db_notmodified): 281 > CrawlDb statistics: done > > After a generate / fetch / parse / update cycle: > > TOTAL urls: 122885 > retry 0: 121816 > retry 1: 677 > retry 2: 392 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > status 1 (db_unfetched): 32153 > status 2 (db_fetched): 75366 > status 3 (db_gone): 9167 > status 4 (db_redir_temp): 2979 > status 5 (db_redir_perm): 2878 > status 6 (db_notmodified): 342 > CrawlDb statistics: done > > 2.) The next thing is related to the first one: > > The generator tells me in the log files: > 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain > cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping > > But when the fetcher is running it fetches many urls which the generator > told me it had skipped before, like: > > 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching > http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kassel.de/unicms/index.php?id=27436> > 2011-08-16 13:56:31,706 INFO fetcher.Fetcher - fetching > http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-kassel.de/unicms/index.php?id=24287&L=1> > > A second example: > > 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain > www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - > skipping > > 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching > http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_** > Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf> > 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching > http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/** > 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publication/2010/2010_Degner_Staffelstein.pdf> > > Did I do something wrong? I don't get it :) > > Thank you all > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

