On Tuesday 16 August 2011 16:17:26 Marek Bachmann wrote: > On 16.08.2011 15:53, Julien Nioche wrote: > > 1) generate.max.count sets a limit on the number of URLs for a single > > host or domain - this is different from the overall limit set by the > > generate -top parameter. > > > > 2) the generator only skips the URLs which are beyond the max number > > allowed for the host (in your case 3K). This does not mean that ALL urls > > for that host are skipped > > > > Makes sense? > > Hey Julien, thank you. Yes, your description makes sense for me. So if I > want to fetch a list with only 3k urls, I just have to run: > > ./nutch parse $seg -topN 3000
No, topN applies to the generator. > > right? > > But I still don't get this message: > 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain > cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping > > What is meant by "more than 3000 URLs for all 1 segments"? Skipping > means then, that "it will skip after 3k urls"? generate.max.count=3000 then all urls above 3000 for a given host/domain are skipped when generating the segment. > > But for now you helped to solve my problem. :) > > > On 16 August 2011 14:16, Marek Bachmann<[email protected]> wrote: > >> Hello, > >> > >> there are two things I don't understand regarding the generator: > >> > >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it > >> seems that this value is ignored. In every run about 20000 pages are > >> fetched. > >> > >> TOTAL urls: 102396 > >> retry 0: 101679 > >> retry 1: 325 > >> retry 2: 392 > >> min score: 1.0 > >> avg score: 1.0 > >> max score: 1.0 > >> status 1 (db_unfetched): 33072 > >> status 2 (db_fetched): 57146 > >> status 3 (db_gone): 6878 > >> status 4 (db_redir_temp): 2510 > >> status 5 (db_redir_perm): 2509 > >> status 6 (db_notmodified): 281 > >> CrawlDb statistics: done > >> > >> After a generate / fetch / parse / update cycle: > >> > >> TOTAL urls: 122885 > >> retry 0: 121816 > >> retry 1: 677 > >> retry 2: 392 > >> min score: 1.0 > >> avg score: 1.0 > >> max score: 1.0 > >> status 1 (db_unfetched): 32153 > >> status 2 (db_fetched): 75366 > >> status 3 (db_gone): 9167 > >> status 4 (db_redir_temp): 2979 > >> status 5 (db_redir_perm): 2878 > >> status 6 (db_notmodified): 342 > >> CrawlDb statistics: done > >> > >> 2.) The next thing is related to the first one: > >> > >> The generator tells me in the log files: > >> 2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain > >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping > >> > >> But when the fetcher is running it fetches many urls which the generator > >> told me it had skipped before, like: > >> > >> 2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching > >> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kass > >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO > >> fetcher.Fetcher - fetching > >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni- > >> kassel.de/unicms/index.php?id=24287&L=1> > >> > >> A second example: > >> > >> 2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain > >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments - > >> skipping > >> > >> 2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching > >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_** > >> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt > >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf> > >> 2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching > >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/** > >> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publ > >> ication/2010/2010_Degner_Staffelstein.pdf> > >> > >> Did I do something wrong? I don't get it :) > >> > >> Thank you all -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

