On Tuesday 16 August 2011 16:17:26 Marek Bachmann wrote:
> On 16.08.2011 15:53, Julien Nioche wrote:
> > 1) generate.max.count sets a limit on the number of URLs for a single
> > host or domain - this is different from the overall limit set by the
> > generate -top parameter.
> > 
> > 2) the generator only skips the URLs which are beyond the max number
> > allowed for the host (in your case 3K). This does not mean that ALL urls
> > for that host are skipped
> > 
> > Makes sense?
> 
> Hey Julien, thank you. Yes, your description makes sense for me. So if I
> want to fetch a list with only 3k urls, I just have to run:
> 
> ./nutch parse $seg -topN 3000

No, topN applies to the generator.

> 
> right?
> 
> But I still don't get this message:
> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> 
> What is meant by "more than 3000 URLs for all 1 segments"? Skipping
> means then, that "it will skip after 3k urls"?

generate.max.count=3000 then all urls above 3000 for a given host/domain are 
skipped when generating the segment.

> 
> But for now you helped to solve my problem. :)
> 
> > On 16 August 2011 14:16, Marek Bachmann<[email protected]>  wrote:
> >> Hello,
> >> 
> >> there are two things I don't understand regarding the generator:
> >> 
> >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
> >> seems that this value is ignored. In every run about 20000 pages are
> >> fetched.
> >> 
> >> TOTAL urls: 102396
> >> retry 0:    101679
> >> retry 1:    325
> >> retry 2:    392
> >> min score:  1.0
> >> avg score:  1.0
> >> max score:  1.0
> >> status 1 (db_unfetched):    33072
> >> status 2 (db_fetched):      57146
> >> status 3 (db_gone): 6878
> >> status 4 (db_redir_temp):   2510
> >> status 5 (db_redir_perm):   2509
> >> status 6 (db_notmodified):  281
> >> CrawlDb statistics: done
> >> 
> >> After a generate / fetch / parse / update cycle:
> >> 
> >> TOTAL urls:     122885
> >> retry 0:        121816
> >> retry 1:        677
> >> retry 2:        392
> >> min score:      1.0
> >> avg score:      1.0
> >> max score:      1.0
> >> status 1 (db_unfetched):        32153
> >> status 2 (db_fetched):  75366
> >> status 3 (db_gone):     9167
> >> status 4 (db_redir_temp):       2979
> >> status 5 (db_redir_perm):       2878
> >> status 6 (db_notmodified):      342
> >> CrawlDb statistics: done
> >> 
> >> 2.) The next thing is related to the first one:
> >> 
> >> The generator tells me in the log files:
> >> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> >> 
> >> But when the fetcher is running it fetches many urls which the generator
> >> told me it had skipped before, like:
> >> 
> >> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
> >> http://cms.uni-kassel.de/**unicms/index.php?id=27436<http://cms.uni-kass
> >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO 
> >> fetcher.Fetcher - fetching
> >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<http://cms.uni-
> >> kassel.de/unicms/index.php?id=24287&L=1>
> >> 
> >> A second example:
> >> 
> >> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
> >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> >> skipping
> >> 
> >> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
> >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> >> Visualizing_and_Optimizing-**Paper.pdf<http://www.iset.uni-kassel.de/abt
> >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> >> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
> >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> >> 2010_Degner_Staffelstein.pdf<http://www.iset.uni-kassel.de/abt/FB-A/publ
> >> ication/2010/2010_Degner_Staffelstein.pdf>
> >> 
> >> Did I do something wrong? I don't get it :)
> >> 
> >> Thank you all

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to