> > 1) generate.max.count sets a limit on the number of URLs for a single
> > > host or domain - this is different from the overall limit set by the
> > > generate -top parameter.
> > >
> > > 2) the generator only skips the URLs which are beyond the max number
> > > allowed for the host (in your case 3K). This does not mean that ALL
> urls
> > > for that host are skipped
> > >
> > > Makes sense?
> >
> > Hey Julien, thank you. Yes, your description makes sense for me. So if I
> > want to fetch a list with only 3k urls, I just have to run:
> >
> > ./nutch parse $seg -topN 3000
>
> No, topN applies to the generator.
>

good catch Markus - I'd read generate.
Marek - this has nothing to do with the parsing


>
> >
> > right?
> >
> > But I still don't get this message:
> > 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> > cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
> >
> > What is meant by "more than 3000 URLs for all 1 segments"? Skipping
> > means then, that "it will skip after 3k urls"?
>
> generate.max.count=3000 then all urls above 3000 for a given host/domain
> are
> skipped when generating the segment.
>
> >
> > But for now you helped to solve my problem. :)
> >
> > > On 16 August 2011 14:16, Marek Bachmann<[email protected]>
>  wrote:
> > >> Hello,
> > >>
> > >> there are two things I don't understand regarding the generator:
> > >>
> > >> 1.) If I set the generate.max.count value to a value, e.g. 3000, it
> > >> seems that this value is ignored. In every run about 20000 pages are
> > >> fetched.
> > >>
> > >> TOTAL urls: 102396
> > >> retry 0:    101679
> > >> retry 1:    325
> > >> retry 2:    392
> > >> min score:  1.0
> > >> avg score:  1.0
> > >> max score:  1.0
> > >> status 1 (db_unfetched):    33072
> > >> status 2 (db_fetched):      57146
> > >> status 3 (db_gone): 6878
> > >> status 4 (db_redir_temp):   2510
> > >> status 5 (db_redir_perm):   2509
> > >> status 6 (db_notmodified):  281
> > >> CrawlDb statistics: done
> > >>
> > >> After a generate / fetch / parse / update cycle:
> > >>
> > >> TOTAL urls:     122885
> > >> retry 0:        121816
> > >> retry 1:        677
> > >> retry 2:        392
> > >> min score:      1.0
> > >> avg score:      1.0
> > >> max score:      1.0
> > >> status 1 (db_unfetched):        32153
> > >> status 2 (db_fetched):  75366
> > >> status 3 (db_gone):     9167
> > >> status 4 (db_redir_temp):       2979
> > >> status 5 (db_redir_perm):       2878
> > >> status 6 (db_notmodified):      342
> > >> CrawlDb statistics: done
> > >>
> > >> 2.) The next thing is related to the first one:
> > >>
> > >> The generator tells me in the log files:
> > >> 2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
> > >> cms.uni-kassel.de has more than 3000 URLs for all 1 segments -
> skipping
> > >>
> > >> But when the fetcher is running it fetches many urls which the
> generator
> > >> told me it had skipped before, like:
> > >>
> > >> 2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
> > >> http://cms.uni-kassel.de/**unicms/index.php?id=27436<
> http://cms.uni-kass
> > >> el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO
> > >> fetcher.Fetcher - fetching
> > >> http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<
> http://cms.uni-
> > >> kassel.de/unicms/index.php?id=24287&L=1>
> > >>
> > >> A second example:
> > >>
> > >> 2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
> > >> www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
> > >> skipping
> > >>
> > >> 2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
> > >> http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
> > >> Visualizing_and_Optimizing-**Paper.pdf<
> http://www.iset.uni-kassel.de/abt
> > >> /FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
> > >> 2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
> > >> http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
> > >> 2010_Degner_Staffelstein.pdf<
> http://www.iset.uni-kassel.de/abt/FB-A/publ
> > >> ication/2010/2010_Degner_Staffelstein.pdf>
> > >>
> > >> Did I do something wrong? I don't get it :)
> > >>
> > >> Thank you all
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to