Re: Some question about the generator

Marek Bachmann Tue, 16 Aug 2011 07:55:23 -0700

On 16.08.2011 16:27, Julien Nioche wrote:

1) generate.max.count sets a limit on the number of URLs for a single

host or domain - this is different from the overall limit set by the
generate -top parameter.


2) the generator only skips the URLs which are beyond the max number
allowed for the host (in your case 3K). This does not mean that ALL

urls

for that host are skipped

Makes sense?


Hey Julien, thank you. Yes, your description makes sense for me. So if I
want to fetch a list with only 3k urls, I just have to run:

./nutch parse $seg -topN 3000


No, topN applies to the generator.


good catch Markus - I'd read generate.
Marek - this has nothing to do with the parsing


Yeah, right, I meant generate. My fault. :-)


right?

But I still don't get this message:
2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping

What is meant by "more than 3000 URLs for all 1 segments"? Skipping
means then, that "it will skip after 3k urls"?


generate.max.count=3000 then all urls above 3000 for a given host/domain
are
skipped when generating the segment.


But for now you helped to solve my problem. :)

On 16 August 2011 14:16, Marek Bachmann<[email protected]>

  wrote:

Hello,

there are two things I don't understand regarding the generator:

1.) If I set the generate.max.count value to a value, e.g. 3000, it
seems that this value is ignored. In every run about 20000 pages are
fetched.

TOTAL urls: 102396
retry 0:    101679
retry 1:    325
retry 2:    392
min score:  1.0
avg score:  1.0
max score:  1.0
status 1 (db_unfetched):    33072
status 2 (db_fetched):      57146
status 3 (db_gone): 6878
status 4 (db_redir_temp):   2510
status 5 (db_redir_perm):   2509
status 6 (db_notmodified):  281
CrawlDb statistics: done

After a generate / fetch / parse / update cycle:

TOTAL urls:     122885
retry 0:        121816
retry 1:        677
retry 2:        392
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        32153
status 2 (db_fetched):  75366
status 3 (db_gone):     9167
status 4 (db_redir_temp):       2979
status 5 (db_redir_perm):       2878
status 6 (db_notmodified):      342
CrawlDb statistics: done

2.) The next thing is related to the first one:

The generator tells me in the log files:
2011-08-16 13:55:55,087 INFO  crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments -

skipping


But when the fetcher is running it fetches many urls which the

generator

told me it had skipped before, like:

2011-08-16 13:56:31,119 INFO  fetcher.Fetcher - fetching
http://cms.uni-kassel.de/**unicms/index.php?id=27436<

http://cms.uni-kass

el.de/unicms/index.php?id=27436>  2011-08-16 13:56:31,706 INFO
fetcher.Fetcher - fetching
http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<

http://cms.uni-

kassel.de/unicms/index.php?id=24287&L=1>

A second example:

2011-08-16 13:55:59,362 INFO  crawl.Generator - Host or domain
www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
skipping

2011-08-16 13:56:30,783 INFO  fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
Visualizing_and_Optimizing-**Paper.pdf<

http://www.iset.uni-kassel.de/abt

/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
2011-08-16 13:56:30,813 INFO  fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
2010_Degner_Staffelstein.pdf<

http://www.iset.uni-kassel.de/abt/FB-A/publ

ication/2010/2010_Degner_Staffelstein.pdf>

Did I do something wrong? I don't get it :)

Thank you all


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Some question about the generator

Reply via email to