On 16.08.2011 16:27, Julien Nioche wrote:
1) generate.max.count sets a limit on the number of URLs for a single
host or domain - this is different from the overall limit set by the
generate -top parameter.
2) the generator only skips the URLs which are beyond the max number
allowed for the host (in your case 3K). This does not mean that ALL
urls
for that host are skipped
Makes sense?
Hey Julien, thank you. Yes, your description makes sense for me. So if I
want to fetch a list with only 3k urls, I just have to run:
./nutch parse $seg -topN 3000
No, topN applies to the generator.
good catch Markus - I'd read generate.
Marek - this has nothing to do with the parsing
Yeah, right, I meant generate. My fault. :-)
right?
But I still don't get this message:
2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments - skipping
What is meant by "more than 3000 URLs for all 1 segments"? Skipping
means then, that "it will skip after 3k urls"?
generate.max.count=3000 then all urls above 3000 for a given host/domain
are
skipped when generating the segment.
But for now you helped to solve my problem. :)
On 16 August 2011 14:16, Marek Bachmann<[email protected]>
wrote:
Hello,
there are two things I don't understand regarding the generator:
1.) If I set the generate.max.count value to a value, e.g. 3000, it
seems that this value is ignored. In every run about 20000 pages are
fetched.
TOTAL urls: 102396
retry 0: 101679
retry 1: 325
retry 2: 392
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 33072
status 2 (db_fetched): 57146
status 3 (db_gone): 6878
status 4 (db_redir_temp): 2510
status 5 (db_redir_perm): 2509
status 6 (db_notmodified): 281
CrawlDb statistics: done
After a generate / fetch / parse / update cycle:
TOTAL urls: 122885
retry 0: 121816
retry 1: 677
retry 2: 392
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 32153
status 2 (db_fetched): 75366
status 3 (db_gone): 9167
status 4 (db_redir_temp): 2979
status 5 (db_redir_perm): 2878
status 6 (db_notmodified): 342
CrawlDb statistics: done
2.) The next thing is related to the first one:
The generator tells me in the log files:
2011-08-16 13:55:55,087 INFO crawl.Generator - Host or domain
cms.uni-kassel.de has more than 3000 URLs for all 1 segments -
skipping
But when the fetcher is running it fetches many urls which the
generator
told me it had skipped before, like:
2011-08-16 13:56:31,119 INFO fetcher.Fetcher - fetching
http://cms.uni-kassel.de/**unicms/index.php?id=27436<
http://cms.uni-kass
el.de/unicms/index.php?id=27436> 2011-08-16 13:56:31,706 INFO
fetcher.Fetcher - fetching
http://cms.uni-kassel.de/**unicms/index.php?id=24287&L=1<
http://cms.uni-
kassel.de/unicms/index.php?id=24287&L=1>
A second example:
2011-08-16 13:55:59,362 INFO crawl.Generator - Host or domain
www.iset.uni-kassel.de has more than 3000 URLs for all 1 segments -
skipping
2011-08-16 13:56:30,783 INFO fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/**abt/FB-I/publication/2011-017_**
Visualizing_and_Optimizing-**Paper.pdf<
http://www.iset.uni-kassel.de/abt
/FB-I/publication/2011-017_Visualizing_and_Optimizing-Paper.pdf>
2011-08-16 13:56:30,813 INFO fetcher.Fetcher - fetching
http://www.iset.uni-kassel.de/**abt/FB-A/publication/2010/**
2010_Degner_Staffelstein.pdf<
http://www.iset.uni-kassel.de/abt/FB-A/publ
ication/2010/2010_Degner_Staffelstein.pdf>
Did I do something wrong? I don't get it :)
Thank you all
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350