Thanks you for you response, But it also re-fetch all webpages...
This is the code that I'm using...
status.put(Nutch.STAT_PHASE, "generate " + i);
jobRes = runTool(GeneratorJob.class, args);
if (jobRes != null) {
subTools.put("generate " + i, jobRes);
}
status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
if (shouldStop) {
return results;
}
status.put(Nutch.STAT_PHASE, "fetch " + i);
jobRes = runTool(FetcherJob.class, args);
if (jobRes != null) {
subTools.put("fetch " + i, jobRes);
}
status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
if (shouldStop) {
return results;
}
if (!parse) {
status.put(Nutch.STAT_PHASE, "parse " + i);
jobRes = runTool(ParserJob.class, args);
if (jobRes != null) {
subTools.put("parse " + i, jobRes);
}
status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
if (shouldStop) {
return results;
}
}
status.put(Nutch.STAT_PHASE, "updatedb " + i);
jobRes = runTool(DbUpdaterJob.class, args);
if (jobRes != null) {
subTools.put("updatedb " + i, jobRes);
}
status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
if (shouldStop) {
return results;
}
And I obtain this results each time...
{jobs={generate 0={jobs={generate: 1353001293-689397482={jobID=null,
jobName=generate: 1353001293-689397482, counters={Map-Reduce
Framework={REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0,
MAP_INPUT_RECORDS=28, REDUCE_SHUFFLE_BYTES=0, REDUCE_OUTPUT_RECORDS=0,
SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, MAP_OUTPUT_RECORDS=0,
COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0},
FileSystemCounters={FILE_BYTES_READ=49432, FILE_BYTES_WRITTEN=102464}}}},
generate.batch.id=1353001293-689397482}, fetch 0={jobs={fetch={jobID=null,
jobName=fetch, counters={Map-Reduce Framework={REDUCE_INPUT_GROUPS=28,
COMBINE_OUTPUT_RECORDS=0, MAP_INPUT_RECORDS=28, REDUCE_SHUFFLE_BYTES=0,
REDUCE_OUTPUT_RECORDS=28, SPILLED_RECORDS=56, MAP_OUTPUT_BYTES=14032,
MAP_OUTPUT_RECORDS=28, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=28},
FileSystemCounters={FILE_BYTES_READ=113078, FILE_BYTES_WRITTEN=233408},
FetcherStatus={HitByTimeLimit-QueueFeeder=0, SUCCESS=28}}}}}, parse
0={jobs={parse={jobID=null, jobName=parse,
counters={ParserStatus={success=28}, Map-Reduce
Framework={MAP_INPUT_RECORDS=28, SPILLED_RECORDS=0, MAP_OUTPUT_RECORDS=28},
FileSystemCounters={FILE_BYTES_READ=88124, FILE_BYTES_WRITTEN=167612}}}}}}}
I have put a counter in the pages of my domain, and nutch is calling every
page each time that I execute it.
I do not give batch_id to the fetcher, and I do not give "resume" parameter.
Do you have any idea?
I'm working with nutch 2.1
Thanks!!
--
View this message in context:
http://lucene.472066.n3.nabble.com/re-Crawl-re-fetch-all-pages-each-time-tp4020464p4020564.html
Sent from the Nutch - User mailing list archive at Nabble.com.