RE: re-Crawl re-fetch all pages each time

vetus Thu, 15 Nov 2012 11:10:21 -0800

Thanks you for you response, But it also re-fetch all webpages...

This is the code that I'm using...


      status.put(Nutch.STAT_PHASE, "generate " + i);
      jobRes = runTool(GeneratorJob.class, args);
      if (jobRes != null) {
        subTools.put("generate " + i, jobRes);
      }
      status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
      if (shouldStop) {
        return results;
      }
      status.put(Nutch.STAT_PHASE, "fetch " + i);
      
      jobRes = runTool(FetcherJob.class, args);
      if (jobRes != null) {
        subTools.put("fetch " + i, jobRes);
      }
      status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
      if (shouldStop) {
        return results;
      }
      if (!parse) {
        status.put(Nutch.STAT_PHASE, "parse " + i);
        jobRes = runTool(ParserJob.class, args);
        if (jobRes != null) {
          subTools.put("parse " + i, jobRes);
        }
        status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
        if (shouldStop) {
          return results;
        }
      }
      status.put(Nutch.STAT_PHASE, "updatedb " + i);
      jobRes = runTool(DbUpdaterJob.class, args);
      if (jobRes != null) {
        subTools.put("updatedb " + i, jobRes);
      }
      status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
      if (shouldStop) {
        return results;
      }


And I obtain this results each time...

{jobs={generate 0={jobs={generate: 1353001293-689397482={jobID=null,
jobName=generate: 1353001293-689397482, counters={Map-Reduce
Framework={REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0,
MAP_INPUT_RECORDS=28, REDUCE_SHUFFLE_BYTES=0, REDUCE_OUTPUT_RECORDS=0,
SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, MAP_OUTPUT_RECORDS=0,
COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0},
FileSystemCounters={FILE_BYTES_READ=49432, FILE_BYTES_WRITTEN=102464}}}},
generate.batch.id=1353001293-689397482}, fetch 0={jobs={fetch={jobID=null,
jobName=fetch, counters={Map-Reduce Framework={REDUCE_INPUT_GROUPS=28,
COMBINE_OUTPUT_RECORDS=0, MAP_INPUT_RECORDS=28, REDUCE_SHUFFLE_BYTES=0,
REDUCE_OUTPUT_RECORDS=28, SPILLED_RECORDS=56, MAP_OUTPUT_BYTES=14032,
MAP_OUTPUT_RECORDS=28, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=28},
FileSystemCounters={FILE_BYTES_READ=113078, FILE_BYTES_WRITTEN=233408},
FetcherStatus={HitByTimeLimit-QueueFeeder=0, SUCCESS=28}}}}}, parse
0={jobs={parse={jobID=null, jobName=parse,
counters={ParserStatus={success=28}, Map-Reduce
Framework={MAP_INPUT_RECORDS=28, SPILLED_RECORDS=0, MAP_OUTPUT_RECORDS=28},
FileSystemCounters={FILE_BYTES_READ=88124, FILE_BYTES_WRITTEN=167612}}}}}}}

I have put a counter in the pages of my domain, and nutch is calling every
page each time that I execute it.

I do not give batch_id to the fetcher, and I do not give "resume" parameter. 

Do you have any idea? 

I'm working with nutch 2.1

Thanks!!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/re-Crawl-re-fetch-all-pages-each-time-tp4020464p4020564.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: re-Crawl re-fetch all pages each time

Reply via email to