Hi,

I have found out that, what happens after 

bin/nutch generate -topN 1000

is that only 1000 of the urls have been marked by gnmrk

Then 
bin/nutch fetch -all

skips all urls that do not have gnmrk
according to the code 
Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
 if (!NutchJob.shouldProcess(mark, batchId)) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different 
batch id (" + mark + ")");
        }
        return;
      }

since shouldProcess(mark, batchId) returns false if mark is null.

Then

bin/nutch parse -all
skips all urls that do not have fetch mark
according to the code
 Utf8 mark = Mark.FETCH_MARK.checkMark(page);
      String unreverseKey = TableUtil.unreverseUrl(key);
      if (!NutchJob.shouldProcess(mark, batchId)) {
        LOG.info("Skipping " + unreverseKey + "; different batch id");
        return;
      }

this outputs to log as INFO and are those that you see in log file.

So, it seems to me that -all option to fetch, parse and solrindex do not work 
as expected.

Alex. 



-----Original Message-----
From: Bai Shen <[email protected]>
To: user <[email protected]>
Sent: Thu, Aug 2, 2012 5:59 am
Subject: Re: Different batch id


I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.

On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <[email protected]> wrote:

> I set up Nutch 2.x with a new instance of HBase.  I ran the following
> commands.
>
> bin/nutch inject urls
> bin/nutch generate -topN 1000
> bin/nutch fetch -all
> bin/nutch parse -all
>
> When looking at the parse log, I'm seeing a bunch of "different batch id"
> messages.  These are all on urls that I did not inject into the database.
>
> Any ideas what's causing this?
>
> Thanks.
>

 

Reply via email to