Hi,
I have found out that, what happens after
bin/nutch generate -topN 1000
is that only 1000 of the urls have been marked by gnmrk
Then
bin/nutch fetch -all
skips all urls that do not have gnmrk
according to the code
Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
if (!NutchJob.shouldProcess(mark, batchId)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different
batch id (" + mark + ")");
}
return;
}
since shouldProcess(mark, batchId) returns false if mark is null.
Then
bin/nutch parse -all
skips all urls that do not have fetch mark
according to the code
Utf8 mark = Mark.FETCH_MARK.checkMark(page);
String unreverseKey = TableUtil.unreverseUrl(key);
if (!NutchJob.shouldProcess(mark, batchId)) {
LOG.info("Skipping " + unreverseKey + "; different batch id");
return;
}
this outputs to log as INFO and are those that you see in log file.
So, it seems to me that -all option to fetch, parse and solrindex do not work
as expected.
Alex.
-----Original Message-----
From: Bai Shen <[email protected]>
To: user <[email protected]>
Sent: Thu, Aug 2, 2012 5:59 am
Subject: Re: Different batch id
I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.
On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <[email protected]> wrote:
> I set up Nutch 2.x with a new instance of HBase. I ran the following
> commands.
>
> bin/nutch inject urls
> bin/nutch generate -topN 1000
> bin/nutch fetch -all
> bin/nutch parse -all
>
> When looking at the parse log, I'm seeing a bunch of "different batch id"
> messages. These are all on urls that I did not inject into the database.
>
> Any ideas what's causing this?
>
> Thanks.
>