Hi,

It depends on the expectation ;)

I agree that it may be confusing, but currently the -all option in the
various Nutch tools only process "all with a mark". There is a separate
option that is able to process "all regardless if mark is present or not".
For the parser this is -reparse. For the indexer -reindex. (At least in the
current branch).There is no such thing for the fetcher. It is up for
discussion if a "-refetch" option would be useful here. If there is such an
option, the purpose of the generator would be gone.

Ferdy.

On Thu, Aug 2, 2012 at 8:47 PM, <[email protected]> wrote:

> Hi,
>
> I have found out that, what happens after
>
> bin/nutch generate -topN 1000
>
> is that only 1000 of the urls have been marked by gnmrk
>
> Then
> bin/nutch fetch -all
>
> skips all urls that do not have gnmrk
> according to the code
> Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
>  if (!NutchJob.shouldProcess(mark, batchId)) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> different batch id (" + mark + ")");
>         }
>         return;
>       }
>
> since shouldProcess(mark, batchId) returns false if mark is null.
>
> Then
>
> bin/nutch parse -all
> skips all urls that do not have fetch mark
> according to the code
>  Utf8 mark = Mark.FETCH_MARK.checkMark(page);
>       String unreverseKey = TableUtil.unreverseUrl(key);
>       if (!NutchJob.shouldProcess(mark, batchId)) {
>         LOG.info("Skipping " + unreverseKey + "; different batch id");
>         return;
>       }
>
> this outputs to log as INFO and are those that you see in log file.
>
> So, it seems to me that -all option to fetch, parse and solrindex do not
> work as expected.
>
> Alex.
>
>
>
> -----Original Message-----
> From: Bai Shen <[email protected]>
> To: user <[email protected]>
> Sent: Thu, Aug 2, 2012 5:59 am
> Subject: Re: Different batch id
>
>
> I just tried running this with the actual batch Id instead of using -all,
> and I'm still getting similar results.
>
> On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <[email protected]> wrote:
>
> > I set up Nutch 2.x with a new instance of HBase.  I ran the following
> > commands.
> >
> > bin/nutch inject urls
> > bin/nutch generate -topN 1000
> > bin/nutch fetch -all
> > bin/nutch parse -all
> >
> > When looking at the parse log, I'm seeing a bunch of "different batch id"
> > messages.  These are all on urls that I did not inject into the database.
> >
> > Any ideas what's causing this?
> >
> > Thanks.
> >
>
>
>

Reply via email to