Hi, It depends on the expectation ;)
I agree that it may be confusing, but currently the -all option in the various Nutch tools only process "all with a mark". There is a separate option that is able to process "all regardless if mark is present or not". For the parser this is -reparse. For the indexer -reindex. (At least in the current branch).There is no such thing for the fetcher. It is up for discussion if a "-refetch" option would be useful here. If there is such an option, the purpose of the generator would be gone. Ferdy. On Thu, Aug 2, 2012 at 8:47 PM, <[email protected]> wrote: > Hi, > > I have found out that, what happens after > > bin/nutch generate -topN 1000 > > is that only 1000 of the urls have been marked by gnmrk > > Then > bin/nutch fetch -all > > skips all urls that do not have gnmrk > according to the code > Utf8 mark = Mark.GENERATE_MARK.checkMark(page); > if (!NutchJob.shouldProcess(mark, batchId)) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; > different batch id (" + mark + ")"); > } > return; > } > > since shouldProcess(mark, batchId) returns false if mark is null. > > Then > > bin/nutch parse -all > skips all urls that do not have fetch mark > according to the code > Utf8 mark = Mark.FETCH_MARK.checkMark(page); > String unreverseKey = TableUtil.unreverseUrl(key); > if (!NutchJob.shouldProcess(mark, batchId)) { > LOG.info("Skipping " + unreverseKey + "; different batch id"); > return; > } > > this outputs to log as INFO and are those that you see in log file. > > So, it seems to me that -all option to fetch, parse and solrindex do not > work as expected. > > Alex. > > > > -----Original Message----- > From: Bai Shen <[email protected]> > To: user <[email protected]> > Sent: Thu, Aug 2, 2012 5:59 am > Subject: Re: Different batch id > > > I just tried running this with the actual batch Id instead of using -all, > and I'm still getting similar results. > > On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <[email protected]> wrote: > > > I set up Nutch 2.x with a new instance of HBase. I ran the following > > commands. > > > > bin/nutch inject urls > > bin/nutch generate -topN 1000 > > bin/nutch fetch -all > > bin/nutch parse -all > > > > When looking at the parse log, I'm seeing a bunch of "different batch id" > > messages. These are all on urls that I did not inject into the database. > > > > Any ideas what's causing this? > > > > Thanks. > > > > >

