The crawl script doesn't accept Batch ID.  So in order to use Batch ID you
would run the commands separately which would not involve depth.  Depth is
just the number of times to run the generate, fetch, parse, update cycle.

Any unfetched pages will not have a Batch ID.  The Batch ID only applies to
the pages that were generated.  By default all of the unfetched and
injected pages are available to be generated with Batch ID 2.

Batch ID is useful because it allows you to run fetch, parse, and index
commands only on the generated urls instead of the entire database.

Hope that makes sense.


On Wed, Jul 10, 2013 at 3:52 PM, Mariam Salloum <[email protected]>wrote:

> Hi All,
>
>
> I'm using Nutch 2.x along with Hbase and SOLR. I have the following
> question.
>
> (a) Lets say I run a crawl (generate, fetch, parse, update, etc.) with
> Batch ID  '1' and set the depth to 3.
> (b) After this, I may still have some pages unfetched and they should be
> marked with Batch ID 1'
>
> (c) I then inject additional URLS
> (d) Run a crawl (generate, fetch, parse, update, etc.) with Batch ID  '2'
>
> My question is what pages get assigned this new batch id? Do the pages from
> the previous crawl (unfetched pages) get assigned this new batch id? Or
> only newly injected pages.
>
> I guess I don't fully understand the concept of batch id and how to utilize
> it. I already searched the Nutch site and past posts, but could not find
> clarification on this.
>
> Thank you for your help
>

Reply via email to