Thanks Bai for your explanation, it make alot of sense.
I had another question. I see you had posted a question on how to query all
unfetched pages from HBase. Were you able to get the query below to work?
<<I'm trying to check hbase for urls that have unfetched status but my
query
isn't working correctly. No matter what I don't get a match.
scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
Bytes.toBytes('1'))} >>
Thanks a lot for your help
Mariam
On Thu, Jul 11, 2013 at 4:16 AM, Bai Shen <[email protected]> wrote:
> The crawl script doesn't accept Batch ID. So in order to use Batch ID you
> would run the commands separately which would not involve depth. Depth is
> just the number of times to run the generate, fetch, parse, update cycle.
>
> Any unfetched pages will not have a Batch ID. The Batch ID only applies to
> the pages that were generated. By default all of the unfetched and
> injected pages are available to be generated with Batch ID 2.
>
> Batch ID is useful because it allows you to run fetch, parse, and index
> commands only on the generated urls instead of the entire database.
>
> Hope that makes sense.
>
>
> On Wed, Jul 10, 2013 at 3:52 PM, Mariam Salloum <[email protected]
> >wrote:
>
> > Hi All,
> >
> >
> > I'm using Nutch 2.x along with Hbase and SOLR. I have the following
> > question.
> >
> > (a) Lets say I run a crawl (generate, fetch, parse, update, etc.) with
> > Batch ID '1' and set the depth to 3.
> > (b) After this, I may still have some pages unfetched and they should be
> > marked with Batch ID 1'
> >
> > (c) I then inject additional URLS
> > (d) Run a crawl (generate, fetch, parse, update, etc.) with Batch ID '2'
> >
> > My question is what pages get assigned this new batch id? Do the pages
> from
> > the previous crawl (unfetched pages) get assigned this new batch id? Or
> > only newly injected pages.
> >
> > I guess I don't fully understand the concept of batch id and how to
> utilize
> > it. I already searched the Nutch site and past posts, but could not find
> > clarification on this.
> >
> > Thank you for your help
> >
>