Re: Programatically determining crawlIds in Nutch 2.x

Ferdy Galema Mon, 30 Jul 2012 08:47:07 -0700

Hi,

Just to avoid confusion: There are 2 concepts, namely batchId and crawlId.
The batchId is a subset within the same table. The table is determined by
crawlId. Not all stores adhere to crawlId, as this requires a specific
implementation. At least HBaseStore supports it.

Not sure what the best way is to collect all crawlIds. You could always do
a listing of your database. I've made some improvments to NutchJob that
includes the crawlId within each jobname. That definitely helps when
managing multiple crawls. Will commit this soon.

Ferdy.

On Mon, Jul 30, 2012 at 5:28 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Bai,
>
> This is something which I've also wanted but haven't got around to
> sorting out. It's 'kinda' similar to the problem we addressed in
> NUTCH-1349 [0].
>
> AFAIK the only solution I have is to hack a logging implementation
> which would provide the crawlId's in a more verbose manner, then you
> could pick them up from your log(s) output. Other than that the next
> step would be to introduce a dedicated tool for the job?
>
> Any other thoughts?
>
> Lewis
>
> [0] https://issues.apache.org/jira/browse/NUTCH-1349
>
> On Mon, Jul 30, 2012 at 4:17 PM, Bai Shen <[email protected]> wrote:
> > How do I check what crawlIds currently exist?  Previously I could look in
> > my segments directory to see what needed to be processed.
> >
> > Thanks.
>
>
>
> --
> Lewis
>

Re: Programatically determining crawlIds in Nutch 2.x

Reply via email to