Hi, Just to avoid confusion: There are 2 concepts, namely batchId and crawlId. The batchId is a subset within the same table. The table is determined by crawlId. Not all stores adhere to crawlId, as this requires a specific implementation. At least HBaseStore supports it.
Not sure what the best way is to collect all crawlIds. You could always do a listing of your database. I've made some improvments to NutchJob that includes the crawlId within each jobname. That definitely helps when managing multiple crawls. Will commit this soon. Ferdy. On Mon, Jul 30, 2012 at 5:28 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Bai, > > This is something which I've also wanted but haven't got around to > sorting out. It's 'kinda' similar to the problem we addressed in > NUTCH-1349 [0]. > > AFAIK the only solution I have is to hack a logging implementation > which would provide the crawlId's in a more verbose manner, then you > could pick them up from your log(s) output. Other than that the next > step would be to introduce a dedicated tool for the job? > > Any other thoughts? > > Lewis > > [0] https://issues.apache.org/jira/browse/NUTCH-1349 > > On Mon, Jul 30, 2012 at 4:17 PM, Bai Shen <[email protected]> wrote: > > How do I check what crawlIds currently exist? Previously I could look in > > my segments directory to see what needed to be processed. > > > > Thanks. > > > > -- > Lewis >

