Hello Mark - why? Although this is possible to do so, for what reason because 
it makes no sense. Gone records are not reindexed, they are ignored, or with 
the correct flags even removed from the index.

In any case, in Nutch 1.x the CrawlDB is read (optionally in trunk i believe) 
and the number of 404's in the segment are passed as well. With some clever 
key/value passing In indexermapreduce, it is straightforward to get that value 
beforehand.

M.

-----Original message-----
> From:mark mark <[email protected]>
> Sent: Thursday 23rd June 2016 19:52
> To: [email protected]
> Subject: Nutch db_gone
> 
> Hi,
> 
> I am using nutch 1.X, in code(plugin) need a way to get total db_gone
> document.
> 
> We want to set some threshold on db_gone document, before indexing we want
> to check number of gone document and if it more than our thrash-hold we
> don't want to index.
> 
> We want to do this from code.
> 
> Thanks Mark
> 

Reply via email to