Re: How to configure SolrDeDup Job to run per batch Id not entire index?

Lewis John Mcgibbney Thu, 18 Jul 2013 07:44:02 -0700

Hi Tony,

On Thursday, July 18, 2013, Tony Mullins <[email protected]> wrote:
> Currently in Nutch2.x SolrDeDup job runs on entire index.
> Is it possible to configure it to run against the current batch Id ?


It will be possible. There are various issues open (and patches) for 2.3
which deal with improving solr* jobs

https://issues.apache.org/jira/issues/?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%20%222.3%22%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC

Of particular relevance will be NUTCH-1556 which aims to develop updatedb
to do the exact same. Maybe you can take some inspiration from this?

> We are trying to maintain historical data in Solr, crawled by nutch on the
> bases of date on it was crawled.
>
> So in this scenario when I run the nutch crawl script it removes all
> duplicate docs against all dates (in entire index) and If I remove the
> SolrDeDup command from crawl script and run it with numberOfRounds >= 2
> then I get duplicate docs against each ( generate ->fetch -> parse->
> dbupdate-> solrindex)  cycle.
>
> Thanks,
> Tony.
>

-- 
*Lewis*

Re: How to configure SolrDeDup Job to run per batch Id not entire index?

Reply via email to