Hi Tony, On Thursday, July 18, 2013, Tony Mullins <[email protected]> wrote: > Currently in Nutch2.x SolrDeDup job runs on entire index. > Is it possible to configure it to run against the current batch Id ?
It will be possible. There are various issues open (and patches) for 2.3 which deal with improving solr* jobs https://issues.apache.org/jira/issues/?jql=project%20%3D%20NUTCH%20AND%20fixVersion%20%3D%20%222.3%22%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC Of particular relevance will be NUTCH-1556 which aims to develop updatedb to do the exact same. Maybe you can take some inspiration from this? > We are trying to maintain historical data in Solr, crawled by nutch on the > bases of date on it was crawled. > > So in this scenario when I run the nutch crawl script it removes all > duplicate docs against all dates (in entire index) and If I remove the > SolrDeDup command from crawl script and run it with numberOfRounds >= 2 > then I get duplicate docs against each ( generate ->fetch -> parse-> > dbupdate-> solrindex) cycle. > > Thanks, > Tony. > -- *Lewis*

