We have a Hadoop process that produces a set of Solr indexes from a cluster of HBase documents. After the job runs, we pull the indexes from HDFS and merge the them together locally. The issue we're running into is that sometimes we'll have duplicate occurrences of a primary key across indexes that we'll want merged out. For example, a set of directories with:
./dir00/ doc_id=0 PK=1 ./dir01/ doc_id=0 PK=1 should merge into a Solr index containing a single document rather than one with two Lucene documents each containing PK=1. The Lucene-level merge code -- i.e. oal.index.SegmentMerger.merge()-- doesn't know about the Solr schema, so it will merge these two directories into two duplicate documents. It doesn't appear that either Solr's oas.handler.admin.CoreAdminHandler.handleMergeAction(SolrQueryRequest, SolrQueryResponse) handles this either, as it ends up passing the list of merge directories to oal.index.IndexWriter.addIndexes(IndexReader...) via oas.update.DirectUpdateHandler2.mergeIndexes(MergeIndexesCommand). So, if I want to merge multiple Solr directories in a way that respects primary key uniqueness, is there any more efficient manner than re-adding all of the documents in each directory to a new Solr index to avoid PK duplicates? Thanks. --Gregg Gregg Donovan Senior Software Engineer, Etsy.com gr...@etsy.com