We have a Hadoop process that produces a set of Solr indexes from a cluster
of HBase documents. After the job runs, we pull the indexes from HDFS and
merge the them together locally. The issue we're running into is that
sometimes we'll have duplicate occurrences of a primary key across indexes
that we'll want merged out. For example, a set of directories with:

./dir00/
doc_id=0
PK=1

./dir01/
doc_id=0
PK=1

should merge into a Solr index containing a single document rather than one
with two Lucene documents each containing PK=1.

The Lucene-level merge code -- i.e. oal.index.SegmentMerger.merge()--
doesn't know about the Solr schema, so it will merge these two directories
into two duplicate documents. It doesn't appear that either Solr's
oas.handler.admin.CoreAdminHandler.handleMergeAction(SolrQueryRequest,
SolrQueryResponse) handles this either, as it ends up passing the list of
merge directories to oal.index.IndexWriter.addIndexes(IndexReader...) via
oas.update.DirectUpdateHandler2.mergeIndexes(MergeIndexesCommand).

So, if I want to merge multiple Solr directories in a way that respects
primary key uniqueness, is there any more efficient manner than re-adding
all of the documents in each directory to a new Solr index to avoid PK
duplicates?

Thanks.

--Gregg

Gregg Donovan
Senior Software Engineer, Etsy.com
gr...@etsy.com

Reply via email to