: I'm doing near duplication detection on a fairly large number of documents. : Each document to be added to Solr will be compared with sample documents : from all clusters in the index. I could of course, dedupe documents at : client side but the performance will not be as good.
have you considered the UpdateRequestProcessor API as an alternative to mucking with DUH2 directly? http://lucene.apache.org/solr/api/org/apache/solr/update/processor/package-summary.html (i don't really know much of the details about it, but i know it was added specificly to support more "biz logic" related tasks at update time -- as oppoed to the really low level nitty gritty updating that DUH2 worries about directly) -Hoss