solr have unique keys, which do that "avoid duplicate" work for you, so you may try to make some kind of unique identifier out of the text your going to index, and use that as a solr <uniqueKey>.
You could try to create a sort of hashCode or something like that from the text your are going to index, and use that as uniquekey of the schema - the next time you're going to add the same text, you should get the same key, and so solr will not add it again, but just update it (or at least it will be a lot simpler to understand if that document is already present in the index). any other thoughts? -- Walter climbingrose wrote: > >>> You would get autowarming, etc, by default though - not what you want >>> >> >from a searcher that is only used for deletions. >> > > As a work around, I manually initialise LRUCache instance in DUH2 > constructor. It works but not very elegant because you can't view cache's > statistics info in Solr admin... > > >>> What problem are you trying to solve that requires directly using or >>> modifying DUH2? >>> > > I'm doing near duplication detection on a fairly large number of documents. > Each document to be added to Solr will be compared with sample documents > from all clusters in the index. I could of course, dedupe documents at > client side but the performance will not be as good. > > BTW, has anyone here done any serious near duplication detection with Solr? > If yes, what approaches did you use? > > Thanks. >