Thanks Walter, Unfortunately some of our documents are "near duplications" which means they are mostly identical (>75%) but usually not 100% identical. hashCode is very sensitive to small changes so it can't be used in our case.
Walter Ferrara-2 wrote: > > solr have unique keys, which do that "avoid duplicate" work for you, so > you may try to make some kind of unique identifier out of the text your > going to index, and use that as a solr <uniqueKey>. > > You could try to create a sort of hashCode or something like that from > the text your are going to index, and use that as uniquekey of the > schema - the next time you're going to add the same text, you should > get the same key, and so solr will not add it again, but just update it > (or at least it will be a lot simpler to understand if that document is > already present in the index). > > any other thoughts? > -- > Walter > > climbingrose wrote: >> >>>> You would get autowarming, etc, by default though - not what you want >>>> >>> >from a searcher that is only used for deletions. >>> >> >> As a work around, I manually initialise LRUCache instance in DUH2 >> constructor. It works but not very elegant because you can't view cache's >> statistics info in Solr admin... >> >> >>>> What problem are you trying to solve that requires directly using or >>>> modifying DUH2? >>>> >> >> I'm doing near duplication detection on a fairly large number of >> documents. >> Each document to be added to Solr will be compared with sample documents >> from all clusters in the index. I could of course, dedupe documents at >> client side but the performance will not be as good. >> >> BTW, has anyone here done any serious near duplication detection with >> Solr? >> If yes, what approaches did you use? >> >> Thanks. >> > > -- View this message in context: http://www.nabble.com/Implication-of-not-calling-closeSearcher%28%29-in-DirectUpdateHandler2--tf4508411.html#a12874713 Sent from the Solr - Dev mailing list archive at Nabble.com.