Hi, Cuong Hoang wrote: > BTW, has anyone here done any serious near duplication detection with Solr? > If yes, what approaches did you use? [...] > Unfortunately some of our documents are "near duplications" which means they > are mostly identical (>75%) but usually not 100% identical. hashCode is very > sensitive to small changes so it can't be used in our case.
You may be interested in this Lucene java-user ML thread: <http://www.gossamer-threads.com/lists/lucene/java-user/41103> The Nutch TextProfileSignature implementation[1] mentioned in the above-linked thread appears to take an MD5 signature of the frequency-ordered downcased whitespace-separated tokens from a document. This approach is not quite as sensitive to small changes as a direct hash of the content, but it will likely fail fairly often if you're looking at differences of more than a few percent (as your ">75% identical" seems to indicate). I have done some small-scale deduplication work (without Solr), and found that a small preprocessing step using regular expressions to remove changeable content that was not meaningful for the purposes of comparison (e.g. hit counters and date/time stamps) was fairly successful in reducing the error rate for a brute-force term frequency comparison approach (i.e., direct calculation of the angle between doc pairs' term vectors). Steve [1] API doc for Nutch TextProfileSignature class: <http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/TextProfileSignature.html>