For reference, you can get a rental copy of this article for less than the cost of the full PDF download here:
http://www.deepdyve.com/lp/association-for-computing-machinery/collection-statistics-for-fast-duplicate-document-detection-0o7i3Sx0Wd (joining the ACM is also a good thing to do) (and yes, this is licensed by the ACM) On Wed, Mar 24, 2010 at 2:28 AM, Thomas Heigl <thomas.he...@systemone.at>wrote: > Hello, > > For my current project I need to implement an index-time mechanism to > detect (near) duplicate documents. The TextProfileSignature available > out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright > but does not use global collection statistics in deciding which terms > will be used for calculating the signature. > Most state-of-the-art hash-based duplication detection algorithms make > use of this information to improve precision and recall (e.g. > > http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122 > ) > > Is it possible to access collection statistics - especially IDF values > for all non-discarded terms in the current document - from within an > implementation of the Signature class? > > Kind regards, > > Thomas > > -- > DDI Thomas Heigl > Software Engineer > -------------------------------------------- > System One > Gesellschaft für technologiegestützte > Kommunikationsprozesse m.b.H. > Stiftgasse 6/2/6 > thomas.he...@systemone.at > http://www.systemone.at > Powered by Open-Xchange.com >