Implementing near duplicate detection algorithm using IDF statistics

Thomas Heigl Wed, 24 Mar 2010 02:28:59 -0700

Hello,

For my current project I need to implement an index-time mechanism to
detect (near) duplicate documents. The TextProfileSignature available
out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright
but does not use global collection statistics in deciding which terms
will be used for calculating the signature.
Most state-of-the-art hash-based duplication detection algorithms make
use of this information to improve precision and recall (e.g.
http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122)


Is it possible to access collection statistics - especially IDF values
for all non-discarded terms in the current document - from within an
implementation of the Signature class?

Kind regards,

Thomas

--
DDI Thomas Heigl
Software Engineer
--------------------------------------------
System One
Gesellschaft für technologiegestützte
Kommunikationsprozesse m.b.H.
Stiftgasse 6/2/6
thomas.he...@systemone.at
http://www.systemone.at
Powered by Open-Xchange.com

Implementing near duplicate detection algorithm using IDF statistics

Reply via email to