Hello, For my current project I need to implement an index-time mechanism to detect (near) duplicate documents. The TextProfileSignature available out-of-the-box (http://wiki.apache.org/solr/Deduplication) seems alright but does not use global collection statistics in deciding which terms will be used for calculating the signature. Most state-of-the-art hash-based duplication detection algorithms make use of this information to improve precision and recall (e.g. http://portal.acm.org/citation.cfm?id=506311&dl=GUIDE&coll=GUIDE&CFID=83187370&CFTOKEN=47052122)
Is it possible to access collection statistics - especially IDF values for all non-discarded terms in the current document - from within an implementation of the Signature class? Kind regards, Thomas -- DDI Thomas Heigl Software Engineer -------------------------------------------- System One Gesellschaft für technologiegestützte Kommunikationsprozesse m.b.H. Stiftgasse 6/2/6 thomas.he...@systemone.at http://www.systemone.at Powered by Open-Xchange.com