Hello folks, i have questions about MLT and Deduplication and what would be the best choice in my case.
Case: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted blog articles from different sources, with slight changes (author name, etc..)). But they have differences. *Now i like to see 1 doc in my result set and the other 4 should be marked as similar.* With *MLT*: <str name="mlt.fl">text</str> <int name="mlt.minwl">5</int> <int name="mlt.maxwl">50</int> <int name="mlt.maxqt">3</int> <int name="mlt.maxntp">5000</int> <bool name="mlt.boost">true</bool> <str name="mlt.qf">text</str> </lst> With this config i get about 500 similar docs for this 1 doc, unfortunately too much. *Deduplication*: I index this docs now with an signature and i'm using TextProfileSignature. <updateRequestProcessorChain name="dedupe"> <processor class="solr.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <str name="signatureField">signature_t</str> <bool name="overwriteDupes">false</bool> <str name="fields">text</str> <str name="signatureClass">solr.processor.TextProfileSignature</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> How can i compare the created signatures? I want only see the 5 similar docs, nothing else. Which of this two cases is relevant to me? Can i tune the MLT for my requirement? Or should i use Dedupe? Thanks and Regards Vadim