Re: Similar documents and advantages / disadvantages of MLT / Deduplication
: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted : blog articles from different sources, with slight changes (author name, : etc..)). : But they have differences. : *Now i like to see 1 doc in my result set and the other 4 should be marked : as similar.* Do you actaully want al 1000 docs in your index, or do you want to prevent 4 of the 5 copies of hte doc from being indexed? Either way, if the the TextProfileSignature is doing a good job of identifying the 5 similar docs, then use that at index time. If you want to keep 4/5 out of the index, then use the Deduplcation features to prefent the duplicates from being indexed and your done. If you wnat all docs in the index, then you have to decide how you want to mark docs as similar ... do you want to only have one of those docs appear in all of your results, or do you want all of them in the results but with an indication that there are other similar docs? If the former: then take a look at Grouping and group on your signature field. If the latter, use the MLT component, to find similar docs based on the signature field (ie: mlt.fl=signature_t) https://wiki.apache.org/solr/FieldCollapsing -Hoss
Similar documents and advantages / disadvantages of MLT / Deduplication
Hello folks, i have questions about MLT and Deduplication and what would be the best choice in my case. Case: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted blog articles from different sources, with slight changes (author name, etc..)). But they have differences. *Now i like to see 1 doc in my result set and the other 4 should be marked as similar.* With *MLT*: str name=mlt.fltext/str int name=mlt.minwl5/int int name=mlt.maxwl50/int int name=mlt.maxqt3/int int name=mlt.maxntp5000/int bool name=mlt.boosttrue/bool str name=mlt.qftext/str /lst With this config i get about 500 similar docs for this 1 doc, unfortunately too much. *Deduplication*: I index this docs now with an signature and i'm using TextProfileSignature. updateRequestProcessorChain name=dedupe processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature_t/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClasssolr.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How can i compare the created signatures? I want only see the 5 similar docs, nothing else. Which of this two cases is relevant to me? Can i tune the MLT for my requirement? Or should i use Dedupe? Thanks and Regards Vadim