Re: Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-16 Thread Chris Hostetter

: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
: blog articles from different sources, with slight changes (author name,
: etc..)).
: But they have differences.
: *Now i like to see 1 doc in my result set and the other 4 should be marked
: as similar.*

Do you actaully want al 1000 docs in your index, or do you want to prevent 
4 of the 5 copies of hte doc from being indexed?

Either way, if the the TextProfileSignature is doing a good job of 
identifying the 5 similar docs, then use that at index time.

If you want to keep 4/5 out of the index, then use the Deduplcation 
features to prefent the duplicates from being indexed and your done.  

If you wnat all docs in the index, then you have to decide how you want to 
mark docs as similar ... do you want to only have one of those docs 
appear in all of your results, or do you want all of them in the results 
but with an indication that there are other similar docs?  If the former: 
then take a look at Grouping and group on your signature field.  If the 
latter, use the MLT component, to find similar docs based on the signature 
field (ie: mlt.fl=signature_t)

https://wiki.apache.org/solr/FieldCollapsing

-Hoss


Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-07 Thread Vadim Kisselmann
Hello folks,

i have questions about MLT and Deduplication and what would be the best
choice in my case.

Case:

I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
blog articles from different sources, with slight changes (author name,
etc..)).
But they have differences.
*Now i like to see 1 doc in my result set and the other 4 should be marked
as similar.*

With *MLT*:
str name=mlt.fltext/str
  int name=mlt.minwl5/int
  int name=mlt.maxwl50/int
  int name=mlt.maxqt3/int
  int name=mlt.maxntp5000/int
  bool name=mlt.boosttrue/bool
  str name=mlt.qftext/str
   /lst

With this config i get about 500 similar docs for this 1 doc, unfortunately
too much.


*Deduplication*:
I index this docs now with an signature and i'm using TextProfileSignature.

updateRequestProcessorChain name=dedupe
   processor class=solr.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
 str name=signatureFieldsignature_t/str
 bool name=overwriteDupesfalse/bool
 str name=fieldstext/str
 str
name=signatureClasssolr.processor.TextProfileSignature/str
/processor
   processor class=solr.LogUpdateProcessorFactory /
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

How can i compare the created signatures?


I want only see the 5 similar docs, nothing else.
Which of this two cases is relevant to me? Can i tune the MLT for my
requirement? Or should i use Dedupe?

Thanks and Regards
Vadim