Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
You can do that pretty easily by just retrieving extra documents and post processing the results list. You are likely to have a significant number of apparent duplicates this way. To really get rid of duplicates in results, it might be better to remove them from the corpus by deploying something

Re: remove answers with identical scores

2011-11-25 Thread Fred Zimmerman
thanks. i did consider postprocessing and may wind up doing that, i was hoping there was a way to have Solr do it for me! that I have to as this question is probably not a good sign, but what is LSH clustering? On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning ted.dunn...@gmail.com wrote: You can

Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
See http://en.wikipedia.org/wiki/Locality-sensitive_hashing The obvious thought that I had just after hitting send was that you could put the LSH signatures on the documents. That would let you do the scan at low volume and using LSH would make the duplicate scan almost as fast as your score

Re: remove answers with identical scores

2011-11-25 Thread Erick Erickson
Have you considered removing them at index time? See: http://wiki.apache.org/solr/Deduplication Best Erick On Fri, Nov 25, 2011 at 3:13 PM, Ted Dunning ted.dunn...@gmail.com wrote: See http://en.wikipedia.org/wiki/Locality-sensitive_hashing The obvious thought that I had just after hitting

remove answers with identical scores

2011-11-24 Thread Fred Zimmerman
I have a corpus that has a lot of identical or nearly identical documents. I'd like to return only the unique ones (excluding the nearly identical which are redirects). I notice that all the identical/nearly identicals have identical Solr scores. How can I tell Solr to throw out all the