See http://en.wikipedia.org/wiki/Locality-sensitive_hashing

The obvious thought that I had just after hitting send was that you could
put the LSH signatures on the documents.  That would let you do the scan at
low volume and using LSH would make the duplicate scan almost as fast as
your score scan idea.

Whether Solr will do this for you is really neither here nor there.  Solr
does an awful lot of stuff for a an awful lot of people who find it very
congenial.  They probably don't have lots of duplicate documents.  If you
really think that this capability is core, then you can contribute an
implementation to Solr and all will be made whole.  In the short-term, I
would recommend you prototype independently.

On Fri, Nov 25, 2011 at 4:47 AM, Fred Zimmerman <zimzaz....@gmail.com>wrote:

> thanks.  i did consider postprocessing and may wind up doing that, i was
> hoping there was a way to have Solr do it for me! that I have to as this
> question is probably not a good sign, but what is LSH clustering?
>
> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > You can do that pretty easily by just retrieving extra documents and post
> > processing the results list.
> >
> > You are likely to have a significant number of apparent duplicates this
> > way.
> >
> > To really get rid of duplicates in results, it might be better to remove
> > them from the corpus by deploying something like LSH clustering.
> >
> > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman <zimzaz....@gmail.com
> > >wrote:
> >
> > > I have a corpus that has a lot of identical or nearly identical
> > documents.
> > > I'd like to return only the unique ones (excluding the "nearly
> identical"
> > > which are redirects).  I notice that all the identical/nearly
> identicals
> > > have identical Solr scores. How can I tell Solr to  throw out all the
> > > successive documents in an answer set that have identical scores?
> > >
> > > doc 1 score 5.0
> > > doc 2  score 5.0
> > > doc 3 score 5.0
> > > doc 4 score 4.9
> > >
> > > skip docs 2 and 3
> > >
> > > bring back 10 docs with unique scores
> > >
> >
>

Reply via email to