Re: using score to find high confidence duplicates

Peter Karich Wed, 13 Oct 2010 12:00:50 -0700

Hi,

are you using moreLikeThis for that feature?
I have no suggestion for a reliable threshold, I think this depends
on the domain you are operating and is IMO only solvable with a heuristic.
It also depends on fields, boosts, ...
It could be that there is a 'score gap' between duplicates and none
duplicates
which you can try to find, but I don't know


BTW: did you check: http://wiki.apache.org/solr/Deduplication

If you need deduplication while querying you could determine
a hashvalue from the procedure above and index that into a different field.
Then you can use collapse feature on that field to remove duplicates.

Regards,
Peter.

> I have a solr index full of documents that contain lots of duplicates.
> The duplicates are not exact duplicates though. Each may vary slightly
> in content.
>
> After indexing, I have a bit of code that loops through the entire
> index just to get what I'm calling "target" documents. For each target
> document, I then send another query to find similar documents to the
> "target". This similarity query includes a clause to match the target
> to itself, so I can have a normalized max score. This was the only way
> I could figure out how to reasonably fix the scoring range. The
> response always includes the target at the top, and similar documents
> afterward. So I take the scores and scale to 0-100, where 100 is
> always the target matching itself. So far so good...
>
> What I want to do is create a confidence score threshold, so I can
> automatically accept similar documents that have a score above the
> threshold. If my query *structure* never changes, but only the values
> in the query change... is it possible to produce a reliable
> "threshold" score that I could use?
>
> Hope this makes sense :)
>
> Matt
>   


-- 
http://jetwick.com twitter search prototype

Re: using score to find high confidence duplicates

Reply via email to