Hi, are you using moreLikeThis for that feature? I have no suggestion for a reliable threshold, I think this depends on the domain you are operating and is IMO only solvable with a heuristic. It also depends on fields, boosts, ... It could be that there is a 'score gap' between duplicates and none duplicates which you can try to find, but I don't know
BTW: did you check: http://wiki.apache.org/solr/Deduplication If you need deduplication while querying you could determine a hashvalue from the procedure above and index that into a different field. Then you can use collapse feature on that field to remove duplicates. Regards, Peter. > I have a solr index full of documents that contain lots of duplicates. > The duplicates are not exact duplicates though. Each may vary slightly > in content. > > After indexing, I have a bit of code that loops through the entire > index just to get what I'm calling "target" documents. For each target > document, I then send another query to find similar documents to the > "target". This similarity query includes a clause to match the target > to itself, so I can have a normalized max score. This was the only way > I could figure out how to reasonably fix the scoring range. The > response always includes the target at the top, and similar documents > afterward. So I take the scores and scale to 0-100, where 100 is > always the target matching itself. So far so good... > > What I want to do is create a confidence score threshold, so I can > automatically accept similar documents that have a score above the > threshold. If my query *structure* never changes, but only the values > in the query change... is it possible to produce a reliable > "threshold" score that I could use? > > Hope this makes sense :) > > Matt > -- http://jetwick.com twitter search prototype