Re: using score to find high confidence duplicates

2010-10-13 Thread Matt Mitchell
No this isn't the MLT, just the standard query parser for now. I did try the heuristic approach and I might stick with that actually. I ran the process on known duplicates and created a collection of all scores. I was then able to see how well the query worked. The scores seemed focused to one rang

Re: using score to find high confidence duplicates

2010-10-13 Thread Peter Karich
Hi, are you using moreLikeThis for that feature? I have no suggestion for a reliable threshold, I think this depends on the domain you are operating and is IMO only solvable with a heuristic. It also depends on fields, boosts, ... It could be that there is a 'score gap' between duplicates and none

using score to find high confidence duplicates

2010-10-13 Thread Matt Mitchell
I have a solr index full of documents that contain lots of duplicates. The duplicates are not exact duplicates though. Each may vary slightly in content. After indexing, I have a bit of code that loops through the entire index just to get what I'm calling "target" documents. For each target docume