No this isn't the MLT, just the standard query parser for now. I did
try the heuristic approach and I might stick with that actually. I ran
the process on known duplicates and created a collection of all
scores. I was then able to see how well the query worked. The scores
seemed focused to one rang
Hi,
are you using moreLikeThis for that feature?
I have no suggestion for a reliable threshold, I think this depends
on the domain you are operating and is IMO only solvable with a heuristic.
It also depends on fields, boosts, ...
It could be that there is a 'score gap' between duplicates and none
I have a solr index full of documents that contain lots of duplicates.
The duplicates are not exact duplicates though. Each may vary slightly
in content.
After indexing, I have a bit of code that loops through the entire
index just to get what I'm calling "target" documents. For each target
docume