from all the examples of what you've described, i'm fairly certain all you
really need is a TFIDF based Similarity where coord(), idf(), tf() and
queryNorm() return 1 allways, and you omitNorms from all fields.

Yeah, that's what I did in the very first iteration. It works only for cases #1 and #2. If you try query 3 and 4 with such Similarity, you'll get:

3. place:(34\ High\ Street)^3 => doc1(score=9), doc2(score=9)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=16), doc2(score=9)

That is not what I need. As I described above, in case of multiple tokens match for a field, method SimScorer.score is called X times, where X is number of matched tokens (in cases #3 and #4 there are 3 tokens), therefore score sums up. I need to score only once in this case, regardless of number of tokens.

How to do it? First idea was HashSet based on fieldName, so that after scoring once, it don't score anymore. But in this case only first document was scoring (since second and other documents have the same field name). So I understood that I need also docID for that. And it worked fine until I found out (thank you for that) about that docID is segment-specific. So now I need segmentID as well (or something similar).


(You didn't give any examples of what you expect to happen with exclusion
clauses in your BooleanQueries

For my needs I won't need exclusion clauses, but in this case the same would happen - it would score depending on weight, because condition is true:

5. (NOT name:DocumentOne)^7 => doc2(score=7)

Reply via email to