Hi everyone,

Following up on my earlier post about hybrid search implementation. After more 
production testing, we've made two important discoveries that significantly 
improved our results.

Update #1: mm (minimum match) breaks hybrid search

This was a subtle but critical finding.

With strict mm (3<90% 5<75% 8<60%), we saw completely irrelevant results 
winning. Example: searching "bed linens 180cm" returned tyres (which had "180" 
in their dimensions).

The root cause:

In a function query like:

q={!func}sum(

    product($vw, query($vectorQuery)),
    product($lw, normalized(query($lexicalQuery)))
)
Solr evaluates ALL documents, but:

KNN returns >0 only for topK docs, else 0.0
eDisMax returns >0 only if mm is satisfied, else 0.0
Strict mm excluded our good matches ("bed linens 160cm" failed because it 
didn't have "180"). Meanwhile, irrelevant docs in the KNN topK won with pure 
vector scores.

Solution: Use mm=1 for hybrid mode. Vectors already handle relevance ranking - 
mm just needs to anchor to some query terms.

Update #2: Linear normalization saturates - use log() instead

Our original formula:

div(query($lexicalQuery), sum(query($lexicalQuery), 10))

Failed when BM25 scores hit thousands (common with phrase boosts):

score=7320 → normalized=0.9986
score=9960 → normalized=0.9990
A 36% raw difference became 0.0004 after normalization - meaningless.

Solution: Logarithmic compression:

div(log(sum(1, query($lexicalQuery))), sum(log(sum(1, query($lexicalQuery))), 
$k))

Log compresses any score magnitude (tens to thousands) into 0-10 range. Now k 
works universally regardless of field weights or phrase boost configuration.

With k=20:

log(7321)/log(7321)+20 ≈ 0.31
log(9961)/log(9961)+20 ≈ 0.32
Difference is now 0.01 - visible and meaningful.

Updated recommendation

vectorQuery = {!knn f=embeddings topK=200}[...]
lexicalQuery = {!edismax qf="..." pf="..." mm="1"}user query

q = {!func}sum(
    product(1.0, query($vectorQuery)),
    product(1.0, div(log(sum(1, query($lexicalQuery))), sum(log(sum(1, 
query($lexicalQuery))), 20)))
)
Tuning k:

k=5: lexical-heavy (~65% contribution)
k=20: vector-heavy (~32% contribution) ← our sweet spot
k=50: nearly pure semantic (~15% lexical)
Questions

Anyone else hit the mm issue with hybrid? Curious if this is a common gotcha.
Are there other normalization approaches worth exploring? RRF? Different log 
bases?
Would there be value in a built-in {!hybrid} query parser that handles this 
automatically? Perhaps in future Solr versions?
Happy to share more details or test cases if useful.



Opensolr.com
Your Path to AI Search 
<https://opensolr.com/faq/view/web-crawler/46/Opensolr-Web-Crawler-Site-Search-Solution>
[email protected] <mailto:[email protected]>
https://opensolr.com <https://opensolr.com/>
VAT: RO-35410526




Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to