Hi everyone, Following up on my earlier post about hybrid search implementation. After more production testing, we've made two important discoveries that significantly improved our results.
Update #1: mm (minimum match) breaks hybrid search
This was a subtle but critical finding.
With strict mm (3<90% 5<75% 8<60%), we saw completely irrelevant results
winning. Example: searching "bed linens 180cm" returned tyres (which had "180"
in their dimensions).
The root cause:
In a function query like:
q={!func}sum(
product($vw, query($vectorQuery)),
product($lw, normalized(query($lexicalQuery)))
)
Solr evaluates ALL documents, but:
KNN returns >0 only for topK docs, else 0.0
eDisMax returns >0 only if mm is satisfied, else 0.0
Strict mm excluded our good matches ("bed linens 160cm" failed because it
didn't have "180"). Meanwhile, irrelevant docs in the KNN topK won with pure
vector scores.
Solution: Use mm=1 for hybrid mode. Vectors already handle relevance ranking -
mm just needs to anchor to some query terms.
Update #2: Linear normalization saturates - use log() instead
Our original formula:
div(query($lexicalQuery), sum(query($lexicalQuery), 10))
Failed when BM25 scores hit thousands (common with phrase boosts):
score=7320 → normalized=0.9986
score=9960 → normalized=0.9990
A 36% raw difference became 0.0004 after normalization - meaningless.
Solution: Logarithmic compression:
div(log(sum(1, query($lexicalQuery))), sum(log(sum(1, query($lexicalQuery))),
$k))
Log compresses any score magnitude (tens to thousands) into 0-10 range. Now k
works universally regardless of field weights or phrase boost configuration.
With k=20:
log(7321)/log(7321)+20 ≈ 0.31
log(9961)/log(9961)+20 ≈ 0.32
Difference is now 0.01 - visible and meaningful.
Updated recommendation
vectorQuery = {!knn f=embeddings topK=200}[...]
lexicalQuery = {!edismax qf="..." pf="..." mm="1"}user query
q = {!func}sum(
product(1.0, query($vectorQuery)),
product(1.0, div(log(sum(1, query($lexicalQuery))), sum(log(sum(1,
query($lexicalQuery))), 20)))
)
Tuning k:
k=5: lexical-heavy (~65% contribution)
k=20: vector-heavy (~32% contribution) ← our sweet spot
k=50: nearly pure semantic (~15% lexical)
Questions
Anyone else hit the mm issue with hybrid? Curious if this is a common gotcha.
Are there other normalization approaches worth exploring? RRF? Different log
bases?
Would there be value in a built-in {!hybrid} query parser that handles this
automatically? Perhaps in future Solr versions?
Happy to share more details or test cases if useful.
Opensolr.com
Your Path to AI Search
<https://opensolr.com/faq/view/web-crawler/46/Opensolr-Web-Crawler-Site-Search-Solution>
[email protected] <mailto:[email protected]>
https://opensolr.com <https://opensolr.com/>
VAT: RO-35410526
smime.p7s
Description: S/MIME cryptographic signature
