RE: TopK strategy for vectorized chunks in Solr

Burgmans, Tom Thu, 06 Nov 2025 13:24:16 -0800

I don't think the same could be achieved without the real nesting because it 
relies on the Block Join query parser (though I'd be happy if you could prove 
me wrong).
Some 'limitations' of this construction:
- The vector side searches for the top K children. If multiple children are 
part of the same parent, we'll get less than K parent vector matches back.
- Updating a single child cannot be done without updating its parent as well. 
Child documents are essentially metadata of the parent.
Those side effects are acceptable for us though.


A small addition to my previous mail: it is possible to hybrid search and 
return chunks only, while filtering on parent metadata. Look at the last 4 
lines of:

{
  params:{
    vectorq:"{!knn f= vector_field topK=30}[13, 4, 73, 66, ...]",
    q:"{!bool filter=$hybridlogic must=$hybridscore}",
    hybridlogic:"{!bool should=$kwq should=$vectorq}",
    
hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
    kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}",
    qq:"Energy",
    kwweight:1,
    vectorweight:9,
    fq:"{!child of=type_s:parent v=$parentfq}",
    parentfq:"(type_s:parent AND title_en:\"chapter 17\")",
    fq:"{!child of=type_s:parent v=$parentfq2}",
    parentfq2:"(type_s:parent AND subscription_ss:(\"123\" OR \"456\"))"
  }
}

This is really practical because we can keep the child documents small and 
clean and don't need to replicate metadata.

Regards, Tom

-----Original Message-----
From: Rahul Goswami <[email protected]>
Sent: Thursday, October 23, 2025 12:10 AM
To: [email protected]
Subject: Re: TopK strategy for vectorized chunks in Solr

Caution, this email may be from a sender outside Wolters Kluwer. Verify the 
sender and know the content is safe.

Guillaume Hoss, Tom,
Thank you for your inputs.

Tom,
Thanks for the detailed explanation. I am also going over your talk as we 
speak. A follow up to your index design...Curious to know what advantage does 
the nested doc design provide in this case?

If my understanding is correct, had the parent and child docs been unrelated 
docs connected by a "secondary key" in the child docs (say "ParentId"), you 
could still have used the "join" parser and achieved the same result as the 
"parent" parser, no?

Especially since the JIRA for getting topK parent hits is still in progress 
(https://issues.apache.org/jira/browse/SOLR-17736).

How are you handling any changes to your child docs? (Since you'd need to 
reindex the whole block I assume? )

Thanks,
Rahul

On Mon, Oct 6, 2025 at 2:49 PM Burgmans, Tom 
<[email protected]> wrote:

> We managed to get our required flavor hybrid search working in Solr
> via a nested index.
>
> The required flavor: applying both lexical search and vector search in
> a single search call with a logical OR (a document could match pure
> lexically, pure as vector match or both). Our documents are large
> enough that chunking is needed and of course no duplication of results are 
> allowed.
>
> The nested index is a construction where the parent documents are the
> large original documents and the children are the chunks that are
> vectorized:
>
> <doc>
>         <field name="id">doc-1</field>
>         <field name="type_s">parent</field>
>         <field name="full_doc_title">This is the title text</field>
>         <field name="full_doc_body">This is the full body text</field>
>         <field name="metadata1_s">some metadata</field>
>         <doc>
>                 <field name="id">doc-1.1</field>
>                 <field name="parentDoc">doc-1</field>
>                 <field name="type_s">child</field>
>                 <field name="chunk_body">This is the chunk body
> text</field>
>                 <field name="chunkoffsets_s">8123-12123</field>
>                 <field
> name="vector_field"><![CDATA[-0.0037859276]]></field>
>                 <field name="vector_field"><![CDATA[-0.012503299]]></field>
>                 <field name="vector_field"><![CDATA[0.018080892]]></field>
>                 <field name="vector_field"><![CDATA[0.0024048693]]></field>
>                 ...
>             </doc>
>             <doc>
>                  <field name="id">doc-1.2</field>
>                 <field name="parentDoc">doc-1</field>
>                 <field name="type_s">child</field>
>                 <field name="chunk_body">This is the body text of
> another chunk</field
>                 <field name="chunkoffsets_s">12200-12788</field>
>                 <field
> name="vector_field"><![CDATA[}[-0.0034859276]]></field>
>                 <field name="vector_field"><![CDATA[0.0024048693]]></field>
>                 <field name="vector_field"><![CDATA[-0.016224038]]></field>
>                 <field name="vector_field"><![CDATA[0.025224038]]></field>
>                 ...
>         </doc>
>   <doc>
>   ...
>   </doc>
> </doc>
>
> This query construction searches the parents lexically and the
> children via ANN search. The result set contain full documents only.
> Balancing the impact of lexical vs vector happens via kwweight and
> vectorweight (these values may change per query, depending on its
> nature). Note that this construction doesn't include score
> normalization, because this is an expensive operation when there are
> many results and moreover normalization doesn't guarantee proper blending of 
> relevant lexical and vector results.
>
> params:{
>   uf:"* _query_",
>   q:"{!bool filter=$hybridlogic must=$hybridscore}",
>   hybridlogic:"{!bool should=$kwq should=$vectorq}",
>
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
>   kwq:"{!type=edismax qf=\"full_doc_body full_doc_title^3\" v=$qq}",
>   qq:"What is the income tax in New York?",
>   vectorq:"{!parent which=\"type_s:parent\" score=max v=$childq}",
>   childq:"{!knn f=vector_field
> topK=10}[-0.0034859276,-0.028224038,0.0024048693,...]",
>   kwweight:1,
>   vectorweight:4
> }
>
> This nested index is multi-purpose: for hybrid searching full
> documents (the construction above) and for hybrid searching the chunks
> only (see below).
>
> This following query construction searches the chunks both lexically
> and via ANN search. The result set contain chunks only. This is meant
> for RAG use cases where we're only interested in document chunks as
> context for the LLM.
>
> params:{
>   uf:"* _query_",
>   q:"{!bool filter=$hybridlogic must=$hybridscore}",
>   hybridlogic:"{!bool should=$kwq should=$vectorq}",
>
> hybridscore:"{!func}sum(product($kwweight,$kwq),product($vectorweight,query($vectorq)))",
>   kwq:"{!type=edismax qf=\"chunk_body\" v=$qq}",
>   qq:"What is the income tax in New York?",
>   vectorq:"{!knn f=vector_field
> topK=10}[-0.002503299,-0.001550957,0.018080892,...]",
>   kwweight:1,
>   vectorweight:4
> }
>
> We recently gave a presentation about this and other things at the
> Haystack EU 2025 conference:
> https://www/.
> youtube.com%2Fwatch%3Fv%3D3CPa1MpnLlI&data=05%7C02%7Ctom.burgmans%40wo
> lterskluwer.com%7C9762869316ba426dfad908de11b7dbee%7C8ac76c91e7f141ffa
> 89c3553b2da2c17%7C0%7C0%7C638967678521208844%7CUnknown%7CTWFpbGZsb3d8e
> yJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWF
> pbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=SpklJAgRYq3iusQWMDsSWzQ2TQubd
> zk2VfldYWFUrCY%3D&reserved=0
>
>
> Regards, Tom
>
>
>
>
> -----Original Message-----
> From: Rahul Goswami <[email protected]>
> Sent: Sunday, August 31, 2025 2:08 PM
> To: [email protected]
> Subject: Re: TopK strategy for vectorized chunks in Solr
>
> Caution, this email may be from a sender outside Wolters Kluwer.
> Verify the sender and know the content is safe.
>
> Hello,
> Floating this up again in case anyone has any insights. Thanks.
>
> Rahul
>
> On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
> wrote:
>
> > Hello,
> > A question for folks using Solr as the vector db in their solutions.
> > As of now since Solr doesn't support parent/child or multi-valued
> > vector field support for vector search, what are some strategies
> > that can be used to avoid duplicates in top K results when you have
> > vectorized chunks for the same (large) document?
> >
> > Would be also helpful to know how folks are doing this when storing
> > vectors in the same docs as the lexical index vs when having the
> > vectorized chunks in a separate index.
> >
> > Thanks.
> > Rahul
> >
>

RE: TopK strategy for vectorized chunks in Solr

Reply via email to