Re: TopK strategy for vectorized chunks in Solr

Guillaume Wed, 17 Sep 2025 23:57:21 -0700

Hi Rahul,

There we go !


1) Currently, I have a total of 1.8 million Solr documents (approximately
600,000 documents and 1.2 million chunks). Some documents have only one
chunk, while others have dozens.

2) Using DocValues for join fields greatly improves performance. However,
joining large volumes is inevitably costly. Anyway, it is still possible to
use multiple nodes within a Solr cluster to divide the load (simply place
the documents and their chunks in the same shard).

3) Injecting document properties into chunks would save the most costly
join. However, this would require restricting the types of filters that can
be applied to parent documents (to the fields selected and cloned in the
chunks). This is a compromise that I am unable to make at this time.

4) Not really. We only control the number of chunks returned. The number of
documents is generally lower (since several chunks from the same document
may be found). In my use case, this isn't a problem. I increase the TopK a
little to compensate for this.

5) Pagination is not a problem because we actually paginate on the final
join (chunks -> documents). Pagination is then identical to a classic
search on documents.

Perhaps indexing nested documents (
https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html)
would make it possible to do away with joins.

I haven't tested it yet because it would require a compromise: reindexing
documents whenever their chunks change (which isn't necessary at the moment
with my approach). But if it eliminates joins, it could be an interesting
option.

Again, here is the link to the video where Alessandro Benedetti discusses
possible solutions to this problem:
https://youtu.be/9KJTbgtFWOU?si=YAUPNvfDhlX3NmJc&t=1450 The “Block Join”
approach is discussed there.

Guillaume

Le jeu. 18 sept. 2025 à 06:15, Rahul Goswami <rahul196...@gmail.com> a
écrit :

> Hi Guillaume,
> Thank you for a detailed and thoughtfully crafted response with examples.
> And apologies for not being able to respond sooner.
>
> I do have a few follow up questions:
>
> 1) How many docs (parent+chunks) does your index hold?
>
> 2) Is the query time join scaling well?
>
> 3) For pre-filtering did you consider duplicating minimal metadata in chunk
> docs to be able to directly pre-filter on chunks instead of joining back to
> parent?
>
> 4) Do you have use case for fetching top k parents instead? If so how are
> you reliably achieving this since multiple chunks could correspond to the
> same parent, shrinking your result set?
>
> 5) As a follow up to #4, do you have a use-case for pagination based on
> vector search? If so how are you achieving this (because of the same
> constraint as in #4)?
>
> Thanks in advance!
>
> -Rahul
>
> On Tue, Sep 2, 2025 at 1:16 PM Guillaume <gjac...@gmail.com> wrote:
>
> > Hello Rahul,
> >
> > Currently, I’m using the following topology:
> >
> > * I index my documents records in the usual way.
> > * I index the chunks records by referencing their parent record id.
> >
> > Concretely, this looks like (simplified version):
> >
> > Document 1
> >   -id: DOC_1
> >   -title: 2025 Annual Report
> >   -document_type: PDF
> >
> > Chunk 1 of document 1
> >   -id: CHUNK_1_1
> >   -text: <text of the first chunk>
> >   -vector: <embedding of the first chunk>
> >   -parent_id: DOC_1
> >   -position: 0
> >
> > Chunk 2 of document 1
> >   -id: CHUNK_1_2
> >   -text: <text of the second chunk>
> >   -vector: <embedding of the second chunk>
> >   -parent_id: DOC_1
> >   -position: 1
> > …
> > …
> >
> > When I want to retrieve documents via a semantic search on the chunks, I
> > use a join, like this:
> > q={!join from=parent_id to=id score=max}{!knn f=vector
> > topK=100}[0.255,0.36,…]
> >
> > Using the aggregation guarantees that I won’t get duplicate documents in
> > the result set. However, even though I request 100 chunks (TopK), I’ll
> > probably get fewer documents because several chunks may belong to the
> same
> > document. I use the “max” aggregation to rank documents by their best
> > chunk.
> >
> > If I need to apply a filter on the **documents** (e.g., restrict the
> > semantic search to PDF documents), things get a bit more complicated
> > because the filtering must happen in the `preFilter` of the KNN search.
> > Here’s an example:
> >
> > q={!join from=parent_id to=id score=max}{!knn f=vector topK=100
> > preFilter=$type_prefilter}[0.255,0.36,…]&type_prefilter={!join from=id
> > to=parent_id score=none} document_type:PDF
> >
> > The pre‑filtering is performed on the **documents**. Then, the join
> fetches
> > the chunks associated with documents that satisfy the constraint
> > (`type:PDF`). Those resulting chunks become the corpus for the main
> > semantic search (through preFilter).
> >
> > This indexing system works great for me because it lets me manage
> document
> > indexing and chunk indexing in a completely decoupled way. Solutions
> based
> > on “partial updates” or “nested documents” are problematic for me
> because I
> > can’t guarantee that all fields are `stored`, and I don’t want to have to
> > rebuild the documents when I index chunks.
> >
> > I'm sure a better way to do that must exist. Especially because *joins
> > always end up becoming a problem as the number of documents grows* (even
> > with docValues).
> >
> > Hope this helps you!
> >
> > By the way, here’s an excellent video by Alessandro Benedetti that I
> > thought you might like :
> > https://youtu.be/9KJTbgtFWOU?si=YAUPNvfDhlX3NmJc&t=1450
> >
> > Guillaume
> >
> >
> >
> > Le dim. 31 août 2025 à 16:08, Sergio García Maroto <marot...@gmail.com>
> a
> > écrit :
> >
> > > Hi Rahul.
> > >
> > > Have you explored the possibility of using streaming expressions? You
> can
> > > get back tuples and group
> > > them?
> > >
> > > Regards
> > > Sergio
> > >
> > > On Sun 31 Aug 2025 at 14:09, Rahul Goswami <rahul196...@gmail.com>
> > wrote:
> > >
> > > > Hello,
> > > > Floating this up again in case anyone has any insights. Thanks.
> > > >
> > > > Rahul
> > > >
> > > > On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <
> rahul196...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > > A question for folks using Solr as the vector db in their
> solutions.
> > As
> > > > of
> > > > > now since Solr doesn't support parent/child or multi-valued vector
> > > field
> > > > > support for vector search, what are some strategies that can be
> used
> > to
> > > > > avoid duplicates in top K results when you have vectorized chunks
> for
> > > the
> > > > > same (large) document?
> > > > >
> > > > > Would be also helpful to know how folks are doing this when storing
> > > > > vectors in the same docs as the lexical index vs when having the
> > > > vectorized
> > > > > chunks in a separate index.
> > > > >
> > > > > Thanks.
> > > > > Rahul
> > > > >
> > > >
> > >
> >
>

Re: TopK strategy for vectorized chunks in Solr

Reply via email to