We are facing a performance issuw when searching in child documents. In order 
to explain the issue, I will provide a very simplified excerpt of our data 
model.

We are making a search engine for libraries. What we want to deliver to the 
users are "works". An example of a work could be Harry Potter and the Goblet of 
fire. Each work can have several manifestations; for example there is a book 
version of the work, an audiobook, and maybe an e-book. Of course, there are 
properties at the work level (like creator, title, subjects, etc) and other 
properties at the manifestation level (like publication year, material type, 
etc).

We have modelled this with parent documents and child documents in solr, and 
have built a search engine on it. The search engine can search for things like 
creators, titles, and subjects at the work level, but users should also be 
allowed to search for things from a specific year or be able to specify that 
the are only interested in things that are available as e-books.

We have around 28 million works in the solr and 41 million manifestations, 
indexed as child documents (so many works have only one manifestation).

As long as as the user searches for things at the work level, the performance 
is fine. But as you can imagine, when users search for things at the 
manifestation level, the performance worsens. As an example, if we make a 
search for a creator, the search executes in less than 200 ms and results in 
maybe 30 hits. If we add a clause for a material type (with a `{!parent 
which:'doc_type:work'}materialType:"book"` construction), the search takes 
several seconds. In this case we want the filtering to books to be part of the 
ranking, so putting it in a filter query will pose a problem.

I am wondering if there is some way to evaluate the search at the work level 
first, and then do the filtering for those works that have manifestations 
matching the child requirements afterwards? I could try to do the search for 
work-level properties first and only fetch IDs and then do the full search with 
the manifestation-level requirements afterwards and an added filter query with 
the IDs, but I am wondering if there is a better way to do this.

I have also looked at denormalizing 
(https://blog.innoventsolutions.com/innovent-solutions-blog/2018/05/avoid-the-parentchild-trap-tips-and-tricks-for-denormalizing-solr-data.html)
 and it helps when doing it for a few child fields. But as said, there are more 
properties in the real model than those I have mentioned here, so that also 
involves some complications.

Kind regards,

/Noah


--

Noah Torp-Smith ([email protected])

Reply via email to