We are facing a performance issuw when searching in child documents. In order
to explain the issue, I will provide a very simplified excerpt of our data
model.
We are making a search engine for libraries. What we want to deliver to the
users are "works". An example of a work could be Harry Potter and the Goblet of
fire. Each work can have several manifestations; for example there is a book
version of the work, an audiobook, and maybe an e-book. Of course, there are
properties at the work level (like creator, title, subjects, etc) and other
properties at the manifestation level (like publication year, material type,
etc).
We have modelled this with parent documents and child documents in solr, and
have built a search engine on it. The search engine can search for things like
creators, titles, and subjects at the work level, but users should also be
allowed to search for things from a specific year or be able to specify that
the are only interested in things that are available as e-books.
We have around 28 million works in the solr and 41 million manifestations,
indexed as child documents (so many works have only one manifestation).
As long as as the user searches for things at the work level, the performance
is fine. But as you can imagine, when users search for things at the
manifestation level, the performance worsens. As an example, if we make a
search for a creator, the search executes in less than 200 ms and results in
maybe 30 hits. If we add a clause for a material type (with a `{!parent
which:'doc_type:work'}materialType:"book"` construction), the search takes
several seconds. In this case we want the filtering to books to be part of the
ranking, so putting it in a filter query will pose a problem.
I am wondering if there is some way to evaluate the search at the work level
first, and then do the filtering for those works that have manifestations
matching the child requirements afterwards? I could try to do the search for
work-level properties first and only fetch IDs and then do the full search with
the manifestation-level requirements afterwards and an added filter query with
the IDs, but I am wondering if there is a better way to do this.
I have also looked at denormalizing
(https://blog.innoventsolutions.com/innovent-solutions-blog/2018/05/avoid-the-parentchild-trap-tips-and-tricks-for-denormalizing-solr-data.html)
and it helps when doing it for a few child fields. But as said, there are more
properties in the real model than those I have mentioned here, so that also
involves some complications.
Kind regards,
/Noah
--
Noah Torp-Smith ([email protected])