Hello, Noah.

A few notes: Query time depends on the number of results. When one query is
slower than another, we can find an excuse in a bigger number of enumerated
docs.
Examine how the query is parsed in debugQuery output. There are many tricks
and pitfalls in query parsers. eg I'm not sure why you put colon after
which, whether you put it so into Solr and how it interprets it.
Which version of Solr/Lucene are you running? Some time ago Lucene had no
two phase iteration, and was prone to redundant enumerations.

> if there is some way to evaluate the search at the work level first, and
then do the filtering for those works that have manifestations matching the
child requirements afterwards?
That's how it's expected to work. You can confirm your hypothesis by
intersecting {!parent ..}.. with work_id:123 whether via fq or +. It should
turn around in a moment.

So, if everything is right you might run just too large indices and have to
break it into many shards.


On Tue, Jan 3, 2023 at 1:12 PM Noah Torp-Smith <[email protected]> wrote:

> We are facing a performance issuw when searching in child documents. In
> order to explain the issue, I will provide a very simplified excerpt of our
> data model.
>
> We are making a search engine for libraries. What we want to deliver to
> the users are "works". An example of a work could be Harry Potter and the
> Goblet of fire. Each work can have several manifestations; for example
> there is a book version of the work, an audiobook, and maybe an e-book. Of
> course, there are properties at the work level (like creator, title,
> subjects, etc) and other properties at the manifestation level (like
> publication year, material type, etc).
>
> We have modelled this with parent documents and child documents in solr,
> and have built a search engine on it. The search engine can search for
> things like creators, titles, and subjects at the work level, but users
> should also be allowed to search for things from a specific year or be able
> to specify that the are only interested in things that are available as
> e-books.
>
> We have around 28 million works in the solr and 41 million manifestations,
> indexed as child documents (so many works have only one manifestation).
>
> As long as as the user searches for things at the work level, the
> performance is fine. But as you can imagine, when users search for things
> at the manifestation level, the performance worsens. As an example, if we
> make a search for a creator, the search executes in less than 200 ms and
> results in maybe 30 hits. If we add a clause for a material type (with a
> `{!parent which:'doc_type:work'}materialType:"book"` construction), the
> search takes several seconds. In this case we want the filtering to books
> to be part of the ranking, so putting it in a filter query will pose a
> problem.
>
> I am wondering if there is some way to evaluate the search at the work
> level first, and then do the filtering for those works that have
> manifestations matching the child requirements afterwards? I could try to
> do the search for work-level properties first and only fetch IDs and then
> do the full search with the manifestation-level requirements afterwards and
> an added filter query with the IDs, but I am wondering if there is a better
> way to do this.
>
> I have also looked at denormalizing (
> https://blog.innoventsolutions.com/innovent-solutions-blog/2018/05/avoid-the-parentchild-trap-tips-and-tricks-for-denormalizing-solr-data.html)
> and it helps when doing it for a few child fields. But as said, there are
> more properties in the real model than those I have mentioned here, so that
> also involves some complications.
>
> Kind regards,
>
> /Noah
>
>
> --
>
> Noah Torp-Smith ([email protected])
>


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Reply via email to