Re: "Slow" Query performance with boosts.

Joel Bernstein Wed, 19 Jan 2022 17:53:04 -0800

One other thing to check is the performance on each node. You can do this
by running the query with the parameter distrib=false on each node. A
distributed search is only as fast as the slowest node. So you'll want to
rule out an underpowered node.



Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, Jan 19, 2022 at 4:13 PM Ashwin Ramesh <ash...@canva.com.invalid>
wrote:

> Hi all,
>
> Thanks for the feedback!
>
> @Allesandro - The additive boost is primarily a legacy function that we are
> moving away from. It allowed us to rank specific document higher than the
> remaining potential result set. We are doing this by ensuring the score of
> the documents is a magnitude higher than the remaining scores.
>
> @Charlie & @Joel - I've run another query, which has 21M matches:
>
> 500 rows - 1800ms
> 200 rows -1650ms
> 100 rows - 1500ms
> 10 rows - 900ms
>
> You are correct that the rows has an impact on the latency, however the
> base case is still high! My assumption is that scoring 21M docs (across 8
> shards) is computationally expensive?
>
> Regarding why we ask for 500 results - it is so we can do second-phase
> ranking on the top N (N=500) with features that are not in Solr.
>
> My current hypothesis consists of:
> 1. Our Solr configuration (hardware and/or solrconfig, solr.xml) files are
> misconfigured for our use case
> 2. Our boosting functions & schema (field types) are misconfigured -
> However, after this thread, I'm pretty certain that the fieldtypes that we
> have for the boosts are as optimized as possible.
> 3. We have to change our scoring function such that a given query does not
> match against 20+ Million documents. Probably need to have more AND clauses
> to cut the result set down. This is something we are already working on.
>
> For context on the 3rd point. I changed my query to ensure every term in
> the query is mandatory:
>
> Total scored: 37,000
>
> 500 rows - 30ms
> 10 rows - 30ms (appox the same)
>
> Obviously this can't be done across the board otherwise recall will drop
> too drastically for some query sets.
>
> Regards,
>
> \Ash
>
> Regards,
>
> Ash
>
> On Thu, Jan 20, 2022 at 5:52 AM Alessandro Benedetti <a.benede...@sease.io
> >
> wrote:
>
> > On top of the already good suggestions to reduce the scope of your
> > experiment, let's see:
> >
> > boost:def(boostFieldA,1) // boostFieldA is docValue float type
> >
> > The first part looks all right to me, it's expensive though,
> independently
> > of the number of rows returned (as the boost request parameter is parsed
> as
> > an additional query that affects the score).
> > Enabling doc-values on such a field is probably the best option you have.
> >
> > In regards to the second part:
> > bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> > textField. No docValue, just indexed
> >
> > This *adds* to the score:
> >
> > Returns the number of times the term appears in the field for that
> > document.
> >
> > termfreq(text,'memory')
> > So I am not even sure how multi term is managed(of course this depends
> also
> > on the tokenization of termScoreFieldB.
> > the* 1000* there smells a lot of bad practice, as you are adding to your
> > score, and your score is not probabilistic, nor limited to a constant
> range
> > of values (the main Lucene score value depends on the query and the
> index).
> > It feels you are likely going to get a better behaviour modelling such
> > requirement as an additional boost query rather then a boost function,
> but
> > I am curious to know what is that you are attempting to do.
> >
> > Cheers
> > --------------------------
> > Alessandro Benedetti
> > Apache Lucene/Solr PMC member and Committer
> > Director, R&D Software Engineer, Search Consultant
> >
> > www.sease.io
> >
> >
> > On Wed, 19 Jan 2022 at 13:44, Joel Bernstein <joels...@gmail.com> wrote:
> >
> > > Testing out a smaller "rows" param is key. Then you can isolate the
> > > performance difference due to the 500 rows. Adding more shards is going
> > to
> > > increase the penalty for having 500 rows, so it's good to understand
> how
> > > big that penalty is.
> > >
> > > Then test out smaller result sets by adjusting the query. Gradually
> > > increase the result set size by adjusting the query. You then can get a
> > > feel for how result set size affects performance. This will give you an
> > > indication how much it will help to have more shards.
> > >
> > >
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Wed, Jan 19, 2022 at 6:19 AM Charlie Hull <
> > > ch...@opensourceconnections.com> wrote:
> > >
> > > > Hi Ashwin,
> > > >
> > > > What happens if you reduce the number of rows requested? Do you
> really
> > > > need 500 results each time? I think this will ask for 500 results
> from
> > > > *each shard* too.
> > > > https://solr.apache.org/guide/8_7/pagination-of-results.html
> > > >
> > > > Also it looks like you mean boost=def(boostFieldA,1) not
> > > > boost:def(boostFieldA,1), am I right?
> > > >
> > > > Cheers
> > > >
> > > > Charlie
> > > >
> > > > On 19/01/2022 02:43, Ashwin Ramesh wrote:
> > > > > Gentle ping! Promise it's my final one! :)
> > > > >
> > > > > On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<ash...@canva.com>
> > > wrote:
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I have a few questions about how we can improve our solr query
> > > > >> performance, especially for boosts (BF, BQ, boost, etc).
> > > > >>
> > > > >> *System Specs:*
> > > > >> Solr Version: 7.7.x
> > > > >> Heap Size: 31gb
> > > > >> Num Docs: >100M
> > > > >> Shards: 8
> > > > >> Replication Factor: 6
> > > > >> Index is completely mapped into memory
> > > > >>
> > > > >>
> > > > >> Example query:
> > > > >> {
> > > > >> q=hello world
> > > > >> qf=title description keywords
> > > > >> pf=title^0.5
> > > > >> ps=0
> > > > >> fq=type:P
> > > > >> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> > > > >> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is
> a
> > > > >> textField. No docValue, just indexed
> > > > >> rows:500
> > > > >> fl=id,score
> > > > >> }
> > > > >>
> > > > >> numFound: >21M
> > > > >> qTime: 800ms
> > > > >>
> > > > >> Experimentation of params:
> > > > >>
> > > > >>     - When I remove the boost parameter, the qTime drops to 525ms
> > > > >>     - When I remove the bf parameter, the qTime dropes to 650ms
> > > > >>     - When I remove both the boost & bf parameters, the qTime
> drops
> > to
> > > > >>     400ms
> > > > >>
> > > > >>
> > > > >> Questions:
> > > > >>
> > > > >>     1. Is there any way to improve the performance of the boosts
> > > > (specific
> > > > >>     field types, etc)?
> > > > >>     2. Will sharding further such that each core only has to
> score a
> > > > >>     smaller subset of documents help with query performance?
> > > > >>     3. Is there any performance impact when boosting/querying
> > against
> > > > >>     sparse fields, both indexed=true or docValues=true?
> > > > >>     4. It seems the base case scoring is 400ms, which is already
> > quite
> > > > >>     high. Is this because the query (hello world) implicitly gets
> > > > parsed as
> > > > >>     (hello OR world)? Thus it would be more computationally
> > expensive?
> > > > >>     5. Any other advice :) ?
> > > > >>
> > > > >>
> > > > >> Thanks in advance,
> > > > >>
> > > > >> Ash
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > --
> > > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > > Founding member of The Search Network <
> http://www.thesearchnetwork.com
> > >
> > > > and co-author of Searching the Enterprise
> > > > <
> > > >
> > >
> >
> https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf
> > > > >
> > > > tel/fax: +44 (0)8700 118334
> > > > mobile: +44 (0)7767 825828
> > > >
> > > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > > > Amtsgericht Charlottenburg | HRB 230712 B
> > > > Geschäftsführer: John M. Woodell | David E. Pugh
> > > > Finanzamt: Berlin Finanzamt für Körperschaften II
> > > >
> > > > --
> > > > This email has been checked for viruses by AVG.
> > > > https://www.avg.com
> > > >
> > >
> >
>
> --
> **
> ** <https://www.canva.com/>Empowering the world to design
> Share accurate
> information on COVID-19 and spread messages of support to your community.
> Here are some resources
> <
> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
>
> that can help.
>  <https://twitter.com/canva> <https://facebook.com/canva>
> <https://au.linkedin.com/company/canva> <https://twitter.com/canva>
> <https://facebook.com/canva>  <https://au.linkedin.com/company/canva>
> <https://instagram.com/canva>
>
>
>
>
>
>
>
>
>
>
>

Re: "Slow" Query performance with boosts.

Reply via email to