Hi all,

Thanks for the feedback!

@Allesandro - The additive boost is primarily a legacy function that we are
moving away from. It allowed us to rank specific document higher than the
remaining potential result set. We are doing this by ensuring the score of
the documents is a magnitude higher than the remaining scores.

@Charlie & @Joel - I've run another query, which has 21M matches:

500 rows - 1800ms
200 rows -1650ms
100 rows - 1500ms
10 rows - 900ms

You are correct that the rows has an impact on the latency, however the
base case is still high! My assumption is that scoring 21M docs (across 8
shards) is computationally expensive?

Regarding why we ask for 500 results - it is so we can do second-phase
ranking on the top N (N=500) with features that are not in Solr.

My current hypothesis consists of:
1. Our Solr configuration (hardware and/or solrconfig, solr.xml) files are
misconfigured for our use case
2. Our boosting functions & schema (field types) are misconfigured -
However, after this thread, I'm pretty certain that the fieldtypes that we
have for the boosts are as optimized as possible.
3. We have to change our scoring function such that a given query does not
match against 20+ Million documents. Probably need to have more AND clauses
to cut the result set down. This is something we are already working on.

For context on the 3rd point. I changed my query to ensure every term in
the query is mandatory:

Total scored: 37,000

500 rows - 30ms
10 rows - 30ms (appox the same)

Obviously this can't be done across the board otherwise recall will drop
too drastically for some query sets.

Regards,

\Ash

Regards,

Ash

On Thu, Jan 20, 2022 at 5:52 AM Alessandro Benedetti <[email protected]>
wrote:

> On top of the already good suggestions to reduce the scope of your
> experiment, let's see:
>
> boost:def(boostFieldA,1) // boostFieldA is docValue float type
>
> The first part looks all right to me, it's expensive though, independently
> of the number of rows returned (as the boost request parameter is parsed as
> an additional query that affects the score).
> Enabling doc-values on such a field is probably the best option you have.
>
> In regards to the second part:
> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> textField. No docValue, just indexed
>
> This *adds* to the score:
>
> Returns the number of times the term appears in the field for that
> document.
>
> termfreq(text,'memory')
> So I am not even sure how multi term is managed(of course this depends also
> on the tokenization of termScoreFieldB.
> the* 1000* there smells a lot of bad practice, as you are adding to your
> score, and your score is not probabilistic, nor limited to a constant range
> of values (the main Lucene score value depends on the query and the index).
> It feels you are likely going to get a better behaviour modelling such
> requirement as an additional boost query rather then a boost function, but
> I am curious to know what is that you are attempting to do.
>
> Cheers
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr PMC member and Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Wed, 19 Jan 2022 at 13:44, Joel Bernstein <[email protected]> wrote:
>
> > Testing out a smaller "rows" param is key. Then you can isolate the
> > performance difference due to the 500 rows. Adding more shards is going
> to
> > increase the penalty for having 500 rows, so it's good to understand how
> > big that penalty is.
> >
> > Then test out smaller result sets by adjusting the query. Gradually
> > increase the result set size by adjusting the query. You then can get a
> > feel for how result set size affects performance. This will give you an
> > indication how much it will help to have more shards.
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Wed, Jan 19, 2022 at 6:19 AM Charlie Hull <
> > [email protected]> wrote:
> >
> > > Hi Ashwin,
> > >
> > > What happens if you reduce the number of rows requested? Do you really
> > > need 500 results each time? I think this will ask for 500 results from
> > > *each shard* too.
> > > https://solr.apache.org/guide/8_7/pagination-of-results.html
> > >
> > > Also it looks like you mean boost=def(boostFieldA,1) not
> > > boost:def(boostFieldA,1), am I right?
> > >
> > > Cheers
> > >
> > > Charlie
> > >
> > > On 19/01/2022 02:43, Ashwin Ramesh wrote:
> > > > Gentle ping! Promise it's my final one! :)
> > > >
> > > > On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<[email protected]>
> > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> I have a few questions about how we can improve our solr query
> > > >> performance, especially for boosts (BF, BQ, boost, etc).
> > > >>
> > > >> *System Specs:*
> > > >> Solr Version: 7.7.x
> > > >> Heap Size: 31gb
> > > >> Num Docs: >100M
> > > >> Shards: 8
> > > >> Replication Factor: 6
> > > >> Index is completely mapped into memory
> > > >>
> > > >>
> > > >> Example query:
> > > >> {
> > > >> q=hello world
> > > >> qf=title description keywords
> > > >> pf=title^0.5
> > > >> ps=0
> > > >> fq=type:P
> > > >> boost:def(boostFieldA,1) // boostFieldA is docValue float type
> > > >> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a
> > > >> textField. No docValue, just indexed
> > > >> rows:500
> > > >> fl=id,score
> > > >> }
> > > >>
> > > >> numFound: >21M
> > > >> qTime: 800ms
> > > >>
> > > >> Experimentation of params:
> > > >>
> > > >>     - When I remove the boost parameter, the qTime drops to 525ms
> > > >>     - When I remove the bf parameter, the qTime dropes to 650ms
> > > >>     - When I remove both the boost & bf parameters, the qTime drops
> to
> > > >>     400ms
> > > >>
> > > >>
> > > >> Questions:
> > > >>
> > > >>     1. Is there any way to improve the performance of the boosts
> > > (specific
> > > >>     field types, etc)?
> > > >>     2. Will sharding further such that each core only has to score a
> > > >>     smaller subset of documents help with query performance?
> > > >>     3. Is there any performance impact when boosting/querying
> against
> > > >>     sparse fields, both indexed=true or docValues=true?
> > > >>     4. It seems the base case scoring is 400ms, which is already
> quite
> > > >>     high. Is this because the query (hello world) implicitly gets
> > > parsed as
> > > >>     (hello OR world)? Thus it would be more computationally
> expensive?
> > > >>     5. Any other advice :) ?
> > > >>
> > > >>
> > > >> Thanks in advance,
> > > >>
> > > >> Ash
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > --
> > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > Founding member of The Search Network <http://www.thesearchnetwork.com
> >
> > > and co-author of Searching the Enterprise
> > > <
> > >
> >
> https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf
> > > >
> > > tel/fax: +44 (0)8700 118334
> > > mobile: +44 (0)7767 825828
> > >
> > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> > > Amtsgericht Charlottenburg | HRB 230712 B
> > > Geschäftsführer: John M. Woodell | David E. Pugh
> > > Finanzamt: Berlin Finanzamt für Körperschaften II
> > >
> > > --
> > > This email has been checked for viruses by AVG.
> > > https://www.avg.com
> > >
> >
>

-- 
**
** <https://www.canva.com/>Empowering the world to design
Share accurate 
information on COVID-19 and spread messages of support to your community.
Here are some resources 
<https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates>
 
that can help.
 <https://twitter.com/canva> <https://facebook.com/canva> 
<https://au.linkedin.com/company/canva> <https://twitter.com/canva>  
<https://facebook.com/canva>  <https://au.linkedin.com/company/canva>  
<https://instagram.com/canva>










Reply via email to