One other thing to check is the performance on each node. You can do this by running the query with the parameter distrib=false on each node. A distributed search is only as fast as the slowest node. So you'll want to rule out an underpowered node.
Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Jan 19, 2022 at 4:13 PM Ashwin Ramesh <ash...@canva.com.invalid> wrote: > Hi all, > > Thanks for the feedback! > > @Allesandro - The additive boost is primarily a legacy function that we are > moving away from. It allowed us to rank specific document higher than the > remaining potential result set. We are doing this by ensuring the score of > the documents is a magnitude higher than the remaining scores. > > @Charlie & @Joel - I've run another query, which has 21M matches: > > 500 rows - 1800ms > 200 rows -1650ms > 100 rows - 1500ms > 10 rows - 900ms > > You are correct that the rows has an impact on the latency, however the > base case is still high! My assumption is that scoring 21M docs (across 8 > shards) is computationally expensive? > > Regarding why we ask for 500 results - it is so we can do second-phase > ranking on the top N (N=500) with features that are not in Solr. > > My current hypothesis consists of: > 1. Our Solr configuration (hardware and/or solrconfig, solr.xml) files are > misconfigured for our use case > 2. Our boosting functions & schema (field types) are misconfigured - > However, after this thread, I'm pretty certain that the fieldtypes that we > have for the boosts are as optimized as possible. > 3. We have to change our scoring function such that a given query does not > match against 20+ Million documents. Probably need to have more AND clauses > to cut the result set down. This is something we are already working on. > > For context on the 3rd point. I changed my query to ensure every term in > the query is mandatory: > > Total scored: 37,000 > > 500 rows - 30ms > 10 rows - 30ms (appox the same) > > Obviously this can't be done across the board otherwise recall will drop > too drastically for some query sets. > > Regards, > > \Ash > > Regards, > > Ash > > On Thu, Jan 20, 2022 at 5:52 AM Alessandro Benedetti <a.benede...@sease.io > > > wrote: > > > On top of the already good suggestions to reduce the scope of your > > experiment, let's see: > > > > boost:def(boostFieldA,1) // boostFieldA is docValue float type > > > > The first part looks all right to me, it's expensive though, > independently > > of the number of rows returned (as the boost request parameter is parsed > as > > an additional query that affects the score). > > Enabling doc-values on such a field is probably the best option you have. > > > > In regards to the second part: > > bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is a > > textField. No docValue, just indexed > > > > This *adds* to the score: > > > > Returns the number of times the term appears in the field for that > > document. > > > > termfreq(text,'memory') > > So I am not even sure how multi term is managed(of course this depends > also > > on the tokenization of termScoreFieldB. > > the* 1000* there smells a lot of bad practice, as you are adding to your > > score, and your score is not probabilistic, nor limited to a constant > range > > of values (the main Lucene score value depends on the query and the > index). > > It feels you are likely going to get a better behaviour modelling such > > requirement as an additional boost query rather then a boost function, > but > > I am curious to know what is that you are attempting to do. > > > > Cheers > > -------------------------- > > Alessandro Benedetti > > Apache Lucene/Solr PMC member and Committer > > Director, R&D Software Engineer, Search Consultant > > > > www.sease.io > > > > > > On Wed, 19 Jan 2022 at 13:44, Joel Bernstein <joels...@gmail.com> wrote: > > > > > Testing out a smaller "rows" param is key. Then you can isolate the > > > performance difference due to the 500 rows. Adding more shards is going > > to > > > increase the penalty for having 500 rows, so it's good to understand > how > > > big that penalty is. > > > > > > Then test out smaller result sets by adjusting the query. Gradually > > > increase the result set size by adjusting the query. You then can get a > > > feel for how result set size affects performance. This will give you an > > > indication how much it will help to have more shards. > > > > > > > > > > > > > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > > > > On Wed, Jan 19, 2022 at 6:19 AM Charlie Hull < > > > ch...@opensourceconnections.com> wrote: > > > > > > > Hi Ashwin, > > > > > > > > What happens if you reduce the number of rows requested? Do you > really > > > > need 500 results each time? I think this will ask for 500 results > from > > > > *each shard* too. > > > > https://solr.apache.org/guide/8_7/pagination-of-results.html > > > > > > > > Also it looks like you mean boost=def(boostFieldA,1) not > > > > boost:def(boostFieldA,1), am I right? > > > > > > > > Cheers > > > > > > > > Charlie > > > > > > > > On 19/01/2022 02:43, Ashwin Ramesh wrote: > > > > > Gentle ping! Promise it's my final one! :) > > > > > > > > > > On Thu, Jan 13, 2022 at 8:01 AM Ashwin Ramesh<ash...@canva.com> > > > wrote: > > > > > > > > > >> Hi everyone, > > > > >> > > > > >> I have a few questions about how we can improve our solr query > > > > >> performance, especially for boosts (BF, BQ, boost, etc). > > > > >> > > > > >> *System Specs:* > > > > >> Solr Version: 7.7.x > > > > >> Heap Size: 31gb > > > > >> Num Docs: >100M > > > > >> Shards: 8 > > > > >> Replication Factor: 6 > > > > >> Index is completely mapped into memory > > > > >> > > > > >> > > > > >> Example query: > > > > >> { > > > > >> q=hello world > > > > >> qf=title description keywords > > > > >> pf=title^0.5 > > > > >> ps=0 > > > > >> fq=type:P > > > > >> boost:def(boostFieldA,1) // boostFieldA is docValue float type > > > > >> bf=mul(termfreq(termScoreFieldB,$q),1000.0) // termScoreFieldB is > a > > > > >> textField. No docValue, just indexed > > > > >> rows:500 > > > > >> fl=id,score > > > > >> } > > > > >> > > > > >> numFound: >21M > > > > >> qTime: 800ms > > > > >> > > > > >> Experimentation of params: > > > > >> > > > > >> - When I remove the boost parameter, the qTime drops to 525ms > > > > >> - When I remove the bf parameter, the qTime dropes to 650ms > > > > >> - When I remove both the boost & bf parameters, the qTime > drops > > to > > > > >> 400ms > > > > >> > > > > >> > > > > >> Questions: > > > > >> > > > > >> 1. Is there any way to improve the performance of the boosts > > > > (specific > > > > >> field types, etc)? > > > > >> 2. Will sharding further such that each core only has to > score a > > > > >> smaller subset of documents help with query performance? > > > > >> 3. Is there any performance impact when boosting/querying > > against > > > > >> sparse fields, both indexed=true or docValues=true? > > > > >> 4. It seems the base case scoring is 400ms, which is already > > quite > > > > >> high. Is this because the query (hello world) implicitly gets > > > > parsed as > > > > >> (hello OR world)? Thus it would be more computationally > > expensive? > > > > >> 5. Any other advice :) ? > > > > >> > > > > >> > > > > >> Thanks in advance, > > > > >> > > > > >> Ash > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > -- > > > > Charlie Hull - Managing Consultant at OpenSource Connections Limited > > > > Founding member of The Search Network < > http://www.thesearchnetwork.com > > > > > > > and co-author of Searching the Enterprise > > > > < > > > > > > > > > > https://opensourceconnections.com/wp-content/uploads/2020/08/ES_book_final_journal_version.pdf > > > > > > > > > tel/fax: +44 (0)8700 118334 > > > > mobile: +44 (0)7767 825828 > > > > > > > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin > > > > Amtsgericht Charlottenburg | HRB 230712 B > > > > Geschäftsführer: John M. Woodell | David E. Pugh > > > > Finanzamt: Berlin Finanzamt für Körperschaften II > > > > > > > > -- > > > > This email has been checked for viruses by AVG. > > > > https://www.avg.com > > > > > > > > > > > -- > ** > ** <https://www.canva.com/>Empowering the world to design > Share accurate > information on COVID-19 and spread messages of support to your community. > Here are some resources > < > https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr&utm_source=news&utm_campaign=covid19_templates> > > that can help. > <https://twitter.com/canva> <https://facebook.com/canva> > <https://au.linkedin.com/company/canva> <https://twitter.com/canva> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva> > <https://instagram.com/canva> > > > > > > > > > > >