Hmm... if you're manually constructing phrase queries during pre-parsing, and those are set sow=true, autogeneratePhraseQueries=true, then despite lack of pf, phrase queries could still be a key to this. Would any of the phrase queries explicitly introduced by your pre-parsing hactually trigger autogeneratePhraseQueries to kick in? (i.e., would any of the whitespace-separated tokens in your phrases be further split by your Solr-internal analysis chain -- WordDelimiter, (Solr-internal) Synonym, etc.?). Would you be able to share the analysis chain on the relevant fields, and perhaps (forgiving readability challenges) an example of pre-parsed input that suffers particularly from performance degradation?
On Thu, Aug 20, 2020 at 2:28 PM Elaine Cario <etca...@gmail.com> wrote: > > Thanks Michael, I took a look, but we don't have any pf or pf1,2,3 phrase > params set at all. Also, we don't add synonyms through Solr filters, > rather we parse the user's query in our own application and add synonyms > there, before it gets to Solr. > > Some additional info: we have sow=true (to be compatible with Solr 6), and > autogeneratePhraseQueries=true. In our A/B testing, we didn't see any > difference in search results (aside from some minor scoring variations), so > functionally everything is working fine. > > I compared the debugQuery results between Solr 6 and 8 on a somewhat > simplified query (they quickly become unreadable otherwise): > > Solr 6: > <str name="parsedquery">(+(DisjunctionMaxQuery((wkxmlsource:"new york" | > title:"new york")~1.0) DisjunctionMaxQuery((wkxmlsource:ny | title:ny)~1.0) > DisjunctionMaxQuery((wkxmlsource:"big apple" | title:"big > apple")~1.0)))/no_coord</str> > <str name="parsedquery_toString">+((wkxmlsource:"new york" | title:"new > york")~1.0 (wkxmlsource:ny | title:ny)~1.0 (wkxmlsource:"big apple" | > title:"big apple")~1.0)</str> > > Solr 8: > <str name="parsedquery">+(DisjunctionMaxQuery((wkxmlsource:"new york" | > title:"new york")~1.0) DisjunctionMaxQuery((wkxmlsource:ny | title:ny)~1.0) > DisjunctionMaxQuery((wkxmlsource:"big apple" | title:"big > apple")~1.0))</str> > <str name="parsedquery_toString">+((wkxmlsource:"new york" | title:"new > york")~1.0 (wkxmlsource:ny | title:ny)~1.0 (wkxmlsource:"big apple" | > title:"big apple")~1.0)</str> > > The only substantial difference is the removal of /no_coord (which is > probably a result of LUCENE-7347 and likely accounts also for scoring > variations). > > We do see generally higher CPU load with Solr 8 (although it is well within > tolerance), and we do see much higher thread count (60 for Solr 6 vs 150 > for Solr 8 on average) even on a relatively quiet system. That seems an > interesting statistic, but not really sure what it signifies. We mostly > take the OOTB defaults for most everything, and config changes were > minimal, mostly to maintain Solr 6 query behavior (uf=*_query_, sow=true). > > On Wed, Aug 19, 2020 at 5:46 PM Michael Gibney <mich...@michaelgibney.net> > wrote: > > > Hi Elaine, > > I'm curious what happens if you remove "pf" (phrase field) setting > > from your edismax config? > > > > This question brought to mind > > > > https://issues.apache.org/jira/browse/SOLR-12243?focusedCommentId=16836448#comment-16836448 > > and https://issues.apache.org/jira/browse/LUCENE-8531. This *could* > > have directly explained the behavior you're observing, except for the > > fact that pre-6.5.0, analyzeGraphPhrase(...) generated a > > fully-enumerated Lucene "GraphQuery" (since removed, but afaict > > similar to MultiPhraseQuery). But the direct topic of SOLR-12243 was > > that SpanNearQuery, nevermind its performance characteristics, was > > getting completely ignored by edismax. Curious about your case, I > > looked at ExtendedDismaxQParser for 6.4.2, and it appears that > > GraphQuery was similarly ignored?: > > > > > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.2/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java#L1219-L1252 > > > > If this is in fact the case (and I could well be overlooking > > something), then it's possible that 6.4.2 was more performant mainly > > because edismax was completely ignoring the more complex phrase > > queries generated by analyzeGraphPhrase(...). > > > > I'll be curious to hear what you find, and eager to be corrected if > > the above speculation is off-base! > > > > Michael > > > > > > On Wed, Aug 19, 2020 at 10:56 AM Elaine Cario <etca...@gmail.com> wrote: > > > > > > Hi Solr experts, > > > > > > We're in the process of upgrading SolrCloud from 6.4.2 to 8.3.1, and our > > > performance testing is consistently showing search latencies are > > measurably > > > higher in 8.3.1, for certain kinds of queries it may be as much as 200 ms > > > higher on average. > > > > > > We've seen this now in 2 different environments. In one environment, we > > > effectively doubled the OS memory for Solr 8 (by removing a replica set), > > > and saw little improvement. > > > > > > The specs on the VM's we're using are the same from Solr 6 and 8, and the > > > index sizes and shard distribution are also the same. We reviewed > > garbage > > > collection logs, and didn't see any red flags there. We're still using > > > Java 8 (sorry!). Content was re-fed into Solr 8 from scratch. > > > > > > We re-ran queries removing all the usual suspects for high latencies: > > > grouping, faceting, highlighting.We saw some improvement (as we would > > > expect), but nothing approaching the average Solr 6 latencies with all > > > those features turned on. > > > > > > We've narrowed the largest overall latencies to queries which contain > > many > > > terms OR'd together (essentially synonyms we add to the query ourselves); > > > there may be as many as 0-38 or more quoted phrases OR'd together. > > > Latencies increase the more synonyms we add (we always knew this), but it > > > seems much worse in Solr 8. (It is an unfortunate quirk of our content > > that > > > these terms often have pretty high frequencies). But it's not clear if > > > this is just amplifying an underlying issue, or if something fundamental > > > changed in the way Solr (or Lucene) resolves queries with OR'd terms. We > > > use a custom variant of edismax (but we also modified the queries to > > enable > > > use of OOTB edismax, and still saw no improvement). > > > > > > We also noted that 0-term queries (*:*) with lots of facets perform as > > well > > > as Solr 6, so it definitely seems related to searching for terms. > > > > > > I'm out of ideas here. Has anyone experienced similar degradation from > > > older Solr versions? > > > > > > Thanks in advance for any help you can provide. > >