Re: Want zero results from SOLR when there are no matches for "querystring"

John Bickerstaff Fri, 12 Aug 2016 12:25:22 -0700

Thanks - I'll look at it...

On Fri, Aug 12, 2016 at 1:21 PM, Erick Erickson <erickerick...@gmail.com>
wrote:


> Maybe rerankqparserplugin?
>
> On Aug 12, 2016 11:54, "John Bickerstaff" <j...@johnbickerstaff.com>
> wrote:
>
> > @Hossman --  thanks again.
> >
> > I've made the following change and so far things look good.  I couldn't
> see
> > debug or find results for what I put in for $func, so I just removed it,
> > but making modifications as you suggested appears to be working.
> >
> > Including the actual line from my endpoint XML in case this thread helps
> > someone else...
> >
> > <str name="q">{!boost defType=synonym_edismax qf='title' synonyms='true'
> > synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> > v=$q}</str>
> >
> > On Fri, Aug 12, 2016 at 12:09 PM, John Bickerstaff <
> > j...@johnbickerstaff.com
> > > wrote:
> >
> > > Thanks!  I'll check it out.
> > >
> > > On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <susheel2...@gmail.com
> >
> > > wrote:
> > >
> > >> Not exactly sure what you are looking from chaining the results but
> > >> similar
> > >> functionality is available in Streaming expressions where result of
> > inner
> > >> expressions are passed to outer expressions and so on
> > >> https://cwiki.apache.org/confluence/display/solr/
> Streaming+Expressions
> > >>
> > >> HTH
> > >> Susheel
> > >>
> > >> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
> > >> j...@johnbickerstaff.com>
> > >> wrote:
> > >>
> > >> > Hossman - many thanks again for your comprehensive and very helpful
> > >> answer!
> > >> >
> > >> > All,
> > >> >
> > >> > I am (possibly mis-remembering) reading something about being able
> to
> > >> pass
> > >> > the results of one query to another query...  Essentially "chaining"
> > >> result
> > >> > sets.
> > >> >
> > >> > I have looked in docs and can't find anything on a quick search -- I
> > may
> > >> > have been reading about the Re-Ranking feature, which doesn't help
> me
> > (I
> > >> > know because I just tried and it seems to return all results anyway,
> > >> just
> > >> > re-ranking the number specified in the reRankDocs flag...)
> > >> >
> > >> > Is there a way to (cleanly) send the results of one query to another
> > >> query
> > >> > for further processing?  Essentially, pass ONLY the results
> (including
> > >> an
> > >> > empty set of results) to another query for processing?
> > >> >
> > >> > thanks...
> > >> >
> > >> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> > >> > j...@johnbickerstaff.com>
> > >> > wrote:
> > >> >
> > >> > > Thanks!
> > >> > >
> > >> > > To answer your questions, while I digest the rest of that
> > >> information...
> > >> > >
> > >> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> > >> > > https://github.com/healthonnet/hon-lucene-synonyms
> > >> > >
> > >> > > The config looks like this - and IIRC, is simply a copy from the
> > >> > > recommended cofig on the site mentioned above.
> > >> > >
> > >> > >  <queryParser name="synonym_edismax"
> class="com.github.healthonnet.
> > >> > search.
> > >> > > SynonymExpandingExtendedDismaxQParserPlugin">
> > >> > >     <!-- You can define more than one synonym analyzer in the
> > >> following
> > >> > > list.
> > >> > >          For example, you might have one set of synonyms for
> > English,
> > >> one
> > >> > > for French,
> > >> > >          one for Spanish, etc.
> > >> > >       -->
> > >> > >     <lst name="synonymAnalyzers">
> > >> > >       <!-- Name your analyzer something useful, e.g.
> "analyzer_en",
> > >> > > "analyzer_fr", "analyzer_es", etc.
> > >> > >            If you only have one, the name doesn't matter (hence
> > >> > > "myCoolAnalyzer").
> > >> > >         -->
> > >> > >       <lst name="myCoolAnalyzer">
> > >> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
> > >> based
> > >> > > on whitespace and quotes.
> > >> > >              This seems to work best with most people's synonym
> > files.
> > >> > >              For details, read the discussion here:
> > >> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
> > >> > >           -->
> > >> > >         <lst name="tokenizer">
> > >> > >           <str name="class">solr.PatternTokenizerFactory</str>
> > >> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
> > >> > >         </lst>
> > >> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
> > >> token
> > >> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
> > >> > >              The default here is to assume you don't have any
> > synonyms
> > >> > > longer than 4 tokens.
> > >> > >              You can tweak this depending on what your synonyms
> look
> > >> > like.
> > >> > > E.g. if you only have unigrams, you can remove
> > >> > >              it entirely, and if your synonyms are up to 7 tokens
> in
> > >> > > length, you should set the maxShingleSize to 7.
> > >> > >           -->
> > >> > >         <lst name="filter">
> > >> > >           <str name="class">solr.ShingleFilterFactory</str>
> > >> > >           <str name="outputUnigramsIfNoShingles">true</str>
> > >> > >           <str name="outputUnigrams">true</str>
> > >> > >           <str name="minShingleSize">2</str>
> > >> > >           <str name="maxShingleSize">4</str>
> > >> > >         </lst>
> > >> > >         <!-- This is where you set your synonym file.  For the
> unit
> > >> tests
> > >> > > and "Getting Started" examples, we use example_synonym_file.txt.
> > >> > >              This plugin will work best if you keep expand set to
> > true
> > >> > and
> > >> > > have all your synonyms comma-separated (rather than =>-separated).
> > >> > >           -->
> > >> > >         <lst name="filter">
> > >> > >           <str name="class">solr.SynonymFilterFactory</str>
> > >> > >           <str name="tokenizerFactory">solr.
> > >> > KeywordTokenizerFactory</str>
> > >> > >           <str name="synonyms">example_synonym_file.txt</str>
> > >> > >           <str name="expand">true</str>
> > >> > >           <str name="ignoreCase">true</str>
> > >> > >         </lst>
> > >> > >       </lst>
> > >> > >     </lst>
> > >> > >   </queryParser>
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> > >> > hossman_luc...@fucit.org
> > >> > > > wrote:
> > >> > >
> > >> > >>
> > >> > >> : First let me say that this is very possibly the "x - y problem"
> > so
> > >> let
> > >> > >> me
> > >> > >> : state up front what my ultimate need is -- then I'll ask about
> > the
> > >> > >> thing I
> > >> > >> : imagine might help...  which, of course, is heavily biased in
> the
> > >> > >> direction
> > >> > >> : of my experience coding Java and writing SQL...
> > >> > >>
> > >> > >> Thank you so much for asking your question this way!
> > >> > >>
> > >> > >> Right off the bat, the background you've provided seems
> > supicious...
> > >> > >>
> > >> > >> : I have a piece of a query that calculates a score based on a
> > >> > "weighting"
> > >> > >>         ...
> > >> > >> : The specific line is this:
> > >> > >> : <str name="bf">product(field(category_weight),20)</str>
> > >> > >> :
> > >> > >> : What I just realized is that when I query Solr for a string
> that
> > >> has
> > >> > NO
> > >> > >> : matches in the entire corpus, I still get a slew of results
> > because
> > >> > >> EVERY
> > >> > >> : doc has the weighting value in the category_weight field - and
> > >> > therefore
> > >> > >> : every doc gets some score.
> > >> > >>
> > >> > >> ...that is *NOT* how dismax and edisamx normally work.
> > >> > >>
> > >> > >> While both the "bf" abd "bq" params result in "additive"
> boosting,
> > >> and
> > >> > the
> > >> > >> implementation of that "additive boost" comes from adding new
> > >> optional
> > >> > >> clauses to the top level BooleanQuery that is executed, that only
> > >> > happens
> > >> > >> after the "main" query (from your "q" param) is added to that top
> > >> level
> > >> > >> BooleanQuery as a "mandaory" clause.
> > >> > >>
> > >> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost
> > every
> > >> > doc,
> > >> > >> but with the techprducts configs/data these requests still don't
> > >> match
> > >> > >> anything...
> > >> > >>
> > >> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >> > >>
> > >> > >> ...and if you look at the debug output, the parsed queries shows
> > that
> > >> > the
> > >> > >> "bogus" part of the query is mandatory...
> > >> > >>
> > >> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> > >> > >> FunctionQuery(const(true))
> > >> > >>
> > >> > >> (i didn't use "pf" in that example, but the effect is the same,
> the
> > >> "pf"
> > >> > >> based clauses are optional, while the "qf" based clauses are
> > >> mandatory)
> > >> > >>
> > >> > >> If you compare that example to your debug output, you'll notice a
> > >> > >> difference in structure -- it's a bit hard to see in your
> example,
> > >> but
> > >> > if
> > >> > >> you simplify your qf, pf, and q fields it should be more obvious,
> > but
> > >> > >> AFAICT the "main" parts of your query are getting wrapped in an
> > extra
> > >> > >> layer of parents (ie: an extra BooleanQuery) which is *not*
> > >> mandatory in
> > >> > >> the top level query ... i don't see *any* mandatory clauses in
> your
> > >> top
> > >> > >> level BooleanQuery, which is why any match on a bf or bq function
> > is
> > >> > >> enough to cause a document to match.
> > >> > >>
> > >> > >> I suspect the reason your parsed query structure is so diff has
> to
> > do
> > >> > with
> > >> > >> this...
> > >> > >>
> > >> > >> :        <str name="defType">synonym_edismax</str>>
> > >> > >>
> > >> > >>
> > >> > >> 1) how exactly is "synonym_edismax" defined in your
> solrconfig.xml?
> > >> > >> 2) what QParserPlugin are you using to implement that?
> > >> > >>
> > >> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
> > >> > >>
> > >> > >>
> > >> > >> If you can't fix the bug, one possibile workaround would be to
> > >> abandon
> > >> > bf
> > >> > >> and bq params completely, and instead wrap the query it produces
> in
> > >> in a
> > >> > >> {!boost} parser with whatever function you want (using functions
> > like
> > >> > >> sum() or prod() to combine multiple functions, and query() to
> > >> > incorporate
> > >> > >> your current bq param).  Doing this will require chanign how you
> > >> specify
> > >> > >> you input (example below) and it will result in *multiplicitive*
> > >> boosts
> > >> > --
> > >> > >> so your scores will be much diff, and you will likely have to
> > adjust
> > >> > your
> > >> > >> constants, but: 1) multiplicitive boosts are almost always what
> > >> people
> > >> > >> *really* want anyway; 2) it will ensure the boosts are only
> applied
> > >> for
> > >> > >> things matching your main query, no matter how that query parser
> > >> works
> > >> > or
> > >> > >> what bugs it has.
> > >> > >>
> > >> > >> Example of using {!boost} to wrap an arbitrary other parser...
> > >> > >>
> > >> > >> instead of...
> > >> > >>   defType=foofoo
> > >> > >>   q=barbarbar
> > >> > >>
> > >> > >> use...
> > >> > >>    q={!boost b=$func defType=foofoo v=$qq}
> > >> > >>   qq=barbarbar
> > >> > >> func=sum(something,somethingelse)
> > >> > >>
> > >> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> > >> > >> https://cwiki.apache.org/confluence/display/solr/
> Function+Queries
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> :
> > >> > >> : What I would like is to return zero results if there is no
> match
> > >> for
> > >> > the
> > >> > >> : querystring.  My collection is small enough that I don't care
> if
> > >> the
> > >> > >> actual
> > >> > >> : calculation runs on each doc (although that's wasteful) -- I
> just
> > >> > don't
> > >> > >> : want to see results come back for zero matches to the
> querystring
> > >> > >> :
> > >> > >> : (The /select endpoint does this of course, but my custom
> endpoint
> > >> > >> includes
> > >> > >> : this "weighting" piece and therefore returns every doc in the
> > >> corpus
> > >> > >> : because they all have the weighting.
> > >> > >> :
> > >> > >> : ====================
> > >> > >> : Enter my imagined solution...  The potential X-Y problem...
> > >> > >> : ====================
> > >> > >> :
> > >> > >> : So - given that I come from a programming background, I
> > immediately
> > >> > >> start
> > >> > >> : thinking of an if statement ...
> > >> > >> :
> > >> > >> :      if(some_score_for_the_primary_search_string) {
> > >> > >> :           run_the_category_weight_calculation;
> > >> > >> :      } else {
> > >> > >> :           do_NOT_run_category_weight_calc;
> > >> > >> :      }
> > >> > >> :
> > >> > >> :
> > >> > >> : Another way of thinking of it would be something like the
> "WHERE"
> > >> > >> clause in
> > >> > >> : SQL...
> > >> > >> :
> > >> > >> :  run_category_weight_calculation WHERE "searchstring" is found
> > in
> > >> the
> > >> > >> : document, not otherwise.
> > >> > >> :
> > >> > >> : I'm aware that things could be handled in the client-side of my
> > web
> > >> > app,
> > >> > >> : but if possible, I'd like the interface to SOLR to be as clean
> as
> > >> > >> possible,
> > >> > >> : and massage incoming SOLR data as little as possible.
> > >> > >> :
> > >> > >> : In other words, do NOT return any docs if the querystring (and
> > any
> > >> > >> : synonyms) match zero docs.
> > >> > >> :
> > >> > >> : Here is the endpoint XML for the query.  I've highlighted the
> > >> specific
> > >> > >> line
> > >> > >> : that is causing the unintended results...
> > >> > >> :
> > >> > >> :
> > >> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
> > >> > >> :     <!-- default values for query parameters can be specified,
> > >> these
> > >> > >> :          will be overridden by parameters in the request
> > >> > >> :       -->
> > >> > >> :      <lst name="defaults">
> > >> > >> :        <str name="echoParams">all</str>
> > >> > >> :        <int name="rows">20</int>
> > >> > >> :        <!-- Query settings -->
> > >> > >> :        <str name="df">text</str>
> > >> > >> :       <!-- <str name="df">title</str> -->
> > >> > >> :        <str name="defType">synonym_edismax</str>>
> > >> > >> :        <str name="synonyms">true</str>
> > >> > >> :     <!-- The line below balances out the weighting of exact
> > >> matches to
> > >> > >> the
> > >> > >> : synonym phrase entered by the user
> > >> > >> :          with the category_weight calculation and the
> titleQuery
> > >> calc.
> > >> > >> : These numbers exist in a balance and
> > >> > >> :          if one is raised or lowered, the others (probably)
> need
> > to
> > >> > >> change
> > >> > >> : as well.  It may be better to go with decimals
> > >> > >> :          for all of them... .4 instead of 4 and 2 instead of 20
> > and
> > >> > 2.5
> > >> > >> : instead of 25.
> > >> > >> :          In the end, I'm not sure it really matters, but don't
> > >> change
> > >> > >> one
> > >> > >> : without changing the others
> > >> > >> :          unless you've tested and are sure you want the results
> > >> -->
> > >> > >> :        <float name="synonyms.originalBoost">1.5</float>
> > >> > >> :        <float name="synonyms.synonymBoost">1.1</float>
> > >> > >> :        <str name="mm">75%</str>
> > >> > >> :        <str name="q.alt">*:*</str>
> > >> > >> :        <str name="rows">20</str>
> > >> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
> > >> > >> :        <str name="bq">{!synonym_edismax qf='title'
> > synonyms='true'
> > >> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf=''
> > >> bq=''
> > >> > >> : v=$q}</str>
> > >> > >> :        <str name="fl">id category_weight title category_ss
> score
> > >> > >> : contentType</str>
> > >> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> > >> > >> v=$q}</str>
> > >> > >> : =====================================================
> > >> > >> :        *<str name="bf">product(field(
> category_weight),20)</str>*
> > >> > >> : =====================================================
> > >> > >> :        <str name="bf">product(query($titleQuery),4)</str>
> > >> > >> :        <str name="qf">text contentType^1000</str>
> > >> > >> :        <str name="wt">python</str>
> > >> > >> :        <str name="debug">true</str>
> > >> > >> :        <str name="debug.explain.structured">true</str>
> > >> > >> :        <str name="indent">true</str>
> > >> > >> :        <str name="echoParams">all</str>
> > >> > >> :      </lst>
> > >> > >> :   </requestHandler>
> > >> > >> :
> > >> > >> : And here is the debug output for a query.  (This was a test for
> > >> > >> synonyms,
> > >> > >> : which you'll see in the output.) The original query string was,
> > of
> > >> > >> : course, "μ-heavy
> > >> > >> : chain disease"
> > >> > >> :
> > >> > >> : You'll note that although there is no score in the first doc
> > >> explain
> > >> > for
> > >> > >> : the actual querystring, the highlighted section does get a
> score
> > >> for
> > >> > >> : product(double(category_weight)=1.5,const(20))
> > >> > >> :
> > >> > >> : ... which is the thing that is currently causing all the docs
> in
> > >> the
> > >> > >> : collection to "match" even though the querystring is not in any
> > of
> > >> > them.
> > >> > >> :
> > >> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> > >> > >> : "querystring":"\"μ-heavy
> > >> > >> : chain disease\"", "parsedquery":"(
> DisjunctionMaxQuery((text:\"μ
> > >> heavy
> > >> > >> chain
> > >> > >> : disease\" | (contentType:\"μ heavy chain
> disease\")^1000.0))^1.5
> > >> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> > >> > >> (contentType:\"mu
> > >> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> > >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\
> "μ
> > >> heavy
> > >> > >> chain
> > >> > >> : disease\" | (contentType:\"μ heavy chain
> > >> > disease\")^1000.0)))/no_coord^
> > >> > >> 1.1)
> > >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\
> "μ
> > >> > heavy
> > >> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
> > >> chain
> > >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
> > >> chain
> > >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> > >> : hcd\")))/no_coord^1.1)))
> > >> > >> : FunctionQuery(product(double(category_weight),const(20)))
> > >> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
> > >> > >> : disease\"),def=0.0),const(4)))",
> "parsedquery_toString":"(((tex
> > >> t:\"μ
> > >> > >> heavy
> > >> > >> : chain disease\" | (contentType:\"μ heavy chain
> > >> disease\")^1000.0))^1.5
> > >> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy
> > chain
> > >> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> > >> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> > >> > >> (contentType:\"μ
> > >> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> > >> > >> (contentType:\"μ
> > >> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> > >> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
> > >> hcd\"))^1.1)
> > >> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
> > >> > hcd\"))^1.1)))
> > >> > >> : product(double(category_weight),const(20))
> > >> product(query(+(title:\"μ
> > >> > >> heavy
> > >> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
> > >> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
> > >> "value":30.0, "
> > >> > >> : description":"sum of:", "details":[{ "match":true,
> "value":30.0,
> > "
> > >> > >> : description":"FunctionQuery(product(double(category_weight),
> > >> > >> const(20))),
> > >> > >> : product of:",
> > >> > >> : =====================================================
> > >> > >> : *"details":**[{ "match":true, "value":30.0,
> > >> > >> : "description":"product(double(category_weight)=1.5,const(20)
> )"},
> > >> {*
> > >> > >> : =====================================================
> > >> > >> :
> > >> > >> : "match":true, "value":1.0, "description":"boost"}, {
> > "match":true,
> > >> > >> "value":
> > >> > >> : 1.0, "description":"queryNorm"}]}, {
> > >> > >> :
> > >> > >>
> > >> > >> -Hoss
> > >> > >> http://www.lucidworks.com/
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Reply via email to