Re: Want zero results from SOLR when there are no matches for "querystring"

John Bickerstaff Fri, 12 Aug 2016 10:09:18 -0700

Hossman - many thanks again for your comprehensive and very helpful answer!


All,

I am (possibly mis-remembering) reading something about being able to pass
the results of one query to another query...  Essentially "chaining" result
sets.

I have looked in docs and can't find anything on a quick search -- I may
have been reading about the Re-Ranking feature, which doesn't help me (I
know because I just tried and it seems to return all results anyway, just
re-ranking the number specified in the reRankDocs flag...)

Is there a way to (cleanly) send the results of one query to another query
for further processing?  Essentially, pass ONLY the results (including an
empty set of results) to another query for processing?

thanks...

On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <j...@johnbickerstaff.com>
wrote:

> Thanks!
>
> To answer your questions, while I digest the rest of that information...
>
> I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> https://github.com/healthonnet/hon-lucene-synonyms
>
> The config looks like this - and IIRC, is simply a copy from the
> recommended cofig on the site mentioned above.
>
>  <queryParser name="synonym_edismax" class="com.github.healthonnet.search.
> SynonymExpandingExtendedDismaxQParserPlugin">
>     <!-- You can define more than one synonym analyzer in the following
> list.
>          For example, you might have one set of synonyms for English, one
> for French,
>          one for Spanish, etc.
>       -->
>     <lst name="synonymAnalyzers">
>       <!-- Name your analyzer something useful, e.g. "analyzer_en",
> "analyzer_fr", "analyzer_es", etc.
>            If you only have one, the name doesn't matter (hence
> "myCoolAnalyzer").
>         -->
>       <lst name="myCoolAnalyzer">
>         <!-- We recommend a PatternTokenizerFactory that tokenizes based
> on whitespace and quotes.
>              This seems to work best with most people's synonym files.
>              For details, read the discussion here:
> http://github.com/healthonnet/hon-lucene-synonyms/issues/26
>           -->
>         <lst name="tokenizer">
>           <str name="class">solr.PatternTokenizerFactory</str>
>           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
>         </lst>
>         <!-- The ShingleFilterFactory outputs synonyms of multiple token
> lengths (e.g. unigrams, bigrams, trigrams, etc.).
>              The default here is to assume you don't have any synonyms
> longer than 4 tokens.
>              You can tweak this depending on what your synonyms look like.
> E.g. if you only have unigrams, you can remove
>              it entirely, and if your synonyms are up to 7 tokens in
> length, you should set the maxShingleSize to 7.
>           -->
>         <lst name="filter">
>           <str name="class">solr.ShingleFilterFactory</str>
>           <str name="outputUnigramsIfNoShingles">true</str>
>           <str name="outputUnigrams">true</str>
>           <str name="minShingleSize">2</str>
>           <str name="maxShingleSize">4</str>
>         </lst>
>         <!-- This is where you set your synonym file.  For the unit tests
> and "Getting Started" examples, we use example_synonym_file.txt.
>              This plugin will work best if you keep expand set to true and
> have all your synonyms comma-separated (rather than =>-separated).
>           -->
>         <lst name="filter">
>           <str name="class">solr.SynonymFilterFactory</str>
>           <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
>           <str name="synonyms">example_synonym_file.txt</str>
>           <str name="expand">true</str>
>           <str name="ignoreCase">true</str>
>         </lst>
>       </lst>
>     </lst>
>   </queryParser>
>
>
>
> On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <hossman_luc...@fucit.org
> > wrote:
>
>>
>> : First let me say that this is very possibly the "x - y problem" so let
>> me
>> : state up front what my ultimate need is -- then I'll ask about the
>> thing I
>> : imagine might help...  which, of course, is heavily biased in the
>> direction
>> : of my experience coding Java and writing SQL...
>>
>> Thank you so much for asking your question this way!
>>
>> Right off the bat, the background you've provided seems supicious...
>>
>> : I have a piece of a query that calculates a score based on a "weighting"
>>         ...
>> : The specific line is this:
>> : <str name="bf">product(field(category_weight),20)</str>
>> :
>> : What I just realized is that when I query Solr for a string that has NO
>> : matches in the entire corpus, I still get a slew of results because
>> EVERY
>> : doc has the weighting value in the category_weight field - and therefore
>> : every doc gets some score.
>>
>> ...that is *NOT* how dismax and edisamx normally work.
>>
>> While both the "bf" abd "bq" params result in "additive" boosting, and the
>> implementation of that "additive boost" comes from adding new optional
>> clauses to the top level BooleanQuery that is executed, that only happens
>> after the "main" query (from your "q" param) is added to that top level
>> BooleanQuery as a "mandaory" clause.
>>
>> So, for example, "bf=true()" and "bq=*:*" should match & boost every doc,
>> but with the techprducts configs/data these requests still don't match
>> anything...
>>
>> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
>> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
>>
>> ...and if you look at the debug output, the parsed queries shows that the
>> "bogus" part of the query is mandatory...
>>
>> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
>> FunctionQuery(const(true))
>>
>> (i didn't use "pf" in that example, but the effect is the same, the "pf"
>> based clauses are optional, while the "qf" based clauses are mandatory)
>>
>> If you compare that example to your debug output, you'll notice a
>> difference in structure -- it's a bit hard to see in your example, but if
>> you simplify your qf, pf, and q fields it should be more obvious, but
>> AFAICT the "main" parts of your query are getting wrapped in an extra
>> layer of parents (ie: an extra BooleanQuery) which is *not* mandatory in
>> the top level query ... i don't see *any* mandatory clauses in your top
>> level BooleanQuery, which is why any match on a bf or bq function is
>> enough to cause a document to match.
>>
>> I suspect the reason your parsed query structure is so diff has to do with
>> this...
>>
>> :        <str name="defType">synonym_edismax</str>>
>>
>>
>> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
>> 2) what QParserPlugin are you using to implement that?
>>
>> I suspect whatever QParserPlugin you are using has a bug in it :)
>>
>>
>> If you can't fix the bug, one possibile workaround would be to abandon bf
>> and bq params completely, and instead wrap the query it produces in in a
>> {!boost} parser with whatever function you want (using functions like
>> sum() or prod() to combine multiple functions, and query() to incorporate
>> your current bq param).  Doing this will require chanign how you specify
>> you input (example below) and it will result in *multiplicitive* boosts --
>> so your scores will be much diff, and you will likely have to adjust your
>> constants, but: 1) multiplicitive boosts are almost always what people
>> *really* want anyway; 2) it will ensure the boosts are only applied for
>> things matching your main query, no matter how that query parser works or
>> what bugs it has.
>>
>> Example of using {!boost} to wrap an arbitrary other parser...
>>
>> instead of...
>>   defType=foofoo
>>   q=barbarbar
>>
>> use...
>>    q={!boost b=$func defType=foofoo v=$qq}
>>   qq=barbarbar
>> func=sum(something,somethingelse)
>>
>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
>> https://cwiki.apache.org/confluence/display/solr/Function+Queries
>>
>>
>>
>>
>> :
>> : What I would like is to return zero results if there is no match for the
>> : querystring.  My collection is small enough that I don't care if the
>> actual
>> : calculation runs on each doc (although that's wasteful) -- I just don't
>> : want to see results come back for zero matches to the querystring
>> :
>> : (The /select endpoint does this of course, but my custom endpoint
>> includes
>> : this "weighting" piece and therefore returns every doc in the corpus
>> : because they all have the weighting.
>> :
>> : ====================
>> : Enter my imagined solution...  The potential X-Y problem...
>> : ====================
>> :
>> : So - given that I come from a programming background, I immediately
>> start
>> : thinking of an if statement ...
>> :
>> :      if(some_score_for_the_primary_search_string) {
>> :           run_the_category_weight_calculation;
>> :      } else {
>> :           do_NOT_run_category_weight_calc;
>> :      }
>> :
>> :
>> : Another way of thinking of it would be something like the "WHERE"
>> clause in
>> : SQL...
>> :
>> :  run_category_weight_calculation WHERE "searchstring" is found in the
>> : document, not otherwise.
>> :
>> : I'm aware that things could be handled in the client-side of my web app,
>> : but if possible, I'd like the interface to SOLR to be as clean as
>> possible,
>> : and massage incoming SOLR data as little as possible.
>> :
>> : In other words, do NOT return any docs if the querystring (and any
>> : synonyms) match zero docs.
>> :
>> : Here is the endpoint XML for the query.  I've highlighted the specific
>> line
>> : that is causing the unintended results...
>> :
>> :
>> :  <requestHandler name="/foo" class="solr.SearchHandler">
>> :     <!-- default values for query parameters can be specified, these
>> :          will be overridden by parameters in the request
>> :       -->
>> :      <lst name="defaults">
>> :        <str name="echoParams">all</str>
>> :        <int name="rows">20</int>
>> :        <!-- Query settings -->
>> :        <str name="df">text</str>
>> :       <!-- <str name="df">title</str> -->
>> :        <str name="defType">synonym_edismax</str>>
>> :        <str name="synonyms">true</str>
>> :     <!-- The line below balances out the weighting of exact matches to
>> the
>> : synonym phrase entered by the user
>> :          with the category_weight calculation and the titleQuery calc.
>> : These numbers exist in a balance and
>> :          if one is raised or lowered, the others (probably) need to
>> change
>> : as well.  It may be better to go with decimals
>> :          for all of them... .4 instead of 4 and 2 instead of 20 and 2.5
>> : instead of 25.
>> :          In the end, I'm not sure it really matters, but don't change
>> one
>> : without changing the others
>> :          unless you've tested and are sure you want the results  -->
>> :        <float name="synonyms.originalBoost">1.5</float>
>> :        <float name="synonyms.synonymBoost">1.1</float>
>> :        <str name="mm">75%</str>
>> :        <str name="q.alt">*:*</str>
>> :        <str name="rows">20</str>
>> :        <str name="fq">meta_doc_type:chapterDoc</str>
>> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
>> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
>> : v=$q}</str>
>> :        <str name="fl">id category_weight title category_ss score
>> : contentType</str>
>> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
>> v=$q}</str>
>> : =====================================================
>> :        *<str name="bf">product(field(category_weight),20)</str>*
>> : =====================================================
>> :        <str name="bf">product(query($titleQuery),4)</str>
>> :        <str name="qf">text contentType^1000</str>
>> :        <str name="wt">python</str>
>> :        <str name="debug">true</str>
>> :        <str name="debug.explain.structured">true</str>
>> :        <str name="indent">true</str>
>> :        <str name="echoParams">all</str>
>> :      </lst>
>> :   </requestHandler>
>> :
>> : And here is the debug output for a query.  (This was a test for
>> synonyms,
>> : which you'll see in the output.) The original query string was, of
>> : course, "μ-heavy
>> : chain disease"
>> :
>> : You'll note that although there is no score in the first doc explain for
>> : the actual querystring, the highlighted section does get a score for
>> : product(double(category_weight)=1.5,const(20))
>> :
>> : ... which is the thing that is currently causing all the docs in the
>> : collection to "match" even though the querystring is not in any of them.
>> :
>> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
>> : "querystring":"\"μ-heavy
>> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ heavy
>> chain
>> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
>> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
>> (contentType:\"mu
>> : heavy chain disease\")^1000.0)))/no_coord^1.1)
>> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ heavy
>> chain
>> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0)))/no_coord^
>> 1.1)
>> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
>> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ heavy
>> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy chain
>> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy chain
>> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
>> : hcd\")))/no_coord^1.1)))
>> : FunctionQuery(product(double(category_weight),const(20)))
>> : FunctionQuery(product(query(+(title:\"μ heavy chain
>> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((text:\"μ
>> heavy
>> : chain disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
>> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
>> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
>> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
>> (contentType:\"μ
>> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
>> (contentType:\"μ
>> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
>> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)
>> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)))
>> : product(double(category_weight),const(20)) product(query(+(title:\"μ
>> heavy
>> : chain disease\"),def=0.0),const(4))", "explain":{ "
>> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, "value":30.0, "
>> : description":"sum of:", "details":[{ "match":true, "value":30.0, "
>> : description":"FunctionQuery(product(double(category_weight),
>> const(20))),
>> : product of:",
>> : =====================================================
>> : *"details":**[{ "match":true, "value":30.0,
>> : "description":"product(double(category_weight)=1.5,const(20))"}, {*
>> : =====================================================
>> :
>> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
>> "value":
>> : 1.0, "description":"queryNorm"}]}, {
>> :
>>
>> -Hoss
>> http://www.lucidworks.com/
>
>
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Reply via email to