Thanks - I'll look at it... On Fri, Aug 12, 2016 at 1:21 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> Maybe rerankqparserplugin? > > On Aug 12, 2016 11:54, "John Bickerstaff" <j...@johnbickerstaff.com> > wrote: > > > @Hossman -- thanks again. > > > > I've made the following change and so far things look good. I couldn't > see > > debug or find results for what I put in for $func, so I just removed it, > > but making modifications as you suggested appears to be working. > > > > Including the actual line from my endpoint XML in case this thread helps > > someone else... > > > > <str name="q">{!boost defType=synonym_edismax qf='title' synonyms='true' > > synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq='' > > v=$q}</str> > > > > On Fri, Aug 12, 2016 at 12:09 PM, John Bickerstaff < > > j...@johnbickerstaff.com > > > wrote: > > > > > Thanks! I'll check it out. > > > > > > On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <susheel2...@gmail.com > > > > > wrote: > > > > > >> Not exactly sure what you are looking from chaining the results but > > >> similar > > >> functionality is available in Streaming expressions where result of > > inner > > >> expressions are passed to outer expressions and so on > > >> https://cwiki.apache.org/confluence/display/solr/ > Streaming+Expressions > > >> > > >> HTH > > >> Susheel > > >> > > >> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff < > > >> j...@johnbickerstaff.com> > > >> wrote: > > >> > > >> > Hossman - many thanks again for your comprehensive and very helpful > > >> answer! > > >> > > > >> > All, > > >> > > > >> > I am (possibly mis-remembering) reading something about being able > to > > >> pass > > >> > the results of one query to another query... Essentially "chaining" > > >> result > > >> > sets. > > >> > > > >> > I have looked in docs and can't find anything on a quick search -- I > > may > > >> > have been reading about the Re-Ranking feature, which doesn't help > me > > (I > > >> > know because I just tried and it seems to return all results anyway, > > >> just > > >> > re-ranking the number specified in the reRankDocs flag...) > > >> > > > >> > Is there a way to (cleanly) send the results of one query to another > > >> query > > >> > for further processing? Essentially, pass ONLY the results > (including > > >> an > > >> > empty set of results) to another query for processing? > > >> > > > >> > thanks... > > >> > > > >> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff < > > >> > j...@johnbickerstaff.com> > > >> > wrote: > > >> > > > >> > > Thanks! > > >> > > > > >> > > To answer your questions, while I digest the rest of that > > >> information... > > >> > > > > >> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here: > > >> > > https://github.com/healthonnet/hon-lucene-synonyms > > >> > > > > >> > > The config looks like this - and IIRC, is simply a copy from the > > >> > > recommended cofig on the site mentioned above. > > >> > > > > >> > > <queryParser name="synonym_edismax" > class="com.github.healthonnet. > > >> > search. > > >> > > SynonymExpandingExtendedDismaxQParserPlugin"> > > >> > > <!-- You can define more than one synonym analyzer in the > > >> following > > >> > > list. > > >> > > For example, you might have one set of synonyms for > > English, > > >> one > > >> > > for French, > > >> > > one for Spanish, etc. > > >> > > --> > > >> > > <lst name="synonymAnalyzers"> > > >> > > <!-- Name your analyzer something useful, e.g. > "analyzer_en", > > >> > > "analyzer_fr", "analyzer_es", etc. > > >> > > If you only have one, the name doesn't matter (hence > > >> > > "myCoolAnalyzer"). > > >> > > --> > > >> > > <lst name="myCoolAnalyzer"> > > >> > > <!-- We recommend a PatternTokenizerFactory that tokenizes > > >> based > > >> > > on whitespace and quotes. > > >> > > This seems to work best with most people's synonym > > files. > > >> > > For details, read the discussion here: > > >> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26 > > >> > > --> > > >> > > <lst name="tokenizer"> > > >> > > <str name="class">solr.PatternTokenizerFactory</str> > > >> > > <str name="pattern"><![CDATA[(?:\s|\")+]]></str> > > >> > > </lst> > > >> > > <!-- The ShingleFilterFactory outputs synonyms of multiple > > >> token > > >> > > lengths (e.g. unigrams, bigrams, trigrams, etc.). > > >> > > The default here is to assume you don't have any > > synonyms > > >> > > longer than 4 tokens. > > >> > > You can tweak this depending on what your synonyms > look > > >> > like. > > >> > > E.g. if you only have unigrams, you can remove > > >> > > it entirely, and if your synonyms are up to 7 tokens > in > > >> > > length, you should set the maxShingleSize to 7. > > >> > > --> > > >> > > <lst name="filter"> > > >> > > <str name="class">solr.ShingleFilterFactory</str> > > >> > > <str name="outputUnigramsIfNoShingles">true</str> > > >> > > <str name="outputUnigrams">true</str> > > >> > > <str name="minShingleSize">2</str> > > >> > > <str name="maxShingleSize">4</str> > > >> > > </lst> > > >> > > <!-- This is where you set your synonym file. For the > unit > > >> tests > > >> > > and "Getting Started" examples, we use example_synonym_file.txt. > > >> > > This plugin will work best if you keep expand set to > > true > > >> > and > > >> > > have all your synonyms comma-separated (rather than =>-separated). > > >> > > --> > > >> > > <lst name="filter"> > > >> > > <str name="class">solr.SynonymFilterFactory</str> > > >> > > <str name="tokenizerFactory">solr. > > >> > KeywordTokenizerFactory</str> > > >> > > <str name="synonyms">example_synonym_file.txt</str> > > >> > > <str name="expand">true</str> > > >> > > <str name="ignoreCase">true</str> > > >> > > </lst> > > >> > > </lst> > > >> > > </lst> > > >> > > </queryParser> > > >> > > > > >> > > > > >> > > > > >> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter < > > >> > hossman_luc...@fucit.org > > >> > > > wrote: > > >> > > > > >> > >> > > >> > >> : First let me say that this is very possibly the "x - y problem" > > so > > >> let > > >> > >> me > > >> > >> : state up front what my ultimate need is -- then I'll ask about > > the > > >> > >> thing I > > >> > >> : imagine might help... which, of course, is heavily biased in > the > > >> > >> direction > > >> > >> : of my experience coding Java and writing SQL... > > >> > >> > > >> > >> Thank you so much for asking your question this way! > > >> > >> > > >> > >> Right off the bat, the background you've provided seems > > supicious... > > >> > >> > > >> > >> : I have a piece of a query that calculates a score based on a > > >> > "weighting" > > >> > >> ... > > >> > >> : The specific line is this: > > >> > >> : <str name="bf">product(field(category_weight),20)</str> > > >> > >> : > > >> > >> : What I just realized is that when I query Solr for a string > that > > >> has > > >> > NO > > >> > >> : matches in the entire corpus, I still get a slew of results > > because > > >> > >> EVERY > > >> > >> : doc has the weighting value in the category_weight field - and > > >> > therefore > > >> > >> : every doc gets some score. > > >> > >> > > >> > >> ...that is *NOT* how dismax and edisamx normally work. > > >> > >> > > >> > >> While both the "bf" abd "bq" params result in "additive" > boosting, > > >> and > > >> > the > > >> > >> implementation of that "additive boost" comes from adding new > > >> optional > > >> > >> clauses to the top level BooleanQuery that is executed, that only > > >> > happens > > >> > >> after the "main" query (from your "q" param) is added to that top > > >> level > > >> > >> BooleanQuery as a "mandaory" clause. > > >> > >> > > >> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost > > every > > >> > doc, > > >> > >> but with the techprducts configs/data these requests still don't > > >> match > > >> > >> anything... > > >> > >> > > >> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query > > >> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query > > >> > >> > > >> > >> ...and if you look at the debug output, the parsed queries shows > > that > > >> > the > > >> > >> "bogus" part of the query is mandatory... > > >> > >> > > >> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*) > > >> > >> FunctionQuery(const(true)) > > >> > >> > > >> > >> (i didn't use "pf" in that example, but the effect is the same, > the > > >> "pf" > > >> > >> based clauses are optional, while the "qf" based clauses are > > >> mandatory) > > >> > >> > > >> > >> If you compare that example to your debug output, you'll notice a > > >> > >> difference in structure -- it's a bit hard to see in your > example, > > >> but > > >> > if > > >> > >> you simplify your qf, pf, and q fields it should be more obvious, > > but > > >> > >> AFAICT the "main" parts of your query are getting wrapped in an > > extra > > >> > >> layer of parents (ie: an extra BooleanQuery) which is *not* > > >> mandatory in > > >> > >> the top level query ... i don't see *any* mandatory clauses in > your > > >> top > > >> > >> level BooleanQuery, which is why any match on a bf or bq function > > is > > >> > >> enough to cause a document to match. > > >> > >> > > >> > >> I suspect the reason your parsed query structure is so diff has > to > > do > > >> > with > > >> > >> this... > > >> > >> > > >> > >> : <str name="defType">synonym_edismax</str>> > > >> > >> > > >> > >> > > >> > >> 1) how exactly is "synonym_edismax" defined in your > solrconfig.xml? > > >> > >> 2) what QParserPlugin are you using to implement that? > > >> > >> > > >> > >> I suspect whatever QParserPlugin you are using has a bug in it :) > > >> > >> > > >> > >> > > >> > >> If you can't fix the bug, one possibile workaround would be to > > >> abandon > > >> > bf > > >> > >> and bq params completely, and instead wrap the query it produces > in > > >> in a > > >> > >> {!boost} parser with whatever function you want (using functions > > like > > >> > >> sum() or prod() to combine multiple functions, and query() to > > >> > incorporate > > >> > >> your current bq param). Doing this will require chanign how you > > >> specify > > >> > >> you input (example below) and it will result in *multiplicitive* > > >> boosts > > >> > -- > > >> > >> so your scores will be much diff, and you will likely have to > > adjust > > >> > your > > >> > >> constants, but: 1) multiplicitive boosts are almost always what > > >> people > > >> > >> *really* want anyway; 2) it will ensure the boosts are only > applied > > >> for > > >> > >> things matching your main query, no matter how that query parser > > >> works > > >> > or > > >> > >> what bugs it has. > > >> > >> > > >> > >> Example of using {!boost} to wrap an arbitrary other parser... > > >> > >> > > >> > >> instead of... > > >> > >> defType=foofoo > > >> > >> q=barbarbar > > >> > >> > > >> > >> use... > > >> > >> q={!boost b=$func defType=foofoo v=$qq} > > >> > >> qq=barbarbar > > >> > >> func=sum(something,somethingelse) > > >> > >> > > >> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers > > >> > >> https://cwiki.apache.org/confluence/display/solr/ > Function+Queries > > >> > >> > > >> > >> > > >> > >> > > >> > >> > > >> > >> : > > >> > >> : What I would like is to return zero results if there is no > match > > >> for > > >> > the > > >> > >> : querystring. My collection is small enough that I don't care > if > > >> the > > >> > >> actual > > >> > >> : calculation runs on each doc (although that's wasteful) -- I > just > > >> > don't > > >> > >> : want to see results come back for zero matches to the > querystring > > >> > >> : > > >> > >> : (The /select endpoint does this of course, but my custom > endpoint > > >> > >> includes > > >> > >> : this "weighting" piece and therefore returns every doc in the > > >> corpus > > >> > >> : because they all have the weighting. > > >> > >> : > > >> > >> : ==================== > > >> > >> : Enter my imagined solution... The potential X-Y problem... > > >> > >> : ==================== > > >> > >> : > > >> > >> : So - given that I come from a programming background, I > > immediately > > >> > >> start > > >> > >> : thinking of an if statement ... > > >> > >> : > > >> > >> : if(some_score_for_the_primary_search_string) { > > >> > >> : run_the_category_weight_calculation; > > >> > >> : } else { > > >> > >> : do_NOT_run_category_weight_calc; > > >> > >> : } > > >> > >> : > > >> > >> : > > >> > >> : Another way of thinking of it would be something like the > "WHERE" > > >> > >> clause in > > >> > >> : SQL... > > >> > >> : > > >> > >> : run_category_weight_calculation WHERE "searchstring" is found > > in > > >> the > > >> > >> : document, not otherwise. > > >> > >> : > > >> > >> : I'm aware that things could be handled in the client-side of my > > web > > >> > app, > > >> > >> : but if possible, I'd like the interface to SOLR to be as clean > as > > >> > >> possible, > > >> > >> : and massage incoming SOLR data as little as possible. > > >> > >> : > > >> > >> : In other words, do NOT return any docs if the querystring (and > > any > > >> > >> : synonyms) match zero docs. > > >> > >> : > > >> > >> : Here is the endpoint XML for the query. I've highlighted the > > >> specific > > >> > >> line > > >> > >> : that is causing the unintended results... > > >> > >> : > > >> > >> : > > >> > >> : <requestHandler name="/foo" class="solr.SearchHandler"> > > >> > >> : <!-- default values for query parameters can be specified, > > >> these > > >> > >> : will be overridden by parameters in the request > > >> > >> : --> > > >> > >> : <lst name="defaults"> > > >> > >> : <str name="echoParams">all</str> > > >> > >> : <int name="rows">20</int> > > >> > >> : <!-- Query settings --> > > >> > >> : <str name="df">text</str> > > >> > >> : <!-- <str name="df">title</str> --> > > >> > >> : <str name="defType">synonym_edismax</str>> > > >> > >> : <str name="synonyms">true</str> > > >> > >> : <!-- The line below balances out the weighting of exact > > >> matches to > > >> > >> the > > >> > >> : synonym phrase entered by the user > > >> > >> : with the category_weight calculation and the > titleQuery > > >> calc. > > >> > >> : These numbers exist in a balance and > > >> > >> : if one is raised or lowered, the others (probably) > need > > to > > >> > >> change > > >> > >> : as well. It may be better to go with decimals > > >> > >> : for all of them... .4 instead of 4 and 2 instead of 20 > > and > > >> > 2.5 > > >> > >> : instead of 25. > > >> > >> : In the end, I'm not sure it really matters, but don't > > >> change > > >> > >> one > > >> > >> : without changing the others > > >> > >> : unless you've tested and are sure you want the results > > >> --> > > >> > >> : <float name="synonyms.originalBoost">1.5</float> > > >> > >> : <float name="synonyms.synonymBoost">1.1</float> > > >> > >> : <str name="mm">75%</str> > > >> > >> : <str name="q.alt">*:*</str> > > >> > >> : <str name="rows">20</str> > > >> > >> : <str name="fq">meta_doc_type:chapterDoc</str> > > >> > >> : <str name="bq">{!synonym_edismax qf='title' > > synonyms='true' > > >> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' > > >> bq='' > > >> > >> : v=$q}</str> > > >> > >> : <str name="fl">id category_weight title category_ss > score > > >> > >> : contentType</str> > > >> > >> : <str name="titleQuery">{!edismax qf='title' bf='' bq='' > > >> > >> v=$q}</str> > > >> > >> : ===================================================== > > >> > >> : *<str name="bf">product(field( > category_weight),20)</str>* > > >> > >> : ===================================================== > > >> > >> : <str name="bf">product(query($titleQuery),4)</str> > > >> > >> : <str name="qf">text contentType^1000</str> > > >> > >> : <str name="wt">python</str> > > >> > >> : <str name="debug">true</str> > > >> > >> : <str name="debug.explain.structured">true</str> > > >> > >> : <str name="indent">true</str> > > >> > >> : <str name="echoParams">all</str> > > >> > >> : </lst> > > >> > >> : </requestHandler> > > >> > >> : > > >> > >> : And here is the debug output for a query. (This was a test for > > >> > >> synonyms, > > >> > >> : which you'll see in the output.) The original query string was, > > of > > >> > >> : course, "μ-heavy > > >> > >> : chain disease" > > >> > >> : > > >> > >> : You'll note that although there is no score in the first doc > > >> explain > > >> > for > > >> > >> : the actual querystring, the highlighted section does get a > score > > >> for > > >> > >> : product(double(category_weight)=1.5,const(20)) > > >> > >> : > > >> > >> : ... which is the thing that is currently causing all the docs > in > > >> the > > >> > >> : collection to "match" even though the querystring is not in any > > of > > >> > them. > > >> > >> : > > >> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"", > > >> > >> : "querystring":"\"μ-heavy > > >> > >> : chain disease\"", "parsedquery":"( > DisjunctionMaxQuery((text:\"μ > > >> heavy > > >> > >> chain > > >> > >> : disease\" | (contentType:\"μ heavy chain > disease\")^1000.0))^1.5 > > >> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" | > > >> > >> (contentType:\"mu > > >> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1) > > >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ > > >> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\ > "μ > > >> heavy > > >> > >> chain > > >> > >> : disease\" | (contentType:\"μ heavy chain > > >> > disease\")^1000.0)))/no_coord^ > > >> > >> 1.1) > > >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ > > >> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\ > "μ > > >> > heavy > > >> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy > > >> chain > > >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ > > >> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy > > >> chain > > >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ > > >> > >> : hcd\")))/no_coord^1.1))) > > >> > >> : FunctionQuery(product(double(category_weight),const(20))) > > >> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain > > >> > >> : disease\"),def=0.0),const(4)))", > "parsedquery_toString":"(((tex > > >> t:\"μ > > >> > >> heavy > > >> > >> : chain disease\" | (contentType:\"μ heavy chain > > >> disease\")^1000.0))^1.5 > > >> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy > > chain > > >> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ > > >> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" | > > >> > >> (contentType:\"μ > > >> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | > > >> > >> (contentType:\"μ > > >> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5 > > >> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ > > >> hcd\"))^1.1) > > >> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ > > >> > hcd\"))^1.1))) > > >> > >> : product(double(category_weight),const(20)) > > >> product(query(+(title:\"μ > > >> > >> heavy > > >> > >> : chain disease\"),def=0.0),const(4))", "explain":{ " > > >> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, > > >> "value":30.0, " > > >> > >> : description":"sum of:", "details":[{ "match":true, > "value":30.0, > > " > > >> > >> : description":"FunctionQuery(product(double(category_weight), > > >> > >> const(20))), > > >> > >> : product of:", > > >> > >> : ===================================================== > > >> > >> : *"details":**[{ "match":true, "value":30.0, > > >> > >> : "description":"product(double(category_weight)=1.5,const(20) > )"}, > > >> {* > > >> > >> : ===================================================== > > >> > >> : > > >> > >> : "match":true, "value":1.0, "description":"boost"}, { > > "match":true, > > >> > >> "value": > > >> > >> : 1.0, "description":"queryNorm"}]}, { > > >> > >> : > > >> > >> > > >> > >> -Hoss > > >> > >> http://www.lucidworks.com/ > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > > > > > >