RE: Combined Dismax and Block Join Scoring on nested documents

Mike Allen Wed, 23 Nov 2016 06:56:52 -0800

Will do once I've validated what I've done. As I'm a total solr novice, on 
numerous occasions I've done stuff I thought was right, but was actually 
returning incorrect, but difficult to notice, results.  In fact, I've just 
noticed I'm having issues with adding filter queries, so perhaps I still have 
it all wrong after all.  But if it helps anybody at all in the future, I THINK 
- large pinch of salt - the anatomy of a dismax and scored block join query in 
the Solr Admin Console (note you'll have to encode a GET request) is like:


q=+{!dismax v="search words" qf="name^0 searchtext^0"}
EXACTLY ONE WHITESPACE
+{!parent which=content_type:product score=min v=$bjv}
&bjv=
EXACTLY ONE +
(base_colour:(blue)^0 AND (in_stock:(true)^0)) 
{!func}list_price_gbp
&sort=score asc
&fl=*,doc:[subquery]
&doc.q={!terms f="productid" v=$row.id}
&doc.rows=1000
&doc.fq=(base_colour:(blue) AND (in_stock:(true)))

Where
"search words" are what you are searching for, 
the qf's are the query fields on the parent doc, i.e. which fields to search, 
the ^0 means Boost (i.e. multiple score match by) 0, i.e. don't contribute to 
the final score because I purely want price, 
the WHITESPACE is super important, without it, rather than getting a subset of 
documents that match BOTH the dismax and the block join query, I was getting 
all documents that matched the dismax,
the "which=content_type:product" would vary for your own documents, it's 
basically a field which ONLY matches parent documents and never children, so 
here the field is content_type and its value is product, but it's entirely 
arbitrary, it could be "which=docscope:company", etc.,
score=min means score will be the minimum value, on a per Parent basis, of a 
Function Query on the child documents that match the "v" query,
the "v" is a syntax thing, I presume means value, whereby in a lot of stuff 
where you can write {!....}SomeValOrOther, you can write write 
{!..v="SomeValOrOther"}, so, for a made-up example, {!terms 
f="device_type"}ipod can be rewritten {!terms f="device_type" v="ipod"}, it's 
the same thing, but crucially, it avoids parser errors in this block join 
example [thanks to Mikhail for that], I wasted hours trying to figure out how 
+{!parent which=content_type:product score=min}(base_colour:(blue) AND 
(in_stock:(true)) could work, and it never did when combined with dismax,
the $bjv is an arbitrary parameter, you don't have to use it, you can have the 
"v" inline as in the above, but again, it helps ease parsing, so if you are 
going to use a parameter, the syntax is that they always start with a $, so 
here $bjv, but it could be $fred or $jane, the point being that you "fill out" 
the parameter with a query string parameter of the same name minus the $, i.e. 
&bjv=+(base_colour:(blue)^=0 AND (in_stock:(true)^=0)) {!func}list_price_gbp, 
BUT IT MUST START WITH A + as far as I can tell in this particular instance
which talking of the bjv, this is constructed of a solr query which ONLY 
matches child documents, in this case "in stock" and "blue", which will 
determine which parents to return, again the ^0 cancels the contribution to the 
scoring, although I don't know the difference, if there is one, between ^0 and 
^=0,
then comes the Function Query, which in this case is just the child field I 
wish to score on, {!func}list_price_gbp,
then sort=score asc refers to the in-built "score" value, which in this case, 
because we've cancelled all other inputs, will be equal list_price_gbp,
then comes the fl, fields list to return, parameter, here I'm being lay and 
using * - but not in the real code - plus telling the parser I'm making a 
subquery called "doc", hence doc:[subquery], but you can call it anything you 
like, say mysubquery:[subquery].
The subquery executes for every input row, and here the input rows are the 
PARENT docs that matched the CHILD queries in the $bjv parameter, so only 
"product" docs that contain at least one variant of colour "blue" that is in 
stock will execute a subquery,
the subquery is a whole solr query in its own right, you just need to qualify 
each aspect of it with the name you chose, here "doc", but "mysubquery" is 
equally valid if that's what you called it in the fl parameter,
so  doc.q={!terms f="productid" v=$row.id} is how I'm joining to the children I 
want, in this case productid is a field on each child doc which matches its 
parent doc id field, note that the special $row is used, essentially solr 
injects every field of each parent row into the query parser prepended by 
"$row.", so $row.id is the parent id value, $row.name would be the parent doc 
name field value, etc.. Don't put quotes around the value though, v=$row.id 
works, v="$row.id" does not as that would look for a literal value "$row.id"
doc.rows=1000 controls the maximum children returned, 1000 for me is way more 
than I know will ever exist for this particular client,
finally doc.fq=(base_colour:(blue) AND (in_stock:(true))) filters the children 
to only those I really want, so only the "blue" in stock documents are returned.

So here's the raw string I'm putting in the "q" box in Solr Admin Console:

q=+{!dismax v="skirt" qf="name^ 0 searchtext^0"} +{!parent 
which=content_type:product score=min v=$bjv}&bjv=+(base_colour:(blue)^0 AND 
(in_stock:(true)^0)) {!func}list_price_gbp&sort=score 
asc&fl=*,doc:[subquery]&doc.q={!terms f="productid" 
v=$row.id}&doc.rows=1000&doc.fq=(base_colour:(blue) AND (in_stock:(true)))

By stepping into my Visual Studio code, the encoded request looks like this:

http://localhost:8983/solr/test_core/select?q=%2b%7b!dismax+v%3d%22skirt%22+qf%3d%22name%5e0+searchtext%5e0%22+%7d+%2b%7b!parent+which%3dcontent_type%3aproduct+score%3dmin+v%3d%24bjv%7d&bjv=%2b(base_colour%3a(blue)+AND+(in_stock%3a(true)))+%7b!func%7dlist_price_gbp&doc.q=%7b!terms+f%3d%22productid%22+v%3d%24row.id%7d&doc.rows=1000&doc.fq=(base_colour%3a(blue)+AND+(in_stock%3a(true)))&start=0&rows=103&fl=*%2cdoc%3a%5bsubquery%5d&sort=score+asc

So you'll notice the explicit "+"s have been encoded as %2B and spaces are "+". 
Correct encoding seems half the battle to be honest.

So that's what I've got for now, but I wouldn't take it as gospel that it's 
working correctly. I'm still validating by hand checking the results I would 
expect versus the results I actually get. For instance I need to know for sure 
it's scoring on only matched variants, not all children of a parent - which 
would completely blow the whole thing out of the water. And as I said, I'm 
pretty sure I've yet to figure out applying a query filter to parent docs.

When I'm a bit less clueless about what I'm actually doing  I'll try and write 
it up properly somewhere.

Cheers all,

Mike

-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: 21 November 2016 21:59
To: solr-user
Subject: Re: Combined Dismax and Block Join Scoring on nested documents

You could do:
*) LinkedIn
*) Wiki
*) Write it up, give it to me and I'll stick it as a guest post on my blog 
(with attribution of your choice)
*) Write it up, give it to Lucidworks and they may (not sure about
rules) stick it on their blog

Regards,
    Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 November 2016 at 02:36, Mike Allen 
<mike.al...@thecommercepartnership.com> wrote:
> Sure thing Alex. I don't actually do any personal blogging, but if there's a 
> suitable place - the Solr Wiki perhaps - you'd suggest I can write something 
> up I'd be more than happy to. What goes around comes around!
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: 21 November 2016 13:01
> To: solr-user
> Subject: Re: Combined Dismax and Block Join Scoring on nested 
> documents
>
> A blog article about what you learned would be very welcome. These edge cases 
> are something other people could certainly learn from.
> Share the knowledge forward etc.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and 
> experienced
>
>
> On 21 November 2016 at 23:57, Mike Allen 
> <mike.al...@thecommercepartnership.com> wrote:
>> Hi Mikhail,
>>
>> Thanks for your advice, it went a long way towards helping me get the right 
>> documents in the first place, especially paramterising the block join with 
>> an explicit v, as otherwise it was a nightmare of parser errors.  Not to 
>> mention I'm still figuring out the nuances of where I need a whitespace and 
>> where I don't! However, I spent a part of the weekend fiddling around with 
>> spaces and +'s and I believe I've got it working as I'd hoped.
>>
>> Again, many thanks,
>>
>> Mike
>>
>> -----Original Message-----
>> From: Mikhail Khludnev [mailto:m...@apache.org]
>> Sent: 18 November 2016 12:58
>> To: solr-user
>> Subject: Re: Combined Dismax and Block Join Scoring on nested 
>> documents
>>
>> Hello Mike,
>> Structured queries in Solr are way cumbersome.
>> Start from:
>> q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product 
>> score=min v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&...
>>
>> beside of "explain" there is a parsed query entry in debug that's more 
>> useful for troubleshooting purposes.
>> Please also make sure that + is properly encoded by %2B and pass http hurdle.
>>
>> On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen < 
>> mike.al...@thecommercepartnership.com> wrote:
>>
>>> Apologies if I'm doing something incredibly stupid as I'm new to Solr.
>>> I am having an issue with scoring child documents in a block join 
>>> query when including a dismax query. I'm actually a little unclear 
>>> on whether or not that's a complete oxymoron, combining dismax and block 
>>> join.
>>>
>>>
>>>
>>> Problem statement: Given a set of Product documents - which contain 
>>> the product names and descriptions - which contain nested variant 
>>> documents (see below for abridged example) - which contain the 
>>> boolean stock status
>>> (in_stock) and the variant prices (list_price_gbp) - I want to do a 
>>> Dismax query of, say, "skirt" on the product name (name) and sort 
>>> the resulting product documents by the minimum price 
>>> (list_price_gbp) of their child variant documents. Note that, 
>>> although the abridged document doesn't show them, there are a number 
>>> of other arbitrary fields which may be used as filter queries on the 
>>> child documents, for example size or colour, which will in effect change 
>>> the "active"
>>> minimum price of a product. Hence, denormalizing, or flattening, the 
>>> documents is not really an option I want to pursue.
>>>
>>>
>>>
>>> An abridged example document returned by the Solr Admin Query 
>>> console which I am querying:
>>>
>>>
>>>
>>> <doc>
>>>
>>>     <str name="id">12345</str>
>>>
>>>                 <str name="content_type">product</str>
>>>
>>>                 <str name="name">black flared skirt</str>
>>>
>>>                 <float name="min_list_price_gbp">40.0</float>
>>>
>>>                 <result name="doc" numFound="2" start="0">
>>>
>>>       <doc>
>>>
>>>                     <str name="skuid">12345abcd</str>
>>>
>>>                                 <str name="productid">12345</str>
>>>
>>>         <str name="content_type">variant</str>
>>>
>>>                                 <float 
>>> name="list_price_gbp">65.0</float>
>>>
>>>                                 <bool name="in_stock">true</bool>
>>>
>>>                   </doc>
>>>
>>>                   <doc>
>>>
>>>                     <str name="skuid">12345fghi</str>
>>>
>>>                                 <str name="productid">12345</str>
>>>
>>>         <str name="content_type">variant</str>
>>>
>>>                                 <float 
>>> name="list_price_gbp">40.0</float>
>>>
>>>                                 <bool name="in_stock">true</bool>
>>>
>>>                   </doc>
>>>
>>> </doc>
>>>
>>>
>>>
>>> So I am familiar with the block join score mode; setting aside the 
>>> dismax aspect for now, this query, using the Function Query 
>>> {!func}list_price_gbp, with score ascending, returns documents 
>>> ordered correctly, with a £2.00
>>> (cheapest) product first:
>>>
>>>
>>>
>>> q={!parent which=content_type:product 
>>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>>> f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> The "explain" for this is:
>>>
>>>
>>>
>>> 2.0000184 = Score based on 1 child docs in range from 26752 to 
>>> 26752, best
>>> match:
>>>
>>>   2.0000184 = sum of:
>>>
>>>     1.8374416E-5 = weight(in_stock:T in 26752) [], result of:
>>>
>>>       1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>         1.8374416E-5 = idf(docFreq=27211, docCount=27211)
>>>
>>>         1.0 = tfNorm, computed from:
>>>
>>>           1.0 = termFreq=1.0
>>>
>>>           1.2 = parameter k1
>>>
>>>           0.0 = parameter b (norms omitted for field)
>>>
>>>     2.0 = FunctionQuery(float(list_price_gbp)), product of:
>>>
>>>       2.0 = float(list_price_gbp)=2.0
>>>
>>>       1.0 = boost
>>>
>>>       1.0 = queryNorm
>>>
>>>
>>>
>>> Even though this is doing what I want, I have a slight niggle the 
>>> that overall score is not just the result of the Function Query, 
>>> however, as all results get the same tiny fraction added, it doesn't matter.
>>>
>>>
>>>
>>> However, when I prepend my dismax query:
>>>
>>>
>>>
>>> q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product 
>>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>>> f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> The scoring is only dependent on the dismax scoring, where the "explain"
>>> for
>>> this is:
>>>
>>>
>>>
>>> 2.7600822 = sum of:
>>>
>>>   2.7600822 = weight(name:skirt in 13406) [], result of:
>>>
>>>     2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>>
>>>       0.76987 = tfNorm, computed from:
>>>
>>>         1.0 = termFreq=1.0
>>>
>>>         1.2 = parameter k1
>>>
>>>         0.75 = parameter b
>>>
>>>         4.108818 = avgFieldLength
>>>
>>>         7.111111 = fieldLength
>>>
>>>
>>>
>>> So in actual fact, with score ascending, it is ordering the results 
>>> by least matching first and the nested document list_price_gbp is 
>>> irrelevant. I strongly suspect I am being totally dumb and that this 
>>> is expected behaviour for an obvious reason that escapes me, apart 
>>> from perhaps it's because the two scoring methods are just plainly 
>>> incompatible.
>>>
>>>
>>>
>>> I have additionally tried just doing a lucene query instead:
>>>
>>>
>>>
>>> q=+name:skirt +{!parent which=content_type:product score=min} 
>>> (in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> The "explain" of this indicates it's scoring products, for which 
>>> list_price_gbp simply does not exist, as the Function Query always 
>>> returns zero.
>>>
>>>
>>>
>>> 6243963 = sum of:
>>>
>>>   3.624396 = weight(name:skirt in 18113) [], result of:
>>>
>>>     3.624396 = score(doc=18113,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>>
>>>       1.0109531 = tfNorm, computed from:
>>>
>>>         1.0 = termFreq=1.0
>>>
>>>         1.2 = parameter k1
>>>
>>>         0.75 = parameter b
>>>
>>>         4.108818 = avgFieldLength
>>>
>>>         4.0 = fieldLength
>>>
>>>   1.0 =
>>> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
>>> QueryBitSetProducer(con
>>> tent_type:product))), product of:
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>   0.0 = FunctionQuery(float(list_price_gbp)), product of:
>>>
>>>     0.0 = float(list_price_gbp)=0.0
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>
>>>
>>> Indeed, if I change the Function Query field to a product scoped 
>>> field, min_list_price_gbp, like so:
>>>
>>>
>>>
>>> q=+name:skirt +{!parent which=content_type:product 
>>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>>> f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> then the "explain" certainly does show the Function Query evaluating
>>>
>>>
>>>
>>> 8.624397 = sum of:
>>>
>>>   3.624396 = weight(name:skirt in 17890) [], result of:
>>>
>>>     3.624396 = score(doc=17890,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>>
>>>       1.0109531 = tfNorm, computed from:
>>>
>>>         1.0 = termFreq=1.0
>>>
>>>         1.2 = parameter k1
>>>
>>>         0.75 = parameter b
>>>
>>>         4.108818 = avgFieldLength
>>>
>>>         4.0 = fieldLength
>>>
>>>   1.0 =
>>> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
>>> QueryBitSetProducer(con
>>> tent_type:product))), product of:
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>   14.0 = FunctionQuery(float(min_list_price_gbp)), product of:
>>>
>>>     14.0 = float(min_list_price_gbp)=14.0
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>
>>>
>>> My grasp of the syntax is pretty flakey, so I would be immensely 
>>> grateful if someone could point out if I'm just doing something 
>>> incredibly dumb. In my head, I see what I am trying to do as
>>>
>>>
>>>
>>> (some dismax or lucene query on parent document [e.g."skirt"])
>>>
>>>                 => (get a subset of these parent docs based on a 
>>> block
>>> join)
>>>
>>>                                 => (where the children match a bunch 
>>> of arbitrary filter queries [e.g. "colour:red"])
>>>
>>>                                                 => (then subquery 
>>> the child docs that match the same filter queries[e.g. 
>>> "colour:red"])
>>>
>>>                                                                 => 
>>> (then score this subset of child documents)
>>>
>>>
>>> => (and order by that score)
>>>
>>>
>>>
>>>
>>> Is this actually possible? I've been googling about this for a day 
>>> or so and can't quite find anything definitive. I'm going to maybe 
>>> try and dive into the solr source code, but I'm a c# guy, not java, 
>>> without a debuggable environment as unneeded yet, and that could 
>>> prove pretty painful.
>>>
>>>
>>>
>>> Any help would be appreciated, even if it is just "can't be done", 
>>> as at least I could stop chasing my tail.
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>

RE: Combined Dismax and Block Join Scoring on nested documents

Reply via email to