Hello, James.

Excuse me if I didn't fully get all points of your inquiry.
As I grasped the challenge. One can not filter/select certain parents
(types) with `which` param, because block join is a plain nextBitSet() over
dense ordinals.
So, parents bitset should include all parents - disjunct all parent types,
and then, a parent level filter should select a certain parent type.
q={!parent which=$dads}chld_name:ABC&dads=doc_type:(t2 p2)&fq=doc_type:t2
It should be explained somewhere around
https://solr.apache.org/guide/8_8/other-parsers.html#block-mask pls let me
know if we can add some more caveats there covering your case.

Have a good join!

On Thu, Apr 28, 2022 at 5:43 PM James Greene <ja...@jamesaustingreene.com>
wrote:

> My team is in the process of moving from solr 6.6 to 8.11.1 and have
> noticed some weirdness (wrong parent docs in result) when using the
> {!parent blockjoin query parser.  We have multiple 'root' entities
> configured in DIH and i'm wondering if this could be a causation or if
> there is a bug at play with the blockjoin.  Any more info on how to
> diagnose the issue is appreciated!
>
> -----------------------------------
> Example data:
>
> [
>     {
>         "_root_": "/t2/1/",
>         "doc_id": "/t2/1/",
>         "doc_type": "t2",
>         "t2_id":1,
>         "chldrn": [
>             {
>                 "_root_": "/t2/1/",
>                 "_nest_path_": "/chldrn#1",
>                 "doc_id": "/t2/chld/1/",
>                 "doc_type": "chld",
>                 "chld_name": "DEF",
>                 "chld_t2_id":1
>             }
>         ]
>     },
>     {
>         "_root_": "/p1/1/",
>         "doc_id": "/p1/1/",
>         "doc_type": "p1",
>         "p1_id":1,
>         "chldrn": [
>             {
>                 "_root_": "/p1/1/",
>                 "_nest_path_": "/chldrn#1",
>                 "doc_id": "/p1/chld/1/",
>                 "doc_type": "chld",
>                 "chld_name": "ABC",
>                 "chld_p1_id":1
>             },
>             {
>                 "_root_": "/p1/1/",
>                 "_nest_path_": "/chldrn#2",
>                 "doc_id": "/p1/chld/2/",
>                 "doc_type": "chld",
>                 "chld_name": "DEF",
>                 "chld_p1_id": 1
>             }
>         ]
>     }
> ]
>
>
> -----------------------------------
> Queries giving the wrong result:
>
> q={!parent which=doc_type:t2}chld_name:ABC
>
> q={!parent which=doc_type:t2}(doc_type:chld AND chld_name:ABC)
>
> q={!parent which=doc_type:t2 v=$qq}chld_name:ABC
> ?qq=doc_type:chld
>
>
> -----------------------------------
> I found an old thread talking about child docs shouldn't have the same
> field name as parent doc (even with different values) here:
>
> https://stackoverflow.com/questions/36602638/solr-returning-incorrect-results-when-filtering-child-docuements
> But I got the same results when trying to filter by childen using a
> different field:
>
> q={!parent which=doc_type:t2}(_nest_path_:/chldrn AND chld_name:ABC)
>
> I would expect there would be no match since the parent (doc_type:t2) does
> not have a child (chld_name:ABC) but i'm actually getting t2 in the result:
> [
>     {
>         "_root_": "/t2/1/",
>         "doc_id": "/t2/1/",
>         "doc_type": "t2",
>         "t2_id":1,
>         "chldrn": [
>             {
>                 "_root_": "/t2/1/",
>                 "_nest_path_": "/chldrn#1",
>                 "doc_id": "/t2/chld/1/",
>                 "doc_type": "chld",
>                 "chld_name": "DEF",
>                 "chld_t2_id":1
>             }
>         ]
>     }
> ]
>
> -----------------------------------
> Debug for query returning the wrong document when 0 docs are expected:
>
> "debug":{
>     "rawquerystring":"{!parent which=doc_type:t2}chld_name:ABC",
>     "querystring":"{!parent which=doc_type:t2}chld_name:ABC",
>     "parsedquery":"AllParentsAware(ToParentBlockJoinQuery
> (+chld_name:abc))",
>     "parsedquery_toString":"ToParentBlockJoinQuery (+chld_name:abc)",
>     "explain":{
>       "/t2/1/":"\n0.0 = Score based on 1 child docs in range from 0 to 3,
> best match:\n  0.0 = ConstantScore(chld_name:abc)^0.0\n"},
>     "QParser":"BlockJoinParentQParser",
>     ...
> }
>
>
> -----------------------------------
> If I query using a diffrent parent doc_type (doc_type:p1) and child name
> (chld_name:DEF) I get the expected result (0 docs returned) using query:
>
> q={!parent which=doc_type:p1}chld_name:DEF
>
>
> -----------------------------------
> If I query using a diffrent parent doc_type (doc_type:p1) and child name
> (chld_name:ABC) I get the expected result (1 docs returned) using query:
>
> q={!parent which=doc_type:p1}chld_name:DEF
>
> ^^Debug query of getting expected 1 doc back (docs in range is 2 to 3 but
> yet the original problematic query has 0 to 3 whatever that means):
> "debug":{
>     "rawquerystring":"{!parent which=doc_type:p1}chld_name:ABC",
>     "querystring":"{!parent which=doc_type:p1}chld_name:ABC",
>     "parsedquery":"AllParentsAware(ToParentBlockJoinQuery
> (+chld_name:abc))",
>     "parsedquery_toString":"ToParentBlockJoinQuery (+chld_name:abc)",
>     "explain":{
>       "/t2/1/":"\n0.0 = Score based on 2 child docs in range from 2 to 3,
> best match:\n  0.0 = ConstantScore(chld_name:abc)^0.0\n"},
>     "QParser":"BlockJoinParentQParser",
>     ...
> }
>
>
> -----------------------------------
> I have a 'work around' which seems to do the trick but it feels hacky and I
> wonder if having to qualify the child docs more will affect query
> performance. If I further qualify the child doc using a field that doesn't
> exist in the other child docs I get the expected (0 matches) result with
> query:
>
> q={!parent which=doc_type:t2}(chld_name:ABC AND chld_t2_id:*)
>
>
> -----------------------------------
> What's also interesting is that if I remove the child doc
> {"doc_id":"/p1/chld/1/","chld_name":"ABC"} of parent
> {"doc_id":"/p1/1/","doc_type":"p1"} out of the index so that my collection
> has:
>
> [
>     {
>         "_root_": "/t2/1/",
>         "doc_id": "/t2/1/",
>         "doc_type": "t2",
>         "t2_id":1,
>         "chldrn": [
>             {
>                 "_root_": "/t2/1/",
>                 "_nest_path_": "/chldrn#1",
>                 "doc_id": "/t2/chld/1/",
>                 "doc_type": "chld",
>                 "chld_name": "DEF",
>                 "chld_t2_id":1
>             }
>         ]
>     },
>     {
>         "_root_": "/p1/1/",
>         "doc_id": "/p1/1/",
>         "doc_type": "p1",
>         "p1_id":1,
>         "chldrn": [
>             {
>                 "_root_": "/p1/1/",
>                 "_nest_path_": "/chldrn#2",
>                 "doc_id": "/p1/chld/2/",
>                 "doc_type": "chld",
>                 "chld_name": "DEF",
>                 "chld_p1_id": 1
>             }
>         ]
>     }
> ]
>
> I get the expected results (no matches found) when I use the query:
>
> q={!parent which=doc_type:t2}chld_name:ABC
>
>
> -----------------------------------
> Other Notes:
>
> - I've blown away recreated the index multiple times (always using DIH to
> re-import that data) which should rule out an anomaly with index
> linking/block merge.
> - Solrcloud mode is not being used.
> - I have <uniqueKey>doc_id</uniqueKey> in managed-schema and have no docs
> with duplicate doc_id in the index (sample config below).
> - I have _root_ as indexed only (changed it to stored=true for debugging
> but the issue remains).
> - We use the DIH (data import handler) to import the data (sample config
> below).
> - The 't2' doc_type appears as first entity in the DIH so I *think* its the
> doc that gets indexed first during the DIH full import (may be relevent in
> identifying a bug with block join/indexing?).
>
>
> -----------------------------------
> Relevent entries in managed-schema:
>
> <uniqueKey>doc_id</uniqueKey>
> ...
> <fieldType name="nest_path" class="solr.NestPathField" stored="false" />
> <fieldType name="lowercase" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer>
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.LengthFilterFactory" min="1" max="32766"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
> </fieldType>
> <fieldType name="plong" class="solr.LongPointField" docValues="true"
> stored="false"/>
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> docValues="true" stored="false"/>
> ...
> <field name="_root_" type="string" docValues="false"/>
> <field name="_nest_path_" type="nest_path"/>
> <field name="_version_" type="plong" indexed="false"/>
> ...
> <field name="doc_id" type="string" stored="true" docValues="false"/>
> <field name="doc_type" type="string"/>
> <field name="chld_name" type="lowercase" stored="true" docValues="false"/>
> ...
> <dynamicField name="*_id" type="plong"/>
>
>
> -----------------------------------
> Relevent entries in data-config.xml:
>
> <?xml version="1.0"?>
> <dataConfig>
>     <dataSource name="mariadb" driver="org.mariadb.jdbc.Driver"
> batchSize="-1"
> url="jdbc:mysql://host:3306/db?sessionVariables=net_write_timeout=3600"
> user="" password="" />
>     <document>
>         <entity dataSource="mariadb" pk="id" name="t2"
>             deletedPkQuery="select concat('/t2/',`id`,'/') as id from `t2`
> where `deleted_at` &gt;= convert_tz('${dataimporter.last_index_time}',
> '+00:00', @@global.time_zone)"
>             query="select concat('/t2/',`id`,'/') as `doc_id`, 't2' as
> `doc_type`, `id` as `t2_id` where `deleted_at`is null"
>             deltaImportQuery="select concat('/t2/',`id`,'/') as `doc_id`,
> 't2' as `doc_type`, `id` as `t2_id` where `deleted_at` is null and `id` =
> '${dataimporter.delta.id}'"
>             deltaQuery="select `id` from `t2` where `updated_at` &gt;
> convert_tz('${dataimporter.last_index_time}', '+00:00',
> @@global.time_zone)">
>                 <entity name="chldrn" child="true" query="select
> concat('/t2/chld/',`id`,'/') as `doc_id`, 'chld' as `doc_type`,
> concat('/chldrn#',`id`) as `_nest_path_`, `name` as `chld_name`, `t2_id` as
> `chld_t2_id` where `t2_id` = ${t2.t2_id} and `deleted_at` is null" />
>         </entity>
>         <entity dataSource="mariadb" pk="id" name="p1"
>             deletedPkQuery="select concat('/p1/',`id`,'/') as `id` from
> `p1` where `deleted_at` &gt;= convert_tz('${dataimporter.last_index_time}',
> '+00:00', @@global.time_zone)"
>             query="select concat('/p1/',`id`,'/') as `doc_id`, 'p1' as
> `doc_type`, `id` as `p1_id` where `deleted_at`is null"
>             deltaImportQuery="select concat('/p1/',`id`,'/') as `doc_id`,
> 'p1' as `doc_type`, `id` as `p1_id` where `deleted_at` is null and `id` =
> '${dataimporter.delta.id}'"
>             deltaQuery="select `id` from `p1` where `updated_at` &gt;
> convert_tz('${dataimporter.last_index_time}', '+00:00',
> @@global.time_zone)">
>                 <entity name="chldrn" child="true" query="select
> concat('/p1/chld/',`id`,'/') as `doc_id`, 'chld' as `doc_type`,
> concat('/chldrn#',`id`) as `_nest_path_`, `name` as `chld_name`, `p1_id` as
> `chld_p1_id` where `p1_id` = ${p1.p1_id} and `deleted_at` is null" />
>     </entity>
>     </document>
> </dataConfig>
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to