Re: Join query fails on coordinator node even when collections are co-located

Mikhail Khludnev Wed, 04 Mar 2026 23:23:36 -0800

Coordinator should fan-out per-shard requests and it what happens in
https://github.com/apache/solr/pull/4186
but not in https://github.com/apache/solr/pull/4184 and now I barely now
how #4184 works, probably it forwards to data-node.
With regards to https://github.com/apache/solr/pull/4186
the stack trace of the failure is
org.apache.solr.common.SolrException: SolrCloud join: To join with a
collection that might not be co-located, use method=crossCollection.
at
org.apache.solr.search.join.ScoreJoinQParserPlugin.getLocalSingleShard(ScoreJoinQParserPlugin.java:523)
at
org.apache.solr.search.join.ScoreJoinQParserPlugin.findLocalReplicaForFromIndex(ScoreJoinQParserPlugin.java:391)
at
org.apache.solr.search.join.ScoreJoinQParserPlugin.getCoreName(ScoreJoinQParserPlugin.java:346)
at
org.apache.solr.search.join.ScoreJoinQParserPlugin$1.createQuery(ScoreJoinQParserPlugin.java:277)
at
org.apache.solr.search.join.ScoreJoinQParserPlugin$1.parse(ScoreJoinQParserPlugin.java:253)
at
org.apache.solr.search.JoinQParserPlugin$1.parse(JoinQParserPlugin.java:227)
at org.apache.solr.search.QParser.getQuery(QParser.java:196)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:191)
at
org.apache.solr.handler.component.SearchHandler.prepareComponents(SearchHandler.java:427)
at
org.apache.solr.handler.component.SearchHandler.processComponents(SearchHandler.java:406)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:239)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:260)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2953)
at
org.apache.solr.servlet.HttpSolrCall.executeCoreRequest(HttpSolrCall.java:719)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:484)
at
org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:183)


it occurs in the real coordinator data-less node. It's caused by awkward
flow when Query Component triggers query parsing even if it will throw away
the lucene query parsed because stepping into distributed process (fan out
per-shards reqs). It would be great to redesign this old flaw. Another
part of the trouble is that JoinQP is too eager - checking indices on
parsing despite the query will be thrown away in a coordinator node.
Meanwhile I'll think about quickly hacking JoinQP to make it lazy
deferring query creation.

On Wed, Mar 4, 2026 at 4:58 PM Gus Heck <[email protected]> wrote:

> That begins to sound like it should have a JIRA. A coordinator node should
> probably be forwarding the request without any sort of interference.
>
> On Wed, Mar 4, 2026 at 7:05 AM Endika Posadas <[email protected]>
> wrote:
>
> > https://github.com/apache/solr/pull/4186 There seems to be a
> difference. I
> > have modified the tests by creating a dedicated coordinator node and then
> > they fail when I target the coordinator but succeed when I target the
> data
> > nodes. I'll continue in github.
> >
> > Thanks
> >
> > On Tue, 3 Mar 2026 at 22:11, Mikhail Khludnev <[email protected]> wrote:
> >
> > > I tried to reproduce join on the coord node, and test passed
> > > https://github.com/apache/solr/pull/4184/changes
> > > I propose to double check the cluster setup, and usage of the coord
> node
> > >
> > >
> >
> https://solr.apache.org/guide/solr/latest/deployment-guide/node-roles.html#the-work-flow-in-a-coordinator-node
> > > Once again the exception above might only occur in the data node with
> > > "to"-side where query parser is actually executed.
> > >
> > > On Tue, Mar 3, 2026 at 8:00 PM Endika Posadas <[email protected]>
> > > wrote:
> > >
> > > > Sorry, I'll add more context. The main collection is a sharded
> > collection
> > > > with over ten shards and where each shard has 2 replicas. The from
> > > > collection (fromData) has a single shard and one replica in each of
> the
> > > > solr nodes.
> > > > The query I send is a Json Query, looking like:
> > > >
> > > > {
> > > >   "filter":[{"join":{
> > > >         "query":{"lucene":{
> > > >             "query":"\"test\"",
> > > >             "df":"value_s"}},
> > > >         "from":"id",
> > > >         "to":"to_s",
> > > >         "fromIndex":"fromData"}},
> > > >     ],
> > > >   "offset":0,
> > > >   "query":"*:*",
> > > >   "limit":1,
> > > >   "params":{
> > > >     "TZ":"GMT+01:00",
> > > >     "timeAllowed":1800000},
> > > >   "fields":["id"]
> > > > }
> > > >
> > > > It works perfectly fine when sending it to any random solr node, but
> it
> > > > fails when it gets sent from the coordinator query. Every other query
> > > that
> > > > doesn't have a join works fine, or at least I haven't found any other
> > > > problems.
> > > >
> > > > Thanks
> > > >
> > > > On Tue, 3 Mar 2026 at 17:38, Mikhail Khludnev <[email protected]>
> wrote:
> > > >
> > > > > Hello,
> > > > > I'm in doubt. Assuming you use
> > > > >
> > > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#joining-multiple-shard-collections
> > > > > Please confirm.
> > > > > There;s no exact coordinator test for shard joins here
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/solr/blob/main/solr/core/src/test/org/apache/solr/search/join/ShardToShardJoinAbstract.java#L58
> > > > > But it creates 5 nodes for 3 shard collections, and I believe pick
> a
> > > > > coordinator randomly. So, we may expect it's working.
> > > > > Then, the error you provide might occur at "to"-node when it didn't
> > > find
> > > > > expected co-shard.
> > > > > I'm afraid we need to check shard alignment across cluster, and
> > > detailed
> > > > > request log across nodes. what exactly happened at coordinator and
> > > > > subordinate nodes.
> > > > > Regarding shards allocation: even if there's a node with a shard1
> of
> > > "to"
> > > > > collection collocated with "from" shard1, nothing will stop the
> > > > coordinator
> > > > > from attempting to search "to" shard1 at another node where "from"
> > > shard1
> > > > > is absent, and got the error like this.
> > > > >
> > > > > On Tue, Mar 3, 2026 at 6:02 PM Endika Posadas <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > We're running dedicated coordinator nodes for query performance,
> > with
> > > > > > collections that are properly co-located across data nodes.
> > > > > >
> > > > > >
> > > > > > When sending a join query (fromIndex pointing to a co-located
> > > > collection)
> > > > > > through the coordinator, we get an error:
> > > > > >
> > > > > > "error":{
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> "metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],
> > > > > >     "msg":"SolrCloud join: To join with a collection that might
> not
> > > be
> > > > > > co-located, use method=crossCollection.",
> > > > > >     "code":400
> > > > > >   }
> > > > > >
> > > > > >
> > > > > > The same query works fine when sent directly to a data node.
> > > > > >
> > > > > > It seems like the coordinator is trying to resolve the join
> instead
> > > of
> > > > > > delegating it to the data nodes. Is there a workaround around
> this?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > >
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Join query fails on coordinator node even when collections are co-located

Reply via email to