Stuart Bertram created SOLR-9550:
------------------------------------

             Summary: innerJoin can succeed with bad sorting
                 Key: SOLR-9550
                 URL: https://issues.apache.org/jira/browse/SOLR-9550
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 6.1
         Environment: CentOS 6.8, OpenJDK 1.8
            Reporter: Stuart Bertram


The innerJoin streaming function requires that both streams are ordered by the 
correct keys for joining. In some situations, you can make a mistake and use an 
incorrect sort order but get a successful (but incorrect) return.

Example:
 * Collection "UserPosts" has columns: ID, ByUserID
 * Collection "User" has columns: ID, Username, Registered, …
 * Streaming query {{gatherNodes(User, gatherNodes(UserPosts, walk="42 69->ID", 
gather="ByUserID"), walk="node->ID", gather="ID")}} returns the IDs of users 
who made posts 42 and 69, but we want the full user details
 * Streaming query {{innerJoin(sort(gatherNodes(User, gatherNodes(UserPosts, 
walk="42 69->ID", gather="ByUserID"), walk="node->ID", gather="ID"), by="ID 
asc"), search(User,qt="/export",q="*:*",fl="ID, Username, Registered, …", 
sort="ID asc"), on="node=ID")}} (Note the {{sort(…, by="ID")}}, because we're 
gathering the ID field, instead of {{sort(…, by="node")}}, because the gathered 
nodes return a tuple with the gathered ID in the "node" field)

(Note: This example is simplified, so while there may be a better way to 
perform this specific query, the concept and the underlying issue remains)

Expected result: Solr throws a (useful) exception saying that the sort orders 
do not match the join (because the first stream is sorted by ID, but the join 
is *node*=ID), as it does if the sort() call wasn't included.

Actual result: Solr believes the queries are correctly sorted and returns each 
node from the first set joined with one set of values chosen from the second 
stream (each row is joined to the *same* row), so the returned ID and node 
values do not match, despite them being used in the join equality.

This seems like a simple mistake to make at first, as I was gathering IDs and 
so automatically tried to sort by ID, but should have sorted by node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to