[jira] [Comment Edited] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-06-05 Thread Kevin Watters (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127100#comment-17127100
 ] 

Kevin Watters edited comment on SOLR-13749 at 6/5/20, 9:29 PM:
---

perhaps no one expects scoring, by default.. unless they explicitly ask for 
scoring then the request should fail if it's not supported in cross-collection 
mode.  Seems like some additional edge cases too .. if we go with auto-magic.

[~dsmiley] would you be able to help with the log message so we can get the 
code over the line for your liking, and once we are done with the code side, 
we'll update the documentation to be consistent with it.  


was (Author: kwatters):
perhaps no one expects scoring.. unless they say they want the score join 
stuff.  but then it's at odds with cross collection.. so users are forced to be 
explicit in that scenario.

Anyway..[~dsmiley]  would you be able to help with the log message so we can 
get the code over the line for your liking, and once we are done with the code 
side, we'll update the documentation to be consistent with it.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.6
>
> Attachments: 2020-03 Smiley with ASF hat.jpeg
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  

[jira] [Comment Edited] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-05-21 Thread Gus Heck (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113517#comment-17113517
 ] 

Gus Heck edited comment on SOLR-13749 at 5/21/20, 8:26 PM:
---

Let me clarify the above... some of it is forward looking in the event that the 
NPE I mentioned above gets changed, or some aspect of when we do/don't 
encode/decode URL's gets changed, etc... or in the event that there are 
parameter hacking/hiding/encoding tricks I didn't think of... HTTP is just too 
ubiquitous, and it initiates the connection with a path string of arbitrary 
size... the ZK protocol is only relevant to ZK servers and there is no way 
(that I know of) to make the initial zk connection send a lot of data.


was (Author: gus_heck):
Let me clarify the above... some of it is forward looking in the even that the 
NPE I mentioned above gets changed, or some aspect of when we do/don't 
encode/decode URL's gets changed, etc... or in the event that there are 
parameter hacking/hiding/encoding tricks I didn't think of... HTTP is just too 
ubiquitous, and it initiates the connection with a path string of arbitrary 
size... the ZK protocol is only relevant to ZK servers and there is no way 
(that I know of) to make the initial zk connection send a lot of data.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.6
>
> Attachments: 2020-03 Smiley with ASF hat.jpeg
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> 

[jira] [Comment Edited] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053883#comment-17053883
 ] 

David Smiley edited comment on SOLR-13749 at 3/7/20, 4:43 AM:
--

The \{{{!join}}} QParser has the params {{fromIndex}}, {{from}}, and {{to}} 
that align with XCJF functionality of similar parameters (to & from are the 
same, fromIndex is "collection").  This is not true of \{{{!parent}}} and 
\{{{!child}}}.  Yes XCJF has _additional_ parameters and a cache etc. but they 
don't change the fundamental semantics (meaning).  For years, users have been 
able to use \{{{!join}}} to match a foreign index to the target index of the 
request and the foreign index has been able to be a collection name.  It has 
limitations (same node).  What's awesome on this issue is that we're lifting 
that same-machine restriction.  I appreciate that the functionality to do that 
requires fundamentally different code (which users don't care about) and there 
are tuning knobs.  This has been the story for \{{{!join}}} for a long time as 
it gained the ability to do scoring which required different code.  [~mkhl] you 
may have an opinion here as someone who has put effort into \{{{!join}}} over 
some years.   
(BTW boy is it hard to type query parser syntax in JIRA with its escaping :-)


was (Author: dsmiley):
The \{{{!join}}} QParser has the params {{fromIndex}}, {{from}}, and {{to}} 
that align with XCJF functionality of similar parameters (to & from are the 
same, fromIndex is "collection").  This is not true of \{{{!parent}}} and 
\{{{!child}}}.  Yes XCJF has _additional_ parameters and a cache etc. but they 
don't change the fundamental semantics (meaning).  For years, users have been 
able to use \{{{!join}}} to match a foreign index to the target index of the 
request and the foreign index has been able to be a collection name.  It has 
limitations (same node).  What's awesome on this issue is that we're lifting 
that same-machine restriction.  I appreciate that the functionality to do that 
requires fundamentally different code (which users don't care about) and there 
are tuning knobs.  This has been the story for {{{!join}}} for a long time as 
it gained the ability to do scoring which required different code.  [~mkhl] you 
may have an opinion here as someone who has put effort into {{{!join}}} over 
some years.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
>