[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053912#comment-17053912
 ] 

Ishan Chattopadhyaya commented on SOLR-13749:
-

Even though I feel there shouldn't be a separate qparser, I feel \{!xcjf} is a 
cryptic choice for the query parser name. \{!ccjoin} or \{!xcjoin} would've 
been better names.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also be specified as a local 
> param.|
>  
> Example Solr Config.xml changes:
>  
>  {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
>  {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
>  {{   }}{{size}}{{=}}{{"128"}}
>  {{   }}{{initialSize}}{{=}}{{"0"}}
>  {{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
>   
>  {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
> 

[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053906#comment-17053906
 ] 

Ishan Chattopadhyaya commented on SOLR-13749:
-

bq.  Basically, lets enhance JoinQParserPlugin to know when to use this new 
implementation instead of adding a new query parser that looks like the current 
one. The existing one already has a "method" and branches. Can we get this in 
ASAP for 8.5 please?

Fully agree. Users shouldn't have to worry about using {!join} or {!xcjf} based 
on whether the other collection is co-located on the same nodes or not.

bq. There is already a long standing precedent for having different query 
parsers for join operations that are significantly different.   child, parent, 
and join.  Why would the xcjf parser be treated by a different set of standards?
All of these are conceptually different and reasonably clear in terms of 
nomenclature ({!child} and {!parent} for nested/block join, {!join} for generic 
joins).

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF 

[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053903#comment-17053903
 ] 

Ishan Chattopadhyaya commented on SOLR-13749:
-

+1 to consolidating within existing JoinQParserPlugin. It already supports 
cross core join, which shouldn't be conceptually different from cross 
collection join (from users' point of view). We should minimize "query parser 
sprawl".

I've not looked at the implementation here, but I really hope this new 
functionality can play well with SOLR-13350. For context, I had to do some 
surgery to the JoinQParserPlugin in order to make it play nicely.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also be specified as a local 
> param.|
>  
> Example Solr Config.xml changes:
>  
>  {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
>  {{   

[jira] [Commented] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053895#comment-17053895
 ] 

Munendra S N commented on SOLR-13893:
-

 [^SOLR-13893.patch] 
patch against the master.

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13893:

Attachment: SOLR-13893.patch

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053894#comment-17053894
 ] 

Munendra S N commented on SOLR-13944:
-

 [^SOLR-13944.patch] 
Thanks [~tflobbe] for the review. I have addressed the comments

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch, SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13944:

Attachment: SOLR-13944.patch

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch, SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread Kevin Watters (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053888#comment-17053888
 ] 

Kevin Watters commented on SOLR-13749:
--

dismax.. edismax payload_check, payload_score ...  there's a lot of query 
parser sprawl... 

This query parser doesn't support any of the scoring of the existing join query 
parser.. so there would be a lot of details to address before the functionality 
could be merged into the standard join query parser, otherwise we would have to 
answer questions like,  "why doesn't the score=max work on the join query ?" , 
sure we could add this in at some point, 

the xcjf query is a filtering only operation.  It does not do any scoring, for 
me that is pretty fundamentally different from the current join query parser.

 

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also 

[jira] [Comment Edited] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053883#comment-17053883
 ] 

David Smiley edited comment on SOLR-13749 at 3/7/20, 4:43 AM:
--

The \{{{!join}}} QParser has the params {{fromIndex}}, {{from}}, and {{to}} 
that align with XCJF functionality of similar parameters (to & from are the 
same, fromIndex is "collection").  This is not true of \{{{!parent}}} and 
\{{{!child}}}.  Yes XCJF has _additional_ parameters and a cache etc. but they 
don't change the fundamental semantics (meaning).  For years, users have been 
able to use \{{{!join}}} to match a foreign index to the target index of the 
request and the foreign index has been able to be a collection name.  It has 
limitations (same node).  What's awesome on this issue is that we're lifting 
that same-machine restriction.  I appreciate that the functionality to do that 
requires fundamentally different code (which users don't care about) and there 
are tuning knobs.  This has been the story for \{{{!join}}} for a long time as 
it gained the ability to do scoring which required different code.  [~mkhl] you 
may have an opinion here as someone who has put effort into \{{{!join}}} over 
some years.   
(BTW boy is it hard to type query parser syntax in JIRA with its escaping :-)


was (Author: dsmiley):
The \{{{!join}}} QParser has the params {{fromIndex}}, {{from}}, and {{to}} 
that align with XCJF functionality of similar parameters (to & from are the 
same, fromIndex is "collection").  This is not true of \{{{!parent}}} and 
\{{{!child}}}.  Yes XCJF has _additional_ parameters and a cache etc. but they 
don't change the fundamental semantics (meaning).  For years, users have been 
able to use \{{{!join}}} to match a foreign index to the target index of the 
request and the foreign index has been able to be a collection name.  It has 
limitations (same node).  What's awesome on this issue is that we're lifting 
that same-machine restriction.  I appreciate that the functionality to do that 
requires fundamentally different code (which users don't care about) and there 
are tuning knobs.  This has been the story for {{{!join}}} for a long time as 
it gained the ability to do scoring which required different code.  [~mkhl] you 
may have an opinion here as someone who has put effort into {{{!join}}} over 
some years.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> 

[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053883#comment-17053883
 ] 

David Smiley commented on SOLR-13749:
-

The \{{{!join}}} QParser has the params {{fromIndex}}, {{from}}, and {{to}} 
that align with XCJF functionality of similar parameters (to & from are the 
same, fromIndex is "collection").  This is not true of \{{{!parent}}} and 
\{{{!child}}}.  Yes XCJF has _additional_ parameters and a cache etc. but they 
don't change the fundamental semantics (meaning).  For years, users have been 
able to use \{{{!join}}} to match a foreign index to the target index of the 
request and the foreign index has been able to be a collection name.  It has 
limitations (same node).  What's awesome on this issue is that we're lifting 
that same-machine restriction.  I appreciate that the functionality to do that 
requires fundamentally different code (which users don't care about) and there 
are tuning knobs.  This has been the story for {{{!join}}} for a long time as 
it gained the ability to do scoring which required different code.  [~mkhl] you 
may have an opinion here as someone who has put effort into {{{!join}}} over 
some years.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache 

[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread Kevin Watters (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053862#comment-17053862
 ] 

Kevin Watters commented on SOLR-13749:
--

I'm a bit surprised by this.  There is already a long standing precedent for 
having different query parsers for join operations that are significantly 
different.   child, parent, and join.  Why would the xcjf parser be treated by 
a different set of standards?

Are there other query parsers that have been introduced in the past that were 
forced to be consolidated into an existing parser or is this a new precedent?

 

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also be specified as a local 
> param.|
>  
> Example Solr Config.xml changes:
>  
>  {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
>  {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
>  {{   

[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053754#comment-17053754
 ] 

David Smiley commented on SOLR-14040:
-

I overlooked that it was documented; I don't know why I didn't notice this when 
I looked for it before (at least I thought I did).

[~noble.paul] I propose I comment out the documentation of it so as to reduce 
exposure to the problem.  Maybe only on the 8x & 8.5 release branches.  I'll 
spend time working on the resolution – SOLR-14232 _a linked issue and I think 
really where this whole conversation should be happening, not here)_.

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14173) Ref Guide Redesign

2020-03-06 Thread Cassandra Targett (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053789#comment-17053789
 ] 

Cassandra Targett commented on SOLR-14173:
--

I've worked on this on & off for the past couple of months and have it pretty 
close to done. From my list of Known Issues before:

* -The fancy tab thing for multiple code examples in one section isn't styled 
right when you click other tabs- Fixed now.
* -The top nav won't be responsive in smaller screens- Finally figured out how 
to do this responsively
* -Behavior of sidebar on smaller screens could be improved- This is better now
* -Still many overlapping CSS rules for elements and many unused CSS rules to 
be cleaned up- Did a lot of cleanup here
* Sidebar requires too much scrolling - Phase 2 will trim this down
* -Now unused CSS/JS files haven't been deleted yet-
* -Search box shows results in the sidebar nav - I wasn't able to see this 
until yesterday and not sure how I feel about it. At any rate, I haven't worked 
with it much yet and it needs more work- I worked on this extensively and think 
it's OK now
* -Home page (index.html) needs some additional love- I don't remember what I 
meant by this, but I worked on that too and it's fine now.

Besides fixing these things, I also changed the left nav a little bit in terms 
of colors of child pages, etc. I don't want to spend too much time on that 
since Phase 2 will be a re-org that will require some additional work in this 
area.

I've updated the demo site at 
https://people.apache.org/~ctargett/RefGuideRedesign/ as I've gone along, it's 
up to date with my latest changes today. I have not pushed the branch in a 
while, I think I will actually make a new branch since this one is really far 
behind and trying to update it gives me weird merge conflicts I really don't 
want to deal with.

Comments? Feedback?

> Ref Guide Redesign
> --
>
> Key: SOLR-14173
> URL: https://issues.apache.org/jira/browse/SOLR-14173
> Project: Solr
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Cassandra Targett
>Assignee: Cassandra Targett
>Priority: Major
>
> The current design of the Ref Guide was essentially copied from a 
> Jekyll-based documentation theme 
> (https://idratherbewriting.com/documentation-theme-jekyll/), which had a 
> couple important benefits for that time:
> * It was well-documented and since I had little experience with Jekyll and 
> its Liquid templates and since I was the one doing it, I wanted to make it as 
> easy on myself as possible
> * It was designed for documentation specifically so took care of all the 
> things like inter-page navigation, etc.
> * It helped us get from Confluence to our current system quickly
> It had some drawbacks, though:
> * It wasted a lot of space on the page
> * The theme was built for Markdown files, so did not take advantage of the 
> features of the {{jekyll-asciidoc}} plugin we use (the in-page TOC being one 
> big example - the plugin could create it at build time, but the theme 
> included JS to do it as the page loads, so we use the JS)
> * It had a lot of JS and overlapping CSS files. While it used Bootstrap it 
> used a customized CSS on top of it for theming that made modifications 
> complex (it was hard to figure out how exactly a change would behave)
> * With all the stuff I'd changed in my bumbling way just to get things to 
> work back then, I broke a lot of the stuff Bootstrap is supposed to give us 
> in terms of responsiveness and making the Guide usable even on smaller screen 
> sizes.
> After upgrading the Asciidoctor components in SOLR-12786 and stopping the PDF 
> (SOLR-13782), I wanted to try to set us up for a more flexible system. We 
> need it for things like Joel's work on the visual guide for streaming 
> expressions (SOLR-13105), and in order to implement other ideas we might have 
> on how to present information in the future.
> I view this issue as a phase 1 of an overall redesign that I've already 
> started in a local branch. I'll explain in a comment the changes I've already 
> made, and will use this issue to create and push a branch where we can 
> discuss in more detail.
> Phase 1 here will be under-the-hood CSS/JS changes + overall page layout 
> changes.
> Phase 2 (issue TBD) will be a wholesale re-organization of all the pages of 
> the Guide.
> Phase 3 (issue TBD) will explore moving us from Jekyll to another static site 
> generator that is better suited for our content format, file types, and build 
> conventions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11359) An autoscaling/suggestions endpoint to recommend operations

2020-03-06 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053828#comment-17053828
 ] 

Noble Paul commented on SOLR-11359:
---

There is no specific url. You can hit any node in the cluster with that URI & 
it should work just fine.

 

No. We don't want this to be run automatically. Users should make a conscious 
decision to run a command that's given by the API. It's safer that way

> An autoscaling/suggestions endpoint to recommend operations
> ---
>
> Key: SOLR-11359
> URL: https://issues.apache.org/jira/browse/SOLR-11359
> Project: Solr
>  Issue Type: New Feature
>  Components: AutoScaling
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Attachments: SOLR-11359.patch
>
>
> Autoscaling can make suggestions to users on what operations they can perform 
> to improve the health of the cluster
> The suggestions will have the following information
> * http end point
> * http method (POST,DELETE)
> * command payload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9170) wagon-ssh Maven HTTPS issue

2020-03-06 Thread Ishan Chattopadhyaya (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053748#comment-17053748
 ] 

Ishan Chattopadhyaya commented on LUCENE-9170:
--

[~romseygeek], I wasn't able to build Solr package from 8.4 release branch, so 
I opened this issue. If you aren't able to build during your release process, 
we can revisit this. In light of that, do you think this is a blocker? I leave 
the judgement to you.
Also, the patch I submitted is the best solution to my extremely basic 
understanding of our build system; it worked for me. I don't know if this is 
the best solution, so I'm not in a position to resolve this issue myself 
without help from experts at the build system.

> wagon-ssh Maven HTTPS issue
> ---
>
> Key: LUCENE-9170
> URL: https://issues.apache.org/jira/browse/LUCENE-9170
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Blocker
> Fix For: 8.5
>
> Attachments: LUCENE-9170.patch, LUCENE-9170.patch
>
>
> When I do, from lucene/ in branch_8_4:
> ant -Dversion=8.4.2 generate-maven-artifacts 
> I see that wagon-ssh is being resolved from http://repo1.maven.org/maven2 
> instead of https equivalent. This is surprising to me, since I can't find the 
> http URL anywhere.
> Here's my log:
> https://paste.centos.org/view/be2d3f3f
> This is a critical issue since releases won't work without this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-13749:

Priority: Blocker  (was: Major)

I understand your point of view but they don't sway my opinion: Params can 
differ based on the method.  The whitelist thing is optional (only applies to 
multi-cluster).  

With 8.5 out soon and if nobody has time to develop this further at the moment, 
I think we have to _do something_ here to prevent a back-compat concern:
 * Option A: document in an obvious way (i.e. some call-out box) that the name 
& parameters will likely change without back-compat.  In the project we 
sometimes throw out the word "experimental" a lot but here I'm just claiming 
the syntax/way it's invoked will change; I'm making no quality judgement on 
what's underneath.
 * Option B: comment it out making it invisible
 * Option C: remove from 8x/8.5; leave in master

Please pick do the one that suits you Gus.  They are all fine with me.

BTW that whitelist thing reminds me heavily of the _existing_ "shardsWhitelist" 
feature (see distributed-requests.adoc).  It's not clear to me if we need a new 
mechanism here.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in 

[jira] [Reopened] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley reopened SOLR-13749:
-

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also be specified as a local 
> param.|
>  
> Example Solr Config.xml changes:
>  
>  {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
>  {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
>  {{   }}{{size}}{{=}}{{"128"}}
>  {{   }}{{initialSize}}{{=}}{{"0"}}
>  {{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
>   
>  {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
> {{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}}
>  {{  }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}}
>  {{}}
>   
>  {{<}}{{queryParser}} {{name}}{{=}}{{"hash_range"}} 
> {{class}}{{=}}{{"org.apache.solr.search.join.HashRangeQueryParserPlugin"}} 
> {{/>}}
>   

[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-03-06 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053847#comment-17053847
 ] 

Noble Paul commented on SOLR-14040:
---

I've created SOLR-14311 to address the shortcomings

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14311) Shared schema should not have access to core level classes

2020-03-06 Thread Noble Paul (Jira)
Noble Paul created SOLR-14311:
-

 Summary: Shared schema should not have access to core level classes
 Key: SOLR-14311
 URL: https://issues.apache.org/jira/browse/SOLR-14311
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Noble Paul
Assignee: David Smiley


When a schema is shared, it should not have access to the classes of a specific 
core. The core may come and go but the shared schema may continue tolive . So 
how do we implement that?

If a schema is shared, create a new {{SolrResourceLoader}} specifically for 
that schema object. The classpath should be the same as the classpath for the 
SRL in {{CoreContainer}}. As and when we implement loading schema plugins from 
packages, they too should be accessible. The SRL created for this schema should 
be able to load resources from ZK in the path {{/configs/}}. 
This SRL should be discarded when the schema no destroyed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-03-06 Thread Noble Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul resolved SOLR-14040.
---
Resolution: Resolved

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053820#comment-17053820
 ] 

Lucene/Solr QA commented on SOLR-13893:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m 54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m 54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m 54s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 74m 
43s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 80m  6s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-13893 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12995879/SOLR-13893.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-170-generic #199-Ubuntu SMP 
Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / c73d2c1 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/701/testReport/ |
| modules | C: solr/core U: solr/core |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/701/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Tomas Eduardo Fernandez Lobbe (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053745#comment-17053745
 ] 

Tomas Eduardo Fernandez Lobbe commented on SOLR-13944:
--

I don't know much about this part of the code, so I'm just replying based on 
the patch.

As an optimization, can we keep the {{if (!fq.startsWith("{!collapse")) {}} and 
*only* do the full parse in case of {{true}}?

{code:java}
+} catch (SyntaxError e) {
+  // shouldn't happen as filters are already validated
 }
{code}
Maybe throw an AssertionError() here, wdyt?

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-03-06 Thread Noble Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-14040:
--
Priority: Major  (was: Blocker)

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14311) Shared schema should not have access to core level classes

2020-03-06 Thread Noble Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-14311:
--
Fix Version/s: 8.6

> Shared schema should not have access to core level classes
> --
>
> Key: SOLR-14311
> URL: https://issues.apache.org/jira/browse/SOLR-14311
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.6
>
>
> When a schema is shared, it should not have access to the classes of a 
> specific core. The core may come and go but the shared schema may continue 
> tolive . So how do we implement that?
> If a schema is shared, create a new {{SolrResourceLoader}} specifically for 
> that schema object. The classpath should be the same as the classpath for the 
> SRL in {{CoreContainer}}. As and when we implement loading schema plugins 
> from packages, they too should be accessible. The SRL created for this schema 
> should be able to load resources from ZK in the path 
> {{/configs/}}. This SRL should be discarded when the schema 
> no destroyed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-03-06 Thread Yonik Seeley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053842#comment-17053842
 ] 

Yonik Seeley commented on SOLR-11725:
-

+1 to commit

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] beettlle commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

2020-03-06 Thread GitBox
beettlle commented on a change in pull request #1297: SOLR-14253 Replace 
various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r389208349
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##
 @@ -1684,58 +1685,39 @@ private void 
doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-int retryCount = 320;
-log.debug("look for our core node name");
-while (retryCount-- > 0) {
-  final DocCollection docCollection = zkStateReader.getClusterState()
-  
.getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-  if (docCollection != null && docCollection.getSlicesMap() != null) {
-final Map slicesMap = docCollection.getSlicesMap();
-for (Slice slice : slicesMap.values()) {
-  for (Replica replica : slice.getReplicas()) {
-// TODO: for really large clusters, we could 'index' on this
-
-String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-String msgNodeName = getNodeName();
-String msgCore = descriptor.getName();
-
-if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-  descriptor.getCloudDescriptor()
-  .setCoreNodeName(replica.getName());
-  getCoreContainer().getCoresLocator().persist(getCoreContainer(), 
descriptor);
-  return;
-}
-  }
+log.debug("waitForCoreNodeName >>> look for our core node name");
+try {
+  zkStateReader.waitForState(descriptor.getCollectionName(), 320, 
TimeUnit.SECONDS, c -> {
 
 Review comment:
   Agreed bout having too many settings, we're already drowning in them.  
   
   Looking back looks like the number was added as part of SOLR-9140 and 
there's no comment of where the "320" came from.  As well, there's another 
retry number 
[here](https://github.com/apache/lucene-solr/pull/1297/files#diff-d5e1be02f6f0c397e18380598aa62b3dR476)
 of "30" but no idea why.  So we already have 2 different numbers of retries.
   
   If the numbers come from empirical experiments then I agree with them being 
constants but because they seem arbitrary seems like good candidates of 
per-application tuning.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

2020-03-06 Thread GitBox
madrob commented on a change in pull request #1297: SOLR-14253 Replace various 
sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r389193738
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##
 @@ -1684,58 +1685,39 @@ private void 
doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-int retryCount = 320;
-log.debug("look for our core node name");
-while (retryCount-- > 0) {
-  final DocCollection docCollection = zkStateReader.getClusterState()
-  
.getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-  if (docCollection != null && docCollection.getSlicesMap() != null) {
-final Map slicesMap = docCollection.getSlicesMap();
-for (Slice slice : slicesMap.values()) {
-  for (Replica replica : slice.getReplicas()) {
-// TODO: for really large clusters, we could 'index' on this
-
-String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-String msgNodeName = getNodeName();
-String msgCore = descriptor.getName();
-
-if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-  descriptor.getCloudDescriptor()
-  .setCoreNodeName(replica.getName());
-  getCoreContainer().getCoresLocator().persist(getCoreContainer(), 
descriptor);
-  return;
-}
-  }
+log.debug("waitForCoreNodeName >>> look for our core node name");
+try {
+  zkStateReader.waitForState(descriptor.getCollectionName(), 320, 
TimeUnit.SECONDS, c -> {
 
 Review comment:
   In general, I think it's good to have knobs, but there's definitely the 
possibility of having too many things available to configure and overwhelming 
operators. Can you describe what conditions would lead to wanting to tweak this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] beettlle commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

2020-03-06 Thread GitBox
beettlle commented on a change in pull request #1297: SOLR-14253 Replace 
various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r389132693
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##
 @@ -1684,58 +1685,39 @@ private void 
doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-int retryCount = 320;
-log.debug("look for our core node name");
-while (retryCount-- > 0) {
-  final DocCollection docCollection = zkStateReader.getClusterState()
-  
.getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-  if (docCollection != null && docCollection.getSlicesMap() != null) {
-final Map slicesMap = docCollection.getSlicesMap();
-for (Slice slice : slicesMap.values()) {
-  for (Replica replica : slice.getReplicas()) {
-// TODO: for really large clusters, we could 'index' on this
-
-String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-String msgNodeName = getNodeName();
-String msgCore = descriptor.getName();
-
-if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-  descriptor.getCloudDescriptor()
-  .setCoreNodeName(replica.getName());
-  getCoreContainer().getCoresLocator().persist(getCoreContainer(), 
descriptor);
-  return;
-}
-  }
+log.debug("waitForCoreNodeName >>> look for our core node name");
+try {
+  zkStateReader.waitForState(descriptor.getCollectionName(), 320, 
TimeUnit.SECONDS, c -> {
 
 Review comment:
   If this change is being made, should the number of retries be configurable?  
This hardcoded value seems to be used a lot in the code.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on issue #1215: LUCENE-9164: Ignore ACE on tragic event if IW is closed

2020-03-06 Thread GitBox
dnhatn commented on issue #1215: LUCENE-9164: Ignore ACE on tragic event if IW 
is closed
URL: https://github.com/apache/lucene-solr/pull/1215#issuecomment-595913520
 
 
   Closes in favor of https://github.com/apache/lucene-solr/pull/1319.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn closed pull request #1215: LUCENE-9164: Ignore ACE on tragic event if IW is closed

2020-03-06 Thread GitBox
dnhatn closed pull request #1215: LUCENE-9164: Ignore ACE on tragic event if IW 
is closed
URL: https://github.com/apache/lucene-solr/pull/1215
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-03-06 Thread Cassandra Targett (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053695#comment-17053695
 ] 

Cassandra Targett commented on SOLR-14040:
--

bq. As I understand shared-schema (SolrCloud or not) is still in development 
and not documented

The page {{format-of-solr-xml.adoc}} does show a {{shareSchema}} parameter, 
which I think is what's being talked about here, so it is actually already 
documented.

Since it is documented, whenever a feature doesn't work in a pretty common use 
case (like running Solr in SolrCloud mode), we should document the limitation 
when it exists in a release that a user would actually use.

bq. If we document the limitation, shouldn't we also document the feature 
itself, but actually it is not ready... difficult. Is there a section specific 
to "coming" features?

You're right, that situation would be awkward and we wouldn't do it - if it's 
not documented it's like it otherwise doesn't exist so mentioning a limitation 
in something that doesn't exist would be misleading, really. We don't put info 
about future features in the Ref Guide as a general rule (unless it's to say 
something like some changes may be coming in the future, but even then, it's 
not usually that relevant - it could be years before it arrives).

If something is implemented but we consider it experimental, that's fine to 
document as long as we're clear about its status as Experimental in that 
documentation. Such docs should also include known limitations (which may or 
may not be resolved in the future) whenever possible.

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9266) ant nightly-smoke fails due to presence of build.gradle

2020-03-06 Thread Mike Drob (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053682#comment-17053682
 ] 

Mike Drob commented on LUCENE-9266:
---

As I'm working on this I'm discovering other issues present as well, will fix 
them all in a single patch if they remain small enough.

> ant nightly-smoke fails due to presence of build.gradle
> ---
>
> Key: LUCENE-9266
> URL: https://issues.apache.org/jira/browse/LUCENE-9266
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Priority: Major
>
> Seen on Jenkins - 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-master/1617/console]
>  
> Reproduced locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9266) ant nightly-smoke fails due to presence of build.gradle

2020-03-06 Thread Mike Drob (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob updated LUCENE-9266:
--
Parent: LUCENE-9077
Issue Type: Sub-task  (was: Task)

> ant nightly-smoke fails due to presence of build.gradle
> ---
>
> Key: LUCENE-9266
> URL: https://issues.apache.org/jira/browse/LUCENE-9266
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Priority: Major
>
> Seen on Jenkins - 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-master/1617/console]
>  
> Reproduced locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9266) ant nightly-smoke fails due to presence of build.gradle

2020-03-06 Thread Mike Drob (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053679#comment-17053679
 ] 

Mike Drob commented on LUCENE-9266:
---

Porting the nightly smoke to gradle is outside of the scope of what I want to 
do here.

> ant nightly-smoke fails due to presence of build.gradle
> ---
>
> Key: LUCENE-9266
> URL: https://issues.apache.org/jira/browse/LUCENE-9266
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Priority: Major
>
> Seen on Jenkins - 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-master/1617/console]
>  
> Reproduced locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053668#comment-17053668
 ] 

Xin-Chun Zhang commented on LUCENE-9136:


1. My personal git branch: 
[https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat].

2. The vector format is as follows, 

!image-2020-03-07-01-25-58-047.png|width=535,height=297!

 

Structure of IVF index meta is as follows,

!image-2020-03-07-01-27-12-859.png|width=606,height=276!

 

Structure of IVF data:

!image-2020-03-07-01-22-06-132.png|width=529,height=309!

3. Ann-benchmark tool could be found in: 
[https://github.com/irvingzhang/ann-benchmarks].

Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, 
nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset):

1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8~49.6, 
recall: 76.8%~99.7%

!glove-25-angular.png|width=653,height=450!

 

2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2~38.3, 
recall 65.8%~96.3%

!glove-100-angular.png|width=671,height=462!

3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8~38.2, 
recall 71.1%~99.2%

!sift-128-euclidean.png|width=684,height=471!

 

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> 

[jira] [Issue Comment Deleted] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Comment: was deleted

(was: 1. My personal git branch: 
[https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat].

2. The vector format is as follows, 

!image-2020-03-07-01-25-58-047.png|width=535,height=297!

 

Structure of IVF index meta is as follows,

!image-2020-03-07-01-27-12-859.png|width=606,height=276!

 

Structure of IVF data:

!image-2020-03-07-01-22-06-132.png|width=529,height=309!

3. Ann-benchmark tool could be found in: 
[https://github.com/irvingzhang/ann-benchmarks].

Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, 
nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset):

1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8~49.6, 
recall: 76.8%~99.7%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583504416262-89784074-c9dc-4489-99a1-5e4b3c76e5fc.png|width=624,height=430!

 

2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2~38.3, 
recall 65.8%~96.3%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583510066130-b4fbcb29-8ad7-4ff2-99ce-c52f7c27826e.png|width=679,height=468!

3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8~38.2, 
recall 71.1%~99.2%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583515010497-20b74f41-72c3-48ce-a929-1cbfbd6a6423.png|width=691,height=476!

 )

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: sift-128-euclidean.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: glove-25-angular.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: glove-100-angular.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: glove-100-angular.png, glove-25-angular.png, 
> image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, 
> image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053653#comment-17053653
 ] 

Lucene/Solr QA commented on SOLR-13944:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
11s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m 35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m 35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m 35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 74m 
57s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 82m 23s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-13944 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12995852/SOLR-13944.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-170-generic #199-Ubuntu SMP 
Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / c73d2c1 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/699/testReport/ |
| modules | C: solr/core U: solr/core |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/699/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14073) Fix segment look ahead NPE in CollapsingQParserPlugin

2020-03-06 Thread Joel Bernstein (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-14073:
--
Attachment: SOLR-14073.patch

> Fix segment look ahead NPE in CollapsingQParserPlugin
> -
>
> Key: SOLR-14073
> URL: https://issues.apache.org/jira/browse/SOLR-14073
> Project: Solr
>  Issue Type: Bug
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: SOLR-14073.patch, SOLR-14073.patch, SOLR-14073.patch
>
>
> The CollapsingQParserPlugin has a bug that if every segment is not visited 
> during the collect it throws an NPE. This causes the CollapsingQParserPlugin 
> to not work when used with any feature that short circuits the segments 
> during the collect. This includes using the CollapsingQParserPlugin twice in 
> the same query and the time limiting collector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Michael Froh (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053640#comment-17053640
 ] 

Michael Froh commented on LUCENE-8962:
--

bq. With a slightly refactored IW we can share the merge logic and let the 
reader re-write itself since we are talking about very small segments the 
overhead is very small. This would in turn mean that we are doing the work 
twice ie. the IW would do its normal work and might merge later etc.

Just to provide a bit more context, for the case where my team uses this 
change, we're replicating the index (think Solr master/slave) from "writers" to 
many "searchers", so we're avoiding doing the work many times.

An earlier (less invasive) approach I tried to address the small flushed 
segments problem was roughly: call commit on writer, hard link the commit files 
to another filesystem directory to "clone" the index, open an IW on that 
directory, merge small segments on the clone, let searchers replicate from the 
clone. That approach does mean that the merging work happens twice (since the 
"real" index doesn't benefit from the merge on the clone), but it doesn't 
involve any changes in Lucene.

Maybe that less-invasive approach is a better way to address this. It's 
certainly more consistent with [~simonw]'s suggestion above.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-03-06 Thread Kevin Watters (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053641#comment-17053641
 ] 

Kevin Watters commented on SOLR-13749:
--

having a local param like method=xcjf could trigger the xcjf query parser if we 
want.  There are some complications.  Currently, XCJF benefits greatly by some 
additional configuration for that query parser to specify the field in which a 
collection has been routed on.  The current join query parsers aren't defined 
by default in the solrconfig.xml . by merging together the functionality of 
these 2 query parsers, we might want to explicitly define the join query parser 
in the solr config by default.  

Additionally, there are many query parsers beyond xcjf that are really join 
query parsers.  

"child", and "parent" should also be considered "join" query parsers if we want 
to fully go to a consolidated join query parser model.  

We'll try to be responsive to issues on this ticket, however, I'm not sure how 
much bandwidth we will have for larger refactors related to xcjf.  My 
preference would be that we leave it as is.  This is what we were asked to 
develop and contribute back so we'd like to keep it as close to the original 
contribution as possible.  If we collectively want to wrangle all of those join 
parsers into a single consolidated join query parser perhaps we could track 
that as a different issue/ticket.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local 

[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389028879
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -299,7 +300,76 @@ static int getActualMaxDocs() {
   final FieldNumbers globalFieldNumberMap;
 
   final DocumentsWriter docWriter;
-  private final Queue eventQueue = new ConcurrentLinkedQueue<>();
+  private final CloseableQueue eventQueue = new CloseableQueue(this);
+
+  static final class CloseableQueue implements Closeable {
+private volatile boolean closed = false;
+private final Semaphore permits = new Semaphore(Integer.MAX_VALUE);
+private final Queue queue = new ConcurrentLinkedQueue<>();
+private final IndexWriter writer;
+
+CloseableQueue(IndexWriter writer) {
+  this.writer = writer;
+}
+
+private void tryAcquire() {
+  if (permits.tryAcquire() == false) {
+throw new AlreadyClosedException("queue is closed");
+  }
+  if (closed) {
+throw new AlreadyClosedException("queue is closed");
+  }
+}
+
+boolean add(Event event) {
+  tryAcquire();
+  try {
+return queue.add(event);
+  } finally {
+permits.release();
+  }
+}
+
+void processEvents() throws IOException {
+  tryAcquire();
+  try {
+processEventsInternal();
+  }finally {
 
 Review comment:
   nit: space after `}`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389028289
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -299,7 +300,76 @@ static int getActualMaxDocs() {
   final FieldNumbers globalFieldNumberMap;
 
   final DocumentsWriter docWriter;
-  private final Queue eventQueue = new ConcurrentLinkedQueue<>();
+  private final CloseableQueue eventQueue = new CloseableQueue(this);
+
+  static final class CloseableQueue implements Closeable {
+private volatile boolean closed = false;
+private final Semaphore permits = new Semaphore(Integer.MAX_VALUE);
+private final Queue queue = new ConcurrentLinkedQueue<>();
+private final IndexWriter writer;
+
+CloseableQueue(IndexWriter writer) {
+  this.writer = writer;
+}
+
+private void tryAcquire() {
+  if (permits.tryAcquire() == false) {
+throw new AlreadyClosedException("queue is closed");
+  }
+  if (closed) {
+throw new AlreadyClosedException("queue is closed");
+  }
+}
+
+boolean add(Event event) {
+  tryAcquire();
+  try {
+return queue.add(event);
+  } finally {
+permits.release();
+  }
+}
+
+void processEvents() throws IOException {
+  tryAcquire();
+  try {
+processEventsInternal();
+  }finally {
+permits.release();
+  }
+}
+private void processEventsInternal() throws IOException {
 
 Review comment:
   nit: add a new line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389029514
 
 

 ##
 File path: lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java
 ##
 @@ -3773,7 +3774,58 @@ public void testRefreshAndRollbackConcurrently() throws 
Exception {
   stopped.set(true);
   indexer.join();
   refresher.join();
+  if (w.getTragicException() != null) {
+w.getTragicException().printStackTrace();
 
 Review comment:
   I think we don't need to print the stack trace here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dnhatn commented on a change in pull request #1319: LUCENE-9164: process all events before closing gracefully

2020-03-06 Thread GitBox
dnhatn commented on a change in pull request #1319: LUCENE-9164: process all 
events before closing gracefully
URL: https://github.com/apache/lucene-solr/pull/1319#discussion_r389028879
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -299,7 +300,76 @@ static int getActualMaxDocs() {
   final FieldNumbers globalFieldNumberMap;
 
   final DocumentsWriter docWriter;
-  private final Queue eventQueue = new ConcurrentLinkedQueue<>();
+  private final CloseableQueue eventQueue = new CloseableQueue(this);
+
+  static final class CloseableQueue implements Closeable {
+private volatile boolean closed = false;
+private final Semaphore permits = new Semaphore(Integer.MAX_VALUE);
+private final Queue queue = new ConcurrentLinkedQueue<>();
+private final IndexWriter writer;
+
+CloseableQueue(IndexWriter writer) {
+  this.writer = writer;
+}
+
+private void tryAcquire() {
+  if (permits.tryAcquire() == false) {
+throw new AlreadyClosedException("queue is closed");
+  }
+  if (closed) {
+throw new AlreadyClosedException("queue is closed");
+  }
+}
+
+boolean add(Event event) {
+  tryAcquire();
+  try {
+return queue.add(event);
+  } finally {
+permits.release();
+  }
+}
+
+void processEvents() throws IOException {
+  tryAcquire();
+  try {
+processEventsInternal();
+  }finally {
 
 Review comment:
   nit: space after `{`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on issue #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-03-06 Thread GitBox
atris commented on issue #1294: LUCENE-9074: Slice Allocation Control Plane For 
Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1294#issuecomment-595884480
 
 
   @jpountz Raised another iteration, please let me know your thoughts and 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053623#comment-17053623
 ] 

Xin-Chun Zhang commented on LUCENE-9136:


1. My personal git branch: 
[https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat].

2. The vector format is as follows, 

!image-2020-03-07-01-25-58-047.png|width=535,height=297!

 

Structure of IVF index meta is as follows,

!image-2020-03-07-01-27-12-859.png|width=606,height=276!

 

Structure of IVF data:

!image-2020-03-07-01-22-06-132.png|width=529,height=309!

3. Ann-benchmark tool could be found in: 
[https://github.com/irvingzhang/ann-benchmarks].

Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, 
nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset):

1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8~49.6, 
recall: 76.8%~99.7%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583504416262-89784074-c9dc-4489-99a1-5e4b3c76e5fc.png|width=624,height=430!

 

2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2~38.3, 
recall 65.8%~96.3%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583510066130-b4fbcb29-8ad7-4ff2-99ce-c52f7c27826e.png|width=679,height=468!

3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8~38.2, 
recall 71.1%~99.2%

!https://intranetproxy.alipay.com/skylark/lark/0/2020/png/35769/1583515010497-20b74f41-72c3-48ce-a929-1cbfbd6a6423.png|width=691,height=476!

 

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png, 
> image-2020-03-07-01-25-58-047.png, image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to 

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053620#comment-17053620
 ] 

Michael Sokolov commented on LUCENE-8962:
-

Based on [~simonw]'s recent comments in github, plus difficulty getting tests 
to pass consistently (apparently there are more failing tests in Elasticland), 
we should probably revert for now, at least from 8.x and 8.5 branches. I am 
tied up for the moment, but will be able to do the revert this weekend.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: image-2020-03-07-01-27-12-859.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png, 
> image-2020-03-07-01-25-58-047.png, image-2020-03-07-01-27-12-859.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: image-2020-03-07-01-25-58-047.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png, 
> image-2020-03-07-01-25-58-047.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] atris commented on a change in pull request #1294: LUCENE-9074: Slice Allocation Control Plane For Concurrent Searches

2020-03-06 Thread GitBox
atris commented on a change in pull request #1294: LUCENE-9074: Slice 
Allocation Control Plane For Concurrent Searches
URL: https://github.com/apache/lucene-solr/pull/1294#discussion_r389038791
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
 ##
 @@ -211,6 +213,18 @@ public IndexSearcher(IndexReaderContext context, Executor 
executor) {
 assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel 
for reader" + context.reader();
 reader = context.reader();
 this.executor = executor;
+this.sliceExecutionControlPlane = executor == null ? null : 
getSliceExecutionControlPlane(executor);
+this.readerContext = context;
+leafContexts = context.leaves();
+this.leafSlices = executor == null ? null : slices(leafContexts);
+  }
+
+  // Package private for testing
+  IndexSearcher(IndexReaderContext context, Executor executor, 
SliceExecutionControlPlane sliceExecutionControlPlane) {
+assert context.isTopLevel: "IndexSearcher's ReaderContext must be topLevel 
for reader" + context.reader();
+reader = context.reader();
+this.executor = executor;
+this.sliceExecutionControlPlane = executor == null ? null : 
sliceExecutionControlPlane;
 
 Review comment:
   Not sure if I understood your point. The passed in instance is the one being 
assigned to the member?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: image-2020-03-07-01-22-06-132.png

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: image-2020-03-07-01-22-06-132.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: (was: 
1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png)

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Comment: was deleted

(was: The index format of IVFFlat is organized as follows, 
!1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png!

In general, the number of centroids lies within the interval [4 * sqrt(N), 16 * 
sqrt(N)], where N is the data set size. We use (4 * sqrt(N)) as the actual 
value of centroid number to balance between accuracy and computational load, 
denoted by c. And the full data set is used for training if its size no larger 
than 200,000. Otherwise (128 *  c) points are selected after shuffling for 
training in order to accelerate training.

Experiments have been conducted on a large data set (sift1M, 
[http://corpus-texmex.irisa.fr/]) to verify the implementation of IVFFlat. The 
base data set (sift_base.fvecs) contains 1,000,000 vectors with 128 dimensions. 
And 10,000 queries (sift_query.fvecs) are used for recall testing. The recall 
ratio follows

Recall=(Recall vectors in groundTruth) / (number of queries * TopK), where 
number of queries = 10,000 and TopK=100. The results are as follows (single 
thread and single segment),

 
||nprobe||avg. search time (ms)||recall (%)||
|8|16.3827|44.24|
|16|16.5834|58.04|
|32|19.2031|71.55|
|64|24.7065|83.30|
|128|34.9165|92.03|
|256|60.5844|97.18|
| | | |

**The test codes could be found in 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java.|https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java]

 

 

 

 )

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> 

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Attachment: (was: image-2020-02-16-15-05-02-451.png)

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-06 Thread Xin-Chun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Comment: was deleted

(was: Hi, [~jtibshirani], thanks for your suggestions!

??"I wonder if this clustering-based approach could fit more closely in the 
current search framework. In the current prototype, we keep all the cluster 
information on-heap. We could instead try storing each cluster as its own 
'term' with a postings list. The kNN query would then be modelled as an 'OR' 
over these terms."??

In the previous implementation 
([https://github.com/irvingzhang/lucene-solr/commit/eb5f79ea7a705595821f73f80a0c5752061869b2]),
 the cluster information is divided into two parts – meta (.ifi) and data(.ifd) 
as shown in the following figure, where each cluster with a postings list is 
stored in the data file (.ifd) and not kept on-heap. A major concern of this 
implementation is its reading performance of cluster data since reading is a 
very frequent behavior on kNN search. I will test and check the performance. 

!image-2020-02-16-15-05-02-451.png!

??"Because of this concern, it could be nice to include benchmarks for index 
time (in addition to QPS)..."??

Many thanks! I will check the links you mentioned and consider optimize the 
clustering cost. In addition, more benchmarks will be added soon.

 
h2. *UPDATE – Feb. 24, 2020*

I have  add a new implementation for IVF index, which has been marked as ***V2 
under the package org.apache.lucene.codecs.lucene90. In current implementation, 
the IVF index has been divided into two files with suffixes .ifi and .ifd, 
respectively. The .ifd file will be read if cluster information is needed. The 
experiments are conducted on dataset sift1M (Test codes: 
[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/KnnIvfPerformTester.java]),
 detailed results are as follows,
 # add document -- 3921 ms;
 # commit -- 3912286 ms (mainly spent on k-means training, 10 iterations, 4000 
centroids, totally 512,000 vectors used for training);
 # R@100 recall time and recall ratio are listed in the following table

 
||nprobe||avg. search time (ms)||recall ratio (%)||
|8|28.0755|44.154|
|16|27.1745|57.9945|
|32|32.986|71.7003|
|64|40.4082|83.50471|
|128|50.9569|92.07929|
|256|73.923|97.150894|

 Compare with on-heap implementation of IVF index, the query time increases 
significantly (22%~71%). Actually, IVF index is comprised of unique docIDs, and 
will not take up too much memory. *There is a small argument about whether to 
keep the cluster information on-heap or not. Hope to hear more suggestions.*

 

 )

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 

[jira] [Commented] (LUCENE-9258) DocTermsIndexDocValues should not assume it's operating on a SortedDocValues field

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053589#comment-17053589
 ] 

David Smiley commented on LUCENE-9258:
--

Makes sense to me; your test is perfect.  I'm curious; how did you see this at 
a higher level (e.g. Solr or ES)?  The issue title & details here are a bit 
geeky / low-level and I'm trying to think of a good CHANGES.txt entry that 
might be more meaningful to users.

> DocTermsIndexDocValues should not assume it's operating on a SortedDocValues 
> field
> --
>
> Key: LUCENE-9258
> URL: https://issues.apache.org/jira/browse/LUCENE-9258
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.7.2, 8.4
>Reporter: Michele Palmia
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-9258.patch
>
>
> When requesting a new _ValueSourceScorer_ (with _getRangeScorer_) from 
> _DocTermsIndexDocValues_ , the latter instantiates a new iterator on 
> _SortedDocValues_ regardless of the fact that the underlying field can 
> actually be of a different type (e.g. a _SortedSetDocValues_ processed 
> through a _SortedSetSelector_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-9258) DocTermsIndexDocValues should not assume it's operating on a SortedDocValues field

2020-03-06 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley reassigned LUCENE-9258:


Assignee: David Smiley

> DocTermsIndexDocValues should not assume it's operating on a SortedDocValues 
> field
> --
>
> Key: LUCENE-9258
> URL: https://issues.apache.org/jira/browse/LUCENE-9258
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.7.2, 8.4
>Reporter: Michele Palmia
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-9258.patch
>
>
> When requesting a new _ValueSourceScorer_ (with _getRangeScorer_) from 
> _DocTermsIndexDocValues_ , the latter instantiates a new iterator on 
> _SortedDocValues_ regardless of the fact that the underlying field can 
> actually be of a different type (e.g. a _SortedSetDocValues_ processed 
> through a _SortedSetSelector_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on issue #1301: LUCENE-9254: UniformSplit supports FST off-heap.

2020-03-06 Thread GitBox
bruno-roustant commented on issue #1301: LUCENE-9254: UniformSplit supports FST 
off-heap.
URL: https://github.com/apache/lucene-solr/pull/1301#issuecomment-595856255
 
 
   Updated after LUCENE-9257 removed FSTLoadMode. Now FST is off-heap by 
default. It is possible to force it with a boolean in the 
UniformSplitPostingsFormat.
   Also, FST is always on-heap if there is block encoding/decoding.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11359) An autoscaling/suggestions endpoint to recommend operations

2020-03-06 Thread Megan Carey (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052544#comment-17052544
 ] 

Megan Carey edited comment on SOLR-11359 at 3/6/20, 4:38 PM:
-

Would it be possible to explicitly return the URL to hit for applying the 
suggestion? i.e. rather than return an HTTP method, operation type, etc. just 
return the constructed URL for executing the action?

Also, are you considering writing a cron to periodically execute these 
suggestions? Or was the intention for these to be manually applied? 
[~noble.paul]


was (Author: megancarey):
Would it be possible to explicitly return the URL to hit for applying the 
suggestion? i.e. rather than return an HTTP method, operation type, etc. just 
return the constructed URL for executing the action?

Also, are you considering writing a cron to periodically execute these 
suggestions?

> An autoscaling/suggestions endpoint to recommend operations
> ---
>
> Key: SOLR-11359
> URL: https://issues.apache.org/jira/browse/SOLR-11359
> Project: Solr
>  Issue Type: New Feature
>  Components: AutoScaling
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Attachments: SOLR-11359.patch
>
>
> Autoscaling can make suggestions to users on what operations they can perform 
> to improve the health of the cluster
> The suggestions will have the following information
> * http end point
> * http method (POST,DELETE)
> * command payload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N reassigned SOLR-11725:
---

Assignee: Munendra S N

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13893:

Attachment: SOLR-13893.patch

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14289) Solr may attempt to check Chroot after already having connected once

2020-03-06 Thread Mike Drob (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053561#comment-17053561
 ] 

Mike Drob commented on SOLR-14289:
--

[~dsmiley] - seems like we're working on similar problems around speeding up 
core startup - can you take a look at this and let me know what you think?

> Solr may attempt to check Chroot after already having connected once
> 
>
> Key: SOLR-14289
> URL: https://issues.apache.org/jira/browse/SOLR-14289
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Server
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Attachments: Screen Shot 2020-02-26 at 2.56.14 PM.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On server startup, we will attempt to load the solr.xml from zookeeper if we 
> have the right properties set, and then later when starting up the core 
> container will take time to verify (and create) the chroot even if it is the 
> same string that we already used before. We can likely skip the second 
> short-lived zookeeper connection to speed up our startup sequence a little 
> bit.
>  
> See this attached image from thread profiling during startup.
> !Screen Shot 2020-02-26 at 2.56.14 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13893:

Status: Patch Available  (was: Open)

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13893) BlobRepository looks at the wrong system variable (runtme.lib.size)

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053562#comment-17053562
 ] 

Munendra S N commented on SOLR-13893:
-

 [^SOLR-13893.patch] 
Slightly modified patch

> BlobRepository looks at the wrong system variable (runtme.lib.size)
> ---
>
> Key: SOLR-13893
> URL: https://issues.apache.org/jira/browse/SOLR-13893
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>Assignee: Munendra S N
>Priority: Major
> Attachments: SOLR-13893.patch, SOLR-13893.patch
>
>
> Tim Swetland on the user's list pointed out this line in BlobRepository:
> private static final long MAX_JAR_SIZE = 
> Long.parseLong(System.getProperty("runtme.lib.size", String.valueOf(5 * 1024 
> * 1024)));
> "runtme" can't be right.
> [~ichattopadhyaya][~noblepaul] what's your opinion?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Nhat Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053545#comment-17053545
 ] 

Nhat Nguyen commented on LUCENE-8962:
-

Some engine tests in Elasticsearch are failing because of this change. I am 
working to backport them to Lucene so that we can catch similar issues in 
Lucene.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8103) QueryValueSource should use TwoPhaseIterator

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053517#comment-17053517
 ] 

David Smiley commented on LUCENE-8103:
--

Notice that {{TwoPhaseIterator.asDocIdSetIterator(tpi);}} will return an 
implementation whose {{advance(docId)}} method will move beyond the passed in 
docID and call matches until it finds a match.  That is a waste _if the user of 
this DISI doesn't care what the next matching document is if the approximation 
doesn't match_.  So QueryValueSource's exists() method could work with the 
approximation first and if that matches, then and only then call TPI.match.  If 
there is no TPI then the the scorer's DISI is accurate.

> QueryValueSource should use TwoPhaseIterator
> 
>
> Key: LUCENE-8103
> URL: https://issues.apache.org/jira/browse/LUCENE-8103
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-8103.patch
>
>
> QueryValueSource (in "queries" module) is a ValueSource representation of a 
> Query; the score is the value.  It ought to try to use a TwoPhaseIterator 
> from the query if it can be offered. This will prevent possibly expensive 
> advancing beyond documents that we aren't interested in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N reassigned SOLR-13199:
---

Assignee: Munendra S N

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Assignee: Munendra S N
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053452#comment-17053452
 ] 

Munendra S N commented on SOLR-13199:
-

 [^SOLR-13199.patch] 
NPE is still occurring when using without nestedPath field. I have removed 
version check which wasn't required.
When parentFilter is null then, setting parentFilter to {{MatchNoDocsQuery}} as 
parentFilter String is specified after parsing it resolves to {{null}}
[~dsmiley] Could you please review this once?

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13199:

Status: Patch Available  (was: Open)

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13199) NPE due to unexpected null return value from QueryBitSetProducer.getBitSet

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13199:

Attachment: SOLR-13199.patch

> NPE due to unexpected null return value from QueryBitSetProducer.getBitSet
> --
>
> Key: SOLR-13199
> URL: https://issues.apache.org/jira/browse/SOLR-13199
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
> Environment: h1. Steps to reproduce
> * Use a Linux machine.
> *  Build commit {{ea2c8ba}} of Solr as described in the section below.
> * Build the films collection as described below.
> * Start the server using the command {{./bin/solr start -f -p 8983 -s 
> /tmp/home}}
> * Request the URL given in the bug description.
> h1. Compiling the server
> {noformat}
> git clone https://github.com/apache/lucene-solr
> cd lucene-solr
> git checkout ea2c8ba
> ant compile
> cd solr
> ant server
> {noformat}
> h1. Building the collection
> We followed [Exercise 
> 2|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html#exercise-2] from 
> the [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_5/solr-tutorial.html]. The 
> attached file ({{home.zip}}) gives the contents of folder {{/tmp/home}} that 
> you will obtain by following the steps below:
> {noformat}
> mkdir -p /tmp/home
> echo '' > 
> /tmp/home/solr.xml
> {noformat}
> In one terminal start a Solr instance in foreground:
> {noformat}
> ./bin/solr start -f -p 8983 -s /tmp/home
> {noformat}
> In another terminal, create a collection of movies, with no shards and no 
> replication, and initialize it:
> {noformat}
> bin/solr create -c films
> curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": 
> {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' 
> http://localhost:8983/solr/films/schema
> curl -X POST -H 'Content-type:application/json' --data-binary 
> '{"add-copy-field" : {"source":"*","dest":"_text_"}}' 
> http://localhost:8983/solr/films/schema
> ./bin/post -c films example/films/films.json
> {noformat}
>Reporter: Johannes Kloos
>Priority: Minor
>  Labels: diffblue, newdev
> Attachments: SOLR-13199.patch, home.zip
>
>
> Requesting the following URL causes Solr to return an HTTP 500 error response:
> {noformat}
> http://localhost:8983/solr/films/select?fl=[child%20parentFilter=ge]=*:*
> {noformat}
> The error response seems to be caused by the following uncaught exception:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.solr.response.transform.ChildDocTransformer.transform(ChildDocTransformer.java:92)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:103)
> at org.apache.solr.response.DocsStreamer.next(DocsStreamer.java:1)
> at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:184)
> at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:136)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> at 
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> at org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> {noformat}
> In ChildDocTransformer.transform, we have the following lines:
> {noformat}
> final BitSet segParentsBitSet = parentsFilter.getBitSet(leafReaderContext);
> final int segPrevRootId = segRootId==0? -1: 
> segParentsBitSet.prevSetBit(segRootId - 1); // can return -1 and that's okay
> {noformat}
> But getBitSet can return null if the set of DocIds is empty:
> {noformat}
> return docIdSet == DocIdSet.EMPTY ? null : ((BitDocIdSet) docIdSet).bits();
> {noformat}
> We found this bug using [Diffblue Microservices 
> Testing|https://www.diffblue.com/labs/?utm_source=solr-br]. Find more 
> information on this [fuzz testing 
> campaign|https://www.diffblue.com/blog/2018/12/19/diffblue-microservice-testing-a-sneak-peek-at-our-early-product-and-results?utm_source=solr-br].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant closed pull request #1328: LUCENE-9257: Always keep FST off-heap. Remove SegmentReadState.openedFromWriter.

2020-03-06 Thread GitBox
bruno-roustant closed pull request #1328: LUCENE-9257: Always keep FST 
off-heap. Remove SegmentReadState.openedFromWriter.
URL: https://github.com/apache/lucene-solr/pull/1328
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-9257.

Fix Version/s: 8.6
   Resolution: Fixed

Thanks reviewers!

> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
> Fix For: 8.6
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053432#comment-17053432
 ] 

ASF subversion and git services commented on LUCENE-9257:
-

Commit e7a61eadf6d2f3c722c791e7470a79b2e919cdeb in lucene-solr's branch 
refs/heads/branch_8x from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e7a61ea ]

LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode, Reader attributes 
and openedFromWriter.


> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053427#comment-17053427
 ] 

David Smiley commented on LUCENE-8962:
--

Thanks so much for your input Simon!  We need to fight the complexity here.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053412#comment-17053412
 ] 

ASF subversion and git services commented on LUCENE-9257:
-

Commit c73d2c15ba7c5936715408807184c99ab7cfdfd4 in lucene-solr's branch 
refs/heads/master from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c73d2c1 ]

LUCENE-9257: Always keep FST off-heap. Remove SegmentReadState.openedFromWriter.


> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] sigram opened a new pull request #1329: SOLR-14275: Policy calculations are very slow for large clusters and large operations

2020-03-06 Thread GitBox
sigram opened a new pull request #1329: SOLR-14275: Policy calculations are 
very slow for large clusters and large operations
URL: https://github.com/apache/lucene-solr/pull/1329
 
 
   
   
   
   # Description
   
   See JIRA for the explanation of the problem.
   
   # Solution
   
   Try and reduce the combinatoric explosion in the candidate placements. Use 
caching more effectively.
   
   # Tests
   
   Manual performance tests using the scenario.txt attached to JIRA.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant closed pull request #1305: LUCENE-9257: Make FSTLoadMode enum not BlockTree specific.

2020-03-06 Thread GitBox
bruno-roustant closed pull request #1305: LUCENE-9257: Make FSTLoadMode enum 
not BlockTree specific.
URL: https://github.com/apache/lucene-solr/pull/1305
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on issue #1305: LUCENE-9257: Make FSTLoadMode enum not BlockTree specific.

2020-03-06 Thread GitBox
bruno-roustant commented on issue #1305: LUCENE-9257: Make FSTLoadMode enum not 
BlockTree specific.
URL: https://github.com/apache/lucene-solr/pull/1305#issuecomment-595761243
 
 
   Replaced by https://github.com/apache/lucene-solr/pull/1320 to always keep 
FST off-heap.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053400#comment-17053400
 ] 

Bruno Roustant commented on LUCENE-9257:


While preparing the port to the 8x branch I saw that I forgot a significant 
cleanup: the openedFromWriter boolean, which was also added to support 
FSTLoadMode logic. So I also remove it.
For visibility I added PR#1328, but I'll commit it immediately.

> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant opened a new pull request #1328: LUCENE-9257: Always keep FST off-heap. Remove SegmentReadState.openedFromWriter.

2020-03-06 Thread GitBox
bruno-roustant opened a new pull request #1328: LUCENE-9257: Always keep FST 
off-heap. Remove SegmentReadState.openedFromWriter.
URL: https://github.com/apache/lucene-solr/pull/1328
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053387#comment-17053387
 ] 

Munendra S N commented on SOLR-11725:
-

I'm planning to commit this weekend (only to master), let me know if there are 
any concerns

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13944:

Status: Patch Available  (was: Open)

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053385#comment-17053385
 ] 

Munendra S N commented on SOLR-13944:
-

 [^SOLR-13944.patch] 
Initial patch for fixing NPE. 

This is valid, as defType for fq is by default is lucene and then localParams 
syntax is parsed but the case of tagging for collapse filter wasn't handled in 
SOLR-8807 (it was doing a simple string match). Here, I have replaced it with 
filter parsing, without that we can't know if there is collapse filter or not.
{noformat}
fq={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
asc, id asc'}
{noformat}
[~tflobbe] As you had asked the user to create the JIRA issue, I would prefer 
if you could take look at this patch

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053381#comment-17053381
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053380#comment-17053380
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053382#comment-17053382
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13944) CollapsingQParserPlugin throws NPE instead of bad request

2020-03-06 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-13944:

Attachment: SOLR-13944.patch

> CollapsingQParserPlugin throws NPE instead of bad request
> -
>
> Key: SOLR-13944
> URL: https://issues.apache.org/jira/browse/SOLR-13944
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.3.1
>Reporter: Stefan
>Assignee: Munendra S N
>Priority: Minor
> Attachments: SOLR-13944.patch
>
>
>  I noticed the following NPE:
> {code:java}
> java.lang.NullPointerException at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>  at 
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>  at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>  at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>  at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> {code}
> If I am correct, the problem was already addressed in SOLR-8807. The fix does 
> was not working in this case though, because of a syntax error in the query 
> (I used the local parameter syntax twice instead of combining it). The 
> relevant part of the query is:
> {code:java}
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc, price 
> asc, id asc'}
> {code}
> After discussing that on the mailing list, I was asked to open a ticket, 
> because this situation should result in a bad request instead of a 
> NullpointerException (see 
> [https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201911.mbox/%3CCAMJgJxTuSb%3D8szO8bvHiAafJOs08O_NMB4pcaHOXME4Jj-GO2A%40mail.gmail.com%3E])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9241) fix most memory-hungry tests

2020-03-06 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053361#comment-17053361
 ] 

Dawid Weiss commented on LUCENE-9241:
-

I wasn't really that much concerned; just pointing out the (sad) fact of how 
it's implemented for Windows.

> fix most memory-hungry tests
> 
>
> Key: LUCENE-9241
> URL: https://issues.apache.org/jira/browse/LUCENE-9241
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9241.patch
>
>
> Currently each test jvm has Xmx of 512M. With a modern macbook pro this is 
> 4GB which is pretty crazy.
> On the other hand, if we fix a few edge cases, tests can work with lower 
> heaps such as 128M. This can save many gigabytes (also it finds interesting 
> memory waste/issues).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14310) Expose solr logs with basic filters via HTTP

2020-03-06 Thread Noble Paul (Jira)
Noble Paul created SOLR-14310:
-

 Summary: Expose solr logs with basic filters via HTTP
 Key: SOLR-14310
 URL: https://issues.apache.org/jira/browse/SOLR-14310
 Project: Solr
  Issue Type: Sub-task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Noble Paul






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14309) Expose GC logs via an HTTP API

2020-03-06 Thread Noble Paul (Jira)
Noble Paul created SOLR-14309:
-

 Summary: Expose GC logs via an HTTP API
 Key: SOLR-14309
 URL: https://issues.apache.org/jira/browse/SOLR-14309
 Project: Solr
  Issue Type: Sub-task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Noble Paul






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread Noble Paul (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053344#comment-17053344
 ] 

Noble Paul commented on SOLR-13942:
---

I've opened a new PR. added more tests . Please review

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: New Feature
>  Components: v2 API
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul opened a new pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread GitBox
noblepaul opened a new pull request #1327: SOLR-13942: /api/cluster/zk/* to 
fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053339#comment-17053339
 ] 

ASF subversion and git services commented on SOLR-13942:


Commit a8e7895c3007f3aa7e58bc52fb610416e80850a6 in lucene-solr's branch 
refs/heads/branch_8x from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a8e7895 ]

Revert "SOLR-13942: /api/cluster/zk/* to fetch raw ZK data"

This reverts commit 2044f8c83ebb0775d76b1e96c168ca936701abd4.


> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: New Feature
>  Components: v2 API
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053338#comment-17053338
 ] 

ASF subversion and git services commented on SOLR-13942:


Commit 4cf37ade3531305d508e383b9c16a0c5690bacae in lucene-solr's branch 
refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4cf37ad ]

Revert "SOLR-13942: /api/cluster/zk/* to fetch raw ZK data"

This reverts commit bc6fa3b65060b17a88013a0378f4a9d285067d82.


> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: New Feature
>  Components: v2 API
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on issue #404: Comment to explain how to use URLClassifyProcessorFactory

2020-03-06 Thread GitBox
janhoy commented on issue #404: Comment to explain how to use 
URLClassifyProcessorFactory
URL: https://github.com/apache/lucene-solr/pull/404#issuecomment-595723278
 
 
   @ohtwadi Do you want to address the review comment so we can merge this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy closed pull request #880: Tweak header format.

2020-03-06 Thread GitBox
janhoy closed pull request #880: Tweak header format.
URL: https://github.com/apache/lucene-solr/pull/880
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9033) Update Release docs an scripts with new site instructions

2020-03-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/LUCENE-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated LUCENE-9033:

Description: 
*releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
Janhoy has started on this, but will likely not finish before the 8.5 release

*[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
page:* I suggest we deprecate this page if folks are happy with releaseWizard, 
which should encapsulate all steps and details, and can also generate an HTML 
TODO document per release.

*publish-solr-ref-guide.sh:* 
[PR#1326|https://github.com/apache/lucene-solr/pull/1326] This script can be 
deleted, not in use since we do not publish PDF anymore

*(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done

 

There may be other places affected, such as other WIKI pages?

  was:
*releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
Janhoy has started on this, but will likely not finish before the 8.5 release

*[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
page:* I suggest we deprecate this page if folks are happy with releaseWizard, 
which should encapsulate all steps and details, and can also generate an HTML 
TODO document per release.

*publish-solr-ref-guide.sh:* This script can be deleted, not in use since we do 
not publish PDF anymore

*(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done

 

There may be other places affected, such as other WIKI pages?


> Update Release docs an scripts with new site instructions
> -
>
> Key: LUCENE-9033
> URL: https://issues.apache.org/jira/browse/LUCENE-9033
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: general/tools
>Reporter: Jan Høydahl
>Assignee: Jan Høydahl
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *releaseWizard.py:* [PR#1324|https://github.com/apache/lucene-solr/pull/1324] 
> Janhoy has started on this, but will likely not finish before the 8.5 release
> *[ReleaseTODO|https://cwiki.apache.org/confluence/display/LUCENE/ReleaseTodo] 
> page:* I suggest we deprecate this page if folks are happy with 
> releaseWizard, which should encapsulate all steps and details, and can also 
> generate an HTML TODO document per release.
> *publish-solr-ref-guide.sh:* 
> [PR#1326|https://github.com/apache/lucene-solr/pull/1326] This script can be 
> deleted, not in use since we do not publish PDF anymore
> *(/) solr-ref-gudie/src/meta-docs/publish.adoc:*  Done
>  
> There may be other places affected, such as other WIKI pages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   >