[GitHub] [lucene-solr] treygrainger opened a new pull request #1421: SOLR-14396: TaggerRequestHandler should not error on empty collection

2020-04-08 Thread GitBox
treygrainger opened a new pull request #1421: SOLR-14396: TaggerRequestHandler 
should not error on empty collection
URL: https://github.com/apache/lucene-solr/pull/1421
 
 
   
   
   
   # Description
   
   The TaggerRequestHandler currently returns a 400 (Bad Request) if used on a 
collection with no terms in the index. This is inconsistent with how Solr works 
in general, and while it's certainly possible for the 400 error to be handled 
client-side for empty collections, the incoming requests aren't really "bad" 
requests in my opinion, the index just doesn't have any data yet. This PR 
removes the error and just has the TaggerRequestHandler return no matched tags.
   
   # Solution
   
   The explicit exception was removed since it was agreed that it doesn't make 
sense.
   
   # Tests
   
   I added a unit test (TaggerTest.testEmptyCollection)to verify a response is 
now returned indicating zero tags.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [X] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [X] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `master` branch.
   - [X] I have run `ant precommit` and the appropriate test suite.
   - [X] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only). **Change too insignificant to warrant addition to 
documentation**
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14399) java.nio.channels.AsynchronousCloseException

2020-04-08 Thread Hoan Tran Van (Jira)
Hoan Tran Van created SOLR-14399:


 Summary: java.nio.channels.AsynchronousCloseException
 Key: SOLR-14399
 URL: https://issues.apache.org/jira/browse/SOLR-14399
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: replication (java), update
Affects Versions: 8.5
 Environment: OS: CentOS release 6.8 (Final)

Java: JDK 1.8.0_231
Reporter: Hoan Tran Van
 Attachments: AsynchronousCloseException, Reset cancel_stream_error

I am going to switch from solr cloud 6.6 to solrcloud 8.5. I installed solr 8.5 
parallel with solr 6.6. they have same data, same configset.

Solr 8.5 have many log errors about 
java.nio.channels.AsynchronousCloseException and Reset cancel_stream_error 
while solr 6.6 doesn't have.

2020-04-09 05:05:00.254 WARN (Thread-34437) [ ] 
o.a.s.u.p.DistributedZkUpdateProcessor Error sending update to 
http://192.168.1.106:8301/solr => 
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: null
 at 
org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:417)
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: null
 at 
org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:417)
 ~[?:?]
 at 
org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:753)
 ~[?:?]
 at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:369)
 ~[?:?]
 at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290) ~[?:?]
 at 
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:344)
 ~[?:?]
 at 
org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:333)
 ~[?:?]
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_231]
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
 at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
 ~[metrics-core-4.1.2.jar:4.1.2]
 at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
 ~[?:?]
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_231]
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_231]
 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]
Caused by: java.nio.channels.AsynchronousCloseException
 at 
org.eclipse.jetty.http2.client.http.HttpConnectionOverHTTP2.close(HttpConnectionOverHTTP2.java:144)
 ~[?:?]
 at 
org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2.onClose(HttpClientTransportOverHTTP2.java:170)
 ~[?:?]
 at 
org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2$SessionListenerPromise.onClose(HttpClientTransportOverHTTP2.java:232)
 ~[?:?]
 at org.eclipse.jetty.http2.api.Session$Listener.onClose(Session.java:206) 
~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at org.eclipse.jetty.http2.HTTP2Session.notifyClose(HTTP2Session.java:1131) 
~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at org.eclipse.jetty.http2.HTTP2Session.onGoAway(HTTP2Session.java:439) 
~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 
org.eclipse.jetty.http2.parser.Parser$Listener$Wrapper.onGoAway(Parser.java:396)
 ~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at org.eclipse.jetty.http2.parser.BodyParser.notifyGoAway(BodyParser.java:192) 
~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 
org.eclipse.jetty.http2.parser.GoAwayBodyParser.onGoAway(GoAwayBodyParser.java:169)
 ~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 
org.eclipse.jetty.http2.parser.GoAwayBodyParser.parse(GoAwayBodyParser.java:139)
 ~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at org.eclipse.jetty.http2.parser.Parser.parseBody(Parser.java:198) 
~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at org.eclipse.jetty.http2.parser.Parser.parse(Parser.java:127) 
~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 
org.eclipse.jetty.http2.HTTP2Connection$HTTP2Producer.produce(HTTP2Connection.java:248)
 ~[http2-common-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
 ~[jetty-util-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
 ~[jetty-util-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
 ~[jetty-util-9.4.24.v20191120.jar:9.4.24.v20191120]
 at 

[jira] [Updated] (SOLR-14397) Vector Search in Solr

2020-04-08 Thread Trey Grainger (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Grainger updated SOLR-14397:
-
Description: 
Search engines have traditionally relied upon token-based matching (typically 
keywords) on an inverted index, plus relevance ranking based upon keyword 
occurrence statistics. This can be viewed as a "sparse vector” match (where 
each term is a one-hot encoded dimension in the vector), since only a few 
keywords out of all possible keywords are considered in each query. With the 
introduction of deep-learning-based transformers over the last few years, 
however, the state of the art in relevance has moved to ranking models based 
upon dense vectors that encode a latent, semantic understanding of both 
language constructs and the underlying domain upon which the model was trained. 
These dense vectors are also referred to as “embeddings”. An example of this 
kind of embedding would be taking the phrase “chief executive officer of the 
tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
 . Other similar phrases should encode to vectors with very similar numbers, so 
we may expect a query like “CEO of a technology org” to generate a vector like 
[0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation 
between these vectors, we would expect a number closer to 1.0, whereas a very 
unrelated text blurb would generate a much smaller cosine similarity.

This is a proposal for how we should implement these vector search capabilities 
in Solr.
h1. Search Process Overview:

In order to implement dense vector search, the following process is typically 
followed:
h2. Offline:

An encoder is built. An encoder can take in text (a query, a sentence, a 
paragraph, a document, etc.) and return a dense vector representing that 
document in a rich semantic space. The semantic space is learned from training 
on textual data (usually, though other sources work, too), typically from the 
domain of the search engine.
h2. Document Ingestion:

When documents are processed, they are passed to the encoder, and the dense 
vector(s) returned are stored as fields on the document. There could be one or 
more vectors per-document, as the granularity of the vectors could be 
per-document, per field, per paragraph, per-sentence, or even per phrase or per 
term.
h2. Query Time:

*Encoding:* The query is translated to a dense vector by passing it to the 
encoder
 Quantization: The query is quantized. Quantization is the process of taking a 
vector with many values and turning it into “terms” in a vector space that 
approximates the full vector space of the dense vectors.
 *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
(approximate nearest neighbor) search. This allows finding most of the best 
matching documents (typically up to 95%) with a traditional and efficient 
lookup against the inverted index.
 _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched 
quantized tokens to get a rough, initial ranking of documents based upon the 
similarity of the query and document vectors. This allows the next step 
(re-ranking) to be performed on a smaller subset of documents. 
 *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
performed, a similarity calculation (cosine, dot-product, or any number of 
other calculations) is typically performed between the full (non-quantized) 
dense vectors for the query and those in the document. This re-ranking will 
typically be on the top-N results for performance reasons.
 *Return Results:* As with any search, the final step is typically to return 
the results in relevance-ranked order. In this case, that would be sorted by 
the re-ranking similarity score (i.e. “cosine descending”).
 --

*Variant:* For small document sets, it may be preferable to rank all documents 
and skip steps steps 2, 3, and 4. This is because ANN Matching typically 
reduces recall (current state of the art is around 95% recall), so it can be 
beneficial to rank all documents if performance is not a concern. In this case, 
step 5 is performed on the full doc set and would obviously just be considered 
“ranking” instead of “re-”ranking.
h1. Proposed Implementation in Solr:
h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
 * 
h3. Dense Vector Field:

We will add a new dense vector field type in Solr. This field type would be a 
compressed encoding of a dense vector into a BinaryDocValues Field. There are 
other ways to do it, but this is almost certain to be the most efficient.
 Ideally this field is multi-valued. If it is single-valued then we are either 
limited to only document-level vectors, or otherwise we have to create many 
vector fields (i.e. per paragraph or term) and search across them, which will 
never be practical or scale well. BinaryDocValues does not natively support 
multiple values, but 

[jira] [Updated] (SOLR-14397) Vector Search in Solr

2020-04-08 Thread Trey Grainger (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Grainger updated SOLR-14397:
-
Description: 
Search engines have traditionally relied upon token-based matching (typically 
keywords) on an inverted index, plus relevance ranking based upon keyword 
occurrence statistics. This can be viewed as a "sparse vector” match (where 
each term is a one-hot encoded dimension in the vector), since only a few 
keywords out of all possible keywords are considered in each query. With the 
introduction of deep-learning-based transformers over the last few years, 
however, the state of the art in relevance has moved to ranking models based 
upon dense vectors that encode a latent, semantic understanding of both 
language constructs and the underlying domain upon which the model was trained. 
These dense vectors are also referred to as “embeddings”. An example of this 
kind of embedding would be taking the phrase “chief executive officer of the 
tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
 . Other similar phrases should encode to vectors with very similar numbers, so 
we may expect a query like “CEO of a technology org” to generate a vector like 
[0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation 
between these vectors, we would expect a number closer to 1.0, whereas a very 
unrelated text blurb would generate a much smaller cosine similarity.

This is a proposal for how we should implement these vector search capabilities 
in Solr.
h1. Search Process Overview:

In order to implement dense vector search, the following process is typically 
followed:
h2. Offline:

An encoder is built. An encoder can take in text (a query, a sentence, a 
paragraph, a document, etc.) and return a dense vector representing that 
document in a rich semantic space. The semantic space is learned from training 
on textual data (usually, though other sources work, too), typically from the 
domain of the search engine.
h2. Document Ingestion:

When documents are processed, they are passed to the encoder, and the dense 
vector(s) returned are stored as fields on the document. There could be one or 
more vectors per-document, as the granularity of the vectors could be 
per-document, per field, per paragraph, per-sentence, or even per phrase or per 
term.
h2. Query Time:

*Encoding:* The query is translated to a dense vector by passing it to the 
encoder
 Quantization: The query is quantized. Quantization is the process of taking a 
vector with many values and turning it into “terms” in a vector space that 
approximates the full vector space of the dense vectors.
 *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
(approximate nearest neighbor) search. This allows finding most of the best 
matching documents (typically up to 95%) with a traditional and efficient 
lookup against the inverted index.
 _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched 
quantized tokens to get a rough, initial ranking of documents based upon the 
similarity of the query and document vectors. This allows the next step 
(re-ranking) to be performed on a smaller subset of documents. 
 *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
performed, a similarity calculation (cosine, dot-product, or any number of 
other calculations) is typically performed between the full (non-quantized) 
dense vectors for the query and those in the document. This re-ranking will 
typically be on the top-N results for performance reasons.
 *Return Results:* As with any search, the final step is typically to return 
the results in relevance-ranked order. In this case, that would be sorted by 
the re-ranking similarity score (i.e. “cosine descending”).
 --

*Variant:* For small document sets, it may be preferable to rank all documents 
and skip steps steps 2, 3, and 4. This is because ANN Matching typically 
reduces recall (current state of the art is around 95% recall), so it can be 
beneficial to rank all documents if performance is not a concern. In this case, 
step 5 is performed on the full doc set and would obviously just be considered 
“ranking” instead of “re-”ranking.
h1. Proposed Implementation in Solr:
h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
 * 
h3. Dense Vector Field:

We will add a new dense vector field type in Solr. This field type would be a 
compressed encoding of a dense vector into a BinaryDocValues Field. There are 
other ways to do it, but this is almost certain to be the most efficient.
 Ideally this field is multi-valued. If it is single-valued then we are either 
limited to only document-level vectors, or otherwise we have to create many 
vector fields (i.e. per paragraph or term) and search across them, which will 
never be practical or scale well. BinaryDocValues does not natively support 
multiple values, but 

[jira] [Updated] (SOLR-14397) Vector Search in Solr

2020-04-08 Thread Trey Grainger (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Grainger updated SOLR-14397:
-
Description: 
Search engines have traditionally relied upon token-based matching (typically 
keywords) on an inverted index, plus relevance ranking based upon keyword 
occurrence statistics. This can be viewed as a "sparse vector” match (where 
each term is a one-hot encoded dimension in the vector), since only a few 
keywords out of all possible keywords are considered in each query. With the 
introduction of deep-learning-based transformers over the last few years, 
however, the state of the art in relevance has moved to ranking models based 
upon dense vectors that encode a latent, semantic understanding of both 
language constructs and the underlying domain upon which the model was trained. 
These dense vectors are also referred to as “embeddings”. An example of this 
kind of embedding would be taking the phrase “chief executive officer of the 
tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
 . Other similar phrases should encode to vectors with very similar numbers, so 
we may expect a query like “CEO of a technology org” to generate a vector like 
[0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation 
between these vectors, we would expect a number closer to 1.0, whereas a very 
unrelated text blurb would generate a much smaller cosine similarity.

This is a proposal for how we should implement these vector search capabilities 
in Solr.
h1. Search Process Overview:

In order to implement dense vector search, the following process is typically 
followed:
h2. Offline:

An encoder is built. An encoder can take in text (a query, a sentence, a 
paragraph, a document, etc.) and return a dense vector representing that 
document in a rich semantic space. The semantic space is learned from training 
on textual data (usually, though other sources work, too), typically from the 
domain of the search engine.
h2. Document Ingestion:

When documents are processed, they are passed to the encoder, and the dense 
vector(s) returned are stored as fields on the document. There could be one or 
more vectors per-document, as the granularity of the vectors could be 
per-document, per field, per paragraph, per-sentence, or even per phrase or per 
term.
h2. Query Time:

*Encoding:* The query is translated to a dense vector by passing it to the 
encoder
 Quantization: The query is quantized. Quantization is the process of taking a 
vector with many values and turning it into “terms” in a vector space that 
approximates the full vector space of the dense vectors.
 *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
(approximate nearest neighbor) search. This allows finding most of the best 
matching documents (typically up to 95%) with a traditional and efficient 
lookup against the inverted index.
 _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched 
quantized tokens to get a rough, initial ranking of documents based upon the 
similarity of the query and document vectors. This allows the next step 
(re-ranking) to be performed on a smaller subset of documents. 
 *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
performed, a similarity calculation (cosine, dot-product, or any number of 
other calculations) is typically performed between the full (non-quantized) 
dense vectors for the query and those in the document. This re-ranking will 
typically be on the top-N results for performance reasons.
 *Return Results:* As with any search, the final step is typically to return 
the results in relevance-ranked order. In this case, that would be sorted by 
the re-ranking similarity score (i.e. “cosine descending”).
 --

*Variant:* For small document sets, it may be preferable to rank all documents 
and skip steps steps 2, 3, and 4. This is because ANN Matching typically 
reduces recall (current state of the art is around 95% recall), so it can be 
beneficial to rank all documents if performance is not a concern. In this case, 
step 5 is performed on the full doc set and would obviously just be considered 
“ranking” instead of “re-”ranking.
h1. Proposed Implementation in Solr:
h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
 * 
h3. Dense Vector Field:

We will add a new dense vector field type in Solr. This field type would be a 
compressed encoding of a dense vector into a BinaryDocValues Field. There are 
other ways to do it, but this is almost certain to be the most efficient.
 Ideally this field is multi-valued. If it is single-valued then we are either 
limited to only document-level vectors, or otherwise we have to create many 
vector fields (i.e. per paragraph or term) and search across them, which will 
never be practical or scale well. BinaryDocValues does not natively support 
multiple values, but 

[jira] [Updated] (SOLR-14397) Vector Search in Solr

2020-04-08 Thread Trey Grainger (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trey Grainger updated SOLR-14397:
-
Description: 
Search engines have traditionally relied upon token-based matching (typically 
keywords) on an inverted index, plus relevance ranking based upon keyword 
occurrence statistics. This can be viewed as a "sparse vector” match (where 
each term is a one-hot encoded dimension in the vector), since only a few 
keywords out of all possible keywords are considered in each query. With the 
introduction of deep-learning-based transformers over the last few years, 
however, the state of the art in relevance has moved to ranking models based 
upon dense vectors that encode a latent, semantic understanding of both 
language constructs and the underlying domain upon which the model was trained. 
These dense vectors are also referred to as “embeddings”. An example of this 
kind of embedding would be taking the phrase “chief executive officer of the 
tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
 . Other similar phrases should encode to vectors with very similar numbers, so 
we may expect a query like “CEO of a technology org” to generate a vector like 
[0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation 
between these vectors, we would expect a number closer to 1.0, whereas a very 
unrelated text blurb would generate a much smaller cosine similarity.

This is a proposal for how we should implement these vector search capabilities 
in Solr.
h1. Search Process Overview:

In order to implement dense vector search, the following process is typically 
followed:
h2. Offline:

An encoder is built. An encoder can take in text (a query, a sentence, a 
paragraph, a document, etc.) and return a dense vector representing that 
document in a rich semantic space. The semantic space is learned from training 
on textual data (usually, though other sources work, too), typically from the 
domain of the search engine.
h2. Document Ingestion:

When documents are processed, they are passed to the encoder, and the dense 
vector(s) returned are stored as fields on the document. There could be one or 
more vectors per-document, as the granularity of the vectors could be 
per-document, per field, per paragraph, per-sentence, or even per phrase or per 
term.
h2. Query Time:

*Encoding:* The query is translated to a dense vector by passing it to the 
encoder
 Quantization: The query is quantized. Quantization is the process of taking a 
vector with many values and turning it into “terms” in a vector space that 
approximates the full vector space of the dense vectors.
 *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
(approximate nearest neighbor) search. This allows finding most of the best 
matching documents (typically up to 95%) with a traditional and efficient 
lookup against the inverted index.
 _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched 
quantized tokens to get a rough, initial ranking of documents based upon the 
similarity of the query and document vectors. This allows the next step 
(re-ranking) to be performed on a smaller subset of documents. 
 *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
performed, a similarity calculation (cosine, dot-product, or any number of 
other calculations) is typically performed between the full (non-quantized) 
dense vectors for the query and those in the document. This re-ranking will 
typically be on the top-N results for performance reasons.
 *Return Results:* As with any search, the final step is typically to return 
the results in relevance-ranked order. In this case, that would be sorted by 
the re-ranking similarity score (i.e. “cosine descending”).
 --

*Variant:* For small document sets, it may be preferable to rank all documents 
and skip steps steps 2, 3, and 4. This is because ANN Matching typically 
reduces recall (current state of the art is around 95% recall), so it can be 
beneficial to rank all documents if performance is not a concern. In this case, 
step 5 is performed on the full doc set and would obviously just be considered 
“ranking” instead of “re-”ranking.
h1. Proposed Implementation in Solr:
h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
 * 
h3. Dense Vector Field:

We will add a new dense vector field type in Solr. This field type would be a 
compressed encoding of a dense vector into a BinaryDocValues Field. There are 
other ways to do it, but this is almost certain to be the most efficient.
 Ideally this field is multi-valued. If it is single-valued then we are either 
limited to only document-level vectors, or otherwise we have to create many 
vector fields (i.e. per paragraph or term) and search across them, which will 
never be practical or scale well. BinaryDocValues does not natively support 
multiple values, but 

[jira] [Updated] (SOLR-14398) package store PUT should be idempotent

2020-04-08 Thread Noble Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-14398:
--
Fix Version/s: 8.6

> package store PUT should be idempotent
> --
>
> Key: SOLR-14398
> URL: https://issues.apache.org/jira/browse/SOLR-14398
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: packagemanager
> Fix For: 8.6
>
>
> If the same content is posted again with the same metadata it should be 
> idempotent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul opened a new pull request #1420: package store PUT should be idempotent

2020-04-08 Thread GitBox
noblepaul opened a new pull request #1420: package store PUT should be 
idempotent
URL: https://github.com/apache/lucene-solr/pull/1420
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14398) package store PUT should be idempotent

2020-04-08 Thread Noble Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-14398:
--
Labels: packagemanager  (was: )

> package store PUT should be idempotent
> --
>
> Key: SOLR-14398
> URL: https://issues.apache.org/jira/browse/SOLR-14398
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: packagemanager
>
> If the same content is posted again with the same metadata it should be 
> idempotent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14398) package store PUT should be idempotent

2020-04-08 Thread Noble Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-14398:
--
Issue Type: Improvement  (was: New Feature)

> package store PUT should be idempotent
> --
>
> Key: SOLR-14398
> URL: https://issues.apache.org/jira/browse/SOLR-14398
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> If the same content is posted again with the same metadata it should be 
> idempotent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding 
always allocate int[] and float[] with size equals to number of unique values 
(WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611307658
 
 
   @dsmiley DynamicMap doesn't have any methods that subclass needs to 
implement. `DynamicMap` is just an interface that can be used to group related 
classes in a logical way.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14398) package store PUT should be idempotent

2020-04-08 Thread Noble Paul (Jira)
Noble Paul created SOLR-14398:
-

 Summary: package store PUT should be idempotent
 Key: SOLR-14398
 URL: https://issues.apache.org/jira/browse/SOLR-14398
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Noble Paul
Assignee: Noble Paul


If the same content is posted again with the same metadata it should be 
idempotent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14397) Vector Search in Solr

2020-04-08 Thread Trey Grainger (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078874#comment-17078874
 ] 

Trey Grainger commented on SOLR-14397:
--

Going to let this set for a few days to see if anyone has 
question/concerns/feedback on the overall proposed implementation plan (or 
underlying goals) described above. If all sounds good, I hope to continue on 
this next week.

Feedback welcome / appreciated!

> Vector Search in Solr
> -
>
> Key: SOLR-14397
> URL: https://issues.apache.org/jira/browse/SOLR-14397
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Trey Grainger
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Search engines have traditionally relied upon token-based matching (typically 
> keywords) on an inverted index, plus relevance ranking based upon keyword 
> occurrence statistics. This can be viewed as a "sparse vector” match (where 
> each term is a one-hot encoded dimension in the vector), since only a few 
> keywords out of all possible keywords are considered in each query. With the 
> introduction of deep-learning-based transformers over the last few years, 
> however, the state of the art in relevance has moved to ranking models based 
> upon dense vectors that encode a latent, semantic understanding of both 
> language constructs and the underlying domain upon which the model was 
> trained. These dense vectors are also referred to as “embeddings”. An example 
> of this kind of embedding would be taking the phrase “chief executive officer 
> of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
> . Other similar phrases should encode to vectors with very similar numbers, 
> so we may expect a query like “CEO of a technology org” to generate a vector 
> like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity 
> calculation between these vectors, we would expect a number closer to 1.0, 
> whereas a very unrelated text blurb would generate a much smaller cosine 
> similarity.
> This is a proposal for how we should implement these vector search 
> capabilities in Solr.
> h1. Search Process Overview:
> In order to implement dense vector search, the following process is typically 
> followed:
> h2. Offline:
> An encoder is built. An encoder can take in text (a query, a sentence, a 
> paragraph, a document, etc.) and return a dense vector representing that 
> document in a rich semantic space. The semantic space is learned from 
> training on textual data (usually, though other sources work, too), typically 
> from the domain of the search engine.
> h2. Document Ingestion:
> When documents are processed, they are passed to the encoder, and the dense 
> vector(s) returned are stored as fields on the document. There could be one 
> or more vectors per-document, as the granularity of the vectors could be 
> per-document, per field, per paragraph, per-sentence, or even per phrase or 
> per term.
> h2. Query Time:
> *Encoding:* The query is translated to a dense vector by passing it to the 
> encoder
>  Quantization: The query is quantized. Quantization is the process of taking 
> a vector with many values and turning it into “terms” in a vector space that 
> approximates the full vector space of the dense vectors.
>  *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
> (approximate nearest neighbor) search. This allows finding most of the best 
> matching documents (typically up to 95%) with a traditional and efficient 
> lookup against the inverted index.
>  (optional) ANN Ranking: ranking may be performed based upon the matched 
> quantized tokens to get a rough, initial ranking of documents based upon the 
> similarity of the query and document vectors. This allows the next step 
> (re-ranking) to be performed on a smaller subset of documents. 
>  *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
> performed, a similarity calculation (cosine, dot-product, or any number of 
> other calculations) is typically performed between the full (non-quantized) 
> dense vectors for the query and those in the document. This re-ranking will 
> typically be on the top-N results for performance reasons.
>  *Return Results:* As with any search, the final step is typically to return 
> the results in relevance-ranked order. In this case, that would be sorted by 
> the re-ranking similarity score (i.e. “cosine descending”).
>  --
> *Variant:* For small document sets, it may be preferable to rank all 
> documents and skip steps steps 2, 3, and 4. This is because ANN Matching 
> typically reduces recall (current state of the art is around 95% recall), so 
> it can be beneficial to rank all documents if performance is not a concern. 
> In this case, 

[jira] [Commented] (SOLR-14397) Vector Search in Solr

2020-04-08 Thread Trey Grainger (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078871#comment-17078871
 ] 

Trey Grainger commented on SOLR-14397:
--

ok, just pushed first commit (see linked pull request).
h1. Implementation:

Haven't started creating the DenseVector field yet; decide to create the 
Function Queries first so I can get something functionally working more quickly 
with existing field types (String fields currently).

This adds in a {{vector_cosine}} function query and a {{vector_dotproduct}} 
function query, which can operate on ValueSources containing dense vector 
content. Supports multiple vectors per field (separated by {{|}}. For now, best 
to use a String field (with {{docValues=true}}) to create the vectors on 
documents, but will ultimately be implementing a DenseVector field to handle 
this more efficiently.

Since multivalued docvalues don't maintain insertion order, multiple vectors 
are instead encoded into the same docvalue per document separated by a {{|}} 
character. Currently the vectors are represented as raw strings (no Base64 
encoding of Binary encoding - will do that later). Initial implementation let's 
you send in one or more vectors in the field, and at query time, to choose to 
return the score as the first parameter (the query vector) with either the 
{{first}}, {{last}}, {{max}}, {{min}}, or {{average}} similarity with all of 
the vectors in the {{vectors}} field.
h1. How to Use:

*Build and Start*
{code:java}
bin/solr stop || ant server && bin/solr start -c -a 
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=6900"{code}
 
*Create Collection*
{code:java}
curl -H "Content-Type: application/json" \
"http://localhost:8983/solr/admin/collections?action=CREATE=vectors=_default=1"{code}
 
*Index Documents*
{code:java}
curl -X POST -H "Content-Type: application/json" \ 
http://localhost:8983/solr/vectors/update?commit=true \ --data-binary ' [ 
{"id": "1", "name_s":"donut", 
"vectors_s":["5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0|4.0,0.0,1.2,3.0,0.3,3.0,3.0,0.75|6.0,0.0,2.0,4.0,0.0,5.0,6.0,0.8"]},
 {"id": "2", "name_s":"apple juice", 
"vectors_s":["1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0|0.0,5.0,0.0,0.0,0.0,3.0,5.0,4.0"]},
 {"id": "3", "name_s":"cappuccino", 
"vectors_s":["0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0|"]}, {"id": "4", "name_s":"cheese 
pizza", "vectors_s":["5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0"]}, {"id": "5", 
"name_s":"green tea", "vectors_s":["0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0"]}, {"id": 
"6", "name_s":"latte", "vectors_s":["0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0"]}, {"id": 
"7", "name_s":"soda", "vectors_s":["0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0"]}, {"id": 
"8", "name_s":"cheese bread sticks", 
"vectors_s":["5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0"]}, {"id": "9", "name_s":"water", 
"vectors_s":["0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0"]}, {"id": "10", 
"name_s":"cinnamon bread sticks", 
"vectors_s":["5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0"]} ] '
{code}

*Send Query*
{noformat}
curl -H "Content-Type: application/json" \
"http://localhost:8983/solr/vectors/select?q=*:*=id,name:name_s,cosine:\$func,vectors:vectors_s=vector_cosine(\$donut_vector,vectors_s,average)=\$func%20desc=11_vector=5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0"{noformat}
 
*Response:* 
{code:java}
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":1,
"params":{
  "q":"*:*",
  "func":"vector_cosine($donut_vector,vectors_s,average)",
  "donut_vector":"5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0",
  "fl":"id,name:name_s,cosine:$func,vectors:vectors_s",
  "json":"",
  "sort":"$func desc",
  "rows":"11"}},
  "response":{"numFound":10,"start":0,"docs":[
  {
"id":"1",
"cosine":0.9884526,
"name":"donut",

"vectors":"5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0|4.0,0.0,1.2,3.0,0.3,3.0,3.0,0.75|6.0,0.0,2.0,4.0,0.0,5.0,6.0,0.8"},
  {
"id":"10",
"cosine":0.98544514,
"name":"cinnamon bread sticks",
"vectors":"5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0"},
  {
"id":"4",
"cosine":0.88938314,
"name":"cheese pizza",
"vectors":"5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0"},
  {
"id":"8",
"cosine":0.88938314,
"name":"cheese bread sticks",
"vectors":"5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0"},
  {
"id":"2",
"cosine":0.524165,
"name":"apple juice",

"vectors":"1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0|0.0,5.0,0.0,0.0,0.0,3.0,5.0,4.0"},
  {
"id":"7",
"cosine":0.50913316,
"name":"soda",
"vectors":"0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0"},
  {
"id":"6",
"cosine":0.30926093,
"name":"latte",
"vectors":"0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0"},
  {
"id":"3",
"cosine":0.25923792,
"name":"cappuccino",
"vectors":"0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0|"},
  {
"id":"5",
"cosine":0.1939959,
"name":"green tea",

[GitHub] [lucene-solr] dsmiley commented on issue #1412: Add MinimalSolrTest for scale testing

2020-04-08 Thread GitBox
dsmiley commented on issue #1412: Add MinimalSolrTest for scale testing
URL: https://github.com/apache/lucene-solr/pull/1412#issuecomment-611299798
 
 
   Test-framework as a stub makes sense.  Essentially we don't want this 
pretend-test slowing down whole test-runs.  I get the sense there is room for 
improvement in how large projects like ours categorize tests (e.g. via 
annotations) and choose which to run and not to run.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bringyou commented on a change in pull request #1389: LUCENE-9298: fix clearDeletedDocIds in BufferedUpdates

2020-04-08 Thread GitBox
bringyou commented on a change in pull request #1389: LUCENE-9298: fix 
clearDeletedDocIds in BufferedUpdates
URL: https://github.com/apache/lucene-solr/pull/1389#discussion_r405927650
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java
 ##
 @@ -176,8 +176,11 @@ void addBinaryUpdate(BinaryDocValuesUpdate update, int 
docIDUpto) {
   }
 
   void clearDeleteTerms() {
-deleteTerms.clear();
 numTermDeletes.set(0);
+deleteTerms.forEach((term, docIDUpto) -> {
 
 Review comment:
   add a counter named `termsBytesUsed`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14396) TaggerRequestHandler Should Not Error on Empty Collection

2020-04-08 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078868#comment-17078868
 ] 

David Smiley commented on SOLR-14396:
-

Ah; skip the header.  That could be its own issue for another day.

> TaggerRequestHandler Should Not Error on Empty Collection
> -
>
> Key: SOLR-14396
> URL: https://issues.apache.org/jira/browse/SOLR-14396
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Trey Grainger
>Priority: Minor
>
> The TaggerRequestHandler (added in SOLR-12376) currently returns a 400 (Bad 
> Request) if used on a collection with no terms in the index. This probably 
> made sense for the use cases for which it was originally written (in the 
> OpenSextant project, before it was contributed to Solr) that focused on on 
> stand-alone document tagging, where the calling application expected there to 
> always be an index.
> More and more use cases are emerging for using the TaggerRequestHandler in 
> real-time for user queries, however. For example, real-time phrase matching 
> and entity resolution in queries. In these cases, the data in the tagger 
> collection may be dynamically updated, and at times, the collection may even 
> be empty.
> While it's certainly possible for the 400 error to be handled client-side for 
> empty collections, the incoming requests aren't really "bad" requests in my 
> opinion, the index just doesn't have any data yet. Sending the same request 
> subsequently once some documents are indexed would result in a success.
> I'm proposing we remove the exception for empty indexes and simply return no 
> matched tags instead.
> If it's important for anyone to preserve the current behavior, we could add a 
> parameter "errorOnEmptyCollection". Does anyone think preserving the error 
> here is needed? What say you [~dsmiley]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-04-08 Thread GitBox
noblepaul commented on a change in pull request #1327: SOLR-13942: 
/api/cluster/zk/* to fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405924222
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/handler/admin/ZkRead.java
 ##
 @@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.handler.admin;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.solr.api.Command;
+import org.apache.solr.api.EndPoint;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.client.solrj.impl.BinaryResponseParser;
+import org.apache.solr.common.MapWriter;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.params.MapSolrParams;
+import org.apache.solr.common.params.SolrParams;
+import org.apache.solr.common.util.ContentStreamBase;
+import org.apache.solr.common.util.Utils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.zookeeper.data.Stat;
+
+import static org.apache.solr.common.params.CommonParams.OMIT_HEADER;
+import static org.apache.solr.common.params.CommonParams.WT;
+import static org.apache.solr.response.RawResponseWriter.CONTENT;
+import static 
org.apache.solr.security.PermissionNameProvider.Name.COLL_READ_PERM;
+
+/**Exposes the content of the Zookeeper
+ * This is an expert feature that exposes the data inside the back end 
zookeeper.This API may change or
+ * be removed in future versions.
+ * This is not a public API. The data that is returned is not guaranteed to 
remain same
+ * across releases, as the data stored in Zookeeper may change from time to 
time.
+ */
+@EndPoint(path = "/cluster/zk/*",
+method = SolrRequest.METHOD.GET,
+permission = COLL_READ_PERM)
+public class ZkRead {
+  private final CoreContainer coreContainer;
+
+  public ZkRead(CoreContainer coreContainer) {
+this.coreContainer = coreContainer;
+  }
+
+  @Command
+  public void get(SolrQueryRequest req, SolrQueryResponse rsp) {
+String path = req.getPathTemplateValues().get("*");
+if (path == null || path.isEmpty()) path = "/";
+byte[] d = null;
+try {
+  List l = 
coreContainer.getZkController().getZkClient().getChildren(path, null, false);
+  if (l != null && !l.isEmpty()) {
+String prefix = path.endsWith("/") ? path : path + "/";
+
+rsp.add(path, (MapWriter) ew -> {
+  for (String s : l) {
+try {
+  Stat stat = 
coreContainer.getZkController().getZkClient().exists(prefix + s, null, false);
+  ew.put(s, (MapWriter) ew1 -> {
+ew1.put("version", stat.getVersion());
+ew1.put("aversion", stat.getAversion());
+ew1.put("children", stat.getNumChildren());
+ew1.put("ctime", stat.getCtime());
+ew1.put("cversion", stat.getCversion());
+ew1.put("czxid", stat.getCzxid());
+ew1.put("ephemeralOwner", stat.getEphemeralOwner());
+ew1.put("mtime", stat.getMtime());
+ew1.put("mzxid", stat.getMzxid());
+ew1.put("pzxid", stat.getPzxid());
+ew1.put("dataLength", stat.getDataLength());
+  });
+} catch (Exception e) {
+  ew.put("s", Collections.singletonMap("error", e.getMessage()));
+}
+  }
+});
+
+  } else {
+d = coreContainer.getZkController().getZkClient().getData(path, null, 
null, false);
+if (d == null || d.length == 0) {
+  rsp.add(path, null);
+  return;
+}
+
+Map map = new HashMap<>(1);
+map.put(WT, "raw");
+map.put(OMIT_HEADER, "true");
+req.setParams(SolrParams.wrapDefaults(new MapSolrParams(map), 
req.getParams()));
 
 Review comment:
   yes, for list operations , you can get data in any format


This is an automated message from the Apache Git Service.
To respond to the message, please log on 

[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-04-08 Thread GitBox
noblepaul commented on a change in pull request #1327: SOLR-13942: 
/api/cluster/zk/* to fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405924050
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/handler/admin/ZkRead.java
 ##
 @@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.handler.admin;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.solr.api.Command;
+import org.apache.solr.api.EndPoint;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.client.solrj.impl.BinaryResponseParser;
+import org.apache.solr.common.MapWriter;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.params.MapSolrParams;
+import org.apache.solr.common.params.SolrParams;
+import org.apache.solr.common.util.ContentStreamBase;
+import org.apache.solr.common.util.Utils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.zookeeper.data.Stat;
+
+import static org.apache.solr.common.params.CommonParams.OMIT_HEADER;
+import static org.apache.solr.common.params.CommonParams.WT;
+import static org.apache.solr.response.RawResponseWriter.CONTENT;
+import static 
org.apache.solr.security.PermissionNameProvider.Name.COLL_READ_PERM;
+
+/**Exposes the content of the Zookeeper
+ * This is an expert feature that exposes the data inside the back end 
zookeeper.This API may change or
+ * be removed in future versions.
+ * This is not a public API. The data that is returned is not guaranteed to 
remain same
+ * across releases, as the data stored in Zookeeper may change from time to 
time.
+ */
+@EndPoint(path = "/cluster/zk/*",
+method = SolrRequest.METHOD.GET,
+permission = COLL_READ_PERM)
+public class ZkRead {
+  private final CoreContainer coreContainer;
+
+  public ZkRead(CoreContainer coreContainer) {
+this.coreContainer = coreContainer;
+  }
+
+  @Command
+  public void get(SolrQueryRequest req, SolrQueryResponse rsp) {
+String path = req.getPathTemplateValues().get("*");
+if (path == null || path.isEmpty()) path = "/";
+byte[] d = null;
+try {
+  List l = 
coreContainer.getZkController().getZkClient().getChildren(path, null, false);
+  if (l != null && !l.isEmpty()) {
+String prefix = path.endsWith("/") ? path : path + "/";
+
+rsp.add(path, (MapWriter) ew -> {
+  for (String s : l) {
+try {
+  Stat stat = 
coreContainer.getZkController().getZkClient().exists(prefix + s, null, false);
+  ew.put(s, (MapWriter) ew1 -> {
+ew1.put("version", stat.getVersion());
+ew1.put("aversion", stat.getAversion());
+ew1.put("children", stat.getNumChildren());
+ew1.put("ctime", stat.getCtime());
+ew1.put("cversion", stat.getCversion());
+ew1.put("czxid", stat.getCzxid());
+ew1.put("ephemeralOwner", stat.getEphemeralOwner());
+ew1.put("mtime", stat.getMtime());
+ew1.put("mzxid", stat.getMzxid());
+ew1.put("pzxid", stat.getPzxid());
+ew1.put("dataLength", stat.getDataLength());
+  });
+} catch (Exception e) {
 
 Review comment:
   ```
 public void list(SolrQueryRequest req, SolrQueryResponse rsp) {
 String path = req.getPathTemplateValues().get("*");
 if (path == null || path.isEmpty()) path = "/";
 try {
   List l = 
coreContainer.getZkController().getZkClient().getChildren(path, null, false);
   String prefix = path.endsWith("/") ? path : path + "/";
   rsp.add(path, (MapWriter) ew -> {
 for (String s : l) {
   Stat stat = null;
   try {
 stat = 
coreContainer.getZkController().getZkClient().exists(prefix + s, null, false);
   } catch (Exception e) {
 throw new RuntimeException(e);
   }
   printStat(ew, s, stat);

[GitHub] [lucene-solr] treygrainger opened a new pull request #1419: SOLR-14397: Vector Search in Solr

2020-04-08 Thread GitBox
treygrainger opened a new pull request #1419: SOLR-14397: Vector Search in Solr
URL: https://github.com/apache/lucene-solr/pull/1419
 
 
   
   
   
   # Description
   
   *WORK IN PROGRESS, DO NOT MERGE.*
   Adds in `vector_cosine` function query and `vector_dotproduct` function 
queries, which can operate on ValueSources containing dense vector content. 
Supports multiple vectors per field (separated by `|`. For now, best to use a 
String field (with `docValues=true`) to create the vectors on documents, but 
will ultimately be implementing a DenseVector field to handle this more 
efficiently. 
   
   
   # Solution
   Since multivalued docvalues don't maintain insertion order, multiple vectors 
are instead encoded into the same docvalue per document separated by a `|` 
character. Currently the vectors are represented as raw strings (no Base64 
encoding of Binary encoding - will do that later). Initial implementation let's 
you send in one or more vectors in the field, and at query time, to choose to 
return the score as the first parameter (the query vector) with either the 
`first`, `last`, `max`, `min`, or `average` similarity with all of the vectors 
in the `vectors` field. 
   
   # Tests
   
   No unit tests yet. You can use/test the functionality as follows:
   
   *Build and Start*
   ```
   bin/solr stop || ant server && bin/solr start -c -a 
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=6900"
   ```
   
   *Create Collection*
   ```
   curl -H "Content-Type: application/json" \
   
"http://localhost:8983/solr/admin/collections?action=CREATE=vectors=_default=1;
   ```
   
   *Index Documents*
   ```
   curl -X POST -H "Content-Type: application/json" \
http://localhost:8983/solr/vectors/update?commit=true \
--data-binary ' [
{"id": "1", "name_s":"donut", 
"vectors_s":["5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0|4.0,0.0,1.2,3.0,0.3,3.0,3.0,0.75|6.0,0.0,2.0,4.0,0.0,5.0,6.0,0.8"]},
{"id": "2", "name_s":"apple juice",

"vectors_s":["1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0|0.0,5.0,0.0,0.0,0.0,3.0,5.0,4.0"]},
{"id": "3", "name_s":"cappuccino",
"vectors_s":["0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0|"]},
{"id": "4", "name_s":"cheese pizza",
"vectors_s":["5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0"]},
{"id": "5", "name_s":"green tea",
"vectors_s":["0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0"]},
{"id": "6", "name_s":"latte", 
"vectors_s":["0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0"]},
{"id": "7", "name_s":"soda", 
"vectors_s":["0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0"]},
{"id": "8", "name_s":"cheese bread sticks",
"vectors_s":["5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0"]},
{"id": "9", "name_s":"water", 
"vectors_s":["0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0"]},
{"id": "10", "name_s":"cinnamon bread sticks", 
"vectors_s":["5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0"]}
   ] '
   ```
   
   *Send Query*
   ```
   curl -H "Content-Type: application/json" \
   
"http://localhost:8983/solr/vectors/select?q=*:*=id,name:name_s,cosine:\$func,vectors:vectors_s=vector_cosine(\$donut_vector,vectors_s,average)=\$func%20desc=11_vector=5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0"
   ```
   
   *Response:*
   ```
   {
 "responseHeader":{
   "zkConnected":true,
   "status":0,
   "QTime":1,
   "params":{
 "q":"*:*",
 "func":"vector_cosine($donut_vector,vectors_s,average)",
 "donut_vector":"5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0",
 "fl":"id,name:name_s,cosine:$func,vectors:vectors_s",
 "json":"",
 "sort":"$func desc",
 "rows":"11"}},
 "response":{"numFound":10,"start":0,"docs":[
 {
   "id":"1",
   "cosine":0.9884526,
   "name":"donut",
   
"vectors":"5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0|4.0,0.0,1.2,3.0,0.3,3.0,3.0,0.75|6.0,0.0,2.0,4.0,0.0,5.0,6.0,0.8"},
 {
   "id":"10",
   "cosine":0.98544514,
   "name":"cinnamon bread sticks",
   "vectors":"5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0"},
 {
   "id":"4",
   "cosine":0.88938314,
   "name":"cheese pizza",
   "vectors":"5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0"},
 {
   "id":"8",
   "cosine":0.88938314,
   "name":"cheese bread sticks",
   "vectors":"5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0"},
 {
   "id":"2",
   "cosine":0.524165,
   "name":"apple juice",
   
"vectors":"1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0|0.0,5.0,0.0,0.0,0.0,3.0,5.0,4.0"},
 {
   "id":"7",
   "cosine":0.50913316,
   "name":"soda",
   "vectors":"0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0"},
 {
   "id":"6",
   "cosine":0.30926093,
   "name":"latte",
   "vectors":"0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0"},
 {
   "id":"3",
   "cosine":0.25923792,
   "name":"cappuccino",
   "vectors":"0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0|"},
 {
   "id":"5",

[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-04-08 Thread GitBox
noblepaul commented on a change in pull request #1327: SOLR-13942: 
/api/cluster/zk/* to fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405921183
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/handler/admin/ZkRead.java
 ##
 @@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.handler.admin;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.solr.api.Command;
+import org.apache.solr.api.EndPoint;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.client.solrj.impl.BinaryResponseParser;
+import org.apache.solr.common.MapWriter;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.params.MapSolrParams;
+import org.apache.solr.common.params.SolrParams;
+import org.apache.solr.common.util.ContentStreamBase;
+import org.apache.solr.common.util.Utils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.zookeeper.data.Stat;
+
+import static org.apache.solr.common.params.CommonParams.OMIT_HEADER;
+import static org.apache.solr.common.params.CommonParams.WT;
+import static org.apache.solr.response.RawResponseWriter.CONTENT;
+import static 
org.apache.solr.security.PermissionNameProvider.Name.COLL_READ_PERM;
+
+/**Exposes the content of the Zookeeper
+ * This is an expert feature that exposes the data inside the back end 
zookeeper.This API may change or
+ * be removed in future versions.
+ * This is not a public API. The data that is returned is not guaranteed to 
remain same
+ * across releases, as the data stored in Zookeeper may change from time to 
time.
+ */
+@EndPoint(path = "/cluster/zk/*",
+method = SolrRequest.METHOD.GET,
+permission = COLL_READ_PERM)
+public class ZkRead {
+  private final CoreContainer coreContainer;
 
 Review comment:
   We should probably  support a `Supplier`.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-04-08 Thread GitBox
noblepaul commented on a change in pull request #1327: SOLR-13942: 
/api/cluster/zk/* to fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405920137
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/handler/admin/ZkRead.java
 ##
 @@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.handler.admin;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.solr.api.Command;
+import org.apache.solr.api.EndPoint;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.client.solrj.impl.BinaryResponseParser;
+import org.apache.solr.common.MapWriter;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.params.MapSolrParams;
+import org.apache.solr.common.params.SolrParams;
+import org.apache.solr.common.util.ContentStreamBase;
+import org.apache.solr.common.util.Utils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.zookeeper.data.Stat;
+
+import static org.apache.solr.common.params.CommonParams.OMIT_HEADER;
+import static org.apache.solr.common.params.CommonParams.WT;
+import static org.apache.solr.response.RawResponseWriter.CONTENT;
+import static 
org.apache.solr.security.PermissionNameProvider.Name.COLL_READ_PERM;
+
+/**Exposes the content of the Zookeeper
+ * This is an expert feature that exposes the data inside the back end 
zookeeper.This API may change or
+ * be removed in future versions.
+ * This is not a public API. The data that is returned is not guaranteed to 
remain same
+ * across releases, as the data stored in Zookeeper may change from time to 
time.
+ */
+@EndPoint(path = "/cluster/zk/*",
+method = SolrRequest.METHOD.GET,
+permission = COLL_READ_PERM)
+public class ZkRead {
+  private final CoreContainer coreContainer;
+
+  public ZkRead(CoreContainer coreContainer) {
+this.coreContainer = coreContainer;
+  }
+
+  @Command
+  public void get(SolrQueryRequest req, SolrQueryResponse rsp) {
+String path = req.getPathTemplateValues().get("*");
+if (path == null || path.isEmpty()) path = "/";
+byte[] d = null;
+try {
+  List l = 
coreContainer.getZkController().getZkClient().getChildren(path, null, false);
+  if (l != null && !l.isEmpty()) {
+String prefix = path.endsWith("/") ? path : path + "/";
+
+rsp.add(path, (MapWriter) ew -> {
+  for (String s : l) {
+try {
+  Stat stat = 
coreContainer.getZkController().getZkClient().exists(prefix + s, null, false);
+  ew.put(s, (MapWriter) ew1 -> {
+ew1.put("version", stat.getVersion());
+ew1.put("aversion", stat.getAversion());
+ew1.put("children", stat.getNumChildren());
+ew1.put("ctime", stat.getCtime());
+ew1.put("cversion", stat.getCversion());
+ew1.put("czxid", stat.getCzxid());
+ew1.put("ephemeralOwner", stat.getEphemeralOwner());
+ew1.put("mtime", stat.getMtime());
+ew1.put("mzxid", stat.getMzxid());
+ew1.put("pzxid", stat.getPzxid());
+ew1.put("dataLength", stat.getDataLength());
+  });
+} catch (Exception e) {
+  ew.put("s", Collections.singletonMap("error", e.getMessage()));
+}
+  }
+});
+
+  } else {
+d = coreContainer.getZkController().getZkClient().getData(path, null, 
null, false);
+if (d == null || d.length == 0) {
+  rsp.add(path, null);
+  return;
+}
+
+Map map = new HashMap<>(1);
+map.put(WT, "raw");
+map.put(OMIT_HEADER, "true");
+req.setParams(SolrParams.wrapDefaults(new MapSolrParams(map), 
req.getParams()));
 
 Review comment:
   no if you are requesting data , you should expect raw data. it will not 
honour the `wt` param. If you request  `/api/cluster/zk/list` you will get 
response in any format that you ask for `javabin` , `xml` , `json`


[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-04-08 Thread GitBox
noblepaul commented on a change in pull request #1327: SOLR-13942: 
/api/cluster/zk/* to fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405920325
 
 

 ##
 File path: 
solr/core/src/test/org/apache/solr/handler/admin/ZookeeperStatusHandlerTest.java
 ##
 @@ -74,6 +78,39 @@ public void tearDown() throws Exception {
 super.tearDown();
   }
 
+  @Test
+  public void testZkread() throws Exception {
+URL baseUrl = cluster.getJettySolrRunner(0).getBaseUrl();
+String basezk = baseUrl.toString().replace("/solr", "/api") + 
"/cluster/zk";
+
+try(  HttpSolrClient client = new 
HttpSolrClient.Builder(baseUrl.toString()).build()) {
+  Object o = Utils.executeGET(client.getHttpClient(),
+  basezk + "/security.json",
+  Utils.JSONCONSUMER );
+  assertNotNull(o);
+  o = Utils.executeGET(client.getHttpClient(),
+  basezk + "/configs",
+  Utils.JSONCONSUMER );
+  assertEquals("0", String.valueOf(getObjectByPath(o,true, 
split(":/configs:_default:dataLength",':';
+  assertEquals("0", String.valueOf(getObjectByPath(o,true, 
split(":/configs:conf:dataLength",':';
+  byte[] bytes = new byte[1024*5];
+  for (int i = 0; i < bytes.length; i++) {
+bytes[i] = (byte) random().nextInt(128);
 
 Review comment:
   I wanted a big enough `byte[]` not a small one.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul commented on a change in pull request #1327: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data

2020-04-08 Thread GitBox
noblepaul commented on a change in pull request #1327: SOLR-13942: 
/api/cluster/zk/* to fetch raw ZK data
URL: https://github.com/apache/lucene-solr/pull/1327#discussion_r405920137
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/handler/admin/ZkRead.java
 ##
 @@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.handler.admin;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.solr.api.Command;
+import org.apache.solr.api.EndPoint;
+import org.apache.solr.client.solrj.SolrRequest;
+import org.apache.solr.client.solrj.impl.BinaryResponseParser;
+import org.apache.solr.common.MapWriter;
+import org.apache.solr.common.params.CommonParams;
+import org.apache.solr.common.params.MapSolrParams;
+import org.apache.solr.common.params.SolrParams;
+import org.apache.solr.common.util.ContentStreamBase;
+import org.apache.solr.common.util.Utils;
+import org.apache.solr.core.CoreContainer;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.zookeeper.data.Stat;
+
+import static org.apache.solr.common.params.CommonParams.OMIT_HEADER;
+import static org.apache.solr.common.params.CommonParams.WT;
+import static org.apache.solr.response.RawResponseWriter.CONTENT;
+import static 
org.apache.solr.security.PermissionNameProvider.Name.COLL_READ_PERM;
+
+/**Exposes the content of the Zookeeper
+ * This is an expert feature that exposes the data inside the back end 
zookeeper.This API may change or
+ * be removed in future versions.
+ * This is not a public API. The data that is returned is not guaranteed to 
remain same
+ * across releases, as the data stored in Zookeeper may change from time to 
time.
+ */
+@EndPoint(path = "/cluster/zk/*",
+method = SolrRequest.METHOD.GET,
+permission = COLL_READ_PERM)
+public class ZkRead {
+  private final CoreContainer coreContainer;
+
+  public ZkRead(CoreContainer coreContainer) {
+this.coreContainer = coreContainer;
+  }
+
+  @Command
+  public void get(SolrQueryRequest req, SolrQueryResponse rsp) {
+String path = req.getPathTemplateValues().get("*");
+if (path == null || path.isEmpty()) path = "/";
+byte[] d = null;
+try {
+  List l = 
coreContainer.getZkController().getZkClient().getChildren(path, null, false);
+  if (l != null && !l.isEmpty()) {
+String prefix = path.endsWith("/") ? path : path + "/";
+
+rsp.add(path, (MapWriter) ew -> {
+  for (String s : l) {
+try {
+  Stat stat = 
coreContainer.getZkController().getZkClient().exists(prefix + s, null, false);
+  ew.put(s, (MapWriter) ew1 -> {
+ew1.put("version", stat.getVersion());
+ew1.put("aversion", stat.getAversion());
+ew1.put("children", stat.getNumChildren());
+ew1.put("ctime", stat.getCtime());
+ew1.put("cversion", stat.getCversion());
+ew1.put("czxid", stat.getCzxid());
+ew1.put("ephemeralOwner", stat.getEphemeralOwner());
+ew1.put("mtime", stat.getMtime());
+ew1.put("mzxid", stat.getMzxid());
+ew1.put("pzxid", stat.getPzxid());
+ew1.put("dataLength", stat.getDataLength());
+  });
+} catch (Exception e) {
+  ew.put("s", Collections.singletonMap("error", e.getMessage()));
+}
+  }
+});
+
+  } else {
+d = coreContainer.getZkController().getZkClient().getData(path, null, 
null, false);
+if (d == null || d.length == 0) {
+  rsp.add(path, null);
+  return;
+}
+
+Map map = new HashMap<>(1);
+map.put(WT, "raw");
+map.put(OMIT_HEADER, "true");
+req.setParams(SolrParams.wrapDefaults(new MapSolrParams(map), 
req.getParams()));
 
 Review comment:
   no if you are requesting data , you should expect raw data. it will not 
honour the `wt` param


This is an automated message from the Apache Git Service.
To 

[GitHub] [lucene-solr] CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding 
always allocate int[] and float[] with size equals to number of unique values 
(WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611288608
 
 
   @dsmiley Done!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
dsmiley commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding 
always allocate int[] and float[] with size equals to number of unique values 
(WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611276129
 
 
   Dat can you move out of the numeric package please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding 
always allocate int[] and float[] with size equals to number of unique values 
(WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611274493
 
 
   Thanks a lot for your hard work @bruno-roustant 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mayya-sharipova commented on issue #1351: LUCENE-9280: Collectors to skip noncompetitive documents

2020-04-08 Thread GitBox
mayya-sharipova commented on issue #1351: LUCENE-9280: Collectors to skip 
noncompetitive documents
URL: https://github.com/apache/lucene-solr/pull/1351#issuecomment-611260306
 
 
   @romseygeek  I have tried to address your outstanding feedback in 
4448499f0f.  Can you please continue the review when you have time?
   
   > Move the logic that checks whether or not to update the iterator into 
setBottom on the leaf comparator.
   
   In the new `FilteringFieldComparator` class, the iterator is updated in
   - setBottom
   - when we change a segment in `getLeafComparator`, so that we can also 
update iterators of subsequent segments. 
   - and also when for the first time queue becomes full and hitsThreshold is 
reached in `setCanUpdateIterator`, this method is called from a collector.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9267) The documentation of getQueryBuildTime function reports a wrong time unit.

2020-04-08 Thread Pierre-Luc Perron (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre-Luc Perron updated LUCENE-9267:
--
Attachment: (was: LUCENE-9267.patch)

> The documentation of getQueryBuildTime function reports a wrong time unit.
> --
>
> Key: LUCENE-9267
> URL: https://issues.apache.org/jira/browse/LUCENE-9267
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/other
>Affects Versions: 8.2, 8.3, 8.4
>Reporter: Pierre-Luc Perron
>Priority: Trivial
>  Labels: documentation, newbie, pull-request-available
> Attachments: LUCENE-9267.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As per documentation, the 
> [MatchingQueries|https://lucene.apache.org/core/8_4_1/monitor/org/apache/lucene/monitor/MatchingQueries.html]
>  class returns both getQueryBuildTime and getSearchTime in milliseconds. The 
> code shows 
> [searchTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/CandidateMatcher.java#L112]
>  returning milliseconds. However, the code shows 
> [buildTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/QueryIndex.java#L280]
>  returning nanoseconds.
> The patch changes the documentation of getQueryBuildTime to report 
> nanoseconds instead of milliseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9267) The documentation of getQueryBuildTime function reports a wrong time unit.

2020-04-08 Thread Pierre-Luc Perron (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre-Luc Perron updated LUCENE-9267:
--
Attachment: LUCENE-9267.patch

> The documentation of getQueryBuildTime function reports a wrong time unit.
> --
>
> Key: LUCENE-9267
> URL: https://issues.apache.org/jira/browse/LUCENE-9267
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/other
>Affects Versions: 8.2, 8.3, 8.4
>Reporter: Pierre-Luc Perron
>Priority: Trivial
>  Labels: documentation, newbie, pull-request-available
> Attachments: LUCENE-9267.patch, LUCENE-9267.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As per documentation, the 
> [MatchingQueries|https://lucene.apache.org/core/8_4_1/monitor/org/apache/lucene/monitor/MatchingQueries.html]
>  class returns both getQueryBuildTime and getSearchTime in milliseconds. The 
> code shows 
> [searchTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/CandidateMatcher.java#L112]
>  returning milliseconds. However, the code shows 
> [buildTime|https://github.com/apache/lucene-solr/blob/320578274be74a18ce150b604d28a740545fde48/lucene/monitor/src/java/org/apache/lucene/monitor/QueryIndex.java#L280]
>  returning nanoseconds.
> The patch changes the documentation of getQueryBuildTime to report 
> nanoseconds instead of milliseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] plperron commented on issue #1330: LUCENE-9267 Replace getQueryBuildTime time unit from ms to ns

2020-04-08 Thread GitBox
plperron commented on issue #1330: LUCENE-9267 Replace getQueryBuildTime time 
unit from ms to ns
URL: https://github.com/apache/lucene-solr/pull/1330#issuecomment-611254162
 
 
   Should I rebase both commits into a single one in order to keep the 
cohesiveness ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects

2020-04-08 Thread GitBox
mocobeta commented on a change in pull request #1388: LUCENE-9278: Use 
-linkoffline instead of relative paths to make links to other projects
URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405878977
 
 

 ##
 File path: gradle/render-javadoc.gradle
 ##
 @@ -15,93 +15,105 @@
  * limitations under the License.
  */
 
-// generate javadocs by using Ant javadoc task
+// generate javadocs by calling javadoc tool
+// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html
+
+// utility function to convert project path to document output dir
+// e.g.: ':lucene:analysis:common' => 'analysis/common'
+def pathToDocdir = { path -> path.split(':').drop(2).join('/') }
 
 allprojects {
   plugins.withType(JavaPlugin) {
-ext {
-  javadocRoot = project.path.startsWith(':lucene') ? 
project(':lucene').file("build/docs") : project(':solr').file("build/docs")
-  javadocDestDir = "${javadocRoot}/${project.name}"
-}
-
 task renderJavadoc {
-  description "Generates Javadoc API documentation for the main source 
code. This invokes Ant Javadoc Task."
+  description "Generates Javadoc API documentation for the main source 
code. This directly invokes javadoc tool."
   group "documentation"
 
   ext {
-linksource = "no"
+linksource = false
 linkJUnit = false
-linkHref = []
+linkLuceneProjects = []
+linkSorlProjects = []
   }
 
   dependsOn sourceSets.main.compileClasspath
 
   inputs.files { sourceSets.main.java.asFileTree }
-  outputs.dir project.javadocRoot
+  outputs.dir project.javadoc.destinationDir
 
   def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr"
   def title = "${libName} ${project.version} ${project.name} 
API".toString()
 
+  // absolute urls for "-linkoffline" option
+  def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/;
+  def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/;
+  def luceneDocUrl = 
"https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString()
+  def solrDocUrl = 
"https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString()
+
+  def javadocCmd = 
org.gradle.internal.jvm.Jvm.current().getJavadocExecutable()
 
 Review comment:
   I just merged it to the master. 
   
   > We may have to do something similar to what ES does since we want to be 
able to run javac, javadocs and tests against new JVMs (which gradle itself may 
not support yet).
   
   Should we open an issue for that, or can it be delayed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12890) Vector Search in Solr (Umbrella Issue)

2020-04-08 Thread Trey Grainger (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078818#comment-17078818
 ] 

Trey Grainger commented on SOLR-12890:
--

After reviewing and testing the code in the patch generously contributed on 
this issue (thank you [~moshebla]!) and subsequently thinking through the 
design a lot, I believe there are several limitations to the approach in this 
current code. Specifically, the use of terms as dimensions in the vector with 
attached payload is pretty inefficient and won't work well at scale and the use 
of a query parser is less flexible and reusable than a function query/value 
source approach would be (in terms of more flexible combination with other 
functions and use in sorting, returned fields, etc.). Additionally, I think an 
optimal design would allow for multi-valued vectors (multiple vectors in a 
field) in order to support things like word embeddings, sentence embeddings, 
paragraph embeddings, etc., as opposed to only one vector per field in each 
document, which is challenging to implement with the current approach.

Instead of hijacking this Jira and replacing the previous work and design, I've 
created a new Jira (SOLR-14397) and submitted a new proposed design there, 
which I plan to work on as next iteration of this Vector Search in Solr 
initiative.

If you're following along with this effort, I'd encourage you to check out 
SOLR-14397 and provide any feedback on the updated design proposed there. 
Thanks!

> Vector Search in Solr (Umbrella Issue)
> --
>
> Key: SOLR-12890
> URL: https://issues.apache.org/jira/browse/SOLR-12890
> Project: Solr
>  Issue Type: New Feature
>Reporter: mosh
>Priority: Major
>
> We have recently come across a need to index documents containing vectors 
> using solr, and have even worked on a small POC. We used an URP to calculate 
> the LSH(we chose to use the superbit algorithm, but the code is designed in a 
> way the algorithm picked can be easily chagned), and stored the vector in 
> either sparse or dense forms, in a binary field.
> Perhaps an addition of an LSH URP in conjunction with a query parser that 
> uses the same properties to calculate LSH(or maybe ktree, or some other 
> algorithm all together) should be considered as a Solr feature?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9278) Make javadoc folder structure follow Gradle project path

2020-04-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078815#comment-17078815
 ] 

ASF subversion and git services commented on LUCENE-9278:
-

Commit 4f92cd414c4da6ac6163ff4101b0e07fb94fd067 in lucene-solr's branch 
refs/heads/master from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4f92cd4 ]

LUCENE-9278: Use -linkoffline instead of relative paths to make links to other 
projects (#1388)



> Make javadoc folder structure follow Gradle project path
> 
>
> Key: LUCENE-9278
> URL: https://issues.apache.org/jira/browse/LUCENE-9278
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/build
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Current javadoc folder structure is derived from Ant project name. e.g.:
> [https://lucene.apache.org/core/8_4_1/analyzers-icu/index.html]
>  [https://lucene.apache.org/solr/8_4_1/solr-solrj/index.html]
> For Gradle build, it should also follow gradle project structure (path) 
> instead of ant one, to keep things simple to manage [1]. Hence, it will look 
> like this:
> [https://lucene.apache.org/core/9_0_0/analysis/icu/index.html]
>  [https://lucene.apache.org/solr/9_0_0/solr/solrj/index.html]
> [1] The change was suggested at the conversation between Dawid Weiss and I on 
> a github pr: [https://github.com/apache/lucene-solr/pull/1304]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mocobeta merged pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects

2020-04-08 Thread GitBox
mocobeta merged pull request #1388: LUCENE-9278: Use -linkoffline instead of 
relative paths to make links to other projects
URL: https://github.com/apache/lucene-solr/pull/1388
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14370) Refactor bin/solr to allow external override of Jetty modules

2020-04-08 Thread Andy Throgmorton (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078806#comment-17078806
 ] 

Andy Throgmorton commented on SOLR-14370:
-

Sure, I can explain more to solicit alternative solutions. But I understand if 
this type of use case is something the Solr community doesn't want to 
encourage/support.

We have some code that builds a custom SslContext, and we need the Jetty server 
to use that during bootstrap. For this purpose, we use a custom jetty.xml file 
and a custom module that loads/runs this at startup.

> Refactor bin/solr to allow external override of Jetty modules
> -
>
> Key: SOLR-14370
> URL: https://issues.apache.org/jira/browse/SOLR-14370
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: scripts and tools
>Reporter: Andy Throgmorton
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The bin/solr script currently does not allow for externally overriding the 
> modules passed to Jetty on startup.
> This PR adds the ability to override the Jetty modules on startup by setting 
> {{JETTY_MODULES}} as an environment variable; when passed, bin/solr will pass 
> through (and not clobber) the string verbatim into {{SOLR_JETTY_CONFIG}}. For 
> example, you can now run:
> {{JETTY_MODULES=--module=foo bin/solr start}}
> We've added some custom Jetty modules that can be optionally enabled; this 
> change allows us to keep our logic (regarding which modules to use) in a 
> separate script, rather than maintaining a forked bin/solr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078792#comment-17078792
 ] 

Dawid Weiss commented on LUCENE-9286:
-

+1!

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on issue #1412: Add MinimalSolrTest for scale testing

2020-04-08 Thread GitBox
madrob commented on issue #1412: Add MinimalSolrTest for scale testing
URL: https://github.com/apache/lucene-solr/pull/1412#issuecomment-611231598
 
 
   Would this be better in test-framework as a stub? My goals here are to 
always have something that I can run against master without needing to recreate 
this class every time I update my branch or constantly rebasing a patch or 
whatever. I don't think this makes sense as a JMH bench. We could add trivial 
assert that the test run times have a total ordering (compare them all in 
`@AfterClass`)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14397) Vector Search in Solr

2020-04-08 Thread Trey Grainger (Jira)
Trey Grainger created SOLR-14397:


 Summary: Vector Search in Solr
 Key: SOLR-14397
 URL: https://issues.apache.org/jira/browse/SOLR-14397
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Trey Grainger


Search engines have traditionally relied upon token-based matching (typically 
keywords) on an inverted index, plus relevance ranking based upon keyword 
occurrence statistics. This can be viewed as a "sparse vector” match (where 
each term is a one-hot encoded dimension in the vector), since only a few 
keywords out of all possible keywords are considered in each query. With the 
introduction of deep-learning-based transformers over the last few years, 
however, the state of the art in relevance has moved to ranking models based 
upon dense vectors that encode a latent, semantic understanding of both 
language constructs and the underlying domain upon which the model was trained. 
These dense vectors are also referred to as “embeddings”. An example of this 
kind of embedding would be taking the phrase “chief executive officer of the 
tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
. Other similar phrases should encode to vectors with very similar numbers, so 
we may expect a query like “CEO of a technology org” to generate a vector like 
[0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation 
between these vectors, we would expect a number closer to 1.0, whereas a very 
unrelated text blurb would generate a much smaller cosine similarity.

This is a proposal for how we should implement these vector search capabilities 
in Solr.
h1. Search Process Overview:

In order to implement dense vector search, the following process is typically 
followed:
h2. Offline:

An encoder is built. An encoder can take in text (a query, a sentence, a 
paragraph, a document, etc.) and return a dense vector representing that 
document in a rich semantic space. The semantic space is learned from training 
on textual data (usually, though other sources work, too), typically from the 
domain of the search engine.
h2. Document Ingestion:

When documents are processed, they are passed to the encoder, and the dense 
vector(s) returned are stored as fields on the document. There could be one or 
more vectors per-document, as the granularity of the vectors could be 
per-document, per field, per paragraph, per-sentence, or even per phrase or per 
term.
h2. Query Time:

*Encoding:* The query is translated to a dense vector by passing it to the 
encoder
 Quantization: The query is quantized. Quantization is the process of taking a 
vector with many values and turning it into “terms” in a vector space that 
approximates the full vector space of the dense vectors.
 *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
(approximate nearest neighbor) search. This allows finding most of the best 
matching documents (typically up to 95%) with a traditional and efficient 
lookup against the inverted index.
 (optional) ANN Ranking: ranking may be performed based upon the matched 
quantized tokens to get a rough, initial ranking of documents based upon the 
similarity of the query and document vectors. This allows the next step 
(re-ranking) to be performed on a smaller subset of documents. 
 *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
performed, a similarity calculation (cosine, dot-product, or any number of 
other calculations) is typically performed between the full (non-quantized) 
dense vectors for the query and those in the document. This re-ranking will 
typically be on the top-N results for performance reasons.
 *Return Results:* As with any search, the final step is typically to return 
the results in relevance-ranked order. In this case, that would be sorted by 
the re-ranking similarity score (i.e. “cosine descending”).
 --

*Variant:* For small document sets, it may be preferable to rank all documents 
and skip steps steps 2, 3, and 4. This is because ANN Matching typically 
reduces recall (current state of the art is around 95% recall), so it can be 
beneficial to rank all documents if performance is not a concern. In this case, 
step 5 is performed on the full doc set and would obviously just be considered 
“ranking” instead of “re-”ranking.
h1. Proposed Implementation in Solr:
h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
 * 
h3. Dense Vector Field:

We will add a new dense vector field type in Solr. This field type would be a 
compressed encoding of a dense vector into a BinaryDocValues Field. There are 
other ways to do it, but this is almost certain to be the most efficient.
 Ideally this field is multi-valued. If it is single-valued then we are either 
limited to only document-level vectors, or otherwise we have to create many 

[jira] [Closed] (LUCENE-4048) Move getLines out of ResourceLoader and require Charset

2020-04-08 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley closed LUCENE-4048.


> Move getLines out of ResourceLoader and require Charset
> ---
>
> Key: LUCENE-4048
> URL: https://issues.apache.org/jira/browse/LUCENE-4048
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Chris Male
>Priority: Major
> Fix For: 4.0-BETA
>
> Attachments: LUCENE-4048.patch, LUCENE-4048.patch
>
>
> {{ResourceLoader.getLines()}} is only used by analysis factories.  
> {{SolrResourceLoader}}'s implementation does the job well and it's unlikely 
> that another {{ResourceLoader}} implementation would handle it differently.
> We should extract the {{getLines()}} method out to 
> {{AbstractAnalysisFactory}} so it can be used by the factories.  Additionally 
> we shouldn't assume the files are encoded in UTF-8, instead we should allow a 
> Charset to be specified.
> This would take us one step closer to reducing the {{ResourceLoader}} 
> interface just to what it says, a loader of resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4048) Move getLines out of ResourceLoader and require Charset

2020-04-08 Thread David Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley resolved LUCENE-4048.
--
Fix Version/s: 4.0-BETA
   Resolution: Duplicate

> Move getLines out of ResourceLoader and require Charset
> ---
>
> Key: LUCENE-4048
> URL: https://issues.apache.org/jira/browse/LUCENE-4048
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Chris Male
>Priority: Major
> Fix For: 4.0-BETA
>
> Attachments: LUCENE-4048.patch, LUCENE-4048.patch
>
>
> {{ResourceLoader.getLines()}} is only used by analysis factories.  
> {{SolrResourceLoader}}'s implementation does the job well and it's unlikely 
> that another {{ResourceLoader}} implementation would handle it differently.
> We should extract the {{getLines()}} method out to 
> {{AbstractAnalysisFactory}} so it can be used by the factories.  Additionally 
> we shouldn't assume the files are encoded in UTF-8, instead we should allow a 
> Charset to be specified.
> This would take us one step closer to reducing the {{ResourceLoader}} 
> interface just to what it says, a loader of resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078707#comment-17078707
 ] 

Bruno Roustant commented on LUCENE-9286:


To complete the perf benchmark, I ran luceneutil on both wikimedium500k and 
wikimediumall. I see a perf slowdown of 4%-5% in PKLookup with FST off-heap 
(and only on PKLookup).

Given that when it was introduced this direct addressing node improved the 
PKLookup perf of at least twice this slowdown, and given that this fix improves 
greatly the FSTEnum traversal speed and the memory for large automata, I 
consider this slowdown is ok.

I'm going to merge the PR tomorrow.

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12005) Solr should have the option of logging all jars loaded

2020-04-08 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078704#comment-17078704
 ] 

David Smiley commented on SOLR-12005:
-

Instead of looking for logging as the tool for this need, we might instead look 
for Solr administrative handlers to expose this information.

> Solr should have the option of logging all jars loaded
> --
>
> Key: SOLR-12005
> URL: https://issues.apache.org/jira/browse/SOLR-12005
> Project: Solr
>  Issue Type: Improvement
>  Components: logging
>Reporter: Shawn Heisey
>Priority: Major
>
> Solr used to explicitly log the filename of every jar it loaded.  It seems 
> that the effort to reduce the verbosity of the logs has changed this, now it 
> just logs the *count* of jars loaded and the paths where they were loaded 
> from.  Here's a log line where Solr is reading from ${solr.solr.home}/lib:
> {code}
> 2018-02-01 17:43:20.043 INFO  (main) [   ] o.a.s.c.SolrResourceLoader [null] 
> Added 8 libs to classloader, from paths: [/index/solr6/data/lib]
> {code}
> When trying to help somebody with classloader issues, it's more difficult to 
> help when the list of jars loaded isn't in the log.
> I would like the more verbose logging to be enabled by default, but I 
> understand that many people would not want that, so I propose this:
>  * Enable verbose logging for ${solr.solr.home}/lib by default.
>  * Disable verbose logging for each core by default.  Allow solrconfig.xml to 
> enable it.
>  * Optionally allow solr.xml to configure verbose logging at the global level.
>  ** This setting would affect both global and per-core jar loading. Each 
> solrconfig.xml could override.
> Rationale: The contents of ${solr.solr.home}/lib are loaded precisely once, 
> and this location doesn't even exist unless a user creates it.  An 
> out-of-the-box config would not have verbose logs from jar loading.
> The solr home lib location is my preferred way of loading custom jars, 
> because they get loaded only once, no matter how many cores you have.  Jars 
> added to this location would add lines to the log, but it would not be logged 
> for every core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078703#comment-17078703
 ] 

Bruno Roustant commented on LUCENE-9286:


Ah ! :) I see the faulty assertion. I'll remove it now because it's not so 
useful.

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-8773) Make blob store usage intuitive and robust

2020-04-08 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-8773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078701#comment-17078701
 ] 

David Smiley commented on SOLR-8773:


[~noble.paul] I suppose all this might be Won't-Fix given the existing 
so-called Blob Store is deprecated?  However, some of these might be 
re-imagined in the context of its replacement -- the "filestore" and may still 
make sense.

> Make blob store usage intuitive and robust
> --
>
> Key: SOLR-8773
> URL: https://issues.apache.org/jira/browse/SOLR-8773
> Project: Solr
>  Issue Type: Improvement
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> blob store is provided as a feature and the only current use is to load jars 
> from there. Ideally, all resources should be loadable from blob store. But, 
> it is not yet ready for prime time because it is just a simple crud handler 
> for binary files. We should provide nice wrappers (java APIs as well as http 
> APIs) which mask the complexity of the underlying storage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly

2020-04-08 Thread GitBox
s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT 
directly
URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-611181862
 
 
   @mikemccand I merged master into this branch


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9271) Make BufferedIndexInput work on a ByteBuffer

2020-04-08 Thread Simon Willnauer (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078690#comment-17078690
 ] 

Simon Willnauer commented on LUCENE-9271:
-

thanks [~jpountz]

> Make BufferedIndexInput work on a ByteBuffer
> 
>
> Key: LUCENE-9271
> URL: https://issues.apache.org/jira/browse/LUCENE-9271
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently {{BufferedIndexInput}} works on a {{byte[]}} but its main 
> implementation, in NIOFSDirectory, has to implement a hack to maintain a 
> ByteBuffer view of it that it can use in calls to the FileChannel API. Maybe 
> we should instead make {{BufferedIndexInput}} work directly on a 
> {{ByteBuffer}}? This would also help reuse the existing 
> {{ByteBuffer#get(|Short|Int|long)}} methods instead of duplicating them from 
> {{DataInput}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9271) Make BufferedIndexInput work on a ByteBuffer

2020-04-08 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078688#comment-17078688
 ] 

Adrien Grand commented on LUCENE-9271:
--

Sorry about that, it should have been fixed in 
3363e1aa4897a5eca9f390a9f22cab5686305ef7 yesterday.

> Make BufferedIndexInput work on a ByteBuffer
> 
>
> Key: LUCENE-9271
> URL: https://issues.apache.org/jira/browse/LUCENE-9271
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently {{BufferedIndexInput}} works on a {{byte[]}} but its main 
> implementation, in NIOFSDirectory, has to implement a hack to maintain a 
> ByteBuffer view of it that it can use in calls to the FileChannel API. Maybe 
> we should instead make {{BufferedIndexInput}} work directly on a 
> {{ByteBuffer}}? This would also help reuse the existing 
> {{ByteBuffer#get(|Short|Int|long)}} methods instead of duplicating them from 
> {{DataInput}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly

2020-04-08 Thread GitBox
jpountz commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT 
directly
URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-611180446
 
 
   @mikemccand I think it's related to the change I merged yesterday indeed. I 
fix it shortly after merging, so if you merge master back, this should address 
the failure.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] s1monw commented on a change in pull request #1389: LUCENE-9298: fix clearDeletedDocIds in BufferedUpdates

2020-04-08 Thread GitBox
s1monw commented on a change in pull request #1389: LUCENE-9298: fix 
clearDeletedDocIds in BufferedUpdates
URL: https://github.com/apache/lucene-solr/pull/1389#discussion_r405796843
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java
 ##
 @@ -176,8 +176,11 @@ void addBinaryUpdate(BinaryDocValuesUpdate update, int 
docIDUpto) {
   }
 
   void clearDeleteTerms() {
-deleteTerms.clear();
 numTermDeletes.set(0);
+deleteTerms.forEach((term, docIDUpto) -> {
 
 Review comment:
   Instead of counting this here on clear, can we use a second counter for the 
deleteTerms next to `bytesUsed`? This would be great. It doesn't need to be 
thread safe IMO


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9309) IW#addIndices(CodecReader) might delete files concurrently to IW#rollback

2020-04-08 Thread Simon Willnauer (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078675#comment-17078675
 ] 

Simon Willnauer commented on LUCENE-9309:
-

[~mikemccand] can you take a look at the PR for this issue

> IW#addIndices(CodecReader) might delete files concurrently to IW#rollback
> -
>
> Key: LUCENE-9309
> URL: https://issues.apache.org/jira/browse/LUCENE-9309
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Simon Willnauer
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> During work on LUCENE-9304 [~mikemccand] ran into a failure: 
> {noformat}
> org.apache.lucene.index.TestAddIndexes > test suite's output saved to 
> /home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestAddIndexes.txt,
>  copied below:
>> java.nio.file.NoSuchFileException: 
> _gt_Lucene85FieldsIndex-doc_ids_6u.tmp
>> at 
> __randomizedtesting.SeedInfo.seed([4760FA81FBD4B2CE:A147156E5F7BD9B0]:0)
>> at 
> org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148)
>> at 
> org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607)
>> at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
>> at 
> org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696)
>> at 
> org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690)
>> at 
> org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:449)
>> at 
> org.apache.lucene.index.IndexWriter.rollbackInternalNoCommit(IndexWriter.java:2334)
>> at 
> org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2275)
>> at 
> org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2268)
>> at 
> org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:974)
>   2> NOTE: reproduce with: ant test  -Dtestcase=TestAddIndexes 
> -Dtests.method=testAddIndexesWithRollback -Dtests.seed=4760FA81FBD4B2CE 
> -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fr-GP -Dtests.t\
> imezone=Asia/Tbilisi -Dtests.asserts=true -Dtests.file.encoding=UTF-8
>   2> NOTE: test params are: codec=Asserting(Lucene84): 
> {c=PostingsFormat(name=LuceneFixedGap), 
> id=PostingsFormat(name=LuceneFixedGap), 
> f1=PostingsFormat(name=LuceneFixedGap), f2=BlockTreeOrds(blocksize=128)\
> , version=BlockTreeOrds(blocksize=128), content=FST50}, 
> docValues:{dv=DocValuesFormat(name=Lucene80), 
> soft_delete=DocValuesFormat(name=Lucene80), 
> doc=DocValuesFormat(name=Lucene80), id=DocValuesFormat(name=\
> Asserting), content=DocValuesFormat(name=Asserting), 
> doc2d=DocValuesFormat(name=Lucene80)}, maxPointsInLeafNode=982, 
> maxMBSortInHeap=5.837219998050092, 
> sim=Asserting(org.apache.lucene.search.similarities.As\
> sertingSimilarity@6ce38471), locale=fr-GP, timezone=Asia/Tbilisi
> {noformat}
> While this unfortunately doesn't reproduce it's likely a bug that exists for 
> quite some time but never showed up until LUCENE-9147 which uses a temporary 
> output. That's fine but with IW#addIndices(CodecReader...) not registering 
> the merge it does in the IW we never wait for the merge to finish while 
> rollback and if that merge finishes concurrently it will also remove these 
> .tmp files. 
> There are many ways to fix this and I can work on a patch, but hey do we 
> really need to be able to add indices while we index and do that on an open 
> and live IW or can it be a tool on top of it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] s1monw opened a new pull request #1418: LUCENE-9309: Wait for #addIndexes merges when aborting merges

2020-04-08 Thread GitBox
s1monw opened a new pull request #1418: LUCENE-9309: Wait for #addIndexes 
merges when aborting merges
URL: https://github.com/apache/lucene-solr/pull/1418
 
 
   The SegmentMerger usage in IW#addIndexes(CodecReader...) might make changes
   to the Directory while the IW tries to clean-up files on rollback. This
   causes issues like FileNotFoundExceptions when IDF tries to remove temp 
files.
   This changes adds a waiting mechanism to the abortMerges method that, in 
addition
   to the running merges, also waits for merges in addIndices(CodecReader...)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11384) add support for distributed graph query

2020-04-08 Thread sambasivarao giddaluri (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078637#comment-17078637
 ] 

sambasivarao giddaluri edited comment on SOLR-11384 at 4/8/20, 7:32 PM:


[~kwatters]  is it possible to share kafka approach  to over come graph query 
parser and also i can see graph traversal with streaming it looses relevancy as 
we have to pass the sort field and if we have to add multiple search conditions 
and pagination is not supported .. it would be really good to have the 
distributed search functionality on the graph query parser, can you share the 
patch details so i can test in local to see the performance of it.


was (Author: sambasiva12):
[~kwatters]  is it possible to share kafka approach  to over come graph query 
parser and also i can see with graph traversal with streaming it looses 
relevancy as we have to pass the sort field if we have to add multiple search 
conditions and pagination is not supported .. it would be really good to have 
the distributed search functionality on the graph query parser and can you 
share the patch i can test to see the performance of it.

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11384) add support for distributed graph query

2020-04-08 Thread sambasivarao giddaluri (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078637#comment-17078637
 ] 

sambasivarao giddaluri commented on SOLR-11384:
---

[~kwatters]  is it possible to share kafka approach  to over come graph query 
parser and also i can see with graph traversal with streaming it looses 
relevancy as we have to pass the sort field if we have to add multiple search 
conditions and pagination is not supported .. it would be really good to have 
the distributed search functionality on the graph query parser and can you 
share the patch i can test to see the performance of it.

> add support for distributed graph query
> ---
>
> Key: SOLR-11384
> URL: https://issues.apache.org/jira/browse/SOLR-11384
> Project: Solr
>  Issue Type: Improvement
>Reporter: Kevin Watters
>Priority: Minor
>
> Creating this ticket to track the work that I've done on the distributed 
> graph traversal support in solr.
> Current GraphQuery will only work on a single core, which introduces some 
> limits on where it can be used and also complexities if you want to scale it. 
>  I believe there's a strong desire to support a fully distributed method of 
> doing the Graph Query.  I'm working on a patch, it's not complete yet, but if 
> anyone would like to have a look at the approach and implementation,  I 
> welcome much feedback.  
> The flow for the distributed graph query is almost exactly the same as the 
> normal graph query.  The only difference is how it discovers the "frontier 
> query" at each level of the traversal.  
> When a distribute graph query request comes in, each shard begins by running 
> the root query, to know where to start on it's shard.  Each participating 
> shard then discovers it's edges for the next hop.  Those edges are then 
> broadcast to all other participating shards.  The shard then receives all the 
> parts of the frontier query , assembles it, and executes it.
> This process continues on each shard until there are no new edges left, or 
> the maxDepth of the traversal has finished.
> The approach is to introduce a FrontierBroker that resides as a singleton on 
> each one of the solr nodes in the cluster.  When a graph query is created, it 
> can do a getInstance() on it so it can listen on the frontier parts coming in.
> Initially, I was using an external Kafka broker to handle this, and it did 
> work pretty well.  The new approach is migrating the FrontierBroker to be a 
> request handler in Solr, and potentially to use the SolrJ client to publish 
> the edges to each node in the cluster.
> There are a few outstanding design questions, first being, how do we know 
> what the list of shards are that are participating in the current query 
> request?  Is that easy info to get at?
> Second,  currently, we are serializing a query object between the shards, 
> perhaps we should consider a slightly different abstraction, and serialize 
> lists of "edge" objects between the nodes.   The point of this would be to 
> batch the exploration/traversal of current frontier to help avoid large 
> bursts of memory being required.
> Thrid, what sort of caching strategy should be introduced for the frontier 
> queries, if any?  And if we do some caching there, how/when should the 
> entries be expired and auto-warmed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly

2020-04-08 Thread GitBox
s1monw commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT 
directly
URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-611147359
 
 
   @mikemccand did you run any benchmarks on this change yet?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9271) Make BufferedIndexInput work on a ByteBuffer

2020-04-08 Thread Simon Willnauer (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078627#comment-17078627
 ] 

Simon Willnauer commented on LUCENE-9271:
-

We ran into a test failure 
[here|https://github.com/apache/lucene-solr/pull/1397#issuecomment-610930218]:

{noformat}
org.apache.lucene.index.TestIndexManyDocuments > test suite's output saved to 
/home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestIndexManyDocuments.txt,
 copi\
ed below:
  1> CheckIndex failed
  1> 0.00% total deletions; 10976 documents; 0 deleteions
  1> Segments file=segments_1 numSegments=4 version=9.0.0 
id=cekay2d5izae12ssuqikoqgoc
  1>   1 of 4: name=_d maxDoc=8700
  1> version=9.0.0
  1> id=cekay2d5izae12ssuqikoqgob
  1> codec=Asserting(Lucene84)
  1> compound=false
  1> numFiles=11
  1> size (MB)=0.003
  1> diagnostics = {os.version=5.5.6-arch1-1, java.vendor=Oracle 
Corporation, source=merge, os.arch=amd64, mergeFactor=10, 
java.runtime.version=11.0.6+8-LTS, os=Linux, timestamp=1586347074798, lucene.ve\
rsion=9.0.0, java.vm.version=11.0.6+8-LTS, java.version=11.0.6, 
mergeMaxNumSegments=-1}
  1> no deletions
  1> test: open reader.OK [took 0.001 sec]
  1> test: check integrity.OK [took 0.000 sec]
  1> test: check live docs.OK [took 0.000 sec]
  1> test: field infos.OK [1 fields] [took 0.000 sec]
  1> test: field norms.OK [1 fields] [took 0.001 sec]
  1> test: terms, freq, prox...ERROR: java.lang.AssertionError: 
buffer=java.nio.HeapByteBuffer[pos=0 lim=0 cap=0] bufferSize=1024 
buffer.length=0
  1> java.lang.AssertionError: buffer=java.nio.HeapByteBuffer[pos=0 lim=0 
cap=0] bufferSize=1024 buffer.length=0
  1>at 
org.apache.lucene.store.BufferedIndexInput.setBufferSize(BufferedIndexInput.java:78)
  1>at 
org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:241)
  1>at 
org.apache.lucene.codecs.MultiLevelSkipListReader.init(MultiLevelSkipListReader.java:208)
  1>at 
org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init(Lucene84SkipReader.java:103)
  1>at 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$EverythingEnum.advance(Lucene84PostingsReader.java:837)
  1>at 
org.apache.lucene.index.FilterLeafReader$FilterPostingsEnum.advance(FilterLeafReader.java:271)
  1>at 
org.apache.lucene.index.AssertingLeafReader$AssertingPostingsEnum.advance(AssertingLeafReader.java:377)
  1>at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1426)
  1>at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1867)
  1>at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:720)
  1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301)
  1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:286)
  1>at 
org.apache.lucene.store.BaseDirectoryWrapper.close(BaseDirectoryWrapper.java:45)
  1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
  1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
  1>at 
org.apache.lucene.index.TestIndexManyDocuments.test(TestIndexManyDocuments.java:69)
  1>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  1>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  1>at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  1>at java.base/java.lang.reflect.Method.invoke(Method.java:566)
  1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
  1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
  1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
  1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
  1>at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  1>at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  1>at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  1>at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  1>at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  1>at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  1>at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
  1>at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
  1>at 

[jira] [Commented] (SOLR-14396) TaggerRequestHandler Should Not Error on Empty Collection

2020-04-08 Thread Trey Grainger (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078620#comment-17078620
 ] 

Trey Grainger commented on SOLR-14396:
--

Cool. I'll take a stab at changing the behavior. If you have a suggestion for 
something specific to return in the header in this case (a warning of some 
sort), I'm happy do that while I'm in there.

> TaggerRequestHandler Should Not Error on Empty Collection
> -
>
> Key: SOLR-14396
> URL: https://issues.apache.org/jira/browse/SOLR-14396
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Trey Grainger
>Priority: Minor
>
> The TaggerRequestHandler (added in SOLR-12376) currently returns a 400 (Bad 
> Request) if used on a collection with no terms in the index. This probably 
> made sense for the use cases for which it was originally written (in the 
> OpenSextant project, before it was contributed to Solr) that focused on on 
> stand-alone document tagging, where the calling application expected there to 
> always be an index.
> More and more use cases are emerging for using the TaggerRequestHandler in 
> real-time for user queries, however. For example, real-time phrase matching 
> and entity resolution in queries. In these cases, the data in the tagger 
> collection may be dynamically updated, and at times, the collection may even 
> be empty.
> While it's certainly possible for the 400 error to be handled client-side for 
> empty collections, the incoming requests aren't really "bad" requests in my 
> opinion, the index just doesn't have any data yet. Sending the same request 
> subsequently once some documents are indexed would result in a success.
> I'm proposing we remove the exception for empty indexes and simply return no 
> matched tags instead.
> If it's important for anyone to preserve the current behavior, we could add a 
> parameter "errorOnEmptyCollection". Does anyone think preserving the error 
> here is needed? What say you [~dsmiley]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14396) TaggerRequestHandler Should Not Error on Empty Collection

2020-04-08 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078602#comment-17078602
 ] 

David Smiley commented on SOLR-14396:
-

Woah; I think it's bad this throws an exception on an empty index -- no need to 
convince me of that.  I can see why I added this to begin with, however -- 
imagine you typo a dynamic field or something like that.  Still, that scenario 
is not specific to the tagger; doing plain old search on a field with no data 
does not and should not result in warnings or errors.  We don't need a 
back-compat parameter to toggle this.

It'd be useful during Solr development / ad-hoc queries if Solr had a response 
header notice that informed you that certain fields used by the request have no 
data.  That'd be useful Solr-wide, not just specifically this handler, but also 
a pain to do as I think about it.  Hmm.

> TaggerRequestHandler Should Not Error on Empty Collection
> -
>
> Key: SOLR-14396
> URL: https://issues.apache.org/jira/browse/SOLR-14396
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Trey Grainger
>Priority: Minor
>
> The TaggerRequestHandler (added in SOLR-12376) currently returns a 400 (Bad 
> Request) if used on a collection with no terms in the index. This probably 
> made sense for the use cases for which it was originally written (in the 
> OpenSextant project, before it was contributed to Solr) that focused on on 
> stand-alone document tagging, where the calling application expected there to 
> always be an index.
> More and more use cases are emerging for using the TaggerRequestHandler in 
> real-time for user queries, however. For example, real-time phrase matching 
> and entity resolution in queries. In these cases, the data in the tagger 
> collection may be dynamically updated, and at times, the collection may even 
> be empty.
> While it's certainly possible for the 400 error to be handled client-side for 
> empty collections, the incoming requests aren't really "bad" requests in my 
> opinion, the index just doesn't have any data yet. Sending the same request 
> subsequently once some documents are indexed would result in a success.
> I'm proposing we remove the exception for empty indexes and simply return no 
> matched tags instead.
> If it's important for anyone to preserve the current behavior, we could add a 
> parameter "errorOnEmptyCollection". Does anyone think preserving the error 
> here is needed? What say you [~dsmiley]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2020-04-08 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078567#comment-17078567
 ] 

Michael Gibney commented on SOLR-13132:
---

This is great, thanks! It's taking me a little longer than expected to sharpen 
focus around exactly how to do this; but it's going steadily at this point, and 
will I think address all your concerns mentioned in earlier comments. More soon 
...

> Improve JSON "terms" facet performance when sorted by relatedness 
> --
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9311) IntelliJ import attempts to compile solr-ref-guide tools/ and fails

2020-04-08 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-9311:
---

 Summary: IntelliJ import attempts to compile solr-ref-guide tools/ 
and fails
 Key: LUCENE-9311
 URL: https://issues.apache.org/jira/browse/LUCENE-9311
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Dawid Weiss


This used to work but now doesn't. Don't know why (we exclude customized ant 
tasks but IntelliJ doesn't seem to pick this up).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails during project import

2020-04-08 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9310.
-
Fix Version/s: master (9.0)
   Resolution: Fixed

> IntelliJ attempts to resolve provider property in jar manifest configuration 
> and fails during project import
> 
>
> Key: LUCENE-9310
> URL: https://issues.apache.org/jira/browse/LUCENE-9310
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>
> It shouldn't be the case but it is. I don't know why.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078564#comment-17078564
 ] 

Dawid Weiss commented on LUCENE-9286:
-

My repro was a test... and I ran with enabled assertions. Sorry for the 
confusion this might have caused! These assertions are *very* costly - can we 
tone them down just a little bit?

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078537#comment-17078537
 ] 

Dawid Weiss commented on LUCENE-9286:
-

Now... this is a head scratcher. I get this on your test code (same fst):
{code}
FST construction (oversizingFactor=1.0)
time = 1753 ms
FST RAM = 54945816 B
FST enum
time = 323 ms
{code}

I'll get to the bottom of this difference, give me some time please.

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails during project import

2020-04-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078527#comment-17078527
 ] 

ASF subversion and git services commented on LUCENE-9310:
-

Commit dbb4be1ca93607c2555fe8b2b2cb3318be582edb in lucene-solr's branch 
refs/heads/master from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=dbb4be1 ]

LUCENE-9310: workaround for IntelliJ gradle import


> IntelliJ attempts to resolve provider property in jar manifest configuration 
> and fails during project import
> 
>
> Key: LUCENE-9310
> URL: https://issues.apache.org/jira/browse/LUCENE-9310
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>
> It shouldn't be the case but it is. I don't know why.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-14396) TaggerRequestHandler Should Not Error on Empty Collection

2020-04-08 Thread Trey Grainger (Jira)
Trey Grainger created SOLR-14396:


 Summary: TaggerRequestHandler Should Not Error on Empty Collection
 Key: SOLR-14396
 URL: https://issues.apache.org/jira/browse/SOLR-14396
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Trey Grainger


The TaggerRequestHandler (added in SOLR-12376) currently returns a 400 (Bad 
Request) if used on a collection with no terms in the index. This probably made 
sense for the use cases for which it was originally written (in the OpenSextant 
project, before it was contributed to Solr) that focused on on stand-alone 
document tagging, where the calling application expected there to always be an 
index.

More and more use cases are emerging for using the TaggerRequestHandler in 
real-time for user queries, however. For example, real-time phrase matching and 
entity resolution in queries. In these cases, the data in the tagger collection 
may be dynamically updated, and at times, the collection may even be empty.

While it's certainly possible for the 400 error to be handled client-side for 
empty collections, the incoming requests aren't really "bad" requests in my 
opinion, the index just doesn't have any data yet. Sending the same request 
subsequently once some documents are indexed would result in a success.

I'm proposing we remove the exception for empty indexes and simply return no 
matched tags instead.

If it's important for anyone to preserve the current behavior, we could add a 
parameter "errorOnEmptyCollection". Does anyone think preserving the error here 
is needed? What say you [~dsmiley]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails during project import

2020-04-08 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-9310:

Summary: IntelliJ attempts to resolve provider property in jar manifest 
configuration and fails during project import  (was: IntelliJ attempts to 
resolve provider property in jar manifest configuration and fails)

> IntelliJ attempts to resolve provider property in jar manifest configuration 
> and fails during project import
> 
>
> Key: LUCENE-9310
> URL: https://issues.apache.org/jira/browse/LUCENE-9310
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>
> It shouldn't be the case but it is. I don't know why.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9310) IntelliJ attempts to resolve provider property in jar manifest configuration and fails

2020-04-08 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-9310:
---

 Summary: IntelliJ attempts to resolve provider property in jar 
manifest configuration and fails
 Key: LUCENE-9310
 URL: https://issues.apache.org/jira/browse/LUCENE-9310
 Project: Lucene - Core
  Issue Type: Task
Reporter: Dawid Weiss
Assignee: Dawid Weiss


It shouldn't be the case but it is. I don't know why.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14385) Add shard name and collection name to split histogram logs

2020-04-08 Thread Saatchi Bhalla (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078516#comment-17078516
 ] 

Saatchi Bhalla commented on SOLR-14385:
---

Hadn't realized that access to MDC variables is actually defined in the 
log4j2.xml files which include both collection and shard. Thanks for pointing 
that out Yonik and for your help David. Megan and I synched up offline and 
realized that the missing data is specific to our solr fork so we can close out 
this JIRA. 

> Add shard name and collection name to split histogram logs
> --
>
> Key: SOLR-14385
> URL: https://issues.apache.org/jira/browse/SOLR-14385
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Saatchi Bhalla
>Priority: Trivial
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Using shard name from MDC to include in split histogram logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2020-04-08 Thread Chris M. Hostetter (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-13132:
--
Attachment: SOLR-13132_testSweep.patch
Status: Open  (was: Open)


bq. ... In fact, otherAccs and resort, being likely to generate more DocSet 
lookups than refinement, make it all the more important that SKGSlotAcc respect 
cacheDf to control filterCache usage, no?

off the top of my head, I'm not certain that they will involved _more_ lookups 
then refinement, but it certainly seems like if it's useful for refinement, it 
would also be useful for those cases as well.

bq. ... I plan to work through them in the next day or two and address any 
questions as they come up.

Sweet.

I went ahead and started working on an "equivilence testing" patch to try and 
help definitively prove that using {{swep: true}} or {{sweep: false}} produce 
the same results on otherwise equivilent (randomly generted) facet requests.  
I'm attaching that as {{SOLR-13132_testSweep.patch}}.  The big missing piece 
here is a stubbed out "whitebox" test (see nocommits) to use the debug output 
to "prove" that sweep collection is actualy being used when/if expected based 
on the {{sweep}} param (and effective processor).

* As is on master this test passes (because nothing looks for a {{sweep}} param 
so it's just comparing queries with themselves).
* When modifying this patch to use {{disable_sweep_collection}} it passed 
reliably from what i could tell.

...once your major changes to the impl are done, we'll probably wnat more 
changes to this test to help tickle "edge code paths" (once we have a better 
handle on what they are .. for instance: right now only one sweep using 
{{relatedness()}} function per facet, but i'm pretty sure testing multiple 
sweep aggs in a single query, and mixing in some non sweep functions, will be 
important for code coverage.



> Improve JSON "terms" facet performance when sorted by relatedness 
> --
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects

2020-04-08 Thread GitBox
dweiss commented on a change in pull request #1388: LUCENE-9278: Use 
-linkoffline instead of relative paths to make links to other projects
URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405684724
 
 

 ##
 File path: gradle/render-javadoc.gradle
 ##
 @@ -15,93 +15,105 @@
  * limitations under the License.
  */
 
-// generate javadocs by using Ant javadoc task
+// generate javadocs by calling javadoc tool
+// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html
+
+// utility function to convert project path to document output dir
+// e.g.: ':lucene:analysis:common' => 'analysis/common'
+def pathToDocdir = { path -> path.split(':').drop(2).join('/') }
 
 allprojects {
   plugins.withType(JavaPlugin) {
-ext {
-  javadocRoot = project.path.startsWith(':lucene') ? 
project(':lucene').file("build/docs") : project(':solr').file("build/docs")
-  javadocDestDir = "${javadocRoot}/${project.name}"
-}
-
 task renderJavadoc {
-  description "Generates Javadoc API documentation for the main source 
code. This invokes Ant Javadoc Task."
+  description "Generates Javadoc API documentation for the main source 
code. This directly invokes javadoc tool."
   group "documentation"
 
   ext {
-linksource = "no"
+linksource = false
 linkJUnit = false
-linkHref = []
+linkLuceneProjects = []
+linkSorlProjects = []
   }
 
   dependsOn sourceSets.main.compileClasspath
 
   inputs.files { sourceSets.main.java.asFileTree }
-  outputs.dir project.javadocRoot
+  outputs.dir project.javadoc.destinationDir
 
   def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr"
   def title = "${libName} ${project.version} ${project.name} 
API".toString()
 
+  // absolute urls for "-linkoffline" option
+  def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/;
+  def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/;
+  def luceneDocUrl = 
"https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString()
+  def solrDocUrl = 
"https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString()
+
+  def javadocCmd = 
org.gradle.internal.jvm.Jvm.current().getJavadocExecutable()
 
 Review comment:
   Thanks for looking into this, Tomoko. We may have to do something similar to 
what ES does since we want to be able to run javac, javadocs and tests against 
new JVMs (which gradle itself may not support yet). It's a different issue 
though and it can certainly wait.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14210) Add replica state option for HealthCheckHandler

2020-04-08 Thread Cassandra Targett (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078500#comment-17078500
 ] 

Cassandra Targett commented on SOLR-14210:
--

bq. The "Class & Javadocs" column of the table already provides a link to the 
Javadocs of that class.

Ah OK, I didn't look at it in the context of the whole page - no need to 
duplicate it then IMO.

> Add replica state option for HealthCheckHandler
> ---
>
> Key: SOLR-14210
> URL: https://issues.apache.org/jira/browse/SOLR-14210
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.5
>Reporter: Houston Putman
>Assignee: Jan Høydahl
>Priority: Major
> Fix For: 8.6
>
> Attachments: docs.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> h2. Background
> As was brought up in SOLR-13055, in order to run Solr in a more cloud-native 
> way, we need some additional features around node-level healthchecks.
> {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe 
> explained in 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n]
>  determine if a node is live and ready to serve live traffic.
> {quote}
>  
> However there are issues around kubernetes managing it's own rolling 
> restarts. With the current healthcheck setup, it's easy to envision a 
> scenario in which Solr reports itself as "healthy" when all of its replicas 
> are actually recovering. Therefore kubernetes, seeing a healthy pod would 
> then go and restart the next Solr node. This can happen until all replicas 
> are "recovering" and none are healthy. (maybe the last one restarted will be 
> "down", but still there are no "active" replicas)
> h2. Proposal
> I propose we make an additional healthcheck handler that returns whether all 
> replicas hosted by that Solr node are healthy and "active". That way we will 
> be able to use the [default kubernetes rolling restart 
> logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies]
>  with Solr.
> To add on to [Jan's point 
> here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559],
>  this handler should be more friendly for other Content-Types and should use 
> bettter HTTP response statuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dweiss commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.

2020-04-08 Thread GitBox
dweiss commented on a change in pull request #1416: LUCENE-9286: 
FST.Arc.BitTable is read directly from the FST bytes.
URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405681455
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/BitTableUtil.java
 ##
 @@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+
+/**
+ * Static helper methods for {@link FST.Arc.BitTable}.
+ *
+ * @lucene.experimental
+ */
+class BitTableUtil {
+
+  /**
+   * Returns whether the bit at given zero-based index is set.
+   * Example: bitIndex 10 means the third bit on the right of the second 
byte.
+   *
+   * @param bitIndex The bit zero-based index. It must be greater than or 
equal to 0, and strictly less than
+   * {@code number of bit-table bytes * Byte.SIZE}.
+   * @param reader   The {@link FST.BytesReader} to read. It must be 
positioned at the beginning of the bit-table.
+   */
+  static boolean isBitSet(int bitIndex, FST.BytesReader reader) throws 
IOException {
+assert bitIndex >= 0 : "bitIndex=" + bitIndex;
+reader.skipBytes(bitIndex >> 3);
+return (readByte(reader) & (1L << (bitIndex & (Byte.SIZE - 1 != 0;
+  }
+
+
+  /**
+   * Counts all bits set in the bit-table.
+   *
+   * @param bitTableBytes The number of bytes in the bit-table.
+   * @param readerThe {@link FST.BytesReader} to read. It must be 
positioned at the beginning of the bit-table.
+   */
+  static int countBits(int bitTableBytes, FST.BytesReader reader) throws 
IOException {
+assert bitTableBytes >= 0 : "bitTableBytes=" + bitTableBytes;
 
 Review comment:
   I noticed the byte order difference (note the "for bitcounts" bit). :) My 
gut feeling is that pushing reads so that they're aggregated first, followed by 
a bitcount will still give you a performance improvement. A bit shift and a bit 
mask is probably dwarfed when hotspot compiles and inlines all this but 
single-byte get() methods with conditionals inside will typically perform worse 
than a bulk get. 
   
   This is a scholarly discussion as things will very likely vary from machine 
to machine and even between hotpot runs, depending on the calling code layout. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078491#comment-17078491
 ] 

Dawid Weiss commented on LUCENE-9286:
-

Let me double check and get back to you. Sorry for the delays, lots of things 
going on at home when you're locked up with three kids.

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests

2020-04-08 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078489#comment-17078489
 ] 

David Smiley commented on SOLR-13979:
-

Also, distributed search long predated SolrCloud so why should this be a 
SolrCloud dependent switch?

> Expose separate metrics for distributed and non-distributed requests
> 
>
> Key: SOLR-13979
> URL: https://issues.apache.org/jira/browse/SOLR-13979
> Project: Solr
>  Issue Type: Bug
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Andrzej Bialecki
>Priority: Major
> Fix For: 8.4
>
> Attachments: SOLR-13979.patch
>
>
> Currently we expose metrics such as count, rate and latency on a per handler 
> level however for search requests there is no distinction made for distrib vs 
> non-distrib requests. This means that there is no way to find the count, rate 
> or latency of only user-sent queries.
> I propose that we expose distrib vs non-distrib metrics separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13979) Expose separate metrics for distributed and non-distributed requests

2020-04-08 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078488#comment-17078488
 ] 

David Smiley commented on SOLR-13979:
-

bq. These metrics are added to all handlers that inherit from 
RequestHandlerBase. This may be probably a bit too many

Yes its too many; distrib is implemented by SearchHandler.  I feel defaulting 
to to track distrib metrics on the vast majority of request handlers that are 
not SearchHandlers pollutes Solr metrics with junk.

> Expose separate metrics for distributed and non-distributed requests
> 
>
> Key: SOLR-13979
> URL: https://issues.apache.org/jira/browse/SOLR-13979
> Project: Solr
>  Issue Type: Bug
>  Components: metrics
>Reporter: Shalin Shekhar Mangar
>Assignee: Andrzej Bialecki
>Priority: Major
> Fix For: 8.4
>
> Attachments: SOLR-13979.patch
>
>
> Currently we expose metrics such as count, rate and latency on a per handler 
> level however for search requests there is no distinction made for distrib vs 
> non-distrib requests. This means that there is no way to find the count, rate 
> or latency of only user-sent queries.
> I propose that we expose distrib vs non-distrib metrics separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bringyou commented on issue #1389: LUCENE-9298: fix clearDeletedDocIds in BufferedUpdates

2020-04-08 Thread GitBox
bringyou commented on issue #1389: LUCENE-9298: fix clearDeletedDocIds in 
BufferedUpdates
URL: https://github.com/apache/lucene-solr/pull/1389#issuecomment-611049924
 
 
   > Change looks good to me. Would you mind adding a small test for this 
issue? Thanks @bringyou!
   
   sorry for the delay~ add a test for `BufferedUpdates` and change a bit more 
code, please take another look @dnhatn 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078431#comment-17078431
 ] 

Bruno Roustant edited comment on LUCENE-9286 at 4/8/20, 4:10 PM:
-

That's strange. In the PR I integrated your code to recompile and walk the FST. 
See TestFSTDirectAddressing.main() with a first arg "-recompileAndWalk" and a 
second the path to an FST. I used the FST zip you provided 
"fst-17291407798783309064.fst.gz".

Before the patch, I got roughly the same perf as you got on your side and that 
you shared previously. Then with the patch, I can verify that the perf is fixed:
{code:java}
Reading FST
time = 402 ms

FST construction (oversizingFactor=0.0)
time = 1302 ms
FST RAM = 56055936 B
FST enum
time = 322 ms

FST construction (oversizingFactor=1.0)
time = 1235 ms
FST RAM = 54945816 B
FST enum
time = 239 ms 
{code}
Can you run this TestFSTDirectAddressing.main()?

I run it on master branch. Should I run it on branch 8x to reproduce your env?


was (Author: broustant):
That's strange. In the PR I integrated your code to recompile and walk the FST. 
See TestFSTDirectAddressing.main() with a first arg "-recompileAndWalk" and a 
second the path to an FST. I used the FST zip you provided 
"fst-17291407798783309064.fst.gz".

Before the patch, I got roughly the same perf as you got on your side and that 
you shared previously. Then with the patch, I can verify that the perf is fixed:

 
{code:java}
Reading FST
time = 402 ms
FST construction (oversizingFactor=0.0)
time = 1302 ms
FST RAM = 56055936 B
FST enum
time = 322 ms
FST construction (oversizingFactor=1.0)
time = 1235 ms
FST RAM = 54945816 B
FST enum
time = 239 ms 
{code}
Can you run this TestFSTDirectAddressing.main()?

I run it on master branch. Should I run it on branch 8x to reproduce your env?

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-04-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078431#comment-17078431
 ] 

Bruno Roustant commented on LUCENE-9286:


That's strange. In the PR I integrated your code to recompile and walk the FST. 
See TestFSTDirectAddressing.main() with a first arg "-recompileAndWalk" and a 
second the path to an FST. I used the FST zip you provided 
"fst-17291407798783309064.fst.gz".

Before the patch, I got roughly the same perf as you got on your side and that 
you shared previously. Then with the patch, I can verify that the perf is fixed:

 
{code:java}
Reading FST
time = 402 ms
FST construction (oversizingFactor=0.0)
time = 1302 ms
FST RAM = 56055936 B
FST enum
time = 322 ms
FST construction (oversizingFactor=1.0)
time = 1235 ms
FST RAM = 54945816 B
FST enum
time = 239 ms 
{code}
Can you run this TestFSTDirectAddressing.main()?

I run it on master branch. Should I run it on branch 8x to reproduce your env?

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Attachments: screen-[1].png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] sigram opened a new pull request #1417: SOLR-12847: Auto-create a policy rule that corresponds to maxShardsPerNode

2020-04-08 Thread GitBox
sigram opened a new pull request #1417: SOLR-12847: Auto-create a policy rule 
that corresponds to maxShardsPerNode
URL: https://github.com/apache/lucene-solr/pull/1417
 
 
   
   
   
   
   # Description
   
   Please provide a short description of the changes you're making with this 
pull request.
   
   # Solution
   
   Please provide a short description of the approach taken to implement your 
solution.
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch 
implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects

2020-04-08 Thread GitBox
mocobeta commented on a change in pull request #1388: LUCENE-9278: Use 
-linkoffline instead of relative paths to make links to other projects
URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405606070
 
 

 ##
 File path: gradle/render-javadoc.gradle
 ##
 @@ -15,93 +15,105 @@
  * limitations under the License.
  */
 
-// generate javadocs by using Ant javadoc task
+// generate javadocs by calling javadoc tool
+// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html
+
+// utility function to convert project path to document output dir
+// e.g.: ':lucene:analysis:common' => 'analysis/common'
+def pathToDocdir = { path -> path.split(':').drop(2).join('/') }
 
 allprojects {
   plugins.withType(JavaPlugin) {
-ext {
-  javadocRoot = project.path.startsWith(':lucene') ? 
project(':lucene').file("build/docs") : project(':solr').file("build/docs")
-  javadocDestDir = "${javadocRoot}/${project.name}"
-}
-
 task renderJavadoc {
-  description "Generates Javadoc API documentation for the main source 
code. This invokes Ant Javadoc Task."
+  description "Generates Javadoc API documentation for the main source 
code. This directly invokes javadoc tool."
   group "documentation"
 
   ext {
-linksource = "no"
+linksource = false
 linkJUnit = false
-linkHref = []
+linkLuceneProjects = []
+linkSorlProjects = []
   }
 
   dependsOn sourceSets.main.compileClasspath
 
   inputs.files { sourceSets.main.java.asFileTree }
-  outputs.dir project.javadocRoot
+  outputs.dir project.javadoc.destinationDir
 
   def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr"
   def title = "${libName} ${project.version} ${project.name} 
API".toString()
 
+  // absolute urls for "-linkoffline" option
+  def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/;
+  def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/;
+  def luceneDocUrl = 
"https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString()
+  def solrDocUrl = 
"https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString()
+
+  def javadocCmd = 
org.gradle.internal.jvm.Jvm.current().getJavadocExecutable()
 
 Review comment:
   I am going to merge it to master branch since I think I understand what I 
did here with `org.gradle.internal.jvm.Jvm.current()`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
bruno-roustant commented on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-611020236
 
 
   Yes I missed the power of 2. So I'll just let you double check this works 
without wasteful map resize.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1388: LUCENE-9278: Use -linkoffline instead of relative paths to make links to other projects

2020-04-08 Thread GitBox
mocobeta commented on a change in pull request #1388: LUCENE-9278: Use 
-linkoffline instead of relative paths to make links to other projects
URL: https://github.com/apache/lucene-solr/pull/1388#discussion_r405603524
 
 

 ##
 File path: gradle/render-javadoc.gradle
 ##
 @@ -15,93 +15,105 @@
  * limitations under the License.
  */
 
-// generate javadocs by using Ant javadoc task
+// generate javadocs by calling javadoc tool
+// see https://docs.oracle.com/en/java/javase/11/tools/javadoc.html
+
+// utility function to convert project path to document output dir
+// e.g.: ':lucene:analysis:common' => 'analysis/common'
+def pathToDocdir = { path -> path.split(':').drop(2).join('/') }
 
 allprojects {
   plugins.withType(JavaPlugin) {
-ext {
-  javadocRoot = project.path.startsWith(':lucene') ? 
project(':lucene').file("build/docs") : project(':solr').file("build/docs")
-  javadocDestDir = "${javadocRoot}/${project.name}"
-}
-
 task renderJavadoc {
-  description "Generates Javadoc API documentation for the main source 
code. This invokes Ant Javadoc Task."
+  description "Generates Javadoc API documentation for the main source 
code. This directly invokes javadoc tool."
   group "documentation"
 
   ext {
-linksource = "no"
+linksource = false
 linkJUnit = false
-linkHref = []
+linkLuceneProjects = []
+linkSorlProjects = []
   }
 
   dependsOn sourceSets.main.compileClasspath
 
   inputs.files { sourceSets.main.java.asFileTree }
-  outputs.dir project.javadocRoot
+  outputs.dir project.javadoc.destinationDir
 
   def libName = project.path.startsWith(":lucene") ? "Lucene" : "Solr"
   def title = "${libName} ${project.version} ${project.name} 
API".toString()
 
+  // absolute urls for "-linkoffline" option
+  def javaSEDocUrl = "https://docs.oracle.com/en/java/javase/11/docs/api/;
+  def junitDocUrl = "https://junit.org/junit4/javadoc/4.12/;
+  def luceneDocUrl = 
"https://lucene.apache.org/core/${project.version.replace(".", "_")}".toString()
+  def solrDocUrl = 
"https://lucene.apache.org/solr/${project.version.replace(".", "_")}".toString()
+
+  def javadocCmd = 
org.gradle.internal.jvm.Jvm.current().getJavadocExecutable()
 
 Review comment:
   I did a little more experiments with 
`org.gradle.internal.jvm.Jvm.current()`, which is used both on compilation and 
test execution, the search path is
   1. org.gradle.java.home
   2. $JAVA_HOME
   3. user's default java (on $PATH)
   
   It's consistent with their documentation.
   
   Elasticsearch's custom build plugin takes completely different search 
strategy from gradle's.
   1. "compiler.java" system property
   2. $JAVA_HOME
   3. org.gradle.java.home
   4. user's default java (on $PATH)
   
   (I didn't run it but just interpreted this method 
https://github.com/elastic/elasticsearch/blob/master/buildSrc/src/main/java/org/elasticsearch/gradle/info/GlobalBuildInfoPlugin.java#L209)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use 
(why they want to use DynamicMaps insteads of hppc maps or java maps). I just 
want to avoid introduce more logic to these classes since one change needs to 
propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo and `initialCapacity` is not equals to `expectedElements`. 
   So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (arraySize = 
expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   I realized that `threshold < mapExpectedElements  <= resizeAt`.
   So we actually can compute maxExpectedElements = threshold - 2, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use 
(why they want to use DynamicMaps insteads of hppc maps or java maps). I just 
want to avoid introduce more logic to these classes since one change needs to 
propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo and `initialCapacity` is not equals to `expectedElements`. 
   So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (arraySize = 
expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   I realized that `threshold < mapExpectedElements  <= resizeAt`.
   So we actually can compute maxExpectedElements = threshold + 2, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use 
(why they want to use DynamicMaps insteads of hppc maps or java maps). I just 
want to avoid introduce more logic to these classes since one change needs to 
propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo and `initialCapacity` is not equals to `expectedElements`. 
   So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (arraySize = 
expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   so basically `threshold < mapExpectedElements  <= resizeAt`.
   We actually can compute maxExpectedElements = (int) threshold / 0.95.
   ( min value of threshold is 64 since we skipping using map when maxSize < 
2^12 )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14210) Add replica state option for HealthCheckHandler

2020-04-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-14210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078326#comment-17078326
 ] 

Jan Høydahl commented on SOLR-14210:


bq. Why wouldn't you add the additional param that's in Javadocs to the Ref 
Guide?

I could not find other examples of http params on that page, and wanted to stay 
DRY and just link to the Javadocs.

bq. If it is better to have users review Javadocs instead of adding to the Ref 
Guide, perhaps you could make it a link instead of making them go find the 
Javadocs and then find the class? The link syntax would be like this:

The "Class & Javadocs" column of the table already provides a link to the 
Javadocs of that class. I could of course repeat the same link inline in that 
new paragraph for clarity.



> Add replica state option for HealthCheckHandler
> ---
>
> Key: SOLR-14210
> URL: https://issues.apache.org/jira/browse/SOLR-14210
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.5
>Reporter: Houston Putman
>Assignee: Jan Høydahl
>Priority: Major
> Fix For: 8.6
>
> Attachments: docs.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> h2. Background
> As was brought up in SOLR-13055, in order to run Solr in a more cloud-native 
> way, we need some additional features around node-level healthchecks.
> {quote}Like in Kubernetes we need 'liveliness' and 'readiness' probe 
> explained in 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/n]
>  determine if a node is live and ready to serve live traffic.
> {quote}
>  
> However there are issues around kubernetes managing it's own rolling 
> restarts. With the current healthcheck setup, it's easy to envision a 
> scenario in which Solr reports itself as "healthy" when all of its replicas 
> are actually recovering. Therefore kubernetes, seeing a healthy pod would 
> then go and restart the next Solr node. This can happen until all replicas 
> are "recovering" and none are healthy. (maybe the last one restarted will be 
> "down", but still there are no "active" replicas)
> h2. Proposal
> I propose we make an additional healthcheck handler that returns whether all 
> replicas hosted by that Solr node are healthy and "active". That way we will 
> be able to use the [default kubernetes rolling restart 
> logic|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies]
>  with Solr.
> To add on to [Jan's point 
> here|https://issues.apache.org/jira/browse/SOLR-13055?focusedCommentId=16716559=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16716559],
>  this handler should be more friendly for other Content-Types and should use 
> bettter HTTP response statuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use 
(why they want to use DynamicMaps insteads of hppc maps or java maps). I just 
want to avoid introduce more logic to these classes since one change needs to 
propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo and `initialCapacity` is not equals to `expectedElements`. 
   So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (arraySize = 
expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   so we kinda safe here, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use. 
I just want to avoid introduce more logic to these classes since one change 
needs to propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo and `initialCapacity` is not equals to `expectedElements`. 
   So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (arraySize = 
expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   so we kinda safe here, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use. 
I just want to avoid introduce more logic to these classes since one change 
needs to propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo. So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (arraySize = 
expectedElements/0.75)` -> `threshold = 768 * 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   so we kinda safe here, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat edited a comment on issue #1395: SOLR-14365: CollapsingQParser - 
Avoiding always allocate int[] and float[] with size equals to number of unique 
values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use. 
I just want to avoid introduce more logic to these classes since one change 
needs to propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo. So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (arraySize = expectedElements / 
0.75)` -> `threshold = 768 * 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   so we kinda safe here, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values (WIP)

2020-04-08 Thread GitBox
CaoManhDat commented on issue #1395: SOLR-14365: CollapsingQParser - Avoiding 
always allocate int[] and float[] with size equals to number of unique values 
(WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#issuecomment-610985910
 
 
   > The javadoc in the constructor is good, but I'm concerned that this util 
class may be used elsewhere without clearly reading/understanding the risk
   
   I'm not very concern about this point much. If we gonna put this class in 
guava or some places like that I think it is worth to spend more time to 
documents or make the api right. But these classes will get used in Solr and 
when they use `DynamicMaps` they must have a clear idea of what they gonna use. 
I just want to avoid introduce more logic to these classes since one change 
needs to propagate and maintain in others.
   
   > Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   It seems that your calculation missed the part that arraySize must be 
powerOfTwo. So I will go from backward with
   `arraySize=1024` -> `mapExpectedElements=768 (768/0.75)` -> `threshold = 768 
* 0.75 = 576`
   -> `resizeAt=768 (arraySize * loadFactor)`
   so we kinda safe here, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on issue #1412: Add MinimalSolrTest for scale testing

2020-04-08 Thread GitBox
dsmiley commented on issue #1412: Add MinimalSolrTest for scale testing
URL: https://github.com/apache/lucene-solr/pull/1412#issuecomment-610972348
 
 
   We *could* have hard timeouts if they are run by a specific CI machine, 
perhaps @sarowe real hardware?
   
   Before this gets committed, we need to ensure it is not run _yet_ by default 
because it isn't asserting anything.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.

2020-04-08 Thread GitBox
bruno-roustant commented on a change in pull request #1416: LUCENE-9286: 
FST.Arc.BitTable is read directly from the FST bytes.
URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405530113
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/FST.java
 ##
 @@ -195,10 +209,9 @@
 posArcsStart = other.posArcsStart();
 arcIdx = other.arcIdx();
 numArcs = other.numArcs();
-if (nodeFlags() == ARCS_FOR_DIRECT_ADDRESSING) {
-  bitTable = other.bitTable() == null ? null : other.bitTable().copy();
-  firstLabel = other.firstLabel();
-}
+bitTableStart = other.bitTableStart;
 
 Review comment:
   Ok. You've debugged this code for a long time so now you know :)
   I'll change that and I'll put a comment to explain.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.

2020-04-08 Thread GitBox
bruno-roustant commented on a change in pull request #1416: LUCENE-9286: 
FST.Arc.BitTable is read directly from the FST bytes.
URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405518985
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/FSTEnum.java
 ##
 @@ -178,7 +179,7 @@ protected void doSeekCeil() throws IOException {
 } else {
   if (targetIndex < 0) {
 targetIndex = -1;
-  } else if (arc.bitTable().isBitSet(targetIndex)) {
 
 Review comment:
   I agree. I'd like to review FSTEnum soon. First to refactor and share common 
code like this, second there is still room for some perf improvement I think 
for seek floor/ceil.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1416: LUCENE-9286: FST.Arc.BitTable is read directly from the FST bytes.

2020-04-08 Thread GitBox
bruno-roustant commented on a change in pull request #1416: LUCENE-9286: 
FST.Arc.BitTable is read directly from the FST bytes.
URL: https://github.com/apache/lucene-solr/pull/1416#discussion_r405516244
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/BitTableUtil.java
 ##
 @@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+
+/**
+ * Static helper methods for {@link FST.Arc.BitTable}.
+ *
+ * @lucene.experimental
+ */
+class BitTableUtil {
+
+  /**
+   * Returns whether the bit at given zero-based index is set.
+   * Example: bitIndex 10 means the third bit on the right of the second 
byte.
+   *
+   * @param bitIndex The bit zero-based index. It must be greater than or 
equal to 0, and strictly less than
+   * {@code number of bit-table bytes * Byte.SIZE}.
+   * @param reader   The {@link FST.BytesReader} to read. It must be 
positioned at the beginning of the bit-table.
+   */
+  static boolean isBitSet(int bitIndex, FST.BytesReader reader) throws 
IOException {
+assert bitIndex >= 0 : "bitIndex=" + bitIndex;
+reader.skipBytes(bitIndex >> 3);
+return (readByte(reader) & (1L << (bitIndex & (Byte.SIZE - 1 != 0;
+  }
+
+
+  /**
+   * Counts all bits set in the bit-table.
+   *
+   * @param bitTableBytes The number of bytes in the bit-table.
+   * @param readerThe {@link FST.BytesReader} to read. It must be 
positioned at the beginning of the bit-table.
+   */
+  static int countBits(int bitTableBytes, FST.BytesReader reader) throws 
IOException {
+assert bitTableBytes >= 0 : "bitTableBytes=" + bitTableBytes;
 
 Review comment:
   > wouldn't it be more efficient to entire longs + the remainder?
   
   Yes good point. I don't know why I didn't do the same way as countBitsUpTo() 
below. This effectively avoids many conditions. I'll change that.
   
   > read8bytes - this is effectively the same as reader.readLong for bitcounts?
   
   Not the same, there is a difference in the byte order. reader.readLong() 
reads 2 ints by reading the high bytes first. read8Bytes() reads 8 bytes with 
low byte first. This matters when shifting the bit index. I agree that this 
does not matter here for bit count, but this way read8Bytes() is compatible for 
future use with bit index. And in addition it uses less operation since it 
requires 1 less bit shift and bit mask.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly

2020-04-08 Thread GitBox
uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to 
pool DWPT directly
URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610937953
 
 
   This could be related to Adrien's changes yesterday: LUCENE-9271


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly

2020-04-08 Thread GitBox
uschindler edited a comment on issue #1397: LUCENE-9304: Refactor DWPTPool to 
pool DWPT directly
URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610937953
 
 
   This could be related to Adrien's changes yesterday: 
https://issues.apache.org/jira/issue/LUCENE-9271


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] uschindler commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly

2020-04-08 Thread GitBox
uschindler commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool 
DWPT directly
URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610937953
 
 
   This could be related to Adrien's changes yesterday: 
https://issues.apache.org/jira/browse/issue/LUCENE-9271


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool DWPT directly

2020-04-08 Thread GitBox
mikemccand commented on issue #1397: LUCENE-9304: Refactor DWPTPool to pool 
DWPT directly
URL: https://github.com/apache/lucene-solr/pull/1397#issuecomment-610930218
 
 
   Hmm here's a more exotic test failure, also likely not caused by the changes 
here.  The fun things you learn when beasting on a 128 core box :)
   
   Hmm, though it is remotely possible my storage device or ECC RAM is flipping 
bits:
   
   ```
   org.apache.lucene.index.TestIndexManyDocuments > test suite's output saved 
to 
/home/mike/src/simon/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestIndexManyDocuments.txt,
 copi\
   ed below:
 1> CheckIndex failed
 1> 0.00% total deletions; 10976 documents; 0 deleteions
 1> Segments file=segments_1 numSegments=4 version=9.0.0 
id=cekay2d5izae12ssuqikoqgoc
 1>   1 of 4: name=_d maxDoc=8700
 1> version=9.0.0
 1> id=cekay2d5izae12ssuqikoqgob
 1> codec=Asserting(Lucene84)
 1> compound=false
 1> numFiles=11
 1> size (MB)=0.003
 1> diagnostics = {os.version=5.5.6-arch1-1, java.vendor=Oracle 
Corporation, source=merge, os.arch=amd64, mergeFactor=10, 
java.runtime.version=11.0.6+8-LTS, os=Linux, timestamp=1586347074798, lucene.ve\
   rsion=9.0.0, java.vm.version=11.0.6+8-LTS, java.version=11.0.6, 
mergeMaxNumSegments=-1}
 1> no deletions
 1> test: open reader.OK [took 0.001 sec]
 1> test: check integrity.OK [took 0.000 sec]
 1> test: check live docs.OK [took 0.000 sec]
 1> test: field infos.OK [1 fields] [took 0.000 sec]
 1> test: field norms.OK [1 fields] [took 0.001 sec]
 1> test: terms, freq, prox...ERROR: java.lang.AssertionError: 
buffer=java.nio.HeapByteBuffer[pos=0 lim=0 cap=0] bufferSize=1024 
buffer.length=0
 1> java.lang.AssertionError: buffer=java.nio.HeapByteBuffer[pos=0 lim=0 
cap=0] bufferSize=1024 buffer.length=0
 1>at 
org.apache.lucene.store.BufferedIndexInput.setBufferSize(BufferedIndexInput.java:78)
 1>at 
org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:241)
 1>at 
org.apache.lucene.codecs.MultiLevelSkipListReader.init(MultiLevelSkipListReader.java:208)
 1>at 
org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init(Lucene84SkipReader.java:103)
 1>at 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$EverythingEnum.advance(Lucene84PostingsReader.java:837)
 1>at 
org.apache.lucene.index.FilterLeafReader$FilterPostingsEnum.advance(FilterLeafReader.java:271)
 1>at 
org.apache.lucene.index.AssertingLeafReader$AssertingPostingsEnum.advance(AssertingLeafReader.java:377)
 1>at 
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1426)
 1>at 
org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1867)
 1>at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:720)
 1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:301)
 1>at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:286)
 1>at 
org.apache.lucene.store.BaseDirectoryWrapper.close(BaseDirectoryWrapper.java:45)
 1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:89)
 1>at org.apache.lucene.util.IOUtils.close(IOUtils.java:77)
 1>at 
org.apache.lucene.index.TestIndexManyDocuments.test(TestIndexManyDocuments.java:69)
 1>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 1>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 1>at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 1>at java.base/java.lang.reflect.Method.invoke(Method.java:566)
 1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
 1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
 1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
 1>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
 1>at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
 1>at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
 1>at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
 1>at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
 1>at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
 1>at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
 1>at 

[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #1395: SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique value

2020-04-08 Thread GitBox
bruno-roustant commented on a change in pull request #1395: SOLR-14365: 
CollapsingQParser - Avoiding always allocate int[] and float[] with size equals 
to number of unique values (WIP)
URL: https://github.com/apache/lucene-solr/pull/1395#discussion_r405478717
 
 

 ##
 File path: 
solr/core/src/java/org/apache/solr/util/numeric/IntFloatDynamicMap.java
 ##
 @@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.numeric;
+
+import java.util.Arrays;
+
+import com.carrotsearch.hppc.IntFloatHashMap;
+import com.carrotsearch.hppc.cursors.FloatCursor;
+import com.carrotsearch.hppc.procedures.IntFloatProcedure;
+import org.apache.lucene.util.ArrayUtil;
+
+import static org.apache.solr.util.numeric.DynamicMap.mapExpectedElements;
+import static org.apache.solr.util.numeric.DynamicMap.threshold;
+import static org.apache.solr.util.numeric.DynamicMap.useArrayBased;
+
+public class IntFloatDynamicMap {
+  private int maxSize;
+  private IntFloatHashMap hashMap;
+  private float[] keyValues;
+  private float emptyValue;
+  private int threshold;
+
+  public IntFloatDynamicMap(int expectedMaxSize, float emptyValue) {
+this.threshold = threshold(expectedMaxSize);
+this.maxSize = expectedMaxSize;
+this.emptyValue = emptyValue;
+if (useArrayBased(expectedMaxSize)) {
+  upgradeToArray();
+} else {
+  this.hashMap = new IntFloatHashMap(mapExpectedElements(expectedMaxSize));
+}
+  }
+
+  private void upgradeToArray() {
+keyValues = new float[maxSize];
+if (emptyValue != 0.0f) {
+  Arrays.fill(keyValues, emptyValue);
+}
+if (hashMap != null) {
+  hashMap.forEach((IntFloatProcedure) (key, value) -> keyValues[key] = 
value);
+  hashMap = null;
+}
+  }
+
+  private void growBuffer(int minSize) {
+assert keyValues != null;
+int size = keyValues.length;
+keyValues = ArrayUtil.grow(keyValues, minSize);
+if (emptyValue != 0.0f) {
+  for (int i = size; i < keyValues.length; i++) {
+keyValues[i] = emptyValue;
+  }
+}
+  }
+
+
+  public void set(int key, float value) {
+if (keyValues != null) {
+  if (key >= keyValues.length) {
+growBuffer(key + 1);
+  }
+  keyValues[key] = value;
+} else {
+  this.hashMap.put(key, value);
+  this.maxSize = Math.max(key, maxSize);
+  if (this.hashMap.size() > threshold) {
 
 Review comment:
   Did you test and debug when the DynamicMap upgrades from a map to an array 
internally? I mean in debug mode step by step here. I think the map first 
enlarges and rehashes just before the upgrade to an array.
   
   Let's take an example with expectedKeyMax = 500K.
   this.threshold = threshold(expectedKeyMax) = 50/64 = 7812
   IntFloatHashMap initial capacity = mapExpectedElements(expectedKeyMax) = 
(int) (threshold(expectedKeyMax) / 0.75f) = (int) (7812 / 0.75f) = 10416
   IntFloatHashMap internal threshold = ceil(initial capacity * 0.75) = 
ceil(10416 * 0.75) = 7812
   Internally the HPPC map enlarges during a put() when its size == 7812 
*before* incrementing the size.
   Here the condition to upgrade to an array triggers when the map size *after* 
the put is > 7812, so at 7813.
   So I think the map first enlarges and rehashes just before we upgrade to an 
array, which would be wasteful.
   
   Also, the map internal threshold is ceil(initial capacity * 0.75), but it 
could be without ceil() for other implementations. To be safe wrt the float 
rounding, I suggested to add +1 in DynamicMap. mapExpectedElements(int), but it 
is probably better to be safe here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   >